HC7 & HC8 -- Data Analysis

Below you will find a first exploration of the AirBnB_NYC_2019.csv dataset. The goal is for you to learn the tools for data exploration (many of these you learned in our online labs) and then be able to further explore the data and obtain results to report in our final presentation.

As you know, we must first import the packages we will use to do this:

EXPLORE THE DATA

To begin, we will read the data into a pandas dataframe and find out its size (dimensions/shape)

Alternatively, to find out how many observations (rows) in the dataset:

To find out the names of the columns, we can print the header of the dataframe (with no rows)

Alternatively, to see sample data, we can print the header of the dataframe with the first two rows:

Let's dig a little deeper and produce some stats

To describe the numerical data:

The function describe() gives the number of values (count), average, standard deviation, minimum and maximum values as well as the 25th, 50th and 75th percentiles, for each numerical column. Alternatively, to retrieve and print the stats for a particular column:

The describe() function, as called above, provides numerical statistics fro numerical data. To get some stats for the categorical data:

The describe() function with include=['O'] argument (that is a capital letter O), will provide the number of values (count), the number of unique values, the most frequent value (top) and the frequency of that value, for each categorical column.

As we did above, if you want to extract the stats for a particular categorical column:

Here we can see that all observations (listings) specify the neighbourhood_group (count is the same as the number of rows in the dataset), we see that there are 5 possible neighborhood_groups (unique == 5, these are the boroughs), that Manhattan has the largest number of listings (21,661).

Let's look at another column:

Here we can see that NOT all observations (listings) have an entry for last_review (count is less than number of rows in the dataset), we see that there are 1764 different dates for the last review, and that the date of most "last reviews" (top) was June 23 2019, with 1,413 reviews on that day.

Now we may wonder, what is the latest date for last_review in our dataset? To get the answer, first we convert the date (which is now a string) to a datetime format recognized by pandas, and then we can ask pandas for the latest (max) date:

Since we don't have data for the number of bookings for each listing, from the last_review analysis above we may hypothesize that there were fewer bookings overall after June -- unless summer guests got lazy with reviews :)

CLEAN THE DATA

It is always a good idea to clean your data before you start working. The most common problem you will encouter is missing data. If some observations (rows) do not provide a value for a particular variable (column), you may encounter errors when you try to compute statistics or plot the data.

To check if there are any missing values:

Here we see that most columns do have entries for all the listings (0 null values), except for:
name (the name of the listing): 16 listings don't have a name
host_name: 21 hosts did not provide a name
last_review and reviews_per_month: 10052 listings seem to omit review data, although they all specify the total number of reviews. Probably some listings have 0 reviews? Let's find out:

Indeed!!! There are listings with 0 reviews. But how many of them?

Here we see on the left the different values in the number_of_reviews column, and on the right the number of listings with that value. And voila', thre are 10052 listings with 0 reviews, and thus null entries for last_review and reviews_per_month for those listings.

So now, let's clean the data by removing the null values.
We can do that by replacing null values for our numerical columns with 0 (we can do that for date too now that we converted it)

And replace null values in our categorical data (string values) with the empty string:

Now we can check to make sure we got rid of all null values:

And voila', our data is free of null values!!!

Another sensible thing to do, and it should always be considered, is whether our data should be anonymized for privacy protection. When working with data, it is always important to think whether we need in our analysis the data that could potentially identify idividual subjects, and if not, it is always a good idea to remove that data alltoghether.

In this case the data comes from Kaggle and it has already been anonymized somewhat, but we still have host_id and host_name information. If we will not use that data in our analysis, it is a good idea to remove it:

And we can see that we no longer have columns that may identify the host!

VISUALIZE THE DATA

Let's look at the data by borough:

To find out the names of the boroughs (we kind of knwo this already...):

Let's group the data by borough (neighborhood_group)

Now that the data is grouped by borough in boro_group, let's plot the data to observe the average price per borough:

To save your plot in an image file you can do the following to leave more space at the bottom of the figure to fully visualize the names of the boroughs on the x axis, label the y axis, and then clear the figure so we can make more plots later:

We can observe that, as we all probably expected, Manhattan has the higest prices.

Let's look at the "minimum number of nights" requirement by neighborhood:

Let's save this plot to a file:

We can observe that, on average, Manhattan requires the highest number of minimum nights per stay. All neighborhoods require, on average, stays that are longer than 4 nights.

Now you may wonder, what is the maximum number of minimum_nights in each borough?

Now we can see that there are long-term listings in all neighborhoods, Manhattan with the highest min-stays (more than 3 years!!!)

Let's say I now want to look at data from Staten Island only (note I am selecting rows here!!!)

I now have a new dataframe called st with only rows where neighbourhood_group == 'Staten Island", and I can see that there are 373 listings for Staten Island.

YOUR TURN!

Now that you know how to explore the data, clean the data, obtain statistics about the data, visualize the data and select a subset of the data based on the value in a particular column (e.g. neighbourhood_group == 'Staten Island"), think about how you want to explore the data for your analysis:

  1. Individually, analyze the data in your neighborhood and compare to the data for the borough.
  2. As a group, think about an overall data-driven discussion to compare your borough to the others.

As you explore your data, keep in mind your observations and conclusions from HC2 and HC3 and see if you can make any connections, or if you find that the data supports those conclusions.