Introduction
Airbnb is an online marketplace that allows the property owners to rent out their properties to travelers who are looking for a place to stay for a short period of time. This stay can range from 1 to 365 days. Property owners can rent out their properties through listings. Listings is like advertising your property for rent. Through Airbnb, property owners or hosts can make a listing of their property by mentioning the features of their housing space such as number of rooms, number of beds etcetera. Moreover, the calendar feature by Airbnb gives its users to list or in other words, a full control on when they want to share their housing spaces (i.e. property availability), for how long they want to share their property and the rental price they want to charge (Airbnb, n.d.).
Our project focuses on predicting rental prices for short-term homestays and experiences during travel. It is important to have an idea of some homestay rental charges and how much they should be, if we decide to travel away from our home. Airbnb is a platform that can easily book our stay at some property according to our conditions and the price we can afford to pay. This project would help people in making data driven decisions when choosing to book their next property for stay during travel.
Methodology and Results
Airbnb provides data of its listings for various countries such as New Zealand, Ireland, and states such as Toronto, Canada and Seattle, USA. The data is available on their website (Inside Airbnb, n.d.). We have used the dataset of New Zealand for the year 2021 – 2022 which contains data from December 2021 to the next 365 days. The data set contains two files named listings and calendar. Originally, the listings file contained 36072 rows and 80 columns and the calendar file contained 14018930 rows and 7 columns. Both of these files are useful for our analysis which is divided majorly into three parts:
● Exploratory Data Analysis
● Preprocessing
● Prediction of Rental Prices
Al three parts are inter-related and inter-dependent on each other and are not completely separated from each other and they do overlap among each other (i.e. some pre-processing has been done in the Exploratory Data Analysis portion).
Exploratory Data Analysis
Calendar File:
We first performed the Exploratory Data Analysis (EDA) on the calendar file to find out that how busy it is (i.e. how much Airbnb services are used) for Airbnb users in New Zealand . First of all, the essential libraries were imported. We read the calendar file using pandas library and found out about the number of unique listings in New Zealand for the year 2021 – 2022. Then we printed out the starting date and the ending date for the listings. Then we checked for duplicated entries and found out that there were zero duplicated entries.

Inference: The first listing was done on 17th December, 2021 and the last on 18th December, 2022. The null values were evaluated and the dimensionality of the file was found out. Then we analyzed how many properties were available to rent out.

Inference: f decribes the number of properties that are not available and that is 7616033 properties. t decribes the number of properties that are available for rent and that is 6402897 properties.
Following the last command, we examined the daily average availability for one year and plotted it on a graph (given in fig 1). The graph decribes that the most Airbnb services are used in February/March in New Zealand.
Then, we removed the dollar sign ‘$’ from prices and converted the clumn into numeric data type so that we could analyze the average change in prices within months (given in fig 2) as well as within weeks (figure 3). We found out that the Airbnb prices shoot up in the months of April, May and June. Moreover, Fridays and Saturdays are over $3 – 4 more expensive than the rest of the week.

Figure 1

Figure 2

Figure 3
Listing File:
For the Exploratory Data Analysis of listings file, we first read out the file using pandas library and checked out for its dimensionality which stated 36072 rows and 81 columns. There were zero duplicated entries. Then we grouped the listings with neighbourhood to analyze how many times a neighbourhood has appeared in a listing. We found out that Auckland has the highest number of listings. The rest of the exploration is summarized in the following table with the first column describing the analysis, the second column describing the results after analysis and the third column describes the availability of graph in the Jupyter Notebook file of project.
| Exploratory Data Analysis | Results | Graph Available | |
1 | Distribution of review score rating | Most reviewers leave high score. The average score is 4.78 | Yes | |
2 | Relationship of Neighbourhood with Price | South Eastern Ward has the highest number of listings. it also enjoys the highest median price, and Flaxmere has the lowest median price. | Yes | |
3 | Property type vs. Price | Private room in cave has the highest number of listings | Yes | |
4 | Room type vs. Price | Entire room/apt has a higher median price than the other room types. | Entire home/apt also has the most number of listings. | Yes |
5 | Median price per bed | The median price per bed is $100 | Yes | |
6 | Common amenities | Smoke alarm, Wifi, Kitchen and Heating are among the most common amenities. | Yes | |
7 | Number of beds along with per bedroom | Vast majority of listings have one bedroom and 1 bed. | Yes | |
8 | Number of bathrooms along with per bedroom | It looks like listings with 6 bedrooms and 10 beds have the highest median price. | Yes |
Pre-processing
The pre-processing includes the following stages:
● We removed the dollar sign ‘$’ from the price column in the listings file and used checked for the description of the price column. We found out that the most expensive Airbnb listing in New Zealand is at $53,788/night.
● We checked for the distribution of price of listings and removed outliers at 1% and 95% quantile.
● We removed the string values from the bathroom_text column and we were left with only float datatype values.
● Then we checked for correlation (given in figure 4). Number of bedrooms and accommodates seem to be correlated with price.
● We generated a new column using amenities and named the new column as ‘Amenities offered’ which consists of the total number of amenities that are being offered at a property.
● Then we did the encoding for columns ‘host_is_superhost’, ‘instant_bookable’,
‘has_availability’, and ‘host_identity_verified’ where every t value in these columns were encoded as 1 and every f value was encoded as 0.
● At last, we did the encoding of the column ‘room_type’ where Entire home/apartment was given the value 0, Private room 1, Shared room 2 and Hotel room as 3.

Figure 4
Following is the list of variables selected for analysis:
| Variables Name | Description |
1. | Price | Price at which host publishes a listing. |
2. | Host_is_superhost | Whether host is a superhost*. |
3. | Host_identity_verified | Wheter the host has completed verification process. |
4. | Instant_bookable | The availbility of the listing. |
5. | Host_listings_count | Provides active listings of a host. |
6. | Host_total_listings_count | Provides total number of listings by a host. |
7. | Bedrooms | Total number of bedrooms of the property. |
8. | Bathrooms_text | Total number of bathroom of the property. |
9 | Accomodates | How many people a property can accommodate. |
10. | Beds | Total number of beds available on the property listed. |
11. | Room_type | Type of property that is available for rental. |
12. | Amenities Offered | Useful features or services offered by host/property. |
*Airbnb provides a designation to hosts. Being an Airbnb Superhost is about providing outstanding hospitality, which means being highly-rated, experienced, reliable, and responsive.
Prediction
Data that has been preprocessed is saved as an excel file and later imported as a data frame in a different file for prediction for better readability of the code.
The price of listings is separated as our dependent variable and a natural log is applied as all of the independent variables have value in single or double digits.
Therefore our model is: ln(Price)=B0 + B1*X + u ~ A change in X (independent variable) by one unit (∆X=1) is associated with an (exp(B1) – 1)*100 %
change in Price.
The dataset is split into small subsets of train and testing sets that consist of random sampling without replacement of about 80 percent of the rows and 20 percent of the rows respectively.
Different Machine Learning algorithms are applied using these sets and their R2 Train, R2 Test, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Mean Squared Error (RMSE) is stored in an evaluation data frame to evaluate the performance of each algorithm.
The ML algorithms we use in our prediction are;
- Linear Regression
- Lasso Regression
- Decision Trees
- Random Forest
- Gradient Boosting
- K-Nearest Neighbour
Analysis:
The results are as follows:

The Gradient Boosting algorithm performed the best as it had the least values for the error and was able to predict the listing price fifty-five percent of the time accurately. Since it is a regression model rather than classification as we are attempting to predict the exact value of the log of the listing price, it is sensitive to small errors, and it is therefore difficult to achieve a high R-square on the test set. The R-squared achieve through gradient boosting is significant enough to regard it as a high R-square for this model.

Faizan Zahid
Associate Consultant Data Analysis