tristansrenaud
- Nov 3, 2020
- 5 min read

Finding the Best Hotel in Paris

Updated: Jan 5, 2021

Using data science to recommend hotels based on interests and budget.

Introduction

I would like to to take a trip to Paris, which means I need to find a hotel! I would like to find a hotel that is:

Within my budget.
Walking distance to the types of places I'd like to visit. Examples include cafes, nightlife, and entertainment.
Well rated.

I decided to use data science methods to identify the best hotels that meet our constraints.

While I will use myself as an example throughout this article, I decided to generalize the methods so the code can generate a recommendation based on anyone's travel interests and budget.

As such, we will answer the question:

"What is the optimal hotel in Paris given a budget and travel interests?"

Data

The data falls into three categories (sources in parenthesis):

Geographical coordinates of Paris. (Nominatim via Geopy) - Used to initialize maps.
Hotel data. (Foursquare Places API and Yelp Fusion API) - Used to identify hotels in Paris.
Venues in proximity to each hotel. (Foursquare Places API) - Used to identify the venues within walking distance to each hotel in Paris.

Let's take a closer look at each category of data:

1. Geographical coordinates of Paris

The geographical coordinates of Paris will be used to center the maps generated using Folium as such:

2. Hotel data

Hotel data is sourced from Foursquare and Yelp. I chose to include Yelp in this analysis because it has pricing information on hotels whereas Foursquare does not.

A combination of hotel information is included, such as ID (unique identifier), address, phone number and URL. This information will keep the data clean and allow others to easily research the hotel and make a reservation.

The dataset started with 433 hotels and ended up with 245 hotels after cleaning the data.

Summary of hotel data:

Sample of hotel data:

3. Venues in proximity to each hotel

Up to 100 venues within 500m (0.31mi) of each hotel were extracted using the Foursquare Places API. Each row represents a venue near a specific hotel.

Sample of venues near each hotel:

Methodology

K-Means Machine Learning

After aggregating the venues near each hotel, the data was one-hot encoded to run k-means clustering. It also enables us to see the 10 most common venues near each hotel:

In order to find the ideal cluster count for our hotels, the distortion score and silhouette score elbow points were considered.

While no elbow point was found in for the distortion score, an elbow point at k=11 was found for the silhouette score, as seen below:

Using k=11 for the model, k-means produces the following breakdown of hotels by cluster:

Let's take a look at the hotels on a map, with each color representing a different cluster:

Ranking Hotels

After a cluster is selected, there needs to be a way to rank the hotels within that cluster such that the best hotels are returned in the recommendation.

The visual below plots the Foursquare and Yelp ratings against each other for each hotel. In addition, the top of the plot represents the Foursquare rating distribution, and the right represents the Yelp rating distribution of ratings.

Notice how the distributions are shaped differently. The foursquare distribution is more evenly distributed and symmetrical than the Yelp distribution. This means the ratings need to be transformed in such a way that they can be averaged together and be used to rank the hotels.

To do this, a uniform quantile transformation is applied to the Foursquare and Yelp ratings independently. Then, an average is applied, weighed by the number of ratings on each platform, to calculate final scores.

Top 5 hotels using combined rating:

Results

Now that we have our hotels clustered and a ranking system, it is time to apply my interests and budget and generate a recommendation!

As a reminder, here are the requirements:

Well rated – recommendation will output highest rated hotels
Moderately priced – recommendation will consider hotels with a price point of 2 out of 4 (with 4 being the highest price).
Near venues I would like to visit – the cluster that best matches the scenario’s preference will be used to narrow down hotels.

Next, I take a few moments to decide which cluster best meets my interests using the table below:

Notice how each cluster has it's own unique set of venues. For example:

Cluster 2 has a diverse array of restaurants.
Cluster 9 has plenty of shopping venues, such as boutiques.
Cluster 10 has nightlife, with a good deal of bars.

Personally, I am interested in Cluster 0 which is nearby coffee shops, cafes, pubs and indie movie theaters.

After drilling down, the following 3 hotels are recommended:

Here are the recommended hotels on the map:

Discussion

Observations

Not all clusters are created equal – Cluster size varied from 1 to 36 hotels. If small clusters were selected, there may not be 3 hotels to return as a recommendation (and the hotels recommended are less likely to be well rated!)
Not all clusters form geographical neighborhoods – If they all created geographical neighborhoods, hotel selection based on existing defined neighborhoods would have been the way to go.
“Hotel” and “French Restaurant” are the most common venues – This makes sense, given that Paris is a city and in France. There was concern that including these venues would skew the results. However, the decision was made to keep these venues because they do, in fact, impact what can be found around each hotel in this report. If all hotels and French restaurant venues were removed, then the k-means algorithm may find a hotel that originally did not have any hotels or French restaurants nearby as similar to one that had a lot of hotels and French restaurants nearby.

Potential Future Exploration

Application to other cities – Is the analysis in this report transferrable to other cities? If so, how well? Are manipulations required besides the initial information (coordinates of the city of interest and name of city)?
Expand beyond nearby venues – Consider additional parameters such as proximity to public transportation, distance to city center, distance to specific points of interests (Louvre, Eiffel tower, etc.). Do they improve the recommendation?
Evaluation, Deployment and Feedback – This report omits the Evaluation, Deployment and Feedback steps found in the Foundational Methodology for Data Science. Unfortunately, they are out of the scope of this report as it would require physically visiting several hotels and identifying if the nearby venues match the original interests of a tourist.

Constraints and Shortfalls

Covid-19 impact – This report was produced during the Covid-19 epidemic. Consumer and business behavior were irregular and impacted the signals Foursquare uses to provide location data. This data may provide different results once the epidemic is over. Further information on changes in behavior: https://enterprise.foursquare.com/intersections/article/understanding-the-impact-of-covid-19/
Free-tier API limits – One big constraint with this project was sticking to a low volume of API calls. With higher limits, the following improvements can be achieved:
1. More nearby venues – Why stop at 100 nearby venues? Expanding to 200 could help paint a better picture of what is around each hotel.
2. More hotel information – additional useful information can be passed to the user, such as tips, photos, and even events happening at the hotel.

Conclusion

It turns out that data science methods can be used to improve travel recommendations! Specifically, for when someone is looking for a hotel in a particular city within their budget and that is in proximity to venues that interest them.

By considering hotels in Paris, along with venues nearby each one, clusters of hotels can be generated, which can help narrow down which immediate neighborhood a tourist would like around their hotel.

Applying a modified hotel rating system and filtering by price range on to the hotels in the selected cluster allows for the best hotels to be returned as a recommendation.

Interested in trying this out for yourself? Check out "Project Information" below!

References

Project Information

GitHub Repository: https://github.com/renautri/best-hotel-in-paris
Formal Report: https://github.com/renautri/best-hotel-in-paris/blob/main/Finding%20the%20Best%20Hotel%20in%20Paris%20-%20Report%20-%20Tristan%20S%20Renaud.pdf
Jupyter Notebook (contains all code): https://nbviewer.jupyter.org/github/renautri/best-hotel-in-paris/blob/main/Finding%20the%20Best%20Hotel%20in%20Paris%20-%20Notebook%20-%20Tristan%20S%20Renaud.ipynb

Data Sources

Foursquare Places API: https://developer.foursquare.com/places
Yelp Fusion API: https://www.yelp.com/fusion
Nominatim (OpenStreetMap): https://nominatim.openstreetmap.org/
CARTO (“Positron with labels” basemap): https://carto.com/help/building-maps/basemap-list/

Python Libraries Used

Pandas
NumPy
Folium
Yellowbrick
Scikit-learn
Matplotlib
Json
Requests
GeoPy
Math