An example of data science clustering analysis with python

This example is part of the Applied Data Science Capstone by IBM.

PRE-FEASIBILITY STUDY FOR THE OPENING OF A MEXICAN RESTAURANT IN THE CITY OF TORONTO, CANADA

Introduction

Since 2010, Mexican cuisine has been recognized as Intangible Cultural Heritage of Humanity by UNESCO. Therefore, a large chain of restaurants in Mexico wants to expand abroad with the aim of showing the world that Mexican cuisine is not Tex-Mex, but to offer consumers the experience of a real Mexican meal made with products imported from Mexico.

Investors have decided to start their internationalization process by opening their first restaurant in Toronto, Canada and for that purpose they have asked their data scientist in Mexico for a pre-feasibility study to help them choose the best place to put the authentic Mexican food restaurant in Toronto.

Data

For this pre-feasibility study we use two datasets. First, we need to scrapping some data from Wikipedia to obtain the neighborhood in Canada. Second, we use the Foursquare API to enable the location data required for the study.

1. List of postal codes of Toronto, Canada:

Description: This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area.

Link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

2. Foursquare API:

Description: Foursquare is a database of more than 105 million places worldwide and an API that enables location data. Foursquare open access to its API, enabling developers to access data in real-time generated by the Foursquare app and build applications on top of that data.

Link: https://developer.foursquare.com/docs

Methodology

Using the Jupyter Notebook, the first step will be to import the libraries needed to work with Python into the notebook (Fig. 1).

In this study, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, the neighborhood data is not readily available on the internet. So, for the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto (Fig. 2).

In this case, we need to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format dataset (Fig. 3). Once the data is in a structured format, we can explore and cluster the neighborhoods in the city of Toronto.

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood (Figure 4). We downloaded the geographical coordinates of each postal code of Toronto from: http://cocl.us/Geospatial_data

And merged the both dataframes (Fig. 3 + Fig. 4) to obtain a new dataframe (Fig. 5):

Next, we group our new dataframe by Borough and count the Neighborhoods and regroup borough by central (9 neighborhoods), downtown (18), east (5) and west (6).

Then we are getting venues data using Foursquare, using our credentials to make calls to the Foursquare API and retrieve location data from its database to search for nearby venues of a specific type, explore a particular venue, and search for trending venues around a location. To make a user request, we need to specify our consumer key’s Client ID and Client Secret in the request URL (Fig. 6).

Then, we get the venues for all neighborhoods in our dataset and count the number of venues per neighbourhood. Now, we need to use the One Hot Encoding that is a common technique used to work with categorical features because Machine Learning (ML) algorithms cannot work with categorical data directly. So, we must convert the categorical data to numbers should be One Hot Encoded (Fig. 7).

Let's group the rows by neighborhood and by taking the mean of the frequency of occurrence of each category and we create a new dataframe to find Mexican Restaurants only. And, we Run a k-means to cluster the neighborhoods in Toronto into 3 clusters (Fig. 8).

We create a new dataframe that includes the cluster as well as the top venues for each neighbourhood, adding the clustering labels and we merge the Toronto dataframes to add latitude/longitude for each neighbourhood, and finally we sort the results by Cluster Labels to visualize and analyse the clusters.

Results

The initial study has a dataframe with 11 boroughs and 103 neighborhoods of Toronto, Canada. After cleaned and re-grouped the data, we reduce our geospatial data to 38 neighborhoods into 4 boroughs.

With 1698 venues in 233 unique categories. Then, we apply a One Hot Encoding and a new re-group of the data we obtain to put de 1698 into the 38 neighborhoods. We check the dataframe to find Mexican Restaurants only and we identified the existence of 13 Mexican restaurants concentrated into 3 clusters in the city of Toronto (Figure 9).

The results from k-means clustering show that we can categorize Toronto neighborhoods into 3 clusters based on how many Mexican restaurants are in each neighborhood. This are the insights:

Cluster 0 (Red): Neighborhoods with 879 venues and few Mexican restaurants.
Cluster 1 (Purple): Neighborhoods with 660 venues and no Mexican restaurants.
Cluster 2 (Green): Neighborhoods with 159 venues and high number of Mexican restaurants.

Discussion

The red and green clusters are concentrated in downtown Toronto. In the purple cluster there is no competition, however, it seems that it would present a problem of strategic location of the business premises, so in a first instance the purple spots are not recommended as places to put a Mexican restaurant.

The green point, is another area to avoid, since it is a geographical point strongly atomized by competition in general, both from Mexican cuisine and other types of food restaurants.

Finally, it seems more advisable to start a more in-depth exploration of the area in red, since it is a good commercial area, very crowded, but with little competition from other restaurants with Mexican cuisine.

Conclusion

It is recommended to start a second stage of data analysis that includes socioeconomic data of the red cluster, with a machine learning model that projects/predict the trends in the area (hot spots) and obtain data about the real estate costs that allow a complementary cost-benefit analysis.

Buscar este blog

El Analista Económico-Financiero

An example of data science clustering analysis with python

Comentarios

Entradas populares de este blog

¿Qué significan los números en el triángulo de reciclaje de los plásticos?

Metallica versus Megadeth ¿quien es mejor? la estadística nos da la respuesta

Los programas más usados por economistas