Supervised Regression Problem

Use machine learning to predict the crowd flow of Taipei Metro — Random Forest

Use Random Forest to predict the crowd flow of Taipei MRT on any given day/time/station in the future.

9 min readJul 10, 2021

Like most of the big cities in the world, Taipei has a very well-structured metro system called the MRT. People can get around the city with ease by just taking the MRT, no scooter or car is needed.

Hypothetically, the general crowd flow of Taipei MRT would follow some specific patterns, making it an ideal time-series data for prediction. In addition, the government publishes MRT operation data as Open Data. Looks like it’s a sign for me to build a forecasting model with it :)

In this article, I will walk you through my entire project and all of my code can be found here. First the outline:

Outline

Problem definition
Data preprocessing
Exploratory Data Analysis
Feature Engineering
Modeling
Next Steps

Without further ado, let’s get started!

Problem Definition

I define the problem as “How to predict the crowd flow (Both people coming in and out) of Taipei MRT on a given day/time/station in the future?”

e.g. How many people were coming in/out of Taipei Main Station on August 2, 2021?

If this model is built successfully, it might be able to create significant business/social impact. If we look at it from the perspectives of different stakeholders:

Uber/Taxi company: can build a more solid dynamic pricing model.
The government: can build a more efficient transportation system with bus and Ubike (Public shared bike in Taiwan) timetables.
Merchants/shops around MRT: can generate better business/operation strategies based on the model results.

Data Preprocessing

Downloading data

The first step is to get the raw data, which can easily be found here.

In my project, I only used data from 2018 and 2019 due to the disruption of COVID in 2021. Of course, we cannot rule out the possibility that similar disruptive events will happen in the future, and frankly, the MRT is constantly changing year by year, with the establishment of new stations. I will address these problems at the end of the article.

After downloading the data (One…Eternity…Later…), I read the data into Jupyter Notebook. As you can see below, the data has 170 million rows and 5 columns. Glad that my laptop survived…

The 5 columns are: [Date], [Hour], [Come in station], [Come out station], and [Crowd flow].

Process data

The first thing I do is to split the problem into “predict the people coming in” and “predict the people coming out”, and focus on “coming in” first(Dropping “coming out”). By doing so, the data will immediately shrink about 100 times without repercussions since the two aspects are identical. The same features and model should work for both “coming in” and “coming out”. My article today will only focus on the prediction of “coming in”, and the code for “coming out” can be found on my GitHub (I used the same features and model, and the result was identical).

The data after processing looks a lot lighter than before.

Exploratory Data Analysis

In this part, I conduct the EDA with my own knowledge of Taipei MRT.

Relation between [Date] and [CrowdFlow]

Many traits can be found here. Take 2018 as an example:

Every spike is a week, with weekdays being the top of the spike and weekends being the bottom.
Different months demonstrate a different pattern — some months are busier than others.
Special holidays also matters, with an example being the Chinese New Year (the lowest point in the graph).

→Add new features: [DayofWeek],[Month],[Special](holiday).

The relation between [Station] and [CrowdFlow]

Obviously, there will be busy stations and idle ones.

Chinese station names cannot be shown here

BIG difference between busy and idle stations!

→We should choose a model that won't be influenced by outliers much.

→We decide what to do with the [Station] feature after we have chosen a model

Model Choice

Through EDA, I believe that Random Forest is the best model for this set of data.

Random Forest is simple but calculation-heavy. It is simple because the mechanism behind it is relatively easier to understand compared to a neural network, which is often referred to as a “Black box”. However, it is calculation-heavy due to the fact that it is an ensemble model consisting of many decision trees.

If you want to understand how Random Forest works, I suggest you watch this video about Decision Trees and this video about Random Forest. Nobody can put it better than Josh Starmer!

But why is Random Forest suitable for this particular dataset? Here are the reasons:

Random Forest is not easily swayed by outliers. The model split data into nodes and leaves, and most outliers will end up in small leaves.
Random Forest handles non-linear relations well. The features in this model are not linear to crowd flow.
Extrapolation is not a problem with this dataset. One weakness of Random Forest is extrapolation — it cannot handle new data that appear out of the realm of old data. This is not likely going happen to MRT data, since there will always only be 24 hours and 12 months. There could be new stations, which we will discuss later in this article.

Feature Engineering

Add [DayofWeek], [Month], and [Special](holiday) features.

[DayofWeek] and [Month] are easy enough — we just have to extract it from [Date] using pandas’ datetime features. For [Special], I hardcoded the whole thing from the calendar. 0 represents holiday, 1 is normal day, and 2 is extra work day.

Use [Daily Average Crowd Flow] to substitute [Station]

After trying different ways to engineer [Station](e.g. one hot encoding), so far this is the best way to do it.

[Daily average crowd flow] is calculated by “the sum of crowd flow in training data ÷ all days in training data”. The reason why it works so well is that it can fully represent the difference in “hotness” between different stations.

Before doing so, however, I have to split the data into Training, Validating, and Testing first in chronological order, so that I don't mistakenly use data in validation or test set to engineer the feature. I didn't use the typical random Train-Test Split because the MRT data is time-series data, so validating and testing data should be in the future compare to training data.

60% of the data will be train data. (The entire 2018 and Jan., Feb. in 2019)

20% of the data will be validation data. (2019 Mar. — Jul.)

20% of the data will be test data. (2019 Aug. — the end.)

Next, we extract the sum of crowd flow in each station, turn it into a dictionary, and divide it by total days (440).

Finally, convert [Station] column into [Daily Average Crowd Flow], and the data is all set!

Modeling

Training Model

Data was all cleaned up, so I fitted it to the model. Note that the hyperparameter choice here is the result of my own tuning, but a model with only default settings can also produce great results, which is a strength of Random Forest.

Result

The model actually performs pretty well! Let’s use some metrics to evaluate it.

R Square

Represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
— investopedia

According to sklearn documentation: “The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get an R² score of 0.0.”

My model produced an R Square of 0.986, which means that there are very strong relations between the chosen features and crowd flow.

RMSE（Root Mean Squared Error）

You can think of RMSE as the average error per prediction. Note that it is a “mean”, which means it’s sensitive to outliers. Since the MRT data has many extreme outliers, the RMSE score is likely inflated.

My model has an RMSE score of 162, which seems high at first glance, but let’s compare it to the baseline.

Baseline: A model that predicts crowd flow solely base on the average crowd flow of training data. The RMSE score is 1306.

Wow, huge difference! Looks like our model is actually doing something.

Evaluate 20 random predictions

You can see from here that, in 20 predictions, only 2 have errors above 162. Most are below 50, or lower. RMSE is indeed inflated by outliers.

Visualization

Last but not least, I picked a random time in the future (validation set) and visualized the result. As you can see, the model predictions are very accurate.

The blue line is actual data, the red dots are predicted data. If the dot is on the line, it generally means the prediction is accurate.

Next Steps

The practicality of the model

As good as the model might seem like, this is partly due to the fact that Taipei MRT didn't change much during 2018 and 2019. If we use the data in 2021, my model will be way off because it cannot predict the effect of COVID. My model will also lose effect when new stations are built.

The best way to cope with these problems is to build a dynamic model that can always take in new data and adapt to any changes. Every day, the model should take in new data and retrain. Although the model might experience short-term adaptation pain when the change just happened, with correct adjustments, the model should be able to adapt pretty quickly.

Model Optimization

Can we make the current model better? One way to do it is through hyperparameter tuning, or changing a different model. Due to hardware constraints, I didn't do large-scale tuning. But I believe the effect of tuning would be limited due to the well-performing nature of Random Forest.

Model Deployment

After all the training and tuning, it is time to find an application for this model, customize it according to commercial need, and deploy it! (Uber plz sponsor me)

Conclusion

Although the model looks good now, it’s still a long way from deployment. After deployment, we will still need to create business strategies to make it valuable. Many things still need to be done.

Anyways, it’s been a fun journey :) Huge thanks to all my friends who provided me with advice and the government for publishing such awesome data.

If you have any thoughts on the project, feel free to discuss them with me below!

Give me some claps and follow if you like this! You can also connect with me on LinkedIn or FB. See you next time!