NYC Taxi Fare Prediction
Machine learning applied to taxi rides fare prediction in New York city. Data comes from the Kaggle challenge: New York City Taxi Fare Prediction .
Find all the relevant code and jupyter notebooks on the GitHub repository.
Main experiments and results
See the GitHub repo for the complete analysis and methodology. Here are reported the main simple neural networks that were employed and the final results.
The used neural networks are simple fully connected ones, with ReLU non-linearities. The baseline is a degenerate network, which is just a linear layer having access to the features:
- Pickup coordinates
- Dropoff coordinates
- Passenger count
- Pickup datetime (month, week, hour)
The main non-linear experiments make use of two similar networks having this shape (fully connected, ReLU non-linearities):
More features are provided in this case, some of them are engineered based on the analysis (see analysis.ipynb
):
- Pickup coordinates
- Dropoff coordinates
- Passenger count
- Pickup datetime (year, month, day of week, hour)
- Boolean: is the pickup datetime after Setpember 2012? (see analysis for the reason)
- Boolean: is the pickup datetime during the weekend?
The only difference between the two models is an extra feature: the approximate travel time of the ride. The travel time is approximated through the NYC graph, obtained by running osmnx.ipynb
. However, this graph only covers the urban area of the city. Since some of the rides in the dataset are outside the urban area, it was not possible to obtain such feature for them. For this reason, two models are trained and combined. One having access to the urban feature of travel time, predicting all the urban rides, and one without it, predicting all the suburban rides.
Results
For the train-time results and evaluations see neural_models.ipynb
. Here are reported the results of the submission of our experiments to the Kaggle leaderboard of the competition:
Even though the competition is long finished, our best score would place us around the first 30% of the leaderboard. The best score (MLP2048 + MLP2048) refers to the double neural network setup described above.
Credits
Work by Francesco Mistri and Michele Faedi.