I analyzed taxi pickup locations in New York that resulted in the highest tip amount from passengers during morning rush hours (7am to 9 am) in Jan 2016. This map visualization will be useful for taxi drivers to know where to pick up passengers during those hours to maximize their earnings.
Data
I used the New York City TLC (Taxi and Limousine Commission) Trip Record Data from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. As the dataset for each month is large (millions of rows, in gigabytes), I only downloaded and analyzed the data for Jan 2016 Yellow at https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv (1.59GB)
R Script
# Author: KS Lee (aisklogy) # Date: 11 Mar 2017 # Objective: Figure out where in New York City to pick up passengers who will give the most tips during morning rush hour # Taxi trip records by NYC Taxi and Limousine Commission (TLC) # Website: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml # Direct link to dataset: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-01.csv # What the columns in this dataset means: http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf # ==================================== # load libraries # ==================================== # library that provides the hour() function to easily extract the hour of a given character timestamp install.packages("lubridate") library(lubridate) # library that provides map plotting functionalities install.packages("ggmap") library(ggmap) # library for winsorize function install.packages("robustHD") library(robustHD) # ==================================== # load dataset # ==================================== # load CSV file setwd("/path/to/data") df = read.csv("yellow_tripdata_2016-01.csv") print(paste("There are", dim(df)[1], "rows in the dataset")) #10,906,858 # ==================================== # prep data # ==================================== # extract rows with pickup time that is within morning rush hour (my definition: 7-9am) df_rushhour = df[hour(df$tpep_pickup_datetime) >= 7 & hour(df$tpep_pickup_datetime) < 9, c("tpep_pickup_datetime", "pickup_longitude", "pickup_latitude", "tip_amount")] # remove empty longitude/latitudes which are just invalid entries df_rushhour = df_rushhour[df_rushhour$pickup_latitude!=0 & df_rushhour$pickup_longitude!=0,] # remove outliers from longitude and latitude outliers = 100 outliers_perc = outliers / dim(df_rushhour)[1] lat_quantile = quantile(df_rushhour$pickup_latitude, probs=c(outliers_perc, 1-outliers_perc)) long_quantile = quantile(df_rushhour$pickup_longitude, probs=c(outliers_perc, 1-outliers_perc)) df_rushhour = df_rushhour[df_rushhour$pickup_latitude >= lat_quantile[1] & df_rushhour$pickup_latitude <= lat_quantile[2] & df_rushhour$pickup_longitude >= long_quantile[1] & df_rushhour$pickup_longitude <= long_quantile[2],] # remove negative tip amounts which do not make sense df_rushhour = df_rushhour[df_rushhour$tip_amount>=0,] # clamp the tip_amount to a certain max value, just so that it is easier to visualize on map df_rushhour$tip_amount_winsorized = winsorize(df_rushhour$tip_amount, const=10) # ==================================== # visualize data # ==================================== # plot the entire lat/long range from our dataset lon.range = range(df_rushhour$pickup_longitude) lat.range = range(df_rushhour$pickup_latitude) pad = 0.05 map = get_map(location=c(lon.range[1]-pad,lat.range[1]-pad,lon.range[2]+pad,lat.range[2]+pad), source="google", maptype="roadmap") tip_norm = df_rushhour$tip_amount_winsorized/max(df_rushhour$tip_amount_winsorized) ggmap(map, fullpage=TRUE) + geom_point(aes(x=pickup_longitude, y=pickup_latitude), data=df_rushhour, colour="red", alpha=tip_norm*0.003, size=tip_norm*4) # zoom into Manhattan area latlong = geocode("Manhattan, New York") map = get_map(location=c(lon=latlong[1,1], lat=latlong[1,2]-0.02), zoom=13, source="google", maptype="roadmap") ggmap(map, fullpage=TRUE) + geom_point(aes(x=pickup_longitude, y=pickup_latitude), data=df_rushhour, colour="red", alpha=tip_norm*0.007, size=tip_norm*7)
Results
The red regions are concentrated around three regions: JFK airport, LaGuardia airport and Manhattan area directly southwest of Central Park.
I plotted another map which zooms into the Manhattan area to get a closer look at which exact areas are the hotspots: Times Square, Rockefeller Center, Museum of Modern Art area etc. These areas make sense since they are popular areas in New York.