After gathering and cleaning data on rental bikes in Cologne, we already looked at map visualizations. Here, we will dive into the data more deeply. Using descriptive statistics we will try to find interesting patterns. For this, we will use the pandas library. It is great for manipulating and analyzing data. Moreover, we will use matplotlib to visualize some of our findings.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from geopy.distance import vincenty # define style of plots %matplotlib inline plt.style.use('ggplot') plt.rcParams["figure.figsize"] = (8,6)
After setting up the environment, we read in the data. Following, we take a first glance at the structure:
BIKES_COMBINED = "../data/bikedata_10min-int-observ_Feb-March.csv" bikes = pd.read_csv(BIKES_COMBINED) bikes.scrape_datetime = pd.to_datetime(bikes.scrape_datetime) bikes.head(8)
|0||22460||Sun||1981722||50.933528||6.995009||BIKE 22460||2017-02-05 00:00:02|
|1||22134||Sun||2118881||50.896996||6.980310||BIKE 22134||2017-02-05 00:00:02|
|2||21911||Sun||2361566||50.967579||6.936491||BIKE 21911||2017-02-05 00:00:02|
|3||21522||Sun||2380019||50.975012||7.007270||BIKE 21522||2017-02-05 00:00:02|
|4||21914||Sun||2380653||50.970199||6.900565||BIKE 21914||2017-02-05 00:00:02|
|5||21296||Sun||2384241||50.965448||6.919332||BIKE 21296||2017-02-05 00:00:02|
|6||22165||Sun||2384381||50.965436||6.919535||BIKE 22165||2017-02-05 00:00:02|
|7||21103||Sun||2388611||50.948214||6.989416||BIKE 21103||2017-02-05 00:00:02|
We have one observation for all bike locations (lat and lon) every ten minutes. The bikes can be identified by their
bike_id is missing at one observation point it means that the bike is currently not available. This is the case when a bike is in use. Alternatively, the bike (or the GPS) could be broken or under maintenance.
If between two observation points the coordinates have changed, the bike has been used. Accordingly, we can create a dummy variable to indicate this. We do this by grouping the data by
bike_id and applying a function to every row in the
lat column. The function checks if the difference between the current and previous row (i.e. the current and previous observation) is greater than zero:
# Compute dummy (0/1) for location changes: did coordinates change between two observations? bikes['change'] = bikes.groupby('bike_id')['lat'].apply(lambda x: abs(x.diff()) > 0).astype(int)
Now, that we know a bike was used let's find out what distance it traveled. Because we only have the start and end point of the journey this will be a rough estimate. We can expect the actual distance to be significantly longer. To achieve this, first we create a new column. It contains the coordinates shifted by one observation period for each bike. Then, we create the index
m that includes all rows in our data where the bike has been used. Using this, we can compute the distance between the actual coordinates and the shifted coordinates. This is done by applying the
vincenty function from the
geopy package to each row in
# Compute Trip Distance when coords changed bikes['coords'] = bikes[['lat','lon']].values.tolist() bikes['coords_shifted'] = bikes.groupby('bike_id')['coords'].shift() m = bikes['coords_shifted'].notnull() & bikes['change'] == 1 bikes.loc[m, 'trip_distance'] = bikes.loc[m, :].apply( lambda x: vincenty(x['coords'], x['coords_shifted']).kilometers, axis=1)
As a next step, we can start to look at some descriptives for the usage and distance variables we've just defined. Let's take a look at the usage first:
# Compute some descriptives statistics on bike usage length_observ_period = bikes.scrape_datetime.dt.dayofyear.max() - bikes.scrape_datetime.dt.dayofyear.min() num_bikes = len(bikes.groupby('bike_id')) num_observ_per_bike = bikes.groupby(['bike_id']).size() usage_mean_daily = np.mean(bikes.groupby(bikes.scrape_datetime.dt.dayofyear).change.sum() / num_bikes) usage_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).change.sum() / length_observ_period usage_by_hour_day_per_week = bikes.groupby( [bikes.scrape_datetime.dt.hour,'scrape_weekday' ]).change.sum() / length_observ_period usage_by_weekday = bikes.groupby(['scrape_weekday']).change.sum() / length_observ_period * 7
print("Days in dataset: ", length_observ_period) print("Number of bikes: ", num_bikes) print("Mean usage per day: ", usage_mean_daily) # print("Number of observations per bike: ", num_observ_per_bike ) print("Average usage by time: ", usage_by_hour) print("Average usage by time and weekday: ", usage_by_hour_day_per_week) print("Average usage by weekday: ", usage_by_weekday)
Days in dataset: 40 Number of bikes: 1230 Mean usage per day: 1.7067023597065238 Average usage by time: scrape_datetime 0 42.725 1 32.025 2 26.425 3 19.275 4 16.325 5 11.800 6 25.100 7 56.500 8 109.200 9 115.400 10 91.525 11 88.675 12 112.075 13 135.650 14 135.825 15 138.325 16 159.625 17 184.250 18 185.200 19 147.025 20 113.750 21 78.925 22 69.525 23 56.575 Name: change, dtype: float64 Average usage by time and weekday: scrape_datetime scrape_weekday 0 Fri 7.125 Mon 5.150 Sat 7.400 Sun 6.950 Thu 5.550 Tue 4.875 Wed 5.675 1 Fri 4.100 Mon 2.725 Sat 7.000 Sun 7.375 Thu 3.725 Tue 3.100 Wed 4.000 2 Fri 3.600 Mon 1.900 Sat 6.475 Sun 7.575 Thu 2.950 Tue 1.925 Wed 2.000 3 Fri 2.100 Mon 1.325 Sat 5.575 Sun 5.850 Thu 1.775 Tue 1.325 Wed 1.325 4 Fri 2.025 Mon 1.500 ... 19 Tue 24.825 Wed 20.475 20 Fri 19.050 Mon 17.125 Sat 9.500 Sun 10.800 Thu 20.600 Tue 20.000 Wed 16.675 21 Fri 14.200 Mon 11.950 Sat 8.275 Sun 7.800 Thu 14.475 Tue 10.550 Wed 11.675 22 Fri 12.175 Mon 10.075 Sat 6.550 Sun 6.800 Thu 12.250 Tue 11.975 Wed 9.700 23 Fri 9.275 Mon 7.325 Sat 5.275 Sun 5.725 Thu 10.425 Tue 9.600 Wed 8.950 Name: change, Length: 168, dtype: float64 Average usage by weekday: scrape_weekday Fri 2514.575 Mon 2264.675 Sat 1582.175 Sun 1518.125 Thu 2545.900 Tue 2403.450 Wed 2233.175 Name: change, dtype: float64
During our observation period of 40 days we have 1230 different bikes on the streets. On average a bike is used 1.7 times a day. Hence, there are about 2100 trips made each day. The usage by hour, weekday and hour and weekday combined hints at some interesting patterns. This becomes more clear when we visualize these figures.
Taking advantage of the possibility to directly plot pandas DataFrames and Series with
pyplot we look at some graphs:
ax = num_observ_per_bike.hist(bins=int(num_bikes/20)) ax.set_title("Total observations") ax.set_xlabel("number of observations") ax.set_ylabel("number of bikes")
Text(0,0.5,'number of bikes')
We see that there are about 50 bikes which have almost no observations in our data. Also a few others have less than 3000. Above 3000 there is a steady increase. Probably, bikes with very few observations have broken down at some point at the beginning. However, bikes that have been used a lot will also have fewer observations. Consequently, there are quite a few bikes that seem to be used very rarely.
Next, we look at patterns between weekdays:
weekday_index = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"] ax = usage_by_weekday.reindex(weekday_index).plot() ax.set_title("Usage of bikes by weekday") ax.set_xlabel("weekday") ax.set_ylabel("bikes used") ax.set_xticks(range(len(weekday_index))) ax.set_xticklabels(weekday_index)
[Text(0,0,'Mon'), Text(0,0,'Tue'), Text(0,0,'Wed'), Text(0,0,'Thu'), Text(0,0,'Fri'), Text(0,0,'Sat'), Text(0,0,'Sun')]
Thursdays and Fridays are the most popular rental days. More than 2500 trips are made on average on these days. On the weekends there is a massive drop. Usage is almost 40% lower compared to the weekdays. This indicates that commuting between home and work (or school / university) is the dominant use case for rental bikes.
Let's follow up by looking at how rentals change by time of the day:
ax = usage_by_hour.plot() plt.xticks(range(0,24)[::2]) ax.set_title("Usage of bikes by time") ax.set_xlabel("time") ax.set_ylabel("bikes used")
The graph aboves shows the usage of bikes by hour of the day for an average day. We see that the usage takes off at about 6 a.m. in the morning. It climbs to its first peak at around 9 and then drops somewhat. This corresponds well with people going to work. Then, at about 11 a.m. usage starts to increase steadily. The maximum is reached at about 6 p.m. Again, this fits the fact that people use the bikes to return back home after work. Thereafter, there is an even decline until the minimum at around 5 a.m.
Let's see if there is a difference in this graph between weekdays and weekends:
# Compute Usage by time for weekdays vs weekends usage_by_hour_wd = bikes[bikes.scrape_datetime.dt.weekday <= 5].groupby( bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 5/7) usage_by_hour_we = bikes[bikes.scrape_datetime.dt.weekday > 5].groupby( bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 2/7)
plt.subplot(2, 1, 1) ax = usage_by_hour_wd.plot() plt.xticks(range(0,24)[::2]) ax.set_title("Usage of bikes by time\nWeekdays") ax.set_xlabel("time") ax.set_ylabel("bikes used") plt.subplot(2, 1, 2) ax = usage_by_hour_we.plot() plt.xticks(range(0,24)[::2]) ax.set_title("Weekends") ax.set_xlabel("time") ax.set_ylabel("bikes used") plt.tight_layout()
A few interesting points can be taken from this comparison. On the weekends, bike rentals in the morning begin to increase two hours later. Also, instead of the two peaks during weekdays there is only one peak. This speaks for the fact that commuting to work does not play a major role on weekends. Moreover, there is an increase of night time rentals between 11 p.m. and 2 a.m. This is consistent with a leisure oriented use of the bikes.
Moving on, we investigate the traveled distance by first looking at some descriptives:
# Compute descriptive stats on distance distance_total = bikes.trip_distance.sum() distance_by_bike = bikes.trip_distance.sum() / num_bikes distance_by_day = distance_total / length_observ_period distance_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).trip_distance.sum() / length_observ_period distance_by_weekday = bikes.groupby(['scrape_weekday']).trip_distance.sum() / length_observ_period * 7
print("Total distance: ", distance_total) print("Total Distance per bike: ", distance_by_bike) print("Total distance per day: ", distance_by_day) print("Average distance per hour: ", distance_by_hour) print("Average distance per weekday: ", distance_by_weekday)
Total distance: 120799.78858713029 Total Distance per bike: 98.21121023343926 Total distance per day: 3019.994714678257 Average distance per hour: scrape_datetime 0 63.158457 1 47.041078 2 44.051801 3 30.702031 4 28.278160 5 15.714868 6 34.531916 7 80.342073 8 158.829917 9 178.581351 10 128.936648 11 120.809741 12 137.800258 13 167.985125 14 177.385833 15 194.355843 16 225.010710 17 270.925226 18 270.701171 19 202.776271 20 154.563156 21 110.757030 22 94.588138 23 82.167915 Name: trip_distance, dtype: float64 Average distance per weekday: scrape_weekday Fri 3531.804911 Mon 3063.002726 Sat 2195.461015 Sun 2226.991564 Thu 3654.011372 Tue 3331.189834 Wed 3137.501581 Name: trip_distance, dtype: float64
Here, we look at the traveled distance during the 40 day period at hand. All bikes taken together have been rode for more than 120.799km. Traveled by car, this would sum up to about 15 tons of CO² emitted. On average, each bike was used for a total of about 100km or 2,5km a day. In general, there are no big surprises here. The overall picture is equivalent to what we've seen before.
What might be interesting, though, is calculating the distance per trip. So let's give it a shot:
# plt.figure(figsize=(8,10)) plt.subplot(2, 1, 1) ax = usage_by_hour.plot() plt.xticks(range(0,24)[::2]) ax.set_title("Usage of bikes by time") ax.set_xlabel("time") ax.set_ylabel("bikes used") distance_per_trip_by_hour = distance_by_hour / usage_by_hour plt.subplot(2, 1, 2) ax = distance_per_trip_by_hour.plot() plt.xticks(range(0,24)[::2]) ax.set_title("Average trip distance by time") ax.set_xlabel("time") ax.set_ylabel("av trip distance") plt.tight_layout()
Above, we compare the average trip distance with the average number of rentals by time. A striking finding is the increase in trip distance observed between 22 p.m. and 4 a.m. While not many people rent bikes during this timespan those who do tend to ride for above average distances.
This concludes the first statistical analysis of the data. But there is so much more that we can look at. Consequently, in a follow up post I'll start applying machine learning to the data. Especially, in combination with supplemental data sources there are much more insights to be gained!