After gathering and cleaning data on rental bikes in Cologne, we already looked at map visualizations. Here, we will dive into the data more deeply. Using descriptive statistics we will try to find interesting patterns. For this, we will use the pandas library. It is great for manipulating and analyzing data. Moreover, we will use matplotlib to visualize some of our findings.

You can find the data used in this article here. This post was written as a jupyter notebook. You can also find it at github.
Let's get started by setting up things:

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import vincenty
# define style of plots
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (8,6)


After setting up the environment, we read in the data. Following, we take a first glance at the structure:

In [2]:
BIKES_COMBINED = "../data/bikedata_10min-int-observ_Feb-March.csv"
bikes.scrape_datetime = pd.to_datetime(bikes.scrape_datetime)

Out[2]:
bike_id scrape_weekday u_id lat lon bike_name scrape_datetime
0 22460 Sun 1981722 50.933528 6.995009 BIKE 22460 2017-02-05 00:00:02
1 22134 Sun 2118881 50.896996 6.980310 BIKE 22134 2017-02-05 00:00:02
2 21911 Sun 2361566 50.967579 6.936491 BIKE 21911 2017-02-05 00:00:02
3 21522 Sun 2380019 50.975012 7.007270 BIKE 21522 2017-02-05 00:00:02
4 21914 Sun 2380653 50.970199 6.900565 BIKE 21914 2017-02-05 00:00:02
5 21296 Sun 2384241 50.965448 6.919332 BIKE 21296 2017-02-05 00:00:02
6 22165 Sun 2384381 50.965436 6.919535 BIKE 22165 2017-02-05 00:00:02
7 21103 Sun 2388611 50.948214 6.989416 BIKE 21103 2017-02-05 00:00:02

We have one observation for all bike locations (lat and lon) every ten minutes. The bikes can be identified by their bike_id. If a bike_id is missing at one observation point it means that the bike is currently not available. This is the case when a bike is in use. Alternatively, the bike (or the GPS) could be broken or under maintenance.
If between two observation points the coordinates have changed, the bike has been used. Accordingly, we can create a dummy variable to indicate this. We do this by grouping the data by bike_id and applying a function to every row in the lat column. The function checks if the difference between the current and previous row (i.e. the current and previous observation) is greater than zero:

In [3]:
# Compute dummy (0/1) for location changes: did coordinates change between two observations?
bikes['change'] = bikes.groupby('bike_id')['lat'].apply(lambda x: abs(x.diff()) > 0).astype(int)


Now, that we know a bike was used let's find out what distance it traveled. Because we only have the start and end point of the journey this will be a rough estimate. We can expect the actual distance to be significantly longer. To achieve this, first we create a new column. It contains the coordinates shifted by one observation period for each bike. Then, we create the index m that includes all rows in our data where the bike has been used. Using this, we can compute the distance between the actual coordinates and the shifted coordinates. This is done by applying the vincenty function from the geopy package to each row in m:

In [4]:
# Compute Trip Distance when coords changed
bikes['coords'] = bikes[['lat','lon']].values.tolist()
bikes['coords_shifted'] =  bikes.groupby('bike_id')['coords'].shift()
m = bikes['coords_shifted'].notnull() & bikes['change'] == 1

bikes.loc[m, 'trip_distance'] = bikes.loc[m, :].apply(
lambda x: vincenty(x['coords'], x['coords_shifted']).kilometers, axis=1)


As a next step, we can start to look at some descriptives for the usage and distance variables we've just defined. Let's take a look at the usage first:

In [23]:
# Compute some descriptives statistics on bike usage
length_observ_period = bikes.scrape_datetime.dt.dayofyear.max() - bikes.scrape_datetime.dt.dayofyear.min()
num_bikes = len(bikes.groupby('bike_id'))
num_observ_per_bike = bikes.groupby(['bike_id']).size()
usage_mean_daily = np.mean(bikes.groupby(bikes.scrape_datetime.dt.dayofyear).change.sum() / num_bikes)
usage_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).change.sum() / length_observ_period
usage_by_hour_day_per_week = bikes.groupby(
[bikes.scrape_datetime.dt.hour,'scrape_weekday' ]).change.sum() / length_observ_period
usage_by_weekday = bikes.groupby(['scrape_weekday']).change.sum() / length_observ_period * 7

In [43]:
print("Days in dataset: ", length_observ_period)
print("Number of bikes: ", num_bikes)
print("Mean usage per day: ", usage_mean_daily)
# print("Number of observations per bike: ", num_observ_per_bike )
print("Average usage by time: ", usage_by_hour)
print("Average usage by time and weekday: ", usage_by_hour_day_per_week)
print("Average usage by weekday: ", usage_by_weekday)

Days in dataset:  40
Number of bikes:  1230
Mean usage per day:  1.7067023597065238
Average usage by time:  scrape_datetime
0      42.725
1      32.025
2      26.425
3      19.275
4      16.325
5      11.800
6      25.100
7      56.500
8     109.200
9     115.400
10     91.525
11     88.675
12    112.075
13    135.650
14    135.825
15    138.325
16    159.625
17    184.250
18    185.200
19    147.025
20    113.750
21     78.925
22     69.525
23     56.575
Name: change, dtype: float64
Average usage by time and weekday:  scrape_datetime  scrape_weekday
0                Fri                7.125
Mon                5.150
Sat                7.400
Sun                6.950
Thu                5.550
Tue                4.875
Wed                5.675
1                Fri                4.100
Mon                2.725
Sat                7.000
Sun                7.375
Thu                3.725
Tue                3.100
Wed                4.000
2                Fri                3.600
Mon                1.900
Sat                6.475
Sun                7.575
Thu                2.950
Tue                1.925
Wed                2.000
3                Fri                2.100
Mon                1.325
Sat                5.575
Sun                5.850
Thu                1.775
Tue                1.325
Wed                1.325
4                Fri                2.025
Mon                1.500
...
19               Tue               24.825
Wed               20.475
20               Fri               19.050
Mon               17.125
Sat                9.500
Sun               10.800
Thu               20.600
Tue               20.000
Wed               16.675
21               Fri               14.200
Mon               11.950
Sat                8.275
Sun                7.800
Thu               14.475
Tue               10.550
Wed               11.675
22               Fri               12.175
Mon               10.075
Sat                6.550
Sun                6.800
Thu               12.250
Tue               11.975
Wed                9.700
23               Fri                9.275
Mon                7.325
Sat                5.275
Sun                5.725
Thu               10.425
Tue                9.600
Wed                8.950
Name: change, Length: 168, dtype: float64
Average usage by weekday:  scrape_weekday
Fri    2514.575
Mon    2264.675
Sat    1582.175
Sun    1518.125
Thu    2545.900
Tue    2403.450
Wed    2233.175
Name: change, dtype: float64


During our observation period of 40 days we have 1230 different bikes on the streets. On average a bike is used 1.7 times a day. Hence, there are about 2100 trips made each day. The usage by hour, weekday and hour and weekday combined hints at some interesting patterns. This becomes more clear when we visualize these figures.
Taking advantage of the possibility to directly plot pandas DataFrames and Series with pyplot we look at some graphs:

In [37]:
ax = num_observ_per_bike.hist(bins=int(num_bikes/20))
ax.set_title("Total observations")
ax.set_xlabel("number of observations")
ax.set_ylabel("number of bikes")

Out[37]:
Text(0,0.5,'number of bikes')

We see that there are about 50 bikes which have almost no observations in our data. Also a few others have less than 3000. Above 3000 there is a steady increase. Probably, bikes with very few observations have broken down at some point at the beginning. However, bikes that have been used a lot will also have fewer observations. Consequently, there are quite a few bikes that seem to be used very rarely.
Next, we look at patterns between weekdays:

In [38]:
weekday_index = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]
ax = usage_by_weekday.reindex(weekday_index).plot()
ax.set_title("Usage of bikes by weekday")
ax.set_xlabel("weekday")
ax.set_ylabel("bikes used")
ax.set_xticks(range(len(weekday_index)))
ax.set_xticklabels(weekday_index)

Out[38]:
[Text(0,0,'Mon'),
Text(0,0,'Tue'),
Text(0,0,'Wed'),
Text(0,0,'Thu'),
Text(0,0,'Fri'),
Text(0,0,'Sat'),
Text(0,0,'Sun')]

Thursdays and Fridays are the most popular rental days. More than 2500 trips are made on average on these days. On the weekends there is a massive drop. Usage is almost 40% lower compared to the weekdays. This indicates that commuting between home and work (or school / university) is the dominant use case for rental bikes.
Let's follow up by looking at how rentals change by time of the day:

In [39]:
ax = usage_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")

Out[39]:
Text(0,0.5,'bikes used')

The graph aboves shows the usage of bikes by hour of the day for an average day. We see that the usage takes off at about 6 a.m. in the morning. It climbs to its first peak at around 9 and then drops somewhat. This corresponds well with people going to work. Then, at about 11 a.m. usage starts to increase steadily. The maximum is reached at about 6 p.m. Again, this fits the fact that people use the bikes to return back home after work. Thereafter, there is an even decline until the minimum at around 5 a.m.
Let's see if there is a difference in this graph between weekdays and weekends:

In [28]:
# Compute Usage by time for weekdays vs weekends
usage_by_hour_wd = bikes[bikes.scrape_datetime.dt.weekday <= 5].groupby(
bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 5/7)
usage_by_hour_we = bikes[bikes.scrape_datetime.dt.weekday > 5].groupby(
bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 2/7)

In [40]:
plt.subplot(2, 1, 1)
ax = usage_by_hour_wd.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time\nWeekdays")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")

plt.subplot(2, 1, 2)
ax = usage_by_hour_we.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Weekends")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")

plt.tight_layout()


A few interesting points can be taken from this comparison. On the weekends, bike rentals in the morning begin to increase two hours later. Also, instead of the two peaks during weekdays there is only one peak. This speaks for the fact that commuting to work does not play a major role on weekends. Moreover, there is an increase of night time rentals between 11 p.m. and 2 a.m. This is consistent with a leisure oriented use of the bikes.
Moving on, we investigate the traveled distance by first looking at some descriptives:

In [41]:
# Compute descriptive stats on distance
distance_total = bikes.trip_distance.sum()
distance_by_bike = bikes.trip_distance.sum() / num_bikes
distance_by_day = distance_total / length_observ_period
distance_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).trip_distance.sum() / length_observ_period
distance_by_weekday  = bikes.groupby(['scrape_weekday']).trip_distance.sum() / length_observ_period * 7

In [42]:
print("Total distance: ", distance_total)
print("Total Distance per bike: ", distance_by_bike)
print("Total distance per day: ", distance_by_day)
print("Average distance per hour: ", distance_by_hour)
print("Average distance per weekday: ", distance_by_weekday)

Total distance:  120799.78858713029
Total Distance per bike:  98.21121023343926
Total distance per day:  3019.994714678257
Average distance per hour:  scrape_datetime
0      63.158457
1      47.041078
2      44.051801
3      30.702031
4      28.278160
5      15.714868
6      34.531916
7      80.342073
8     158.829917
9     178.581351
10    128.936648
11    120.809741
12    137.800258
13    167.985125
14    177.385833
15    194.355843
16    225.010710
17    270.925226
18    270.701171
19    202.776271
20    154.563156
21    110.757030
22     94.588138
23     82.167915
Name: trip_distance, dtype: float64
Average distance per weekday:  scrape_weekday
Fri    3531.804911
Mon    3063.002726
Sat    2195.461015
Sun    2226.991564
Thu    3654.011372
Tue    3331.189834
Wed    3137.501581
Name: trip_distance, dtype: float64


Here, we look at the traveled distance during the 40 day period at hand. All bikes taken together have been rode for more than 120.799km. Traveled by car, this would sum up to about 15 tons of CO² emitted. On average, each bike was used for a total of about 100km or 2,5km a day. In general, there are no big surprises here. The overall picture is equivalent to what we've seen before.
What might be interesting, though, is calculating the distance per trip. So let's give it a shot:

In [44]:
# plt.figure(figsize=(8,10))
plt.subplot(2, 1, 1)
ax = usage_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")

distance_per_trip_by_hour = distance_by_hour / usage_by_hour
plt.subplot(2, 1, 2)
ax = distance_per_trip_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Average trip distance by time")
ax.set_xlabel("time")
ax.set_ylabel("av trip distance")

plt.tight_layout()


Above, we compare the average trip distance with the average number of rentals by time. A striking finding is the increase in trip distance observed between 22 p.m. and 4 a.m. While not many people rent bikes during this timespan those who do tend to ride for above average distances.

This concludes the first statistical analysis of the data. But there is so much more that we can look at. Consequently, in a follow up post I'll start applying machine learning to the data. Especially, in combination with supplemental data sources there are much more insights to be gained!