After gathering and cleaning data on rental bikes in Cologne, we already looked at map visualizations. Here, we will dive into the data more deeply. Using descriptive statistics we will try to find interesting patterns. For this, we will use the pandas library. It is great for manipulating and analyzing data. Moreover, we will use matplotlib to visualize some of our findings.

You can find the data used in this article here. This post was written as a jupyter notebook. You can also find it at github.

Let's get started by setting up things:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import vincenty
# define style of plots
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (8,6)
```

After setting up the environment, we read in the data. Following, we take a first glance at the structure:

```
BIKES_COMBINED = "../data/bikedata_10min-int-observ_Feb-March.csv"
bikes = pd.read_csv(BIKES_COMBINED)
bikes.scrape_datetime = pd.to_datetime(bikes.scrape_datetime)
bikes.head(8)
```

We have one observation for all bike locations (lat and lon) every ten minutes. The bikes can be identified by their `bike_id`

.
If a `bike_id`

is missing at one observation point it means that the bike is currently **not** available. This is the case when a bike is in use. Alternatively, the bike (or the GPS) could be broken or under maintenance.

If between two observation points the coordinates have changed, the bike has been used. Accordingly, we can create a dummy variable to indicate this. We do this by grouping the data by `bike_id`

and applying a function to every row in the `lat`

column. The function checks if the difference between the current and previous row (i.e. the current and previous observation) is greater than zero:

```
# Compute dummy (0/1) for location changes: did coordinates change between two observations?
bikes['change'] = bikes.groupby('bike_id')['lat'].apply(lambda x: abs(x.diff()) > 0).astype(int)
```

Now, that we know a bike was used let's find out what distance it traveled. Because we only have the start and end point of the journey this will be a rough estimate. We can expect the actual distance to be significantly longer. To achieve this, first we create a new column. It contains the coordinates shifted by one observation period for each bike. Then, we create the index `m`

that includes all rows in our data where the bike has been used. Using this, we can compute the distance between the actual coordinates and the shifted coordinates. This is done by applying the `vincenty`

function from the `geopy`

package to each row in `m`

:

```
# Compute Trip Distance when coords changed
bikes['coords'] = bikes[['lat','lon']].values.tolist()
bikes['coords_shifted'] = bikes.groupby('bike_id')['coords'].shift()
m = bikes['coords_shifted'].notnull() & bikes['change'] == 1
bikes.loc[m, 'trip_distance'] = bikes.loc[m, :].apply(
lambda x: vincenty(x['coords'], x['coords_shifted']).kilometers, axis=1)
```

As a next step, we can start to look at some descriptives for the usage and distance variables we've just defined. Let's take a look at the usage first:

```
# Compute some descriptives statistics on bike usage
length_observ_period = bikes.scrape_datetime.dt.dayofyear.max() - bikes.scrape_datetime.dt.dayofyear.min()
num_bikes = len(bikes.groupby('bike_id'))
num_observ_per_bike = bikes.groupby(['bike_id']).size()
usage_mean_daily = np.mean(bikes.groupby(bikes.scrape_datetime.dt.dayofyear).change.sum() / num_bikes)
usage_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).change.sum() / length_observ_period
usage_by_hour_day_per_week = bikes.groupby(
[bikes.scrape_datetime.dt.hour,'scrape_weekday' ]).change.sum() / length_observ_period
usage_by_weekday = bikes.groupby(['scrape_weekday']).change.sum() / length_observ_period * 7
```

```
print("Days in dataset: ", length_observ_period)
print("Number of bikes: ", num_bikes)
print("Mean usage per day: ", usage_mean_daily)
# print("Number of observations per bike: ", num_observ_per_bike )
print("Average usage by time: ", usage_by_hour)
print("Average usage by time and weekday: ", usage_by_hour_day_per_week)
print("Average usage by weekday: ", usage_by_weekday)
```

During our observation period of 40 days we have 1230 different bikes on the streets. On average a bike is used 1.7 times a day. Hence, there are about 2100 trips made each day. The usage by hour, weekday and hour and weekday combined hints at some interesting patterns. This becomes more clear when we visualize these figures.

Taking advantage of the possibility to directly plot pandas DataFrames and Series with `pyplot`

we look at some graphs:

```
ax = num_observ_per_bike.hist(bins=int(num_bikes/20))
ax.set_title("Total observations")
ax.set_xlabel("number of observations")
ax.set_ylabel("number of bikes")
```

We see that there are about 50 bikes which have almost no observations in our data. Also a few others have less than 3000. Above 3000 there is a steady increase. Probably, bikes with very few observations have broken down at some point at the beginning. However, bikes that have been used a lot will also have fewer observations. Consequently, there are quite a few bikes that seem to be used very rarely.

Next, we look at patterns between weekdays:

```
weekday_index = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]
ax = usage_by_weekday.reindex(weekday_index).plot()
ax.set_title("Usage of bikes by weekday")
ax.set_xlabel("weekday")
ax.set_ylabel("bikes used")
ax.set_xticks(range(len(weekday_index)))
ax.set_xticklabels(weekday_index)
```

Thursdays and Fridays are the most popular rental days. More than 2500 trips are made on average on these days. On the weekends there is a massive drop. Usage is almost 40% lower compared to the weekdays. This indicates that commuting between home and work (or school / university) is the dominant use case for rental bikes.

Let's follow up by looking at how rentals change by time of the day:

```
ax = usage_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")
```

The graph aboves shows the usage of bikes by hour of the day for an average day. We see that the usage takes off at about 6 a.m. in the morning. It climbs to its first peak at around 9 and then drops somewhat. This corresponds well with people going to work. Then, at about 11 a.m. usage starts to increase steadily. The maximum is reached at about 6 p.m. Again, this fits the fact that people use the bikes to return back home after work. Thereafter, there is an even decline until the minimum at around 5 a.m.

Let's see if there is a difference in this graph between weekdays and weekends:

```
# Compute Usage by time for weekdays vs weekends
usage_by_hour_wd = bikes[bikes.scrape_datetime.dt.weekday <= 5].groupby(
bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 5/7)
usage_by_hour_we = bikes[bikes.scrape_datetime.dt.weekday > 5].groupby(
bikes.scrape_datetime.dt.hour).change.sum() / (length_observ_period * 2/7)
```

```
plt.subplot(2, 1, 1)
ax = usage_by_hour_wd.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time\nWeekdays")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")
plt.subplot(2, 1, 2)
ax = usage_by_hour_we.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Weekends")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")
plt.tight_layout()
```

A few interesting points can be taken from this comparison. On the weekends, bike rentals in the morning begin to increase two hours later. Also, instead of the two peaks during weekdays there is only one peak. This speaks for the fact that commuting to work does not play a major role on weekends. Moreover, there is an increase of night time rentals between 11 p.m. and 2 a.m. This is consistent with a leisure oriented use of the bikes.

Moving on, we investigate the traveled distance by first looking at some descriptives:

```
# Compute descriptive stats on distance
distance_total = bikes.trip_distance.sum()
distance_by_bike = bikes.trip_distance.sum() / num_bikes
distance_by_day = distance_total / length_observ_period
distance_by_hour = bikes.groupby(bikes.scrape_datetime.dt.hour).trip_distance.sum() / length_observ_period
distance_by_weekday = bikes.groupby(['scrape_weekday']).trip_distance.sum() / length_observ_period * 7
```

```
print("Total distance: ", distance_total)
print("Total Distance per bike: ", distance_by_bike)
print("Total distance per day: ", distance_by_day)
print("Average distance per hour: ", distance_by_hour)
print("Average distance per weekday: ", distance_by_weekday)
```

Here, we look at the traveled distance during the 40 day period at hand. All bikes taken together have been rode for more than 120.799km. Traveled by car, this would sum up to about 15 tons of CO² emitted. On average, each bike was used for a total of about 100km or 2,5km a day. In general, there are no big surprises here. The overall picture is equivalent to what we've seen before.

What might be interesting, though, is calculating the distance per trip. So let's give it a shot:

```
# plt.figure(figsize=(8,10))
plt.subplot(2, 1, 1)
ax = usage_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Usage of bikes by time")
ax.set_xlabel("time")
ax.set_ylabel("bikes used")
distance_per_trip_by_hour = distance_by_hour / usage_by_hour
plt.subplot(2, 1, 2)
ax = distance_per_trip_by_hour.plot()
plt.xticks(range(0,24)[::2])
ax.set_title("Average trip distance by time")
ax.set_xlabel("time")
ax.set_ylabel("av trip distance")
plt.tight_layout()
```

Above, we compare the average trip distance with the average number of rentals by time. A striking finding is the increase in trip distance observed between 22 p.m. and 4 a.m. While not many people rent bikes during this timespan those who do tend to ride for above average distances.

This concludes the first statistical analysis of the data. But there is so much more that we can look at. Consequently, in a follow up post I'll start applying machine learning to the data. Especially, in combination with supplemental data sources there are much more insights to be gained!

## Comments

comments powered by Disqus