Motivation

Tinder is a huge phenomenon in the online dating world. Because of its massive user base it potentially offers lots of data that is exciting to analyze. A general overview on Tinder can be found in this article which mainly looks at business key figures and surveys of users:

Tinder usage in the UK

Source: Survey by Weareflint


However, there are only sparse resources looking at Tinder app data on an user level. One reason for that being that data is not easy to gather. One approach is to ask Tinder for your own data. This process was used in this inspiring analysis which focuses on matching rates and messaging between users. Another way is to create profiles and automatically collect data on your own by using the undocumented Tinder API. This method was used in a paper which is summarized neatly in this blogpost. The paper's focus also was the study of matching and messaging behavior of users. Lastly, this post summarizes finding from the biographies of male and female Tinder profiles from Sydney.

In the following, we will complement and expand previous analysis on Tinder data. Using an unique, extensive dataset we will apply descriptive statistics, natural language processing and visualizations in order to uncover patterns on Tinder. In this first analysis we will focus on insights from profiles we observe during swiping. In a follow up post we will then look at novel findings from a field experiment on Tinder. The results will reveal new insights regarding liking behavior and patterns in matching and messaging of users.

Data collection

The dataset was gathered using two bots making use of the unofficial Tinder API. The bots used two almost identical male profiles aged 29 to swipe in Germany over the course of four weeks. After each week, the location was set to the city center of one of the following (top 5 largest) cities: Berlin, Frankfurt, Hamburg and Munich. The distance filter was set to 16km, age filter to 20-40 and search preference to women.
Each bot encountered about 300 profiles per day. The profile data was returned in JSON format in batches of 10-30 profiles per response.
Unfortunately, I won't be able to share the dataset because doing so is in a gray area. Check out this post to learn about the many legal issues that come with such datasets.

Setting up things

In the following, I will share my data analysis of the dataset using a Jupyter Notebook. So, let's get started by first importing the packages we will use and setting some options:

In [117]:
# coding: utf-8
import os
import json
import pandas as pd
import numpy as np
import nltk
import textblob
import datetime
from wordcloud import WordCloud
from PIL import Image
from IPython.display import Markdown as md
from pandas.io.json import json_normalize
import hvplot.pandas
from bokeh.io import output_notebook
##
output_notebook()
pd.set_option('display.max_columns', 100)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Loading BokehJS ...


Most packages are the basic stack for any data analysis. In addition, we will use the wonderful hvplot library for visualization. Until now I was overwhelmed by the vast choice of visualization libraries in Python (here is a great read on that). This ends with hvplot which comes out of the PyViz initiative. It is a high-level library with a concise syntax that produces not only aesthetic but also interactive plots. Among others, it smoothly works on pandas DataFrames.
With json_normalize we can easily create flat tables from deeply nested json files. The Natural Language Toolkit (nltk) and Textblob will be used to deal with language and text. And finally wordcloud does what it says.

After reading in the .json files and applying some minor preprocessing the result is a dataframe. It contains all relevant pieces of information in its columns. Some columns, i.e. photos, are deeply nested in the original json. Therefore, their cells include lists. For now, we don't need them so we'll just ignore that fact. A first peek at the data shows this:

In [24]:
profiles.head()
Out[24]:
_id bio birth_date birth_date_info common_friend_count common_friends common_like_count common_likes connection_count content_hash distance_mi gender group_matched hide_age hide_distance instagram.completed_initial_fetch instagram.last_fetch_time instagram.media_count instagram.photos instagram.profile_picture instagram.username is_traveling jobs name photos ping_time s_number schools show_gender_on_profile spotify_theme_track.album.id spotify_theme_track.album.images spotify_theme_track.album.name spotify_theme_track.artists spotify_theme_track.id spotify_theme_track.name spotify_theme_track.preview_url spotify_theme_track.uri teaser.string teaser.type teasers scrape_time city bot spotify_top_artists custom_gender is_super_like
0 536df31115b9cb4d570010bf 1991-07-05T19:20:22.207Z fuzzy birthdate active, not displaying real bi... 0 [] 0 [] 0 934sQJIXPhvNHpMuXJf81s8oHpkFDQhn8sekIp4fg7tZqhJJ 2 1 False NaN NaN NaN NaN NaN NaN NaN NaN NaN [] Pauline [{'crop_info': {'user': {'height_pct': 1.0, 'w... 2014-12-09T00:00:00.000Z 35163144 [] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [] 2019-07-02 19:20:22 frankfurt 1 NaN NaN NaN
1 5d0798f48a5236150004cd0c ...liebt Mode,Autos,nicht oberflächlich, große... 1987-07-05T19:20:22.207Z fuzzy birthdate active, not displaying real bi... 0 [] 0 [] 0 j5AiztQNCn2fmQseVTkJsrVHDEcNJiMofz7TeUVpTgahQ7 2 1 False NaN NaN NaN NaN NaN NaN NaN NaN NaN [] Caro [{'crop_info': {'algo': {'height_pct': 0.15717... 2014-12-09T00:00:00.000Z 771689874 [] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [] 2019-07-02 19:20:22 frankfurt 1 NaN NaN NaN
2 5a91a76dbd1aef380c118ea6 1991-07-05T19:20:22.207Z fuzzy birthdate active, not displaying real bi... 0 [] 0 [] 0 77xIpateugSgwh0QHD9trmfOJf9YtYOHXdcvgs5PcJ9uAQ 6 1 False NaN NaN NaN NaN NaN NaN NaN NaN NaN [] Jennifer [{'crop_info': {'algo': {'height_pct': 0.50207... 2014-12-09T00:00:00.000Z 488436299 [] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [] 2019-07-02 19:20:22 frankfurt 1 NaN NaN NaN
3 5ba1326f74b4ef521e54b951 1990-07-05T19:20:22.207Z fuzzy birthdate active, not displaying real bi... 0 [] 0 [] 0 4rcqRTA5HOdtzCR0C2pHYji8QTNJU8ioAuwJC3lT1oc9e 3 1 False NaN NaN NaN NaN NaN NaN NaN NaN NaN [{'title': {'name': 'Managerin'}}] Alexandra [{'extension': 'jpg', 'fileName': '27eb3237-20... 2014-12-09T00:00:00.000Z 629764160 [] NaN NaN NaN NaN NaN NaN NaN NaN NaN Managerin job [{'string': 'Managerin', 'type': 'job'}] 2019-07-02 19:20:22 frankfurt 1 NaN NaN NaN
4 588fb22f5250a85b0a4b14dc Match me if you can ✌️\nAny American guys here... 1988-07-05T19:20:22.207Z fuzzy birthdate active, not displaying real bi... 0 [] 0 [] 0 w7gs9Fmwuki78tpauMjfepIYjt6cD7TbULnuN8I7bHoP 5 1 False False False True 2019-06-09T16:50:17.006Z 573.0 [{'image': 'https://scontent.cdninstagram.com/... https://scontent.cdninstagram.com/vp/f669206f3... Tinder False [] Maria [{'extension': 'jpg', 'fileName': '7011f614-48... 2014-12-09T00:00:00.000Z 316845661 [] NaN 5w1Q754bGLBaWg7R4rggwM [{'height': 640, 'url': 'https://i.scdn.co/ima... Existensia [{'id': '7nR8HLYnGB0qaVv8869zN2', 'name': 'Mis... 3FVkxus3g1Qq4vd0nIS2za Existensia https://p.scdn.co/mp3-preview/b2e835ab5a2319a2... spotify:track:3FVkxus3g1Qq4vd0nIS2za NaN [{'string': '573 Instagram Photos', 'type': 'i... 2019-07-02 19:20:22 frankfurt 1 NaN NaN NaN


Basically, we have all the info that makes up a tinder profile. Moreover, we have some additional data which might not be obivous when using the app. For example, the hide_age and hide_distance variables indicate whether the person has a premium account (those are premium features). Usually, they are NaN but for paying users they are either True or False. Paying users can either have a Tinder Plus or Tinder Gold subscription. In addition, teaser.string and teaser.type are empty for most profiles. In some cases they are not. I would guess that this indicates profiles showing up in the top picks part of the app.

Some general figures

Let's see how many profiles there are in the data. Also, we'll check how many profile we've encountred multiple times while swiping. For that, we'll look at the number of duplicates. Moreover, let's see what fraction of people are paying premium users:

In [43]:
num_profiles_1 = len(profiles[profiles['bot']==1])
num_profiles_2 = len(profiles[profiles['bot']==2])
num_profiles = num_profiles_1 + num_profiles_2

num_dups_total = len(profiles[profiles.duplicated(['_id'])].sort_values('_id'))

num_dups_1 = len(profiles[(profiles.duplicated(['_id', 'bot'])) &
                          (profiles['bot']==1)].sort_values('_id'))
num_dups_2 = len(profiles[(profiles.duplicated(['_id', 'bot'])) &
                          (profiles['bot']==2)].sort_values('_id'))

share_dups_total = num_dups_total / num_profiles * 100
share_dups_1 =  num_dups_1 / num_profiles_1 * 100
share_dups_2 =  num_dups_2 / num_profiles_2 * 100

share_premium = profiles['hide_age'].count() / len(profiles['hide_age']) * 100
Out[148]:

In total we have observed 16673 female profiles during swiping. The first bot has encountered 8428 and the second 8245. Out of those only 0.6%, respectively 0.6% have been encountered more than once per bot. In conclusion, if you don't swipe excessively in the same area it is very unprobable to see a person twice after passing on them. In 12.5% of the cases a profile was suggested to both our bots. Taking into account the number of profiles observed in total this shows that the total user base must be huge for the cities we swiped in. Our next interesting finding is the 8.1% of premium users encountered in our sample. I would expect this fraction to be even higher for men. In conclusion, tinder seems to be very successful at getting users to pay for better chances in the matching game.

I'm old enough to be ...

Next, we drop the duplicates and start looking at the data in more depth. We begin by calculating the age of the profiles and visualizing its distribution:

In [51]:
profiles = profiles.drop_duplicates(['_id'], keep='last').reset_index(drop=True)
In [96]:
# Deal with Time and calculate age - original Analysis done on 2019-07-01
NOW = pd.to_datetime(datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S %z"),
                     utc=True)
profiles['birth_date'] = pd.to_datetime(profiles['birth_date'])
profiles['age'] = profiles['birth_date']\
                    .map(lambda x: np.floor((NOW - x) / np.timedelta64(1,'Y')) )
In [142]:
# Plot ages by city - click legend to toggle city 
age_plot = pd.crosstab(profiles['age'], profiles['city'], normalize='columns')\
    .apply(lambda x: x * 100).stack().rename('Profile age dsitribution by city')\
    .hvplot.bar()
age_plot.groupby('city').opts(alpha=0.4, width=700, ylabel='profiles', yformatter='%.0f%%')\
    .overlay()
Out[142]:


The distribution of ages we encounter comes close to a normal distribution with a mean of 28. There are some minor differences across cities. Ages in Berlin are more concentrated around the mean while in Frankfurt they are more evenly distributed. However, this is not a mere representation of the respective female tinder user base. It is biased because there is an age-filter which works both ways: Tinder will only display profiles that are not only within your defined limits but for which you are withnin their respective age limits as well. Apparently, women prefer to swipe on men similar in age. Still, the right skew of the distribution indicates that older men are prefered to younger ones.

Say my name, say my name....

Tinder shows the first names of its users. Using that, let's look at the most popular names that we came across:

In [11]:
from collections import Counter
# Table with topN profile names
wordcount = Counter(profiles['name'])
pd.DataFrame(wordcount.most_common(50), columns=['name', 'count']).hvplot.table()
Out[11]:


As expected, very common German names are overrepresented in our sample. Julia definitely takes home the trophy by being by far the most common name we observe.

It's all about the pictures

Your looks contribute vastly to how attractive you are perceived especially lacking other information. That's more true on Tinder than anywhere else and the reason why success here is all about your pictures. This is why you can upload up to ten profile pictures. Profiles with only one picture often seem fishy. But ten pictures? Really? That might come off as desperate. So, how many is too many? Let us investigate how females decide here:

In [141]:
# Count and viz Number of Photos
profiles['num_profiles_photos'] = profiles['photos'].map(lambda x: len(x))
profiles['num_profiles_photos'].describe()

profiles.groupby('num_profiles_photos')['_id'].count().transform(lambda x: x / x.sum() * 100)\
        .rename('rel. freq.').hvplot\
        .bar(xlabel='pictures', ylabel='profiles', width=700,
             title='Number of profile pictures', yformatter='%.0f%%')
Out[141]:
count    14590.000000
mean         4.644208
std          2.226438
min          1.000000
25%          3.000000
50%          4.000000
75%          6.000000
max         10.000000
Name: num_profiles_photos, dtype: float64
Out[141]:


Looking at the inter-quantile-range (IQR) we see that females mostly tend to upload between three and six pictures. Almost nobody seems so desperate as to make use of all ten pictures. However, quite a few settle for a single picture. Does this have anything to do with age? Let's check:

In [149]:
profiles[['age', 'num_profiles_photos']].corr()
Out[149]:
age num_profiles_photos
age 1.000000 0.057172
num_profiles_photos 0.057172 1.000000


A basic correlation between age and num_profile_photos shows no link. However, this is a perfect lesson on why one should be wary of correlations. We can only conclude that there is no linear link between those variables. But what about non-linear dependencies? A truly interesting pattern emerges when we depict both variables in a scatter plot:

In [146]:
profiles.groupby(['age'])['num_profiles_photos'].mean().rename('Average number of profile pics')\
        .hvplot.scatter(width=700)
Out[146]:


We get an almost perfect inversed U-shape: you can see clearly how the positive association peaks around 30 and then drops quickly. For me, this pattern is somewhat surprising. I would have expected younger people to be more involved in their profiles thus sharing more pictures.

We can further investigate these patterns by looking at Instagram (IG). As users can link their IG account with Tinder to share even more pictures we have some data on that as well. We can investigate the frequency of linked IG accounts and also the number of IG pictures by age. Here, again, I expect younger people to be more willing to share. Let's see if my hunch is right this time:

In [152]:
profiles['instagram.media_count'] = profiles['instagram.media_count'].fillna(0)
profiles['instagram.media_count'].describe()

IG_max = profiles['instagram.media_count'].max()
IG_share = profiles[profiles['instagram.media_count'] > 0]['_id'].count()\
    / profiles['instagram.media_count'].count() * 100
IG_excessive = profiles[profiles['instagram.media_count'] > 1000]['_id'].count()

profiles['instagram.media_count'].rename('Number of IG pictures')\
    .hvplot.hist(bins=50, xlabel='pictures',
                 title='Histogram for number of IG pictures', width=700)
Out[152]:
count    14590.000000
mean        25.268677
std        112.344774
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max       2545.000000
Name: instagram.media_count, dtype: float64
Out[152]:
Out[147]:


Surprisingly, only 12% of women link their IG account. But amongst those who do there are 32 females with more than 1000 pictures. One hits the jackpot with 2545. That seems a little bit excessive... Fortunately, this exaggerated use of IG can only be observed in less than 1% of our sample. So most females we encounter seem to still have a live besides that. In terms of correlation with age: there is none. So again, I have been proven wrong with my intuition.

A picture is worth a thousand words. But still ...

Obviously pictures are the most important feature of a tinder profile. Also, age plays an important role because of the age filter. But there is one more piece to the puzzle: the biography text (bio). While some don't use it at all some seem to be very wary about it. The text can be used to describe oneself, to state expectations or in some cases just to be funny:

How to use text in your Tinder profile

Source: Reddit


For that, you have a 500 character limit. Let's see what we can learn from the female bios:

In [78]:
# Calc some stats on number of chars
profiles['bio_num_chars'] = profiles['bio'].str.len()
profiles['bio_num_chars'].describe()
bio_chars_mean = profiles['bio_num_chars'].mean()
bio_text_yes = len(profiles[profiles['bio_num_chars'] > 0])
bio_text_100 = len(profiles[profiles['bio_num_chars'] > 100])
bio_text_share_no = (1 - (bio_text_yes / len(profiles['bio_num_chars']))) * 100
bio_text_share_100 = bio_text_100 / len(profiles['bio_num_chars']) * 100
Out[78]:
count    14590.000000
mean        59.100274
std        100.689832
min          0.000000
25%          0.000000
50%         12.000000
75%         72.000000
max        500.000000
Name: bio_num_chars, dtype: float64
Out[82]:

In 41% of the cases females didn't use the biography at all. The average female observed has around 59 characters in her bio. And only 19.6% seem to put some emphasis on the text by using more than 100 characters. These findings suggest that text only plays a minor role on Tinder profiles. However, while naturally pictures are essential text might have a more subtle part. For example, emojis (or hashtags) are often used to describe one's preferences in a very character efficient way. This strategy is in line with communication in other online channels like Twitter or WhatsApp. Hence, we'll take a look at emoijs and hashtags later on.

What can we learn from the content of biography texts? To answer this, we will need to dive into Natural Language Processing (NLP). For this, we will use the nltk and Textblob libraries. Some informative introductions on the topic can be found here and here. They describe all methods applied here.
We start by looking at the most common words. For that, we create a string containing all bio texts. Next, we need to get rid of very common words (stopwords). Following, we can look at the number of occurrences of the remaining, used words:

In [ ]:
from textblob import TextBlob
profiles['bio'] = profiles['bio'].fillna('').str.lower()
# Create a string with all bio texts
bio_text = []
profiles['bio'].map(lambda x: bio_text.append(x))
bio_text_dirty = ' '.join(bio_text)
bio_text = TextBlob(bio_text_dirty)
In [84]:
# Filter out English AND German stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop.extend(stopwords.words('german'))
bio_text_words = ' '.join(word for word in bio_text.words if word not in stop)
In [104]:
# Count word occurences and show table
wordcount = Counter(TextBlob(bio_text_words).words)
words_top100 = pd.DataFrame(wordcount.items(), columns=['word', 'count'])\
                .sort_values('count', ascending=False)[0:100]
words_top100.hvplot.table()
Out[104]:


We can also visualize our word frequencies. The classic way to do this is using a wordcloud. The package we use has a nice feature that allows you to define the outlines of your wordcloud. As an homage to Tinder we use this to make it look like a flame:

In [129]:
import matplotlib.pyplot as plt
mask = np.array(Image.open('./flame.png'))

wordcloud = WordCloud(
                background_color='white', stopwords=stop, mask = mask, 
                max_words=60, max_font_size=60, scale=3, random_state=1
            ).generate(str(bio_text_words))
plt.figure(figsize=(6,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Out[129]:
(-0.5, 1535.5, 1535.5, -0.5)


So, what do we see here? Well, people like to show where they are from especially if that is Berlin or Hamburg. That's why the cities we swiped in are very common. No big surprise here. More interesting, we find the words ons, ig and love. What about the most popular hashtags?

In [93]:
# Get hashtags from uncleaned text
hashtags = [] 
bio_text_dirty = bio_text_dirty.replace('\n', ' ')
for word in bio_text_dirty.split(' '):
    if word.startswith('#'):
        hashtags.append(word)
num_hashtags = Counter(hashtags)
hashtags_top10 = pd.DataFrame(num_hashtags.most_common(100), columns=['hashtag', 'count'])
hashtags_top10.hvplot.table()
Out[93]:


It seems that people like to get creative with their hashtags and use whatever they feel like. Because of that the tags are not very repetitive. The reason might be that hashtags don't really serve a purpose on Tinder. But here, as well as in the words investigated above we see that English words are pretty common. So, before we further investigate frequent words we'll look at the share of German vs. English profile texts:

In [94]:
# Get a sample of all profiles with text and guess language
profiles_sample = profiles.loc[profiles['bio'].str.len() > 0, 'bio']\
                          .sample(frac=0.1, random_state=1)
profiles_sample['lang'] = profiles_sample\
                            .map(lambda x: TextBlob(x).detect_language()
                                 if len(x) > 3 else None)
In [113]:
(profiles_sample['lang'].value_counts(normalize=True).rename('rel. freq.') * 100)\
    .hvplot.bar(width=700, title="Profile text languages",
                xlabel='language', ylabel='profiles', yformatter='%.0f%%')
Out[113]:


The detect_language method of TextBlob works using the Google Translate API. Thus, it takes quite some time and there is also a limit on the amount of text we can submit. Consequently, we only take a random sample (10%) of our data to work on. Also, we should keep in mind that it returns only a best guess. Hence, the results might be inaccurate especially when the text sample is short.
In our sample we see that English is used almost as often as German. While we observe a lot of different languages besides that their share is miniscule. Only Spanish appears in more than 1% of the biographies.

Without context the most common words from above are not very meaningful. What should we conclude from the word ons (one night stand)? Is tinder really used for hook-ups only? Or do women rather express that they are not looking for that at all? We'll find out soon:

In [195]:
# Words sourounding word of interest
words_closeby =  []
for sentence in bio_text.sentences:
    for word in sentence.words:
        if word in ['ons']:
            word_pos = sentence.words.index(word)
            try:
                words_closeby.append((sentence.words[word_pos-1],
                                     # sentence.words[word_pos],
                                      sentence.words[word_pos+1]))
            except IndexError as e:
                pass
most_common_words = Counter(words_closeby).most_common(10)
most_common_words
Out[195]:
[(('keine', 'oder'), 17),
 (('an', 'oder'), 9),
 (('an', 'f'), 8),
 (('no', 'i'), 8),
 (('keine', 'und'), 5),
 (('keine', 'keine'), 4),
 (('no', 'no'), 4),
 (('no', '🚫'), 4),
 (('no', 'or'), 4),
 (('no', 'if'), 3)]

The picture is clear: in most cases the texts state that they are in fact not interested in one night stands. Good thing we checked the context before drawing any conclusions.

Finally, we want to take a look at the emojis used in bios. We have seen before that they seem to be popular as they can carry a lot of meaning with few characters. With the help of the emoji package we can easily identify them in texts:

In [114]:
# Extract all emojis and count occurences
import emoji
emojis_unicode = []
for word in TextBlob(bio_text_words).words:
    if word in emoji.EMOJI_UNICODE.values():
        emojis_unicode.append(emoji.emojize(word))

wordcount = Counter(emojis_unicode)
emoji_df = pd.DataFrame(wordcount.most_common(10), columns=['emoji','count'])
In [138]:
emoji_df.hvplot.bar(x='emoji', y='count', invert=True, flip_yaxis=True,
                    title='Top10 emoji occurence', xlabel='')\
        .opts(fontsize={"yticks": 15, "title": 18}, width=700)
Out[138]:


Again, it becomes clear that people often share their location. That's why the pin emoji is regularly used. Moreover, the German flag is commonly used. Probably to indicate the spoken language. Smoking seems to be pretty unpopular while wine and traveling (globe emoji) seem to be well liked by many women. Also, photography (or linking ones instagram) seems to be popular.

Coming up next

This concludes our analysis of the Tinder female profiles data. If you are curious, there are some more things to explore in this data. Here are some suggestions:

  • Schools
  • Jobs
  • Most popular Artists and Tracks for profiles with linked Spotify
  • Scrape profile pictures and start going crazy with AI

In an upcoming post I will dive deeper into the Tinder game: I'll present the results of an unique Tinder field experiment. Analyzing swiping and matching patterns I'll present novel findings regarding discrimination in online dating.


Comments

comments powered by Disqus