What I Learned from Scraping 15k Data Science Articles on Medium
Have you ever wondered about what factors make an article receive a high number of claps? Besides, as a data science writer, I wonder:
- What is the average number of claps? Some articles I came across have 100 or even 1000 claps. Is that a typical number of claps for a Data Science article?
- Which titles are most used by data science articles?
- What is the ideal reading time for a good article?
- Will publishing on the weekdays give more claps than the weekends?
To answer these questions, I scraped all data science articles on Medium published within the last year.
To scrape medium, I used the excellent repository from Harrison Jansma with slight changes in the packages to deal with the errors in the requirements. I choose 6 tags related to data science:
- Data science
- Machine learning
- Data visualization
- Big data
The articles are published anytime between July 2019 and July 2020. It took me 4 to 5 hours to scrape all of these tags but I got good data ready for cleaning and analyzing. I merged data from 6 tags together and added a column Tag showing which tags the article belongs to.
If you want to play along with the data and follow along with the articles, you could download the data here:
or use Datapane Blob to get direct access to the data:
Save the data as
medium. Import libraries. Make sure to change the string
nan to null values and drop duplicates. We will keep the same articles with different tags for now.
import datapane as dp
import pandas as pd
import numpy as npmedium = medium.replace('nan', np.nan)# Drop duplicated
medium = medium.drop_duplicates(subset=['Title', 'Subtitle', 'Author', 'Year','Month', 'Day', 'Tag'])
and take a look at what we got:
What I Found
Which tags are most popular among data science-related tags?
Since different rows may just be the same articles with different tags, we will drop these articles to make sure we have only unique articles in our data.
# Number of duplicated articles with different tags
38516# Drop duplicated
medium = medium.drop_duplicates(subset=['Title', 'Subtitle', 'Author', 'Year', 'Month', 'Day'])
From looking at the data, we can see the number of comments in Medium is often really low. But how low are they exactly?
No article has above 1 comment and 96.1% of them have no comment!
It is surprising to know that even articles with a high number of claps are low in a number of comments. The hidden comment tab in Medium may discourage readers from commenting. A more visible comment section may increase the number of comments.
What is the average number of claps for an article? Change string to numerical such as 1.5k to 1500 and use
The average is 55. Not so bad. But look at the 50th percentile. It is 3! And the max is 26000. This seems to be highly skewed data. Let’s double-check with histogram.
Since the data is large and the number of claps is highly skewed, we sort the data by order of decreasing and plot the first 80k instances.
claps = px.histogram(medium.sort_values(by='Claps')[:80000],
title='Number of Claps')
From the plot, we could see that a majority of the number of claps is in the range from 0 to 10. When dealing with highly skewed data like this, it is better to grasp the ‘middle’ of the data using median than mean!
If you are a data science writer who feels discouraged because you have 0–10 claps, you should feel OK, because this is pretty typical!
Reading Time vs Claps
You may hear from somewhere that the duration of reading time can affect how much an audience likes the articles. Let’s test it out:
There seems to be a low correlation between claps and reading time. The correlation between claps and reading time is:
>>> medium.corr().loc['Reading_Time', 'Claps']0.1301349558669967
But one thing to notice is that the articles with a long reading time have a really low number of claps. The articles with a high number of claps tend to have short reading time.
Let’s find out the average reading time for the articles with the top 25% number of claps
The average number of claps of well-received articles is around 6.6 minutes with a standard deviation of 3.9. This means that the articles the duration from 2 to 10.5 minutes are ideal.
What is the typical publishing frequency of a data science writer? We use
groupby() to group the data based on author and count the total number of articles that the author published within last year
I am curious where I am in this rank so I used a simple code to find out:
>>> author_rank = medium.Author.value_counts().index>>> 100-(list(author_rank).index('Khuyen Tran')+1)/len(author_rank) * 10099.85684944295761
I am in the 99.86 percentile of authors publishing data science articles last year. Considering I have just started writing in December 2019, this is a rewarding finding for the efforts I put on articles every week.
Find the median of the number of articles an author publishes last year:
This means the average author publishes one article per year.
Which publication publishes data science articles most frequently among all publications in Medium?
From the charts, the top four most active data science publications within last year are:
- Towards Data Science
- Analytics Vidhya
- The Startup
- Data Driven Investor
Let’s take one step further and find out how many percentages of articles 1% of the publications post:
>>> sum(publication_groupby.sort_values(by='Year', ascending=False).head(int(len(publication_groupby)*0.01)).Year)/sum(publication_groupby.Year)
1% of the publications post 62% of the articles last year!
What is the trend of data science articles? Has the number of data science remained stable or changed within the last year?
The number of data science articles tends to be stable until March and increases significantly from March to the present. Since this is about the same time that coronavirus hits many countries, could staying at home more give the writers more time to write? Or has data science become a more attractive topic to readers and writers?
Day of the Week
What is the day that the authors prefer to publish their articles?
There are more articles published on weekdays than weekends. Is it because there are more readers if the article is published on weekdays?
Not quite. There seems to be a similar number of claps for both weekdays and weekends.
What are the most used titles of data science articles?
For 6488 missing titles, I use the URL to get the title. since the
url contains the name of the article such as this:
https://towardsdatascience.com/to-become-a-better-data-scientist-you-need-to-think-like-a-programmer-18d0a00994dc?source=your_stories_page---------------------------. From the URL, the title is: to become a better data scientist you need to think like a programmer
Then we can process the text, combine subtitle and title, and visualize with word cloud:
The most popular words are as we expect from data science articles’ titles such as machine learning, python, data science, algorithm, data scientist, data analyst, etc.
Key takeaways from this article:
- A majority of articles have no comment.
- It is totally normal to have a low number of claps. In fact, that should be expected.
- There is no magic number for reading time but the ideal reading time should not be too long. A lengthy article could scare the audience away.
- Many authors prefer to publish on weekdays but they do not necessarily gain more claps by publishing over the weekdays.
- A typical author publishes one article per year.
- Some publications publish significantly more data science articles than others.
- The most popular words in data science articles are machine learning, python, data science, algorithm, data scientist, data analyst, etc.
I hope this article has given you interesting insights into data science articles. I encourage you to play with the data to use this for your advantage, either to gain more claps for your articles or to find the next article on topics related to data science that you should read.
and the notebook of this article could be found here.
Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these: