Let’s Scrape a Blog! (Part 1)

One thing I’ve been considering lately is what kind of intelligence you could gain from scraping a blog and analyzing the data. To test this out, this is the first in a series of posts where I’ll scrape a blog and try to squeeze out every last bit of useful or interesting intelligence I possibly can.

I’ll start off simple, but down the road I plan to use more advanced techniques in machine learning and natural language processing techniques to see what additional information these tools can uncover. I’m keeping all my analysis on a Jupyter notebook you can find on Github here.

The target site I’ll be using for my analysis is my all-time favourite blog: Marginal Revolution. I have been following this blog pretty much daily since 2005 when I started my undergraduate degree. It’s run by the economists Tyler Cowen and Alex Tabarrok, who are personal heroes of mine.

Why scrape a blog?

For me, scraping Marginal Revolution was just something I did for kicks. Since I’m so interested in the content of the blog, I want to be able to do very customized searches of blog posts that would not be possible through the blog’s built-in search feature.

But there are reasons other than “just for fun” that you might want to scrape a blog. For example, maybe the blog is a competitor or in an industry you’re researching. Maybe you want to find out:

  • Roughly how many people read / comment on the blog
  • Blogging strategy in terms of number, type, and timing of posts
  • Which types of posts produce the most discussion / comments / controversy
  • What notable people read the blog (i.e. seeing if they comment in the comments section)
  • Analyzing trends over time to determine if things have changed

…and I’m sure there are more possibilities.

Very brief overview of how the scraper works

My goal with the scraper was to get each individual post from the Marginal Revolution website. Marginal Revolution was fairly easy to scrape since the list of posts by month provided a predictable URL structure that made it possible to gather the links for each individual post across the entire website. With the full list of links, it was then simply a matter of making a request to each of these URLs and saving the resulting blog post HTML to disk. The scraper ultimately gathered 23,342 posts.

The final step was to extract the information of relevance through each HTML file and conduct data cleaning. I did this with the python BeautifulSoup library to parse the html and then pandas to do some further data cleaning and feature generation. The final result was a nice csv file:

My scraper had a generous delay between requests so I didn’t create a burden on the website. As you would expect, the scraper took a very long time to run to get all the posts – I ran it slowly over a period of about 3 weeks.

Initial Analysis

Often times when reading Marginal Revolution, I would want to search in ways that the built-in search feature wouldn’t allow. For example, I know that Marginal Revolution has had a few guest posts over the years, but they are difficult to find with the search feature because of the sheer volume of posts. Also, many people guest posting are often mentioned in the regular daily posts by Tyler and Alex, further complicating the search.

With all the posts scraped, figuring out who has all posted on the site and how many posts they’ve done was easy:

Obviously it’s totally dominated by Tyler Cowen and Alex Tabarrok, as any reader of the blog would expect, but the plot reveals some interesting authors that I had no idea posted on Marginal Revolution.

Now, say I want to look at all the posts by Tim Harford. It’s just a simple filter operation to get all the links and check them out (15 of them):

http://marginalrevolution.com/marginalrevolution/2005/07/dear_economist_-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/using_cartoons_.html
http://marginalrevolution.com/marginalrevolution/2005/07/marginal_revolu-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/we_shall_see_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/markets_in_ever_6-2.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_2.html
http://marginalrevolution.com/marginalrevolution/2005/07/john_kay_on_cli.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_1.html
http://marginalrevolution.com/marginalrevolution/2005/07/choosing_whethe.html
http://marginalrevolution.com/marginalrevolution/2005/07/what_is_the_rig-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/a_critic_on_cri.html
http://marginalrevolution.com/marginalrevolution/2005/07/red_tape_and_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/should_londoner.html
http://marginalrevolution.com/marginalrevolution/2005/07/risky_business_1.html

I also looked at the amount of discussion generated by each author:

Note that some of these authors only posted once or twice which would skew their results. Also, some posted in the blog’s early years where there appear to be few comments (e.g. Tim Harford in 2005). Interesting to see that Alex’s posts on average seem to generate slightly more comments. Of course, the total amount of discussion / engagement is way higher for Tyler, given that he posts about 5 times as much as Alex.

Another easy thing to do is examine the time of post to get an idea of the blogging habits of each of the authors. Each blog post includes the time of publication down to the minute. 

Looking at the time of the post reveals some clear patterns. Tyler Cowen is most likely to post in the morning, around 7 am, although he is also likely to post in the early afternoon. 

Alex Tabarrok clearly has a much more rigid blogging schedule. Almost all of his posts are published around 7 am.

You can also get an idea of the writing techniques and writing habits of the blog authors. I’m barely scratching the surface of what’s possible here, but as a start I simply looked at the number of characters in the headline. The headline is the most important part of a blog post as it determines whether the reader will continue to read.

The longest headline in Marginal Revolution is 117 characters: The Icelandic Stock Exchange fell by 76% in early trading as it re-opened after closing for two days last week.” The table below shows the different average headline length for each of the blog authors. Tyler tends to use longer headlines than Alex.

Interestingly, when I read through the top 10 longest headlines, I noticed one called: “Browse every book hyperlink ever posted on Marginal Revolution (is this the second best website ever?)Clearly I’m not the first person to have scraped Marginal Revolution!

My goal now is to figure out what to do with this data to make the 3rd best website ever…

Addendum: In the comments, the creator of Marginal Revolution Books points to the github repository for his website.

Setting up Email Updates for your Scraper using Python and a Gmail Account

Very often when building web scrapers (and lots of other scripts), you’ll run into one of these situations:

  • You want to send the program’s results to someone else
  • You’re running the script on a remote server and you want automatic, real-time reports on results (e.g. updates on price information from an online retailer, an update indicating a competing company has made changes to their job openings site)

One easy and effective solution is to have your web scraping scripts automatically email their results to you (or anyone else that’s interested).

It turns out this is extremely easy to do in Python. All you need is a Gmail account and you can piggyback on Google’s Simple Mail Transfer Protocol (SMTP) servers. I’ve found this technique really useful, especially for a recent project I created to send myself and my family monthly financial updates from a program that does some customized calculations on our Mint account data.

The first step is importing the built-in Python packages that will do most of the work for us:

import smtplib
from email.mime.text import MIMEText

smtplib is the built-in Python SMTP protocol client that allows us to connect to our email account and send mail via SMTP.

The MIMEText class is used to define the contents of the email. MIME (Multipurpose Internet Mail Extensions) is a standard for formatting files to be sent over the internet so they can be viewed in a browser or email application. It’s been around for ages and it basically allows you to send stuff other than ASCII text over email, such as audio, video, images, and other good stuff. The example below is for sending an email that contains HTML.

Here is example code to build your MIME email:

sender = 'your_email@email.com'
receivers = ['recipient1@recipient.com', 'recipient2@recipient.com']
body_of_email = 'String of html to display in the email'
msg = MIMEText(body_of_email, 'html')
msg['Subject'] = 'Subject line goes here'
msg['From'] = sender
msg['To'] = ','.join(receivers)

The MIMEText object takes in the email message as a string and also specifies that the message has an html “subtype”. See this site for a useful list of MIME media types and the corresponding subtypes. Check out the Python email.mime docs for other classes available to send other types of MIME messages (e.g. MIMEAudio, MIMEImage).

Next, we connect to the Gmail SMTP server with host ‘smtp.gmail.com’ and port 465, login with your Gmail account credentials, and send it off:

s = smtplib.SMTP_SSL(host = 'smtp.gmail.com', port = 465)
s.login(user = 'your_username', password = ‘your_password')
s.sendmail(sender, receivers, msg.as_string())
s.quit()

Heads up: notice that the list of email recipients needs to be expressed as a string in the assignment to msg[‘From’] (with each email separated by a comma), and expressed as a Python list when specified in smtplib object s.sendmail(sender, receivers, msg.as_string(). (For quite a while, I was banging my head against the wall trying to figure out why the message was only sending to the first recipient or not sending at all, and this was the source of the error. I finally came across this StackExchange post which solved the problem.)

As a last step, you need to change your Gmail account settings to allow access to “less secure apps” so your Python script can access your account and send emails from it (see instructions here). A scraper running on your computer or another machine is considered “less secure” because your application is considered a third party and it is sending your credentials directly to Gmail to gain access. Instead, third party applications should be using an authorization mechanism like OAuth to gain access to aspects of your account (see discussion here).

Of course, you don’t have to worry about your own application accessing your account since you know it isn’t acting maliciously. However, if other untrusted applications can do this, they may store your login credentials without telling you or doing other nasty things.  So, allowing access from less secure apps makes your Gmail account a little less secure.

If you’re not comfortable turning on access to less secure apps on your personal Gmail account, one option is to create a second Gmail account solely for the purpose of sending emails from your applications. That way, if that account is compromised for some reason due to less secure app access being turned on, the attacker would only be able to see sent mail from the scraper.