Let’s Scrape a Blog! (Part 1)

One thing I’ve been considering lately is what kind of intelligence you could gain from scraping a blog and analyzing the data. To test this out, this is the first in a series of posts where I’ll scrape a blog and try to squeeze out every last bit of useful or interesting intelligence I possibly can.

I’ll start off simple, but down the road I plan to use more advanced techniques in machine learning and natural language processing techniques to see what additional information these tools can uncover. I’m keeping all my analysis on a Jupyter notebook you can find on Github here.

The target site I’ll be using for my analysis is my all-time favourite blog: Marginal Revolution. I have been following this blog pretty much daily since 2005 when I started my undergraduate degree. It’s run by the economists Tyler Cowen and Alex Tabarrok, who are personal heroes of mine.

Why scrape a blog?

For me, scraping Marginal Revolution was just something I did for kicks. Since I’m so interested in the content of the blog, I want to be able to do very customized searches of blog posts that would not be possible through the blog’s built-in search feature.

But there are reasons other than “just for fun” that you might want to scrape a blog. For example, maybe the blog is a competitor or in an industry you’re researching. Maybe you want to find out:

  • Roughly how many people read / comment on the blog
  • Blogging strategy in terms of number, type, and timing of posts
  • Which types of posts produce the most discussion / comments / controversy
  • What notable people read the blog (i.e. seeing if they comment in the comments section)
  • Analyzing trends over time to determine if things have changed

…and I’m sure there are more possibilities.

Very brief overview of how the scraper works

My goal with the scraper was to get each individual post from the Marginal Revolution website. Marginal Revolution was fairly easy to scrape since the list of posts by month provided a predictable URL structure that made it possible to gather the links for each individual post across the entire website. With the full list of links, it was then simply a matter of making a request to each of these URLs and saving the resulting blog post HTML to disk. The scraper ultimately gathered 23,342 posts.

The final step was to extract the information of relevance through each HTML file and conduct data cleaning. I did this with the python BeautifulSoup library to parse the html and then pandas to do some further data cleaning and feature generation. The final result was a nice csv file:

My scraper had a generous delay between requests so I didn’t create a burden on the website. As you would expect, the scraper took a very long time to run to get all the posts – I ran it slowly over a period of about 3 weeks.

Initial Analysis

Often times when reading Marginal Revolution, I would want to search in ways that the built-in search feature wouldn’t allow. For example, I know that Marginal Revolution has had a few guest posts over the years, but they are difficult to find with the search feature because of the sheer volume of posts. Also, many people guest posting are often mentioned in the regular daily posts by Tyler and Alex, further complicating the search.

With all the posts scraped, figuring out who has all posted on the site and how many posts they’ve done was easy:

Obviously it’s totally dominated by Tyler Cowen and Alex Tabarrok, as any reader of the blog would expect, but the plot reveals some interesting authors that I had no idea posted on Marginal Revolution.

Now, say I want to look at all the posts by Tim Harford. It’s just a simple filter operation to get all the links and check them out (15 of them):

http://marginalrevolution.com/marginalrevolution/2005/07/dear_economist_-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/using_cartoons_.html
http://marginalrevolution.com/marginalrevolution/2005/07/marginal_revolu-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/we_shall_see_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/markets_in_ever_6-2.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_2.html
http://marginalrevolution.com/marginalrevolution/2005/07/john_kay_on_cli.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_1.html
http://marginalrevolution.com/marginalrevolution/2005/07/choosing_whethe.html
http://marginalrevolution.com/marginalrevolution/2005/07/what_is_the_rig-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/a_critic_on_cri.html
http://marginalrevolution.com/marginalrevolution/2005/07/red_tape_and_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/should_londoner.html
http://marginalrevolution.com/marginalrevolution/2005/07/risky_business_1.html

I also looked at the amount of discussion generated by each author:

Note that some of these authors only posted once or twice which would skew their results. Also, some posted in the blog’s early years where there appear to be few comments (e.g. Tim Harford in 2005). Interesting to see that Alex’s posts on average seem to generate slightly more comments. Of course, the total amount of discussion / engagement is way higher for Tyler, given that he posts about 5 times as much as Alex.

Another easy thing to do is examine the time of post to get an idea of the blogging habits of each of the authors. Each blog post includes the time of publication down to the minute. 

Looking at the time of the post reveals some clear patterns. Tyler Cowen is most likely to post in the morning, around 7 am, although he is also likely to post in the early afternoon. 

Alex Tabarrok clearly has a much more rigid blogging schedule. Almost all of his posts are published around 7 am.

You can also get an idea of the writing techniques and writing habits of the blog authors. I’m barely scratching the surface of what’s possible here, but as a start I simply looked at the number of characters in the headline. The headline is the most important part of a blog post as it determines whether the reader will continue to read.

The longest headline in Marginal Revolution is 117 characters: The Icelandic Stock Exchange fell by 76% in early trading as it re-opened after closing for two days last week.” The table below shows the different average headline length for each of the blog authors. Tyler tends to use longer headlines than Alex.

Interestingly, when I read through the top 10 longest headlines, I noticed one called: “Browse every book hyperlink ever posted on Marginal Revolution (is this the second best website ever?)Clearly I’m not the first person to have scraped Marginal Revolution!

My goal now is to figure out what to do with this data to make the 3rd best website ever…

Addendum: In the comments, the creator of Marginal Revolution Books points to the github repository for his website.

Adventures in Open Data Hacking – Winnipeg Tree Data!

As a data guy, I’m pretty excited to see my city making a strong commitment to open data. The other day, I was sifting through some of this stuff to see what I could play with and what kind of interesting data mash-ups I could create with it. Very soon, my prayers were answered, with the Winnipeg tree inventory – yes, that’s right, the City of Winnipeg keeps a detailed database of all 300,000+ public trees located in the City, including botanical name, common name, tree size (diameter), and precise GPS coordinates!

After chuckling to myself in amusement for a while about how awesome this is, I dug into the data. Read on to see the results. 

You can find the code I used to create the visualizations here. For those curious, I used Python, along with a couple of amazing geographic mapping packages: GeoPandas (incorporates geographic data types into the Pandas package) and Folium (allows you to write Leaflet.js maps using Python code).

Most common trees

Turns out that, by far, the most common trees in the city are Green Ash and American Elm. These two types represent almost half of the trees in the database. Check out the plot below to see the top 10 tree types in the City.

Rarest trees

As shown in the previous plot, there are a few types of trees that totally dominate. Looking at the other end of the spectrum, there are quite a few tree types that are extremely rare, with only one or two of them found across the entire city. The map below shows the location of the 50 rarest trees (click on the tree to see the tree type). A valuable resource for all you rare tree hunters.

Biggest trees

The map below shows the 50 biggest trees in Winnipeg (by diameter). The size of each of the green dots represents the size of the tree. As you can see in the map, apparently there is a monster American Elm in Transcona that I may have to check out next time I’m the area.

Neighbourhoods with the most trees

Next, I thought I would mash-up the tree data with neighbourhood boundary data also available from the City of Winnipeg website to see what the tree situation is like for each neighbourhood.

Looking at the total number of trees, Pulberry is in the lead, followed by Kildonan Park, River Park South, and Linden Woods.

Here are the 10 neighbourhoods with the fewest trees.

I also put together a choropleth map to visualize at a glance the total number of trees in each neighbourhood.

These measurements aren’t totally fair, since bigger neighbourhoods will naturally have more trees. A better measure of how tree-filled a neighbourhood is the tree density, or the number of trees per square kilometres. The map below shows the tree density of all the neighbourhoods in the city.

Looking at the top ten neighbourhoods in terms of tree density, many are (not surprisingly) parks (Kildonan Park is the most densely treed, by far).

The plot below shows the 10 neighbourhoods with the lowest tree densities. Interestingly, Assiniboine Park is near the bottom. The only explanation I can think of is that the trees there are privately owned (the data only includes public trees).

Other ideas?

There are a lot of other ways you could slice, dice, and mash-up the data with other sources to get more interesting results. Here’s a few things I thought of:

  • Zooming in on one neighbourhood and show a dot map or heat map of all the trees to show the distribution.
  • Looking at  percentage of the trees that come from parks and filtering out park trees.
  • Developing a “tree index” for each neighbourhood (like something you would see on a real estate ad to describe the neighbourhood).
  • Examining the most common types of trees in each neighbourhood.

Comment below if you have any suggestions / requests!