How to Find Underrated People on Twitter with TURI (Twitter Underrated Index)

One of the best skills that you can develop is the ability to find talented people before anyone else does.

This great advice from Tyler Cowen (economist and blogger at Marginal Revolution) got me thinking: What are some strategies for finding talented but underrated people?

One possible source is Twitter. For a long time, I didn’t “get” Twitter, but after following Michael Nielsen’s advice I’m officially a convert. The key is carefully selecting the list of people you follow. If you do this well, your Twitter feed becomes a constant stream of valuable information and interesting people.

If you look carefully, you can find a lot of highly underrated people on Twitter, i.e. incredibly smart people that put out valuable and interesting content, but have a smaller following that you would expect.

These are the kinds of people that are the best to follow: you get access to insights that a lot of other people are not getting (since not many people are following them), and they are more likely to respond to queries or engage in discussion (since they have a smaller following to manage).

One option for finding these people is trial and error, but I wanted to see if it’s possible to quantify how underrated people are on Twitter and automate the process for finding good people to follow.

I call this the TURI (Twitter Underrated Index), because hey, it needs a name and acronyms make things sound so official.

Components of TURI

The index has three main components: Growth, Influence of Followers, and the Number of Followers.

Growth (G): The number of followers a user has per unit of content they have published (i.e. per tweet).

A user that is growing their Twitter following quickly suggests that they are underrated. It implies they are putting out quality content and people are starting to notice rapidly. The way I measure this is the number of followers a person acquires per Tweet.

Another possible measurement of growth is the number of followers the user has acquired per unit of time (i.e. number of followers divided by the length of time the Twitter account has existed). However, there are a couple of problems with this option:

  • Twitter accounts can be dormant for years. For example, someone might start an account but not tweet for 5 years and then put out great content. Measuring growth in terms of time would unfairly punish these people.
  • A person may have a strategy of Tweeting constantly. Some of the content results in followers, but the overall quality is still low. We are looking for people that publish great content, not necessarily people that put out a lot of content.

Influence of Followers (IF): The average number of the user’s follower’s followers.

In my opinion, the influence of a person’s following is the most important factor determining whether they are underrated on Twitter. Here’s a few reasons why:

  • Influential people are, on average, better judges of good content.
  • Influential people are more selective in who they decide to follow, especially if Twitter is an important part of their online “brand”.
  • Influential people tend to engage with or are in some way related to other high quality people in their offline personal lives, and these people are more likely to appear in their Twitter feed even if they are not widely known or appreciated yet.

I’m somewhat biased toward this measure because, from my own personal experience, it has worked out really well when I browse through people who are followed by influential people on my feed. If I see someone followed by Tyler Cowen, Alex Tabarrok, Russ Roberts, Patrick Collison, and Marc Andreessen, and yet they only have 5,000 followers, then I’m pretty confident that person is currently underrated.

After some consideration, I believe the best way to measure the influence of a user’s followers with the data available in the Twitter API is taking the average number of the user’s follower’s followers.

I mulled over the possibility of using the median rather than the average, but decided against it: If someone with 1 million followers follows someone with 50 followers, I want to know more about that person, even though their TURI is high only because of that one highly influential follower. Outliers are good – we’re looking for diamonds in the rough.

Total Number of Followers (NF): The total number of followers the user currently has.

Our very definition of “underrated” in this context is when a user does not have as many followers as you expect, so total number of followers is obviously going to play an important role in TURI.

So to summarize the main idea behind TURI: if a person has a large number of “influential” followers, is growing their number of followers quickly relative to the volume of content they put out, and they still have a small number of total followers, then they are likely underrated.

Defining the Index

For any user i, we can calculate their Twitter Underrated Index (TURI) as:

TURI_i = \frac{G_iIF_i}{NF_i}

Where G is growth, IF is influence of followers, and NF is the number of followers.

This formula has the general relationships we are looking for: high growth in users for each unit of content, highly influential followers, and a low number of total followers all push TURI upward.

We can simplify the equation by rewriting G = NF / T, where T is the total number of tweets for the user. Cancelling out some terms, this gives us our final version of the index:

TURI_i = \frac{IF_i}{T_i}

In other words, our index of how under-rated a person is on twitter is given by the average number of i’s followers followers per tweet of user i.

Before calculating TURI for a group of users, there are a couple of pre-processing steps you will likely want to take to improve the results:

  1. Filter out verified accounts. One of the shortcomings of TURI is that a user’s growth / trajectory (i.e. G) will be very high if they are already a celebrity. These people typically have a large number of followers per Tweet not because of the content they put out, but because they’ve attained fame already elsewhere. Luckily, Twitter has a feature called a “verified account”, which applies “if it is determined to be an account of public interest. Typically this includes accounts maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas”. This is a prime group to filter out, because we are not looking for well-known people.
  2. Filter users by number of followers: There are a few reasons why you might want to only calculate TURI for users that have a following within some range (e.g. between 500 and 1,000):
    • Although there may be situations where a person with 500,000 followers is underrated, but this seems unlikely to be the kind of person you’re looking for so not worth the API resources.  
    • Filtering by some upper follower threshold mitigates the risk of including celebrities without verified accounts.
    • You limit the number of calls you make to the API. The most costly operation in terms of API calls is figuring out the influence of followers. The more followers a person has, the more API calls required to calculate TURI.

Trying out TURI

To test out the index, I calculated its value on a subset of 49 people that Tyler Cowen follows who have 1,000 or fewer followers (Tyler blogs at my favourite blog, inspired this project, and has good taste in people to follow).

The graph below illustrates TURI for these users (not including 4 accounts that were not accessible due to privacy settings). 

As you can see, one user (@xgabaix) is a significant outlier. Looking a bit more closely at the profile, this is Xavier Gabaix, a well known economist at Harvard University. His TURI is so high because he has several very influential followers and he has not tweeted yet. 

So did TURI fail here? I don’t think so, since this is very likely someone to follow if he was actually Tweeting. However, it does seem a little strange to put someone at the top of the list that doesn’t actually have any Twitter content.

So, I filtered again for users that have published at least 20 tweets:

The following chart looks solely at the various user’s IF (Influence of their Followers). Interestingly, another user @shanagement has the most influential followers by far. However, they rank in third place for overall TURI since they tweet significantly more than @davidbrooks13 or @davidhgillen.

Limitations

Of course, TURI has some shortcomings:

  • Difficult to tell how well TURI works: The measures are based on intuition and there is obviously no “ground truth” data about how underrated twitter users actually are. So, we don’t really have a data-based method for seeing how well the index works and improving it systematically. For example, you might question is whether Growth G should be included in the index at all. I think there’s a good argument for it: if people get followers quickly per unit of content there must be something about that content that draws others. But, on the other hand, maybe they aren’t truly underrated. Maybe the truly underrated people have good content but your average Twitter user underestimates them even after reading a few posts. People don’t always know high quality when the see it.
  • It takes a fairly long time to calculate TURI: This is due to Twitter rate limits of API requests. For example, calculating TURI for 49 Twitter users above took about an hour. It would take even longer for people with larger followings (remember, I only focused on people with 1,000 or fewer followers). So, if you want to do a large batch of people, it’s probably a good idea run this on a server on an ongoing basis and store user and TURI information into a database. 

Other ideas?

There are many many different ways you could potentially specify an index like this. Leave a comment or reach out to me on Twitter or email if you have any suggestions.

One other possible tweak is accounting for the number of people that the user follows. I notice that some Twitter users seem to have a strategy of following a huge number of people in the hopes of being followed back. In these cases, their following strategy may be the cause of their high growth and not the quality of their content. One solution is to adjust the TURI by multiplying by (Number of User’s Followers) / (Number of People the User Follows). This would penalize people that, for example, have 15,000 followers but follow 15,000 people themselves.

Technical Details

You can find the code I used to interact with the API and calculate TURI here. The code uses the python-twitter package, which provides a nice way of interacting with the Twitter API, abstracting away annoying details so you don’t have to deal with them (e.g. authenticating with OAuth, dealing with rate limits).

Digging into Data Science Tools: Using Plotly’s Dash to Build Interactive Dashboards

Dash is a tool designed for building interactive dashboards and web applications using only Python (no CSS, HTML, or JavaScript required). I came across Dash while surveying options for building dashboards and reporting tools in my current position as Data Scientist with the City of Winnipeg Transportation Division.

Why use Dash?

Some of the lowest-hanging fruit working as a data science involves simply getting the data in front of the right people in a way that is easily digestible and actionable. No fancy ML, just presenting data to the right people at the right time in the right format. Dashboards are often a great way of doing that.

However, they can be a pain to build. Off-the-shelf dashboard tools like Tableau and PowerBI can be expensive and limiting (you are constrained to the features that they choose to include in their software). Developing dashboards from scratch as a web application is also a pain, since that often requires writing CSS, HTML, and JavaScript: as a data scientist, your focus is typically programming in python or R and your job typically does not require being an expert in JavaScript.

Dash strikes a perfect middle ground for a data scientist, providing way more flexibility and customizability than off-the-shelf tools at a much lower cost (i.e. free), while still having an easy-to-use API that a python-oriented data scientist should not have difficulty grasping.

A Demo Dash app of Blog Post Data from Marginal Revolution

To become familiar with Dash and see what it can do, I started out with a simple dashboard using some of the the scraped blog post data from Marginal Revolution (see this post on the MR scraper project). You can find the dashboard here, which provides an interactive chart of the number of posts on Marginal Revolution over time for each author. You can choose any of the authors in the drop down box, and the relevant data shows up in the chart:

I have to say, I can’t believe how easy it was to build this dashboard. In total, it’s only 33 lines of python code (you can find the code here. It was also easy to deploy to heroku, following this guide.

How Dash Works

I highly recommend going through the Dash tutorial here, which walks you through app creation and the main components of Dash. Here I’ll give a high level overview of some key points.

Dash apps have two components: a layout component which describes how the application looks visually, and an interactivity component which specifies how your dashboard interacts with the user and responds to input.

Layout Component

The layout defines what the dashboard looks like. When defining the layout of your application, you create a hierarchical “tree” of components, and each component will come from one of the following two main classes:

  • dash_html_components: This includes a component to represent every HTML tag. E.g. dash_html_components.H1(children = ‘Hello Dash’) adds <h1>Hello Dash</h1> to your dashboard.
  • dash_core_components: This describes higher-level components that combine HTML, JavaScript, and CSS created through the React.js library. These components have an interactive element to them (e.g. graphs with tooltips). One of the key classes within dash_core_components is Graph. Graph components represent data visualizations created using plotly.js (Plotly is the company that created Dash). Graph provides 35 different chart types to meet your needs. Another nice feature included within dash_core_components is the ability to include Markdown, which is often more convenient for presenting text based copy in your dashboards. There are many other options available to you in the dash_core_components library to use in your dashboard (e.g. sliders, checkboxes, date ranges, interactive tables, tabs, and more – see the gallery here).

To define the layout for a page and the data contained within it, you nest these various components as appropriate. You assign this nested list to app.layout, where app = dash.Dash(). To see an example of how these components fit together and how they are rendered in the browser, see the getting started page on the Dash tutorial.

Interactivity Component

Another key feature of Dash is its ability to respond to user input and change the visualization accordingly (dashboards aren’t really of much use without this feature). Dash makes this functionality quite easy, although there is a bit too much to go into great detail in this blog post. Read part 2 of the Dash user guide.

In short, Dash lets you add interactivity by adding an @app.callback decorator to a Python function you write which takes the input from one of the layout components and returns outputs to send to another layout component that you want to change according to the new input.

You can have multiple inputs and multiple outputs. For example, if you want a graph that gives the user the option of filtering by multiple variables in your dataset, you can use multiple input filters. The function that you apply the @app.callback decorator to automatically fires whenever there are any changes to any of the inputs.

Deploying to Production

Dash uses the python Flask web application framework under the hood. Flask is an awesome, lightweight web application framework for Python that is definitely worth knowing (and relatively easy to learn given how stripped down it is). You can access the underlying Flask app instance using app.server where app is the instance of your dash app (i.e. app = Dash()).

Since the Dash app is essentially a Flask app under the hood, deploying a Dash app is the same as deploying a Flask app, and there are many guides online for deploying flask apps in a variety of situations. The Dash website provides some more details about deployment as well as steps for deploying to Heroku.

Comment below if you have any questions or if you want to share your experience using Dash!

Further Resources

Dash User Guide and Documentation
Plotly.py Documentation and Gallery
Dash Community Forum

Let’s Scrape a Blog! (Part 1)

One thing I’ve been considering lately is what kind of intelligence you could gain from scraping a blog and analyzing the data. To test this out, this is the first in a series of posts where I’ll scrape a blog and try to squeeze out every last bit of useful or interesting intelligence I possibly can.

I’ll start off simple, but down the road I plan to use more advanced techniques in machine learning and natural language processing techniques to see what additional information these tools can uncover. I’m keeping all my analysis on a Jupyter notebook you can find on Github here.

The target site I’ll be using for my analysis is my all-time favourite blog: Marginal Revolution. I have been following this blog pretty much daily since 2005 when I started my undergraduate degree. It’s run by the economists Tyler Cowen and Alex Tabarrok, who are personal heroes of mine.

Why scrape a blog?

For me, scraping Marginal Revolution was just something I did for kicks. Since I’m so interested in the content of the blog, I want to be able to do very customized searches of blog posts that would not be possible through the blog’s built-in search feature.

But there are reasons other than “just for fun” that you might want to scrape a blog. For example, maybe the blog is a competitor or in an industry you’re researching. Maybe you want to find out:

  • Roughly how many people read / comment on the blog
  • Blogging strategy in terms of number, type, and timing of posts
  • Which types of posts produce the most discussion / comments / controversy
  • What notable people read the blog (i.e. seeing if they comment in the comments section)
  • Analyzing trends over time to determine if things have changed

…and I’m sure there are more possibilities.

Very brief overview of how the scraper works

My goal with the scraper was to get each individual post from the Marginal Revolution website. Marginal Revolution was fairly easy to scrape since the list of posts by month provided a predictable URL structure that made it possible to gather the links for each individual post across the entire website. With the full list of links, it was then simply a matter of making a request to each of these URLs and saving the resulting blog post HTML to disk. The scraper ultimately gathered 23,342 posts.

The final step was to extract the information of relevance through each HTML file and conduct data cleaning. I did this with the python BeautifulSoup library to parse the html and then pandas to do some further data cleaning and feature generation. The final result was a nice csv file:

My scraper had a generous delay between requests so I didn’t create a burden on the website. As you would expect, the scraper took a very long time to run to get all the posts – I ran it slowly over a period of about 3 weeks.

Initial Analysis

Often times when reading Marginal Revolution, I would want to search in ways that the built-in search feature wouldn’t allow. For example, I know that Marginal Revolution has had a few guest posts over the years, but they are difficult to find with the search feature because of the sheer volume of posts. Also, many people guest posting are often mentioned in the regular daily posts by Tyler and Alex, further complicating the search.

With all the posts scraped, figuring out who has all posted on the site and how many posts they’ve done was easy:

Obviously it’s totally dominated by Tyler Cowen and Alex Tabarrok, as any reader of the blog would expect, but the plot reveals some interesting authors that I had no idea posted on Marginal Revolution.

Now, say I want to look at all the posts by Tim Harford. It’s just a simple filter operation to get all the links and check them out (15 of them):

http://marginalrevolution.com/marginalrevolution/2005/07/dear_economist_-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/using_cartoons_.html
http://marginalrevolution.com/marginalrevolution/2005/07/marginal_revolu-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/we_shall_see_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/markets_in_ever_6-2.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_2.html
http://marginalrevolution.com/marginalrevolution/2005/07/john_kay_on_cli.html
http://marginalrevolution.com/marginalrevolution/2005/12/seasonal_advice_1.html
http://marginalrevolution.com/marginalrevolution/2005/07/choosing_whethe.html
http://marginalrevolution.com/marginalrevolution/2005/07/what_is_the_rig-2.html
http://marginalrevolution.com/marginalrevolution/2005/07/a_critic_on_cri.html
http://marginalrevolution.com/marginalrevolution/2005/07/red_tape_and_ho.html
http://marginalrevolution.com/marginalrevolution/2005/07/should_londoner.html
http://marginalrevolution.com/marginalrevolution/2005/07/risky_business_1.html

I also looked at the amount of discussion generated by each author:

Note that some of these authors only posted once or twice which would skew their results. Also, some posted in the blog’s early years where there appear to be few comments (e.g. Tim Harford in 2005). Interesting to see that Alex’s posts on average seem to generate slightly more comments. Of course, the total amount of discussion / engagement is way higher for Tyler, given that he posts about 5 times as much as Alex.

Another easy thing to do is examine the time of post to get an idea of the blogging habits of each of the authors. Each blog post includes the time of publication down to the minute. 

Looking at the time of the post reveals some clear patterns. Tyler Cowen is most likely to post in the morning, around 7 am, although he is also likely to post in the early afternoon. 

Alex Tabarrok clearly has a much more rigid blogging schedule. Almost all of his posts are published around 7 am.

You can also get an idea of the writing techniques and writing habits of the blog authors. I’m barely scratching the surface of what’s possible here, but as a start I simply looked at the number of characters in the headline. The headline is the most important part of a blog post as it determines whether the reader will continue to read.

The longest headline in Marginal Revolution is 117 characters: The Icelandic Stock Exchange fell by 76% in early trading as it re-opened after closing for two days last week.” The table below shows the different average headline length for each of the blog authors. Tyler tends to use longer headlines than Alex.

Interestingly, when I read through the top 10 longest headlines, I noticed one called: “Browse every book hyperlink ever posted on Marginal Revolution (is this the second best website ever?)Clearly I’m not the first person to have scraped Marginal Revolution!

My goal now is to figure out what to do with this data to make the 3rd best website ever…

Addendum: In the comments, the creator of Marginal Revolution Books points to the github repository for his website.

Adventures in Open Data Hacking – Winnipeg Tree Data!

As a data guy, I’m pretty excited to see my city making a strong commitment to open data. The other day, I was sifting through some of this stuff to see what I could play with and what kind of interesting data mash-ups I could create with it. Very soon, my prayers were answered, with the Winnipeg tree inventory – yes, that’s right, the City of Winnipeg keeps a detailed database of all 300,000+ public trees located in the City, including botanical name, common name, tree size (diameter), and precise GPS coordinates!

After chuckling to myself in amusement for a while about how awesome this is, I dug into the data. Read on to see the results. 

You can find the code I used to create the visualizations here. For those curious, I used Python, along with a couple of amazing geographic mapping packages: GeoPandas (incorporates geographic data types into the Pandas package) and Folium (allows you to write Leaflet.js maps using Python code).

Most common trees

Turns out that, by far, the most common trees in the city are Green Ash and American Elm. These two types represent almost half of the trees in the database. Check out the plot below to see the top 10 tree types in the City.

Rarest trees

As shown in the previous plot, there are a few types of trees that totally dominate. Looking at the other end of the spectrum, there are quite a few tree types that are extremely rare, with only one or two of them found across the entire city. The map below shows the location of the 50 rarest trees (click on the tree to see the tree type). A valuable resource for all you rare tree hunters.

Biggest trees

The map below shows the 50 biggest trees in Winnipeg (by diameter). The size of each of the green dots represents the size of the tree. As you can see in the map, apparently there is a monster American Elm in Transcona that I may have to check out next time I’m the area.

Neighbourhoods with the most trees

Next, I thought I would mash-up the tree data with neighbourhood boundary data also available from the City of Winnipeg website to see what the tree situation is like for each neighbourhood.

Looking at the total number of trees, Pulberry is in the lead, followed by Kildonan Park, River Park South, and Linden Woods.

Here are the 10 neighbourhoods with the fewest trees.

I also put together a choropleth map to visualize at a glance the total number of trees in each neighbourhood.

These measurements aren’t totally fair, since bigger neighbourhoods will naturally have more trees. A better measure of how tree-filled a neighbourhood is the tree density, or the number of trees per square kilometres. The map below shows the tree density of all the neighbourhoods in the city.

Looking at the top ten neighbourhoods in terms of tree density, many are (not surprisingly) parks (Kildonan Park is the most densely treed, by far).

The plot below shows the 10 neighbourhoods with the lowest tree densities. Interestingly, Assiniboine Park is near the bottom. The only explanation I can think of is that the trees there are privately owned (the data only includes public trees).

Other ideas?

There are a lot of other ways you could slice, dice, and mash-up the data with other sources to get more interesting results. Here’s a few things I thought of:

  • Zooming in on one neighbourhood and show a dot map or heat map of all the trees to show the distribution.
  • Looking at  percentage of the trees that come from parks and filtering out park trees.
  • Developing a “tree index” for each neighbourhood (like something you would see on a real estate ad to describe the neighbourhood).
  • Examining the most common types of trees in each neighbourhood.

Comment below if you have any suggestions / requests!

How to Build an Automated, Large-Scale Fax Survey Campaign using Python, docx-mailmerge, and Phaxio

You can find the full code for this project here:
https://github.com/marknagelberg/fax-survey-with-phaxio

For most people, fax is an antiquated, old-school technology that has basically no relevance to their lives at all. Most people nowadays probably don’t even know how to use a fax machine and would be annoyed if they were forced to.

But fax is by no means dead: a 2017 poll by Spiceworks suggests that approximately 89% of companies  still use fax machines, including 62% that still use physical fax devices. Fax is still very popular in particular industries, such as medicine. If you need to reach these businesses, fax is something to consider.

One of the main problems with physical fax machines is they are labour intensive. This can become an issue in scenarios where you need to send a lot of faxes or you need to send them in some automated way.

For example, suppose you’re conducting a survey of businesses and the only contact information is a fax number. Let’s say you have 5,000 people in your sample and each survey has unique information for each respondent (for example, their name or a unique ID value to track responses).

What do you do? At first glance, it looks like the only option is to print out each survey for each person on the sample, and then feed each survey by hand into the fax machine. If each survey takes a minute to fax, getting these out would mean 83 hours of work, or 2 full work weeks for a single person, plus printing costs (not to mention low morale). Not good!

Luckily, if you have some programming skills, there is a much better alternative: programmable fax APIs. I’m going to take you through an example of using one of these services called Phaxio to send out faxes for a hypothetical survey campaign. Here’s the scenario:

  • You have a sample of potential respondents in a CSV file that you want to fax information to which includes their fax number, name, address, and ID.
  • You have a template in a word document of the survey that you want to send. The survey needs to be slightly different for each person in the sample, including their particular name, address, and ID.

Solving this problem involves two main steps: 1) creating the documents to fax and 2) faxing them out.

Creating the documents to fax (using docx-mailmerge in Python)

Let’s say you have two people in the sample like this (it’s straightforward to generalize this to 500 or 5000 or higher sample sizes):

Picture1

And suppose the template of the survey you want to fax looks something like this:

Picture1

To get the documents ready to fax in Phaxio, we need to create separate documents for each person in the sample, each with the appropriate variable information (i.e. ID, Address, and Name) filled in.

As a first step, you need to open up your fax template document in Microsoft Word and add the appropriate mail merge fields to the document. This actually was not as straightforward as I expected, but this guide from Practical Business Python blog was helpful. First you want to select the “Insert Field” button which can be found within the “Insert” tab:

Picture1

Then, you’ll be taken to a pop-up where you’ll want to select the category “Mail Merge” and the field name “MergeField”, and write the desired name of your field. Repeat this step for all fields you need (in our case, three times: once for Name, Address, and ID).

Picture1

This produces the mergeable fields that you can copy and paste throughout the document as needed. You can tell it worked when there is a greyed out box that appears when you click your cursor over the field:

Picture1

Now that your Word document template is prepped, you are ready to do the mail merge to create a unique document for each person in the sample. Since we need to separate each merged document into a separate file, this cannot be done in Microsoft Word (the standard mail merge in Word combines all the documents into one big file).

To do this, there is a handy little library called docx-mailmerge built exactly for the purpose of doing custom mail merges in Microsoft Word with Python.  To install the library, type:

$ pip install docx-mailmerge

Using docx-mailmerge, loading the csv sample values into the Word document is simply a matter of calling the merge function, which contains each merge field in the Word document as an argument and allows you to assign these fields the string values of sample information you want to fill in:

from mailmerge import MailMerge
import pandas as pd
import os

def merge_documents(sample_file, template_file, folder):
    df = pd.read_csv(sample_file)

    for index, row in df.iterrows():
        with MailMerge(template_file) as document:
            document.merge(Name = str(row['Name']),
                       Address = str(row['Address']),ID = str(row['ID']))
            document.write(os.path.join(folder,&amp;nbsp;
                       str(row['Fax Number']) + '.docx'))

This produces a folder full of all the documents that you want to fax with the variable information filled and the fax number included as the filename.

Faxing the documents out with Phaxio

To get started with Phaxio, you first need to create an account (https://www.phaxio.com/). Once you’ve done that, install Phaxio’s Python client library, which makes it easy to interact with its API:

$ pip install phaxio

The code below for the send_fax function takes in your personal Phaxio key and secret (provided by Phaxio when you create your account), the fax number, the file to fax. It sets a time delay of 1.5 seconds in between faxes (the Phaxio rate limit is 1 request per second).

from phaxio import PhaxioApi
import time

def send_fax(key, secret, fax_number, filename):
    time.sleep(1.5)
    api = PhaxioApi(key, secret)
    response = api.Fax.send(to = fax_number,
                             files = filename)

With this function, sending out your faxes is just a matter of iterating through the Word files you created in the mail merge and passing them to send_fax (along with the fax_number, which appears in the filename).

To Come: Using the Twilio Programmable Fax API

Although Phaxio is a fantastic tool, it is considerably more expensive than the fax API provided by Twilio (https://www.twilio.com/fax): the Phaxio API costs 7 cents per page, while the Twilio fax API only costs 1 cent per page. Although this seems like a minor difference given the low numbers involved, this really adds up if you are sending a lot of faxes. For example, say you are sending out a survey via fax to 5,000 potential respondents and the survey is 5 pages long. Using Phaxio, this would cost $1,750, while Twilio would cost only $250. The more documents you fax, the wider this price gap grows.

However, you do pay a price for using Twilio due to added complexity: Twilio does not let you pass in binary files into it’s API – you must pass in a URL to the file you want to fax and cannot just point to the documents on your local disk. This would be a great feature for Twilio to have (take note, Twilio!), but since it doesn’t yet exist, you have to configure a server that provides the documents you want to fax to Twilio. Doing this securely is by no means a straightforward task, but I’ll cover this in a future post…Stay tuned!

Setting up Email Updates for your Scraper using Python and a Gmail Account

Very often when building web scrapers (and lots of other scripts), you’ll run into one of these situations:

  • You want to send the program’s results to someone else
  • You’re running the script on a remote server and you want automatic, real-time reports on results (e.g. updates on price information from an online retailer, an update indicating a competing company has made changes to their job openings site)

One easy and effective solution is to have your web scraping scripts automatically email their results to you (or anyone else that’s interested).

It turns out this is extremely easy to do in Python. All you need is a Gmail account and you can piggyback on Google’s Simple Mail Transfer Protocol (SMTP) servers. I’ve found this technique really useful, especially for a recent project I created to send myself and my family monthly financial updates from a program that does some customized calculations on our Mint account data.

The first step is importing the built-in Python packages that will do most of the work for us:

import smtplib
from email.mime.text import MIMEText

smtplib is the built-in Python SMTP protocol client that allows us to connect to our email account and send mail via SMTP.

The MIMEText class is used to define the contents of the email. MIME (Multipurpose Internet Mail Extensions) is a standard for formatting files to be sent over the internet so they can be viewed in a browser or email application. It’s been around for ages and it basically allows you to send stuff other than ASCII text over email, such as audio, video, images, and other good stuff. The example below is for sending an email that contains HTML.

Here is example code to build your MIME email:

sender = 'your_email@email.com'
receivers = ['recipient1@recipient.com', 'recipient2@recipient.com']
body_of_email = 'String of html to display in the email'
msg = MIMEText(body_of_email, 'html')
msg['Subject'] = 'Subject line goes here'
msg['From'] = sender
msg['To'] = ','.join(receivers)

The MIMEText object takes in the email message as a string and also specifies that the message has an html “subtype”. See this site for a useful list of MIME media types and the corresponding subtypes. Check out the Python email.mime docs for other classes available to send other types of MIME messages (e.g. MIMEAudio, MIMEImage).

Next, we connect to the Gmail SMTP server with host ‘smtp.gmail.com’ and port 465, login with your Gmail account credentials, and send it off:

s = smtplib.SMTP_SSL(host = 'smtp.gmail.com', port = 465)
s.login(user = 'your_username', password = ‘your_password')
s.sendmail(sender, receivers, msg.as_string())
s.quit()

Heads up: notice that the list of email recipients needs to be expressed as a string in the assignment to msg[‘From’] (with each email separated by a comma), and expressed as a Python list when specified in smtplib object s.sendmail(sender, receivers, msg.as_string(). (For quite a while, I was banging my head against the wall trying to figure out why the message was only sending to the first recipient or not sending at all, and this was the source of the error. I finally came across this StackExchange post which solved the problem.)

As a last step, you need to change your Gmail account settings to allow access to “less secure apps” so your Python script can access your account and send emails from it (see instructions here). A scraper running on your computer or another machine is considered “less secure” because your application is considered a third party and it is sending your credentials directly to Gmail to gain access. Instead, third party applications should be using an authorization mechanism like OAuth to gain access to aspects of your account (see discussion here).

Of course, you don’t have to worry about your own application accessing your account since you know it isn’t acting maliciously. However, if other untrusted applications can do this, they may store your login credentials without telling you or doing other nasty things.  So, allowing access from less secure apps makes your Gmail account a little less secure.

If you’re not comfortable turning on access to less secure apps on your personal Gmail account, one option is to create a second Gmail account solely for the purpose of sending emails from your applications. That way, if that account is compromised for some reason due to less secure app access being turned on, the attacker would only be able to see sent mail from the scraper.