Syncing your Jupyter Notebook Charts to Microsoft Word Reports

Here’s the situation: You’re doing a big data analysis in your Jupyter Notebook. You’ve got tons of charts and you want to report on them. Ideally, you’d create your final report in the Jupyter notebook itself, with all its fancy markdown features and the ability to keep your code and reporting all in the same place. But here’s the rub: most people still want Word document reports, and don’t care about your code, reproducibility, etc. When reporting it’s important to give people the information in a format most useful to them.

So you’ve got tons of charts and graphs that you want to put in the Word report – how do you keep the two in sync? What if your charts change slightly (e.g. changing the styling of every chart in your report)? You’re stuck copying and pasting charts from your notebook, which is a manual, time-consuming, and error prone process.

In this post, I’ll show you my solution to this problem. It involves the following steps:

  • Saving the chart images from Jupyter Notebook to your desktop in code.
  • Preparing your Word Document report, referencing the image names that you saved in your desktop in the appropriate location in your report.
  • Loading the images into a new version of your Word Document.

Saving Chart Images from Jupyter Notebook to your Desktop

The first step is to gather the images you want to load into your report by saving them from Jupyter Notebook to image files on your hard drive. For this post, I’ll be using the data and analysis produced in my “Lets Scrape A Blog” post from a few months ago, where I scraped my favourite blog Marginal Revolution and did some simple analyses.

That analysis used matplotlib to produce some simple charts of results. To save these images to your desktop, matplotlib provides a useful function called savefig. For example, one chart produced in that analysis looks at the number of blog posts by author:

The following code produces this chart and saves it to a file named ‘num_posts_by_author_plot.png’ in the folder ‘report_images’.

A few pointers about this step:

  • Make sure you give your images useful, descriptive names. This helps ensure you place the proper reference in the Word document. I personally like following the convention of giving the plot image the same name as the plot object in your code.
  • Your images must have unique names or else the first image will be overwritten by the second image
  • To stay organized, store the images in a separate folder designed specifically for the purpose of holding your report images.

Repeating similar code for my other charts, I’m left with 5 chart images in my report_images folder:

Preparing the Word Document Report with Image References

There is a popular Microsoft Word document package out there for Python called python-docx, which is great library for manipulating Word Documents with your Python code. However, its API does not easily allow you to insert an image in the middle of the document.

What we really need is something like Jinja2 for Word Docs: a packages where you can specify special placeholder values in your document and then automatically loads images in. Well, luckily exactly such a package exists: python-docx-template. It is built on top of python-docx and Jinja2 and lets you use Jinja2-like syntax in your Word Documents (see my other post on using Jinja2 templates to create PDF Reports).

To get images into Word using python-docx-template, it’s simple: you just have to use the usual Jinja2 syntax {{ image_variable }} within your Word document. In my case, I had six images, and the Word template I put together for testing looked something like this:

For the system to work, you have to use variable names within {{ }} that align with the image names (before ‘.png’) and the plot variable names in your Jupyter notebook.

Loading the Images Into your Document

The final and most important step is to get all the images in your template. To do this, the code roughly follows the following steps: load your Word document template, load the images from your image directory as a dict of InlineImage objects, render the images in the Word document, and save the loaded-image version to a new filename.

Here is the code that does this:

To run the code, you need to specify the Word template document, the new Word document name which will contain the images, and the directory where your images are stored. 

python load_images.py <template Word Doc Filename> <image-loaded Word Doc filename> <image directory name>

In my case, I used the following command, which takes in the template template.docx, produced the image-loaded Word Document result.docx, and grabs the images from the folder report_images:

python load_images.py template.docx result.docx report_images

And voila, the image-loaded Word Document looks like this:

You can find the code I used to create this guide on Github here.

Building a Rare Event Reporting System in Python (Part 1 – Crunching Historical Data to Determine What is Unusual)

Here’s the situation: you have a dataset where each row represents some type of event and the time it occurred. You want to build a system to let you know if there were unusual events occurring in some time period (a common application). You also want your users to be able to explore the events in more detail when the get an alert and understand where the event sits in the history of time the data has been collected.

After starting a task similar to this at work, I broke it down into the following parts (each of which will receive a separate blog post):

  1. Using historical data to understand what is “unusual” (Part 1 – this post):
  2. Reporting “unusual” events taken from a live feed of your data (Part 2): Developing a system that monitors a live feed of your data, compares it to historical trends, and sends out an email alert when the number of incidents exceed some threshold.
  3. Developing a front end for users to sign up for email alerts (Part 3): Developing a system for signing up with email to receive the updates and specify how often you want to see them (i.e. specifying how “unusual” an event you want to see).
  4. Creating interactive dashboards (Part 4): Developing dashboards complementing the alert system where users can look up current status of data and how it compares to historical data.

The Data: Obviously, this scenario applies to many, many datasets, but I have to pick something specific for this series. So, I’ll be using this dataset on air quality in the City of Winnipeg from Winnipeg’s open data portal. This data is perfect because it’s updated regularly (every 5 minutes), and there is historical data as well as a live feed API (you can find documentation on the API here). We’ll be using the historical data to help figure out what’s “unusual” and the live data  as the basis for alerts by comparing its values against historical and provide alerts.

Data Exploration and Pre-Processing

After downloading our air quality dataset, we see it looks like this:

As you can see, there are different measurements in different rows, including Temperature, Humidity, and PM2.5 Particulates. Let’s filter our dataset to only contains information on PM2.5 Particulates (we only really care about air pollution in this application).

Looking at the values in MeasurementValue, we see it is measured in units of micrograms per cubic metre (ug/m3) and all values are positive integers. Plotting the histogram, you can see that almost all of the values are below 1,000, with a relatively small number of outliers around 5000.

Counting the number of Measurement Values and filtering for lower values reveals that the vast majority of the observations are less than 100.

Looking at the Measurement Values greater than 100 confirms that there are no observations between 500 and 5,000, with a sudden spike around 5,000.

I don’t have domain expertise related to air quality measurement and the nature of the sensors used to collect this data; however, but the fact that there are no observations in this wide range suggests that these are measurement errors. Exporting the Measurement Values to a table shows that there is one observation with 673 ug/m3 and then the next highest observation suddenly jumps by almost an order of magnitude to 4892 ug/m3.

Given the clear cutoff from these outliers, we’ll set a threshold and define an observation as a measurement error if it is greater than 1000 ug/m3. So as a first step in our analysis, we’ll take out these outliers.

Once these have been filtered out, we’ll do a bit of aggregation across longer time periods. This helps further smooth out other one-off outliers that are likely measurement errors rather than a real increase in overall air quality in the city.

To do this, we put together a dataset that groups data into hourly chunks and takes the average particulate matter over that period. Plotting this out shows the average hourly particulate matter over the historical time period provided is between 0 and 120 ug/m3.

Determining Alert Thresholds

From this hourly dataset of average particulate levels, here’s the general approach we’ll take to determine what is “unusual”: we’ll define some percentile threshold where we want to be alerted, and call anything above this threshold “unusual”.

Python’s pandas package has a nice Series.quantile() function to help out with this: you pass in the quantile you’re interested in and it pops out the value in your data representing that quantile. For example, if you want to see the value representing the 95th percentile in our hourly average air quality data (hrly_avg_aqp), then you would write this:

The 95th percentile in the dataset is 32.47 ug/m3.

The next logical question is: what percentile threshold should we use to send alerts?

One way to think about this is how often you would like to or expect to see emails about these events. Being alerted about 99th percentile events seems like it might be rare enough. However, remember that we are collecting data by the hour; which means if you set a 99th percentile criteria, that means you would receive an email in about 1 out of every 100 hours or about once every four days. For our application, this seems like far too many alerts: we want to hear about the rare air quality events (e.g. once a year events).

As a solution, we’ll leave it to the user to input about how rare an event they want to receive alerts for: once-in-two-years events, once-in-a-year events, six-in-a-year events, or once-in-a-month events. Since we’re looking at hourly data, if a user specifies the number of alerts they want to receive per year (n_alerts_per_year), the calculation of the appropriate percentile threshold is:

percentile_threshold = [1 - n_alerts_per_year  / Total Number of Observations in a Year] * 100 = [1 - n_alerts_per_year  / (365 * 24)] * 100

So now we have all the fundamental pieces of data we need: when the user specified n_alerts_per_year, we calculate the percentile threshold using the calculation above, and then finally, we can calculate the threshold value of air quality that should set off an alert for that user with

hrly_avg_aqp.quantile(percentile_threshold)

Here’s an overview of our data pre-processing steps for our unusual event application.

What’s Next?

Now that we have the basic data in place for understanding what levels of air quality are “unusual”, in Part 2 of the series we’ll write code to ingest the JSON data feed of air quality readings reporting “unusual” events based on your data developed in #1 and the live data feed (Part 2): A system that regularly monitors a particular “feed” of incidents, compares the number of incidents with historical data, and then sends out an email whenever the number of incidents exceed a percentile threshold.

Getting Started with Airflow Using Docker

Lately I’ve been reading intensively on data engineering after being inspired by this great article by Robert Chang providing an introduction to the field.  The underlying message of the article really resonated with me: when most people think of data science they immediately think about the stuff being done by very mature tech companies like Google or Twitter, like deploying uber-sophisticated machine learning models all the time.

However, many organizations are not at the stage where these kind of models makes sense as a top priority. This is because, to build and deploy these kind of models efficiently and effectively, you need to have foundation data infrastructure in place that you can build the models on. Yes, you can develop a machine learning model with the data you have in your organization, but you have to ask: how long did it take you to do it, is your work repeatable / automatable, and are you able to deploy or actually use your solution in a meaningful and reliable way? This is where data engineering comes in: it’s all about building the data warehouses and ETL pipelines (extract-transform-load) that provide the fundamental plumbing required to do everything else.

One tool that keeps coming up in my research on data engineering is Apache Airflow, which is “a platform to programmatically author, schedule and monitor workflows”. Essentially, Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks. It’s becoming very popular among data engineers / data scientists as a great tool for orchestrating ETL pipelines and monitor them as they run.

In this post, I’ll give a really brief overview of some key concepts in Airflow and then show a step-by-step deployment of Airflow in a Docker container.

Key Airflow Concepts

Before we get into deploying Airflow, there are a few basic concepts to introduce. See this page in the Airflow docs which go through these in greater detail and describe additional concepts as well.

Directed Acyclic Graph (DAG): A DAG is a collection of the tasks you want to run, along with the relationships and dependencies between the tasks. DAGs can be expressed visually as a graph with nodes and edges, where the nodes represent tasks and the edges represent dependencies between tasks (i.e. the order in which the tasks must run). Essentially, DAGs represent the workflow that you want to orchestrate and monitor in Airflow. They are “acyclic”, which means that the graph has no cycles – in English, this means means your workflows must have a beginning and an end (if there was a cycle, the workflow would be stuck in an infinite loop).

Operators: Operators represent what is actually done in the tasks that compose a DAG workflow. Specifically, an operator represents a single task in a DAG. Airflow provides a lot of pre-defined classes with tons of flexibility about what you can run as tasks. This includes classes for very common tasks, like BashOperator, PythonOperator, EmailOperator, OracleOperator, etc. On top of the multitude of operator classes available, Airflow provides the ability to define your own operators. As a result, a task in your DAG can do almost anything you want, and you can schedule and monitor it using Airflow.

Tasks: A running instance of an operator. During the instantiation, you can define specific parameters associated with the operator and the parameterized task becomes a node in a DAG.

Deploying Airflow with Docker and Running your First DAG

This rest of this post focuses on deploying Airflow with docker and it assumes you are somewhat familiar with Docker or you have read my previous article on getting started with Docker.

As a first step, you obviously need to have Docker installed and have a Docker Hub account. Once you do that, go to Docker Hub and search “Airflow” in the list of repositories, which produces a bunch of results. We’ll be using the second one: puckel/docker-airflow which has over 1 million pulls and almost 100 stars. You can find the documentation for this repo here. You can find the github repo associated with this container here.

So, all you have to do to get this pre-made container running Apache Airflow is type:

docker pull puckel/docker-airflow

And after a few short moments, you have a Docker image installed for running Airflow in a Docker container. You can see your image was downloaded by typing:

docker images

Now that you have the image downloaded, you can create a running container with the following command:

docker run -d -p 8080:8080 puckel/docker-airflow webserver

Once you do that, Airflow is running on your machine, and you can visit the UI by visiting http://localhost:8080/admin/

On the command line, you can find the container name by running:

docker ps

You can jump into your running container’s command line using the command:

docker exec -ti <container name> bash

So in my case, my container was automatically named competent_vaughan by docker, so I ran the following to get into my container’s command line:

Running a DAG

So your container is up and running. Now, how do we start defining DAGs?

In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is:

/usr/local/airflow/dags

The thing is, you don’t want to jump into your container and add the DAG definition files directly in there. One reason is that the minimal version of Linux installed in the container doesn’t even have a text editor. But a more important reason is that jumping in containers and editing them is considered bad practice and “hacky” in Docker, because you can no longer build the image your container runs on from your Dockerfile.

Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container with the following command:

docker run -d -p 8080:8080 -v /path/to/dags/on/your/local/machine/:/usr/local/airflow/dags  puckel/docker-airflow webserver

The DAG we’ll add can be found  in this repo created by Manasi Dalvi. The DAG is called Helloworld and you can find the DAG definition file here. (Also see this YouTube video where she provides an introduction to Airflow and shows this DAG in action.)

To add it to Airflow, copy Helloworld.py to /path/to/dags/on/your/local/machineAfter waiting a couple of minutes, refreshed your Airflow GUI  and voila, you should see the new DAG Helloworld:

You can test individual tasks in your DAG by entering into the container and running the command airflow test. First, you enter into your container using the docker exec command described earlier. Once you’re in, you can see all of your dags by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list:

One useful command you can run on the command line before you run your full DAG is the airflow test command, which allows you to test individual tests as part of your DAG and logs the output to the command line. You specify a date / time and it simulates the run at that time. The command doesn’t bother with dependencies and doesn’t communicate state (running, success, failed, …) to the database, so you won’t see the results of the test in the Airflow GUI. So, with our Helloworld DAG, you could run a test on task_1

airflow test Helloworld task_1 2015-06-01

Note that when I do this, it appears to run without error; however, I’m not getting any logs output to the console. If anyone has any suggestions about why this may be the case, let me know. 

You can run the backfill command, specifying a start date and an end date to run the Helloworld DAG for those dates. In the example below, I run the dag 7 times, each day from June 1 – June 7, 2015:

When you run this, you can see the following in the Airflow GUI, which shows the success of the individual tasks and each of the runs of the DAG.

Resources

Deploying and Maintaining a Web App Part 3: Adding Tests with Pytest

Testing is an important part of development, including developing web applications. Among other benefits, they increase your confidence that your web application is doing what you expect, and a provide a basis for preventing bugs in an automated way when you make changes to your code.

Since testing is such an integral part of web application maintenance and deployment, in this Part 3 of our app-deploy project we’ll put in some basic tests to see how they can be implemented in flask applications and so we will eventually see how tests fit into the deployment workflow. We’ll be using the pytest python library as our testing framework.

To follow along, you can clone the repository:

git clone https://github.com/marknagelberg/app-deploy.git

And then go to the appropriate location in the code with the following command which includes all of the changes made in this post:

git checkout ed2ed02f8b3db

Finally, create the python environment with:

conda env create -f environment.yml

First, a Slight Upgrade to our App

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To make testing a bit more interesting, we first add a little bit of additional functionality: our app will now print out a list of all the existing names in the database on the main page.

First we add a few lines to app/app.py so that it queries the database to get all names, and then sends the list of names to the template.

Then, we update the template to print out these names.

Now, when you enter in new names, they are listed out for you, like so:

Pytest Installation and Set-up

To start off, let’s activate our app’s conda environment and install pytest:

source activate app-deploy
conda install pytest

Our tests will be stored in a top level directory in a folder that we’ll call `tests`. Our tests will be stored in *.py files within this folder. These files must either be named like test_*.py or *_test.py: this is how you tell pytest that these files represent tests (in other words, this is how pytest does “test discovery”).

As a reminder, your top level directory should look something like this:

So, let’s add a file called test_app.py in /test.

touch test/test_app.py

To run your tests, you simply just have to enter ‘pytest’ in the command line as follows. No tests have been added to test_app.py yet, so when you run ‘pytest’ now, the result should look like this:


A couple of other small preliminaries to get testing set up:

  • Set WTF_CSRF_ENABLED configuration variable equal to False in the testing configuration found in config.py. This is required for tests to run properly. Keep in mind that you normally want to have this set equal to True when you’re running your application in production, since this protects your forms against Cross Site Request Forgery (CSRF) attacks.
  • Add __init__.py in your /tests directory.

Adding Tests

The great thing about pytest is that it allows you to write tests in a very concise, pythonic way using assert statements. Suppose you have the following file which defines a function that takes the square of a number. You want to test this. Adding a couple of tests is as easy as this:

Then, you run your tests by typing `pytest` in the command line when you’re in the app’s main directory. This returns the following result (one test pass, one test fail):


Another great feature of pytest is the ability to produce “fixtures”, which provides tests with pre-initialized objects you require to run your tests. In our case, a great use of fixtures is initializing the test database or the test application instance.

Pytest fixtures have some advantages over the usual setup and teardown functions used in other testing frameworks. One is the ability to run fixtures using different “scopes”, allowing you to do the setup / teardown operation at different times when you run your test to maximize test efficiency. The options for scope here include function, class, module, and session. So, for example, a fixture in the function scope means that the fixture object is invoked once for each test function you define in pytest.

So, say all your tests need to be able to access the Flask test client application instance. Using the module fixture, the application instance is only created once at the very beginning when you run your test, rather than being created and destroyed once for every test function.

For our small app, we’ll create 3 fixtures:

  • A database record in our Name table
  • The Flask application instance
  • The test database

The following code creates a fixture for one database record in our Name table which will make this object available to all our tests.

As you can see, you register fixtures using the pytest.fixtures decorator. This function returns the fixture object that you want your tests to access. You access this fixture object in your tests by supplying them as an argument to the test function. The code below shows a test we can add to test_app.py that accesses the new_name fixture and ensures it has the value we expect.

We can take advantage of the work we did in Part 2 of the series using our application factory pattern to easily create a fixture for our application instance with our testing configuration options. The code below does this by creating an application instance with testing configuration and returning its test client (the test client provides a simple interface to the application where we can trigger requests application and track cookies).

Note that the part after the yield statement represents the “teardown” part of the test: this is where the application instance is removed when the tests are done running.

We also need to create a similar fixture for our database. The following code creates the database and initializes the data with create_all(), adds one Name “Mark” and then cleans up the database after the tests are complete.

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To add some more interesting system tests, let’s add a little bit of additional functionality to our application: it will now print out a list of all the existing names in the database on the main page.

Finally, we add a few more tests to use all of our new fixtures, including tests to make sure our application instance exists and ensure the HTML output produced contains data we expect. 

(Note that “Mark” appears on the main page since this record was added to the database as part of the fixture.)

Now, in your main app directory, you can run your tests by simply typing “pytest” in the command line. The result for our code here should look like this:

The three dots ‘…’ indicate that there were three tests and they each passed (if a test failed, one of these dots appear as an F).

Aside: during the creation of this post, I was puzzled to see that my code was running in the flask production environment, despite my configuration and environment variables clearly indicating that I was in development. Turns out there is another environment variable that needs to be set called FLASK_ENV, which defaults to “production”, turning off debug mode in Flask and throwing a warning when you run your application on the Flask development server.  To fix this, run `export FLASK_ENV=“development”` on the command line. Note that this must be set as an environment variable: adding it in config.py doesn’t work. I’ve updated this information in Part 2.  For further information in the Flask documentation, see here.

Resources

https://www.patricksoftwareblog.com/testing-a-flask-application-using-pytest/

https://piotr.banaszkiewicz.org/blog/2014/02/22/how-to-bite-flask-sqlalchemy-and-pytest-all-at-once/

Digging into Data Science Tools: Docker

Docker is a tool for creating and managing “containers” which are like little virtual machines where you can run your code. A Docker container is like a little Linux OS, preinstalled with everything you need to run your web app, machine learning model, script, or any other code you write.

Docker containers are like a really lightweight version of virtual machines. They use way less computer resources than a virtual machine, and can spin up in seconds rather than minutes. (The reason for this performance improvement is Docker containers share the kernel of the host machine, whereas virtual machines run a separate OS with a separate kernel for every virtual machine.)

Aly Sivji provides a great comparison of Docker containers to shipping containers. Shipping containers improved efficiency of logistics by standardizing the design: they all operate the same way and we have standardized infrastructure for dealing with them, and as a result you can ship them regardless of transportation type (truck, train, or boat) and logistics company (all are aware of shipping containers and mold to their standards). In a similar way, Docker provides a standardized software container which you can pass into different environments and be confident they’ll run as you expect.  

Brief Overview of How Docker Works

To give you a really high-level overview of how Docker works, first let’s define three big Docker-related terms – “Dockerfile”, “Image”, and “Container”:

  • Dockerfile: A text file you write to build the Docker “image” that you need (see definition of image below). You can think of the Dockerfile like a wrapper around the Linux command line: the commands that you would use to set up a Linux system on the command line have equivalents which you can place in a docker file. “Building” the Dockerfile produces an image that represents a Linux machine that’s in the exact state that you need. You can learn all about the ins-and-outs of the syntax and commands at the Dockerfile reference page. To get an idea of what Dockerfiles look like, here is a Dockerfile you would use to create an image that has the Ubuntu 15.04 Linux distribution, copy all the files from your application to ./app in the image, run the make command on /app within your image’s Linux command line, and then finally run the python file defined in /app/app.py:
FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py
  • Image: A “snapshot” of the environment that you want the containers to run. The images include all you need to run your code, such as code dependencies (e.g. python venv or conda environment) and system dependencies (e.g. server, database). You “build” images from Dockerfiles which define everything the image should include. You then use these images to create containers.
  • Container: An “instance” of the image, similar to how objects are instances of classes in object oriented programming. You create (or “run” using Docker language) containers from images. You can think of containers as a running the “virtual machine” defined by your image.

To sum up these three main concepts: you write a Dockerfile to “build” the image that you need, which represents the snapshot of your system at a point in time. From this image, you can then “run” one or more containers with that image.

Here are a few other useful terms to know:

  • Volume: “Shared folders” that lets a docker container see the folder on your host machine (very useful for development, so your container is automatically updated with your code changes). Volumes also allow one docker container to see data in another container. Volumes can be “persistent” (the volume continues to exist after the container is stopped) or “ephemeral” (the volume disappears as soon as the container is stopped).
  • Container Orchestration: When you first start using Docker, you’ll probably just spin up one container at a time. However, you’ll soon find that you want to have multiple containers, each running using a different image with different configurations. For example, a common use of Docker is deployment of applications as “microservices”, where each Docker container represents an individual microservice that interacts with your other microservices to deliver your application. Since it can get very unwieldy to manage multiple containers manually, there are “container orchestration” tools that automate tasks such as starting up all your containers, automatically restarting failing containers, connecting containers together so they can see each other, and distributing containers across multiple computers. Examples of tools in this space include docker-compose and Kubernetes.
  • Docker Daemon / Docker Client: The Docker Daemon must be running on the machine where you want to run containers (could be on your local or remote machine). The Docker Client is front-end command line interface to interact with Docker, connect to the Docker Daemon, and tell it what to do. It’s through the Docker client where you run commands to build images from Dockerfiles, create containers from images, and do other Docker-related tasks.

Why is Docker useful to Data Scientists?

You might be thinking “Oh god, another tool for me to learn on top of the millions of other things I have to keep on top of? Is it worth my time to learn it? Will this technology even exist in a couple years?

I think the answer is, yes, this is definitely a worthwhile tool for you to add to your data science toolbox.

To help illustrate, here is a list of reasons for using Docker as a data scientist, many of which are discussed in Michael D’agostino’s “Docker for Data Scientists” talk as well as this Lynda course from Arthur Ulfeldt:

  • Creating 100% Reproducible Data Analysis: Reproducibility is increasingly recognized as critical for both methodological and legal reasons. When you’re doing analysis, you want others to be able to verify your work. Jupyter notebooks and Python virtual environments are a big help, but you’re out of luck if you have critical system dependencies. Docker ensures you’re running your code in exactly the same way every time, with the same OS and system libraries.
  • Documentation: As mentioned above, the basis for building docker containers is a “Dockerfile”, which is a line by line description of all the stuff that needs to exist in your image / container. Reading this file gives you (and anyone else that needs to deploy your code) a great understanding about what exactly is running on the container.
  • Isolation: Using Docker helps ensure that your tools don’t conflict with one another. By running them in separate containers, you’ll know that you can run Python 2, Python 3, and R and these pieces of software will not interfere with each other.
  • Gain DevOps powers: in the words of Michaelangelo D’Agostino, “Docker Democratizes DevOps”, since it opens up opportunities to people that used to only available to systems / DevOps experts:
    • Docker allows you to more easily “sidestep” DevOps / system administration if you aren’t interested, since someone can create a container for you and all you have to do it run it. Similarly, if you like working with Docker,  you can create a container less technically savvy coworkers that lets them run things easily in the environment they need.
    • Docker provides the ability to build docker containers starting from existing containers. You can find many of these on DockerHub, which holds thousands of pre-built Dockerfiles and images. So if you’re running a well-known application (or even obscure applications), there is often a Dockerfile already available that can give you a tremendous running start to deploy your project. This includes “official” Docker repositories for many tools, such as ubuntu, postgres, nginx, wordpress, python, and much more.
    • Using Docker helps you work with your IT / DevOps colleagues, since you can do your Data Science work in a container, and simply pass it over to DevOps as a black box that they can run without having to know everything about your model.

Here are a few examples of applications relevant to data science where you might try out with Docker:

  • Create an ultra-portable, custom development workflow: Build a personal development environment in a Dockerfile, so you can access your workflow immediately on any machine with Docker installed. Simply load up the image wherever you are, on whatever machine you’re on, and your entire work environment is there: everything you need to do your job, and how you want to do your job.
  • Create development, testing, staging, and production environments: Rest assured that your code will run as you expect and become able to create staging environments identical to production so you know when you push to production, you’re going to be OK.
  • Reproduce your Jupyter notebook on any machine: Create a container that runs everything you need for your Jupyter Notebook data analysis, so you can pass it along to other researchers / colleagues and know that it will run on their machine. As great as Jupyter Notebooks are for doing analysis, they tend to suffer from the “it works on my machine” issue, and Docker can solve this issue.

For more inspiration, check out Civis Analytics Michaelangelo D’Agostino describe the Docker containers they use (start at the 18:08 mark). This includes containers specialized for survey processing, R shiny apps and other dashboards, Bayesian time series modeling and poll aggregation, as well as general purpose R/Python packages that have all the common packages needed for staff.

Further Resources

If you’re serious about starting to use Docker, I highly recommend the Lynda Course Learning Docker by Arthur Ulfeldt as a starting point. It’s well-explained and concise (only about 3 hours of video in total).

Here are a few other useful resources you might want to check out:

Creating PDF Reports with Python, Pdfkit, and Jinja2 Templates

Once in a while as a data scientist, you may need to create PDF reports of your analyses. This seems somewhat “old school” nowadays, but here are a couple situations why you might want to consider it:

  • You need to make reports that are easily printable. People often want “hard copies” of particular reports they are running and don’t want to reproduce everything they did in an interactive dashboard.
  • You need to match existing reporting formats: If you’re replacing a legacy reporting system, it’s often a good idea to try to match existing reporting methods as your first step. This means that if the legacy system used PDF reporting, then you should strongly consider creating this functionality in the replacement system. This is often important for getting buy-in from people comfortable with the old system.

I recently needed to do PDF reporting in a work assignment. The particular solution I came up with uses two main tools:

We’ll install our required packages with the following commands:

pip install pdfkit
pip install Jinja2

Note that you also need to install a tool called wkhtmltopdf for pdfkit to work.

Primer on Jinja2 Templates

Jinja2 is a great tool to become familiar with, especially if you do web development in Python. In short, it lets you automatically generate text documents by programmatically filling in placeholder values that you assign to text file templates. It’s a very flexible tool, used widely in Python web applications to generate HTML for users. You can think of it like super high-powered string substitution.

We’ll be using Jinja2 to generate HTML files of our reports that we will convert into PDFs with other tools. Keep in mind that Jinja2 can come in handy for other reporting applications, like sending automated emails or creating reports in other text file formats.

There are two main components of working with Jinja2:

  • Creating the text file Jinja2 templates that contain placeholder values. In these templates, you can use a variety of Jinja2 syntax features that allow you to adjust the look of the file and how it loads the placeholder data.
  • Writing the python code that assigns the placeholder values to your Jinja2 templates and renders a new text string according to these values.

Let’s create a simple template just as an illustration. This template will simply be a text file that prints out the value of a name. All you have to do it create a text file (let’s call it name.txt). Then in this file, simply add one line:

Your name is: {{ name }}

Here, ‘name’ is the name of the python variable that we’ll pass into the template, which holds the string placeholder that we want to include in the template.

Now that we have our template created, we need to write the python code that fills in the placeholder values in the template with what you need. You do this with the render function. Say, we want to create a version of the template where the name is “Mark”. Then write the following code:

Now, outputText holds a string of the template where {{ name }} is now equal to “Mark”. You can confirm this by writing the following on the command line:

The arguments to template.render() are the placeholder variables contained in the template along with what you want to assign them to:

template.render(placeholder_variable_in_template1=value_you_want_it_assigned1, placeholder_variable_in_template2=value_you_want_it_assigned2, ..., placeholder_variable_in_templateN=value_you_want_it_assignedN)

There is much much more you can to with Jinja2 templates. For example, we have only shown how to render a simple variable here but Jinja2 allows more complex expressions, such as for loops, if-else statements, and template inheritance. Another useful fact about Jinja2 templates is you can pass in arbitrary python objects like lists, dictionaries, or pandas data frames and you are able to use the objects directly in the template. Check out Jinja2 Template Designer Documentation for a full list of features. I also highly recommend the book Flask Web Development: Developing Web Applications with Python which includes an excellent guide on Jinja2 templates (which are the built-in template engine for the Flask web development framework).

Creating PDF Reports

Let’s say you want to print PDFs of tables that show the growth of a bank account. Each table shows the growth rate year by year of $100, $500, $20,000, and $50,000 dollars. Each separate pdf report uses a different interest rate to calculate the growth rate. We’ll need 10 different reports, each of which prints tables with 1%, 2%, 3%, …, 10% interest rates, respectively.

Lets first define the Pandas Dataframes that we need.

data_frames contains 10 dictionaries, each of which contain the data frame and the interest rate used to produce that data frame.

Next, we create the template file. We will generate one report for each of the 10 data frames above, and generate them by passing each data frame to the template along with the interest rate used. 

After creating this template, we then write the following code to produce 10 HTML files for our reports.

Our HTML reports now look something like this:

As a final step, we need to convert these HTML files to PDFs. To do this, we use pdfkit. All you have to do is iterate through your HTML files and then use a single line of code from pdfkit to each file to convert it into a pdf.

All of this code combined will pop out the following HTML files with PDF versions:

You can then click on 1.pdf to see that we’re getting the results we’re looking for.

We’ve given a very stripped down example of how you can create reports using python in an automated way. With more work, you can develop much more sophisticated reports limited only by what’s possible with HTML.

Deploying and Maintaining a Web App Part 2: Setting up the Database and App Configuration

It is good practice when developing a web application to set up different environments for deploying changes. The number and nature of environments that are used can vary, but we’ll be using the following commonplace architecture:

  • Development Environment: Where you develop the web application, typically on your local machine.
  • Testing Environment: Where you perform tests to help ensure the quality of your application.
  • Staging Environment: Where you deploy the application for the purpose of conducting “integration tests” that testing the broader functionality of your web app.
  • Production Environment: Where you deploy your web app to your users.

In this Part 2 of our web app deployment and maintenance project, we’ll begin to set up some of these environments, including a database with an extremely simple schema as well as some additions to our application to incorporate the database and configure the app.

To follow along, you can clone the repository:

git clone https://github.com/marknagelberg/app-deploy.git

And then go to the appropriate location in the code changes with the following command:

git checkout 75521eb0e3d

Finally, create the python environment with:

conda env create -f environment.yml

Installing Additional Python Packages

To start off, we’ll install a couple of additional packages to our app’s Conda environment:

  • Flask-SQLAlchemy: SQLAlchemy is a fantastic Python library that provides a way of interacting with your database purely in python, without having to write SQL query strings. Another benefit of using SQLAlchemy and similar ORMs is the ability to easily swap out databases if you decide you want to change in the future. Flask-SQLAlchemy is an extension for Flask that makes it easier to incorporate SQLAlchemy into Flask apps.
  • Flask-Migrate: A Flask extension that provides a lightweight wrapper around the Alembic database migration framework. This will come in handy later when we need to make changes to our database schema after it’s already been deployed to production.
  • Flask-WTF: A Flask extension that makes it easy to work with web forms.
  • psycopg2: An adapter for PostgreSQL databases in Python. PostgreSQL will be our database of choice and this package is required to connect it up to our application.

To install these applications, first start up the Conda environment with the following command:

source activate app-deploy

Then install the packages above with the following commands:

pip install Flask-SQLAlchemy

pip install flask-migrate

pip install Flask-WTF

conda install psycopg2

Installing and Setting Up the PostgreSQL Databases

Next, we need to install PostgreSQL. You can find the proper install for your system here.

Postgres comes with a handy front end terminal program called psql, which allows you to examine your database’s tables and records using SQL queries. We first use psql to initialize two of the databases we’ll be working with for our program: the development database and the test database.

First enter into psql by typing psql into the terminal:

From here, you can run a bunch of useful commands. For example \l lists all the databases you currently have in postgres:

For now, we’ll need to use psql to create two databases that we’ll call app_deploy_development and app_deploy_test. This is as simple as running the two commands below.

Running the command \l shows that these two databases have been created:

You can then exit postgres with the command \q or simply press control-d.

To be able to connect to this database from Flask / SQL Alchemy, we need the URL string associated with the database. In PostgreSQL, this takes the form:

postgresql://username:password@hostname:port/database

We’ll eventually be running the the development and test databases on localhost, so the hostname is ‘localhost’, the port is 5432 (this is a postgres default) and database is ’app_deploy_development’. You can confirm the port Postgres is running on by running the following command:

Configuring our Flask App

Now we’ll make some changes to the app so that the database is actually used.

FIrst we will begin to incorporate the application environments that we need. The main way to do this in Flask is through the configuration variables, which are stored in app.config variable (where app is the instance of your application).

The simplistic version of app.config is treated like a simple dictionary (e.g. you can set app.config[‘SECRET_KEY’] = ‘some_secret’), but this does not account for the fact that you need to work with different sets of configurations depending on how you’re running your application. Instead, a more flexible way to set up the app.config variable is to use a hierarchy of configuration classes, defined like this:

SQLALCHEMY_DATABASE_URI is an environmental variable required by SQLAlchemy to tell the extension where to find the database. Since these environmental variables are sensitive (they contain username and password information that you would not want to share), they are defined in your system’s environmental variables rather than including them in version control (you define these using the export command). Add the configuration code above is to a file called config.py in the app’s top directory.

Refactoring our App to use the Application Factory Pattern

We need to do some restructuring of our application to account for the database and implement the “application factory” pattern in Flask that allows us to provide different app instances with different configurations. We first move our app.py file into the /app folder. We also create an app/templates/ folder to store the front-end template that will hold the template for our front end form to enter names the and “hello world” message.

We also create /app/__init__.py:

This file stores our factory function ‘create_app’ that returns instances of the application. create_app takes a config_name argument, allowing us to create an app with the desired configuration option (e.g. create_app(‘development’), create_app(‘testing’), etc.) The configuration is assigned to the app within the factory function using the Flask app.config.from_object function along with the dictionary of configuration classes defined in config.py.

Adding the Database to our App

Rather than having just “Hello World”, we want to have our application make use of the database in some minimal way. So, we add a form where the user can enter names which are stored in the database inside a single table, consisting of a single column holding the names.

To define the database schema, we create a file called models.py:

This file imports our database instance db. Here we define a ‘name’ string column, and an integer id that serves as a primary key (your table must have a primary key).

We also create a file create_app.py in the app’s top directory:

This file is tasked with creating the instances of app from the create_app function defined in app/__init__.py. This function is what allows us to choose a particular configuration, which we define in an environment variable on our system called FLASK_CONFIG. It is also a place where we can do additional configuring of our newly generated application instance (e.g. creating commands for your app with the flask command line program by decorating functions with @app.cli.command()). This file may seem a little unnecessary now but it will come in handy later.

We also need to insert a form for the user to enter in data. We’ll use the assistance of the great Flask extension, Flask-WTF. With Flask-WTF, you define forms as classes that inherit from FlaskForm. Each of the class variables represents a field in the form that takes in certain types of values. Here our form is quite simple as it only takes a single field where the user can enter in names. There is also a ‘submit’ button which provides exactly what you expect.

Now that we have a form defined, we can define the new front end template index.html.

In our main application, we need to pass our NameForm class to render this template so we add the following code to app.py. Also note that, because of the way that we used the app_factory pattern, we must now use ‘BluePrints” for our view functions rather than defining routes in a single file directly to the app instance (blueprints are basically objects that can hold a bunch of an app’s routes and code and then can be registered with the application in the apps factory function).

Running our Updated App With the Database

To run the application first define environmental variables:

  • FLASK_APP, which tells the flask command line program where the flask application is located so that it can run it:
export FLASK_APP=create_app.py
  • FLASK_CONFIG, which is the environmental variable that tells the program what environment your app should be running in. As a reminder, this value is used in create_app.py and references one of the values in the dict found in config.py.
export FLASK_CONFIG=development
  • DEV_DATABASE_URL, which points to your development database. As a reminder, this is assigned to the environment variable SQLALCHEMY_DATABASE_URI in config.py.
export DEV_DATABASE_URL=”postgresql://<insert username>:<insert password>@localhost:5432/app_deploy_development”

Before you run the application the first time, you have to initialize the database. You only have to do this once, using the db.create_all() method provided by Flask-SQLAlchemy. From the top application directory run:

flask shell
>> from app import db
>> db.create_all()

Then you can exit the python interpreter with quit() or control-d.

Finally, run the application with:

flask run

Now, when you visit http://127.0.0.1:5000/, you should see this:

Your app should now be running with a live development database in the background ready to store whatever is entered into the form. To make sure it’s working, enter a few names into the form (for my test, I typed in “Foobar”, “Mark”, and “Joanna”). Then, in the terminal from the top directory in your app, write the command “flask shell” to run the python interpreter in the context of your app. This throws you into the python interpreter. We’ll then query the database to make sure our names exist:

Looks like we’re good to go! Just to be sure, I loop through the names to make sure they’re the values we expect:

Workflow to Deploy and Maintain a Web App

In this series, I’m going to work through the process of deploying and maintaining a simple web application. The goal is for you to get an understanding of all the steps that may be involved in deploying a web app and maintaining it in production with modern tools and techniques.

To follow along in this series, it would help to have an understanding of the command line, python programming, and basic web application development. I’ll be using the Flask web development framework. You’re in my target audience if you’ve built simple Flask applications running locally, and you now want to deploy it and use related best practices, such as implementing development, staging, and production environments, using docker containers, and making changes to the application via continuous integration / continuous deployment (CI / CD) tools.

The focus here is not on the web application itself –  rather, it’s on the process of deployment and maintenance. I’m going to get into all the nuts and bolts related to getting the app running live and maintaining it in a smart way. Along the way, we’ll learn about the best practices around deploying a web application that I find are not readily available in many tutorials.

Table of contents:

Part 1 – Setting up the server (currently reading)
Part 2 – Setting up the Database and App Configuration
Part 3 – Adding Tests with Pytest
… To come, as I finish the series 🙂

Getting Started

As the basis for our adventure, we’ll start with the simplest Flask app possible: the hello world app from the Flask website.

To get the full code for this application, run:

git clone https://github.com/marknagelberg/app-deploy.git

You need install Conda, which is the tool I use for managing the python environments (see my related post here). The environment information is stored in environment.yml in the repository.

To create the python environment from the environment.yml file, run the following command in the main project directory (i.e. app-deploy/):

conda env create -f environment.yml

The conda environment is called ‘app-deploy’. To run the environment after you’ve created it, simply type (on Mac or Linux):

source activate app-deploy

Or on Windows:

activate app-deploy

For each part of this series, I will provide a git checkout command that will allow you to see the code in the state relevant to that part of the series.

To start off with the initial, most basic version of the application, run:

git checkout 75521eb0e3

The main document in the repository is a file called app.py that looks like this:

First activate the environment, which will add ‘(app-deploy)’ to your command line to indicate you are in the environment, like this:

Now that you are in the appropriate python environment, you can run:

python app.py

Which runs Flask’s development server and serves the app to http://127.0.0.1:5000/.

If you visit the site in the browser, you should see this:

Part 1 – Setting up the Server

As a first step in this process, we’re going to spin up a server that will ultimately run the web application in production.

An aside for data scientists: Running and configuring web servers is a valuable skill for data scientists. You often need to do this to run data-driven applications you build for others (you don’t want to be running these things on your desktop computer). Furthermore, there are many valuable tools and services out there for data scientists and their coworkers that are designed to run on a server, rather than as a desktop application (e.g. Apache Superset, which is a free and open source business intelligence dashboard like Tableau or PowerBI). Servers also come in handy for your own personal use even if you have no plans to deploy a web application. For example, your server can run scripts that automatically create reports and send you email notifications, or you can run web scrapers that collect data while you sleep.

Back in the day, you needed an on-site physical server to do this. Thankfully, modern cloud computing makes it incredibly easy and cheap to run your own server. There are a few good options out there, but in this series, I’m going to use a Digital Ocean “droplet”. 

First, create an account with Digital Ocean. Then, choose the option to create a droplet:

Then, choose the operating system that you want your server to run on. I went with Ubuntu, since it’s what I’m most familiar with. It has a large user base, and as a result, there is lots of online tutorials / documentation to help you troubleshoot when things go wrong.

Then choose the memory and CPU power of your server. I went with the smallest / cheapest one. It’s easy to “resize” your droplet later on if you realize you need more juice.

Now choose a datacenter region – this is the location where your server will run. I suggest picking a data center closest to you, or closest to where your users will be. For me, that’s their Toronto data center.

Finally, you have to specify an SSH key. This will allow you to use the ssh command line program to log into your server remotely (you’ll be working with your server entirely through the command line). DigitalOcean gives you the option to enter it right inside their online dashboard interface as a step in creating the droplet. They also provide this useful guide. Here’s what I did on my Mac.

  1. Run the command:
ssh-keygen

This generates two files which represent the public-private key pairs that you’ll need to authenticate via SSH. Save these files in your ~/.ssh directory (this is the directory where the ssh program will automatically look when you attempt to log into the server). You’ll also get an option to add a password to the file, which is recommended as protection against your laptop being stolen and someone getting access to the files.

id_rsa and id_rsa.pub are the default names for these public / private key pairs. Since these already exist in my .ssh directory, I chose a different filename: app-deploy and app-deploy.pub.  

Once you create these files, you then need to copy the contents of the public key (i.e. the file ending in ‘.pub’) into the “New SSH Key” form that Digital Ocean provides when setting up your droplet:

After entering in your public key, you can click the button to create your server.

Finally, to login to my server I run:

ssh -i ~/.ssh/app_deploy root@server_ip_address

If you chose the default filename (id_rsa and id_rsa.pub), then you only need to run:

ssh root@server_ip_address

And congratulations – you are now logged into your own personal server running in the cloud!

Using Python to Figure out Sample Sizes for your Study

It’s common wisdom among data scientists that 80% of your time is spent cleaning data, while 20% is the actual analysis.

There’s a similar issue when doing an empirical research study: typically, there’s tons of work to do up front before you get to the fun part (i.e. seeing and interpreting results).

One important up front activity in empirical research is figuring out the sample size you need. This is a crucial, since it significantly impacts the cost of your study and the reliability of your results. Collect too much sample: you’ve wasted money and time. Collect too little: your results may be useless.

Understanding the sample size you need depends on the statistical test you plan to use. If it’s a straightforward test, then finding the desired sample size can be just a matter of plugging numbers into an equation. However, it can be more involved, in which case a programming language like Python can make life easier. In this post, I’ll go through one of these more difficult cases.

Here’s the scenario: you are doing a study on a marketing effort that’s intended to increase the proportion of women entering your store (say, a change in signage). Suppose you want to know whether the change actually increased the proportion of women walking through. You’re planning on collecting the data before and after you change the signs and determine if there’s a difference. You’ll be using a two-proportion Z test for comparing the two proportions. You’re unsure how long you’ll need to collect the data to get reliable results – you first have to figure out how much sample you need!

Overview of the Two Proportion Z test

The first step in determining the required sample size is understanding the statical test you’ll be using. The two sample Z test for proportions determines whether a population proportion p1 is equal to another population proportion p2. In our example, p1 and p2 are the proportion of women entering the store before and after the marketing change (respectively), and we want to see whether there was a statistically significant increase in p2 over p1, i.e. p2 > p1.

The test test the null hypothesis: p1 – p2 = 0. The test statistic we use to test this null hypotheses is:

Z = \frac{p_2 - p_1}{\sqrt{p*(1-p*)(\frac{1}{n_1} + \frac{1}{n_2})}}

Where p* is the proportion of “successes” (i.e. women entering the store) in the two samples combined. I.e.

p* = \frac{n_1p_1 + n_2p_2}{n_1 + n_2}

Z is approximately normally distributed (i.e. ~N(0, 1)), so given a Z score for two proportions, you can look up its value against the normal distribution to see the likelihood of that value occurring by chance.

So how to figure out the sample size we need? It depends on a few factors:

  • The confidence level: How confident do we need to be to ensure the results didn’t occur by chance? For a given difference in results, detecting it with higher confidence requires more sample. Typical choices here include 95% or 99% confidence, although these are just conventions.
  • The percentage difference that we want to be able to detect: The smaller the differences you want to be able to detect, the more sample will be required.
  • The absolute values of the probabilities you want to detect differences on: This is a little trickier and somewhat unique to the particular test we’re working with. It turns out that, for example, detecting a difference between 50% and 51% requires a different sample size than detecting a difference between 80% and 81%. In other words, the sample size required is a function of p1, not just p1 – p2.
  • The distribution of the data results: Say that you want to compare proportions within sub-groups (in our case, say you subdivide proportion of women by age group). This means that you need the sample to be big enough within each subgroup to get statistically significant comparisons. You often don’t know how the sample will pan out within each of these groups (it may be much harder to get sample for some). There are at least a couple of alternatives for you here: i) you could assume sample is distributed uniformly across subgroups ii) you can run a preliminary test (e.g. sit outside the store for half a day to get preliminary proportions of women entering for each age group).

So, how do you figure out sample sizes when there are so many factors at play? 

Figuring out Possibilities for Sample Sizes with Python

Ultimately, we want to make sure we’re able to calculate a difference between p1 and p2 when it exists. So, let’s assume you know that the “true” difference that exists between p1 and p2. Then, we can look at sample size requirements for various confidence levels and absolute levels of p1.

We need a way of figuring out Z, so we can determine whether a given sample size provides statistically significant results, so let’s define a function that returns the Z value given p1, p2, n1, and n2.

Then, we can define a function that returns the sample required, given p1 (the before probability), pdiff (i.e. p2 – p1), and alpha (which represents the p-value, or 1 minus the confidence level). For simplicity we’ll just assume that n1 = n2. If you know in advance that n1 will have about a quarter of the size of n2, then it’s trivial to incorporate this into the function. However, you typically don’t know this in advance and in our scenario an equal sample assumption seems reasonable.

The function is fairly simplistic: it counts up from n starting from 1, until n gets large enough where the probability of that statistic being that large (i.e. the p-value) is less than alpha (in this case, we would reject the null hypothesis that p1 = p2). The function uses the normal distribution available from the scipy library to calculate the p value and compare it to alpha. 

These functions we’ve defined provide the main tools we need to determine minimum sample levels required.

As mentioned earlier, one complication to deal with is the fact that the sample required to determine differences between p1 and p2 depend on the absolute level of p1. So, the first question we want to answer is “what p1 that would require the biggest sample size to determine a given difference with p2?” Figuring this out allows you to calculate a lower bound on the sample you need for any p1. If you calculate the sample for the p1 with the highest required sample, you know it’ll be enough for any other p1.

Let’s say we want to be able to calculate a 5% difference with 95% confidence level, and we need to find a p1 that gives us the largest sample required. We first generate a list in Python of all the p1 to look at, from 0% to 95% and then use the sample_required function for each difference to calculate the sample.

Then, we plot the data with the following code.

Which produces this plot:

This plot makes it clear that p1 = 50% produces the highest sample sizes.

Using this information, let’s say we want to calculate the sample sizes required to calculate differences in p1 and p2 where p2 – p1 is between  2% and 10%, and confidence levels are 95% or 99%. To ensure we get a sample large enough, we know to set p1 = 50%. We first write the code to build up the data frame to plot.

Then we write the following code to plot the data with Seaborn.

The final result is this plot:

This shows the minimum sample required to detect probability differences between 2% and 10%, for both 95% and 99% confidence levels. So, for example, detecting a difference of 2% at 95% confidence level requires a sample of ~3,500, which translates into n1 = n2 = 1,750. So, in our example, you would need about 1,750 people walking into the store before the marketing intervention, and 1,750 people after to detect a 2% difference in probabilities at a 95% confidence level.

Conclusion

The example shows how Python can be a very useful tool for performing “back of the envelope” calculations, such as estimates of required sample sizes for tests where this determination is not straightforward. These calculations can save you a lot of time and money, especially when you’re thinking about collecting your own data for a research project.

How to Find Underrated People on Twitter with TURI (Twitter Underrated Index)

One of the best skills that you can develop is the ability to find talented people before anyone else does.

This great advice from Tyler Cowen (economist and blogger at Marginal Revolution) got me thinking: What are some strategies for finding talented but underrated people?

One possible source is Twitter. For a long time, I didn’t “get” Twitter, but after following Michael Nielsen’s advice I’m officially a convert. The key is carefully selecting the list of people you follow. If you do this well, your Twitter feed becomes a constant stream of valuable information and interesting people.

If you look carefully, you can find a lot of highly underrated people on Twitter, i.e. incredibly smart people that put out valuable and interesting content, but have a smaller following that you would expect.

These are the kinds of people that are the best to follow: you get access to insights that a lot of other people are not getting (since not many people are following them), and they are more likely to respond to queries or engage in discussion (since they have a smaller following to manage).

One option for finding these people is trial and error, but I wanted to see if it’s possible to quantify how underrated people are on Twitter and automate the process for finding good people to follow.

I call this the TURI (Twitter Underrated Index), because hey, it needs a name and acronyms make things sound so official.

Components of TURI

The index has three main components: Growth, Influence of Followers, and the Number of Followers.

Growth (G): The number of followers a user has per unit of content they have published (i.e. per tweet).

A user that is growing their Twitter following quickly suggests that they are underrated. It implies they are putting out quality content and people are starting to notice rapidly. The way I measure this is the number of followers a person acquires per Tweet.

Another possible measurement of growth is the number of followers the user has acquired per unit of time (i.e. number of followers divided by the length of time the Twitter account has existed). However, there are a couple of problems with this option:

  • Twitter accounts can be dormant for years. For example, someone might start an account but not tweet for 5 years and then put out great content. Measuring growth in terms of time would unfairly punish these people.
  • A person may have a strategy of Tweeting constantly. Some of the content results in followers, but the overall quality is still low. We are looking for people that publish great content, not necessarily people that put out a lot of content.

Influence of Followers (IF): The average number of the user’s follower’s followers.

In my opinion, the influence of a person’s following is the most important factor determining whether they are underrated on Twitter. Here’s a few reasons why:

  • Influential people are, on average, better judges of good content.
  • Influential people are more selective in who they decide to follow, especially if Twitter is an important part of their online “brand”.
  • Influential people tend to engage with or are in some way related to other high quality people in their offline personal lives, and these people are more likely to appear in their Twitter feed even if they are not widely known or appreciated yet.

I’m somewhat biased toward this measure because, from my own personal experience, it has worked out really well when I browse through people who are followed by influential people on my feed. If I see someone followed by Tyler Cowen, Alex Tabarrok, Russ Roberts, Patrick Collison, and Marc Andreessen, and yet they only have 5,000 followers, then I’m pretty confident that person is currently underrated.

After some consideration, I believe the best way to measure the influence of a user’s followers with the data available in the Twitter API is taking the average number of the user’s follower’s followers.

I mulled over the possibility of using the median rather than the average, but decided against it: If someone with 1 million followers follows someone with 50 followers, I want to know more about that person, even though their TURI is high only because of that one highly influential follower. Outliers are good – we’re looking for diamonds in the rough.

Total Number of Followers (NF): The total number of followers the user currently has.

Our very definition of “underrated” in this context is when a user does not have as many followers as you expect, so total number of followers is obviously going to play an important role in TURI.

So to summarize the main idea behind TURI: if a person has a large number of “influential” followers, is growing their number of followers quickly relative to the volume of content they put out, and they still have a small number of total followers, then they are likely underrated.

Defining the Index

For any user i, we can calculate their Twitter Underrated Index (TURI) as:

TURI_i = \frac{G_iIF_i}{NF_i}

Where G is growth, IF is influence of followers, and NF is the number of followers.

This formula has the general relationships we are looking for: high growth in users for each unit of content, highly influential followers, and a low number of total followers all push TURI upward.

We can simplify the equation by rewriting G = NF / T, where T is the total number of tweets for the user. Cancelling out some terms, this gives us our final version of the index:

TURI_i = \frac{IF_i}{T_i}

In other words, our index of how under-rated a person is on twitter is given by the average number of i’s followers followers per tweet of user i.

Before calculating TURI for a group of users, there are a couple of pre-processing steps you will likely want to take to improve the results:

  1. Filter out verified accounts. One of the shortcomings of TURI is that a user’s growth / trajectory (i.e. G) will be very high if they are already a celebrity. These people typically have a large number of followers per Tweet not because of the content they put out, but because they’ve attained fame already elsewhere. Luckily, Twitter has a feature called a “verified account”, which applies “if it is determined to be an account of public interest. Typically this includes accounts maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas”. This is a prime group to filter out, because we are not looking for well-known people.
  2. Filter users by number of followers: There are a few reasons why you might want to only calculate TURI for users that have a following within some range (e.g. between 500 and 1,000):
    • Although there may be situations where a person with 500,000 followers is underrated, but this seems unlikely to be the kind of person you’re looking for so not worth the API resources.  
    • Filtering by some upper follower threshold mitigates the risk of including celebrities without verified accounts.
    • You limit the number of calls you make to the API. The most costly operation in terms of API calls is figuring out the influence of followers. The more followers a person has, the more API calls required to calculate TURI.

Trying out TURI

To test out the index, I calculated its value on a subset of 49 people that Tyler Cowen follows who have 1,000 or fewer followers (Tyler blogs at my favourite blog, inspired this project, and has good taste in people to follow).

The graph below illustrates TURI for these users (not including 4 accounts that were not accessible due to privacy settings). 

As you can see, one user (@xgabaix) is a significant outlier. Looking a bit more closely at the profile, this is Xavier Gabaix, a well known economist at Harvard University. His TURI is so high because he has several very influential followers and he has not tweeted yet. 

So did TURI fail here? I don’t think so, since this is very likely someone to follow if he was actually Tweeting. However, it does seem a little strange to put someone at the top of the list that doesn’t actually have any Twitter content.

So, I filtered again for users that have published at least 20 tweets:

The following chart looks solely at the various user’s IF (Influence of their Followers). Interestingly, another user @shanagement has the most influential followers by far. However, they rank in third place for overall TURI since they tweet significantly more than @davidbrooks13 or @davidhgillen.

Limitations

Of course, TURI has some shortcomings:

  • Difficult to tell how well TURI works: The measures are based on intuition and there is obviously no “ground truth” data about how underrated twitter users actually are. So, we don’t really have a data-based method for seeing how well the index works and improving it systematically. For example, you might question is whether Growth G should be included in the index at all. I think there’s a good argument for it: if people get followers quickly per unit of content there must be something about that content that draws others. But, on the other hand, maybe they aren’t truly underrated. Maybe the truly underrated people have good content but your average Twitter user underestimates them even after reading a few posts. People don’t always know high quality when the see it.
  • It takes a fairly long time to calculate TURI: This is due to Twitter rate limits of API requests. For example, calculating TURI for 49 Twitter users above took about an hour. It would take even longer for people with larger followings (remember, I only focused on people with 1,000 or fewer followers). So, if you want to do a large batch of people, it’s probably a good idea run this on a server on an ongoing basis and store user and TURI information into a database. 

Other ideas?

There are many many different ways you could potentially specify an index like this. Leave a comment or reach out to me on Twitter or email if you have any suggestions.

One other possible tweak is accounting for the number of people that the user follows. I notice that some Twitter users seem to have a strategy of following a huge number of people in the hopes of being followed back. In these cases, their following strategy may be the cause of their high growth and not the quality of their content. One solution is to adjust the TURI by multiplying by (Number of User’s Followers) / (Number of People the User Follows). This would penalize people that, for example, have 15,000 followers but follow 15,000 people themselves.

Technical Details

You can find the code I used to interact with the API and calculate TURI here. The code uses the python-twitter package, which provides a nice way of interacting with the Twitter API, abstracting away annoying details so you don’t have to deal with them (e.g. authenticating with OAuth, dealing with rate limits).