Getting Started with Airflow Using Docker

Lately I’ve been reading intensively on data engineering after being inspired by this great article by Robert Chang providing an introduction to the field.  The underlying message of the article really resonated with me: when most people think of data science they immediately think about the stuff being done by very mature tech companies like Google or Twitter, like deploying uber-sophisticated machine learning models all the time.

However, many organizations are not at the stage where these kind of models makes sense as a top priority. This is because, to build and deploy these kind of models efficiently and effectively, you need to have foundation data infrastructure in place that you can build the models on. Yes, you can develop a machine learning model with the data you have in your organization, but you have to ask: how long did it take you to do it, is your work repeatable / automatable, and are you able to deploy or actually use your solution in a meaningful and reliable way? This is where data engineering comes in: it’s all about building the data warehouses and ETL pipelines (extract-transform-load) that provide the fundamental plumbing required to do everything else.

One tool that keeps coming up in my research on data engineering is Apache Airflow, which is “a platform to programmatically author, schedule and monitor workflows”. Essentially, Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks. It’s becoming very popular among data engineers / data scientists as a great tool for orchestrating ETL pipelines and monitor them as they run.

In this post, I’ll give a really brief overview of some key concepts in Airflow and then show a step-by-step deployment of Airflow in a Docker container.

Key Airflow Concepts

Before we get into deploying Airflow, there are a few basic concepts to introduce. See this page in the Airflow docs which go through these in greater detail and describe additional concepts as well.

Directed Acyclic Graph (DAG): A DAG is a collection of the tasks you want to run, along with the relationships and dependencies between the tasks. DAGs can be expressed visually as a graph with nodes and edges, where the nodes represent tasks and the edges represent dependencies between tasks (i.e. the order in which the tasks must run). Essentially, DAGs represent the workflow that you want to orchestrate and monitor in Airflow. They are “acyclic”, which means that the graph has no cycles – in English, this means means your workflows must have a beginning and an end (if there was a cycle, the workflow would be stuck in an infinite loop).

Operators: Operators represent what is actually done in the tasks that compose a DAG workflow. Specifically, an operator represents a single task in a DAG. Airflow provides a lot of pre-defined classes with tons of flexibility about what you can run as tasks. This includes classes for very common tasks, like BashOperator, PythonOperator, EmailOperator, OracleOperator, etc. On top of the multitude of operator classes available, Airflow provides the ability to define your own operators. As a result, a task in your DAG can do almost anything you want, and you can schedule and monitor it using Airflow.

Tasks: A running instance of an operator. During the instantiation, you can define specific parameters associated with the operator and the parameterized task becomes a node in a DAG.

Deploying Airflow with Docker and Running your First DAG

This rest of this post focuses on deploying Airflow with docker and it assumes you are somewhat familiar with Docker or you have read my previous article on getting started with Docker.

As a first step, you obviously need to have Docker installed and have a Docker Hub account. Once you do that, go to Docker Hub and search “Airflow” in the list of repositories, which produces a bunch of results. We’ll be using the second one: puckel/docker-airflow which has over 1 million pulls and almost 100 stars. You can find the documentation for this repo here. You can find the github repo associated with this container here.

So, all you have to do to get this pre-made container running Apache Airflow is type:

docker pull puckel/docker-airflow

And after a few short moments, you have a Docker image installed for running Airflow in a Docker container. You can see your image was downloaded by typing:

docker images

Now that you have the image downloaded, you can create a running container with the following command:

docker run -d -p 8080:8080 puckel/docker-airflow webserver

Once you do that, Airflow is running on your machine, and you can visit the UI by visiting http://localhost:8080/admin/

On the command line, you can find the container name by running:

docker ps

You can jump into your running container’s command line using the command:

docker exec -ti <container name> bash

So in my case, my container was automatically named competent_vaughan by docker, so I ran the following to get into my container’s command line:

Running a DAG

So your container is up and running. Now, how do we start defining DAGs?

In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is: /usr/local/airflow/dags The thing is, you don’t want to jump into your container and add the DAG definition files directly in there. One reason is that the minimal version of Linux installed in the container doesn’t even have a text editor. But a more important reason is that jumping in containers and editing them is considered bad practice and “hacky” in Docker, because you can no longer build the image your container runs on from your Dockerfile. Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container with the following command: docker run -d -p 8080:8080 -v /path/to/dags/on/your/local/machine/:/usr/local/airflow/dags puckel/docker-airflow webserver The DAG we’ll add can be found in this repo created by Manasi Dalvi. The DAG is called Helloworld and you can find the DAG definition file here. (Also see this YouTube video where she provides an introduction to Airflow and shows this DAG in action.) To add it to Airflow, copy Helloworld.py to /path/to/dags/on/your/local/machineAfter waiting a couple of minutes, refreshed your Airflow GUI and voila, you should see the new DAG Helloworld: You can test individual tasks in your DAG by entering into the container and running the command airflow test. First, you enter into your container using the docker exec command described earlier. Once you’re in, you can see all of your dags by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list: One useful command you can run on the command line before you run your full DAG is the airflow test command, which allows you to test individual tests as part of your DAG and logs the output to the command line. You specify a date / time and it simulates the run at that time. The command doesn’t bother with dependencies and doesn’t communicate state (running, success, failed, …) to the database, so you won’t see the results of the test in the Airflow GUI. So, with our Helloworld DAG, you could run a test on task_1 airflow test Helloworld task_1 2015-06-01 Note that when I do this, it appears to run without error; however, I’m not getting any logs output to the console. If anyone has any suggestions about why this may be the case, let me know. You can run the backfill command, specifying a start date and an end date to run the Helloworld DAG for those dates. In the example below, I run the dag 7 times, each day from June 1 – June 7, 2015: When you run this, you can see the following in the Airflow GUI, which shows the success of the individual tasks and each of the runs of the DAG. Resources Digging into Data Science Tools: Docker Docker is a tool for creating and managing “containers” which are like little virtual machines where you can run your code. A Docker container is like a little Linux OS, preinstalled with everything you need to run your web app, machine learning model, script, or any other code you write. Docker containers are like a really lightweight version of virtual machines. They use way less computer resources than a virtual machine, and can spin up in seconds rather than minutes. (The reason for this performance improvement is Docker containers share the kernel of the host machine, whereas virtual machines run a separate OS with a separate kernel for every virtual machine.) Aly Sivji provides a great comparison of Docker containers to shipping containers. Shipping containers improved efficiency of logistics by standardizing the design: they all operate the same way and we have standardized infrastructure for dealing with them, and as a result you can ship them regardless of transportation type (truck, train, or boat) and logistics company (all are aware of shipping containers and mold to their standards). In a similar way, Docker provides a standardized software container which you can pass into different environments and be confident they’ll run as you expect. Brief Overview of How Docker Works To give you a really high-level overview of how Docker works, first let’s define three big Docker-related terms – “Dockerfile”, “Image”, and “Container”: • Dockerfile: A text file you write to build the Docker “image” that you need (see definition of image below). You can think of the Dockerfile like a wrapper around the Linux command line: the commands that you would use to set up a Linux system on the command line have equivalents which you can place in a docker file. “Building” the Dockerfile produces an image that represents a Linux machine that’s in the exact state that you need. You can learn all about the ins-and-outs of the syntax and commands at the Dockerfile reference page. To get an idea of what Dockerfiles look like, here is a Dockerfile you would use to create an image that has the Ubuntu 15.04 Linux distribution, copy all the files from your application to ./app in the image, run the make command on /app within your image’s Linux command line, and then finally run the python file defined in /app/app.py: FROM ubuntu:15.04 COPY . /app RUN make /app CMD python /app/app.py • Image: A “snapshot” of the environment that you want the containers to run. The images include all you need to run your code, such as code dependencies (e.g. python venv or conda environment) and system dependencies (e.g. server, database). You “build” images from Dockerfiles which define everything the image should include. You then use these images to create containers. • Container: An “instance” of the image, similar to how objects are instances of classes in object oriented programming. You create (or “run” using Docker language) containers from images. You can think of containers as a running the “virtual machine” defined by your image. To sum up these three main concepts: you write a Dockerfile to “build” the image that you need, which represents the snapshot of your system at a point in time. From this image, you can then “run” one or more containers with that image. Here are a few other useful terms to know: • Volume: “Shared folders” that lets a docker container see the folder on your host machine (very useful for development, so your container is automatically updated with your code changes). Volumes also allow one docker container to see data in another container. Volumes can be “persistent” (the volume continues to exist after the container is stopped) or “ephemeral” (the volume disappears as soon as the container is stopped). • Container Orchestration: When you first start using Docker, you’ll probably just spin up one container at a time. However, you’ll soon find that you want to have multiple containers, each running using a different image with different configurations. For example, a common use of Docker is deployment of applications as “microservices”, where each Docker container represents an individual microservice that interacts with your other microservices to deliver your application. Since it can get very unwieldy to manage multiple containers manually, there are “container orchestration” tools that automate tasks such as starting up all your containers, automatically restarting failing containers, connecting containers together so they can see each other, and distributing containers across multiple computers. Examples of tools in this space include docker-compose and Kubernetes. • Docker Daemon / Docker Client: The Docker Daemon must be running on the machine where you want to run containers (could be on your local or remote machine). The Docker Client is front-end command line interface to interact with Docker, connect to the Docker Daemon, and tell it what to do. It’s through the Docker client where you run commands to build images from Dockerfiles, create containers from images, and do other Docker-related tasks. Why is Docker useful to Data Scientists? You might be thinking “Oh god, another tool for me to learn on top of the millions of other things I have to keep on top of? Is it worth my time to learn it? Will this technology even exist in a couple years? I think the answer is, yes, this is definitely a worthwhile tool for you to add to your data science toolbox. To help illustrate, here is a list of reasons for using Docker as a data scientist, many of which are discussed in Michael D’agostino’s “Docker for Data Scientists” talk as well as this Lynda course from Arthur Ulfeldt: • Creating 100% Reproducible Data Analysis: Reproducibility is increasingly recognized as critical for both methodological and legal reasons. When you’re doing analysis, you want others to be able to verify your work. Jupyter notebooks and Python virtual environments are a big help, but you’re out of luck if you have critical system dependencies. Docker ensures you’re running your code in exactly the same way every time, with the same OS and system libraries. • Documentation: As mentioned above, the basis for building docker containers is a “Dockerfile”, which is a line by line description of all the stuff that needs to exist in your image / container. Reading this file gives you (and anyone else that needs to deploy your code) a great understanding about what exactly is running on the container. • Isolation: Using Docker helps ensure that your tools don’t conflict with one another. By running them in separate containers, you’ll know that you can run Python 2, Python 3, and R and these pieces of software will not interfere with each other. • Gain DevOps powers: in the words of Michaelangelo D’Agostino, “Docker Democratizes DevOps”, since it opens up opportunities to people that used to only available to systems / DevOps experts: • Docker allows you to more easily “sidestep” DevOps / system administration if you aren’t interested, since someone can create a container for you and all you have to do it run it. Similarly, if you like working with Docker, you can create a container less technically savvy coworkers that lets them run things easily in the environment they need. • Docker provides the ability to build docker containers starting from existing containers. You can find many of these on DockerHub, which holds thousands of pre-built Dockerfiles and images. So if you’re running a well-known application (or even obscure applications), there is often a Dockerfile already available that can give you a tremendous running start to deploy your project. This includes “official” Docker repositories for many tools, such as ubuntu, postgres, nginx, wordpress, python, and much more. • Using Docker helps you work with your IT / DevOps colleagues, since you can do your Data Science work in a container, and simply pass it over to DevOps as a black box that they can run without having to know everything about your model. Here are a few examples of applications relevant to data science where you might try out with Docker: • Create an ultra-portable, custom development workflow: Build a personal development environment in a Dockerfile, so you can access your workflow immediately on any machine with Docker installed. Simply load up the image wherever you are, on whatever machine you’re on, and your entire work environment is there: everything you need to do your job, and how you want to do your job. • Create development, testing, staging, and production environments: Rest assured that your code will run as you expect and become able to create staging environments identical to production so you know when you push to production, you’re going to be OK. • Reproduce your Jupyter notebook on any machine: Create a container that runs everything you need for your Jupyter Notebook data analysis, so you can pass it along to other researchers / colleagues and know that it will run on their machine. As great as Jupyter Notebooks are for doing analysis, they tend to suffer from the “it works on my machine” issue, and Docker can solve this issue. For more inspiration, check out Civis Analytics Michaelangelo D’Agostino describe the Docker containers they use (start at the 18:08 mark). This includes containers specialized for survey processing, R shiny apps and other dashboards, Bayesian time series modeling and poll aggregation, as well as general purpose R/Python packages that have all the common packages needed for staff. Further Resources If you’re serious about starting to use Docker, I highly recommend the Lynda Course Learning Docker by Arthur Ulfeldt as a starting point. It’s well-explained and concise (only about 3 hours of video in total). Here are a few other useful resources you might want to check out: Creating PDF Reports with Python, Pdfkit, and Jinja2 Templates Once in a while as a data scientist, you may need to create PDF reports of your analyses. This seems somewhat “old school” nowadays, but here are a couple situations why you might want to consider it: • You need to make reports that are easily printable. People often want “hard copies” of particular reports they are running and don’t want to reproduce everything they did in an interactive dashboard. • You need to match existing reporting formats: If you’re replacing a legacy reporting system, it’s often a good idea to try to match existing reporting methods as your first step. This means that if the legacy system used PDF reporting, then you should strongly consider creating this functionality in the replacement system. This is often important for getting buy-in from people comfortable with the old system. I recently needed to do PDF reporting in a work assignment. The particular solution I came up with uses two main tools: We’ll install our required packages with the following commands: pip install pdfkit pip install Jinja2  Note that you also need to install a tool called wkhtmltopdf for pdfkit to work. Primer on Jinja2 Templates Jinja2 is a great tool to become familiar with, especially if you do web development in Python. In short, it lets you automatically generate text documents by programmatically filling in placeholder values that you assign to text file templates. It’s a very flexible tool, used widely in Python web applications to generate HTML for users. You can think of it like super high-powered string substitution. We’ll be using Jinja2 to generate HTML files of our reports that we will convert into PDFs with other tools. Keep in mind that Jinja2 can come in handy for other reporting applications, like sending automated emails or creating reports in other text file formats. There are two main components of working with Jinja2: • Creating the text file Jinja2 templates that contain placeholder values. In these templates, you can use a variety of Jinja2 syntax features that allow you to adjust the look of the file and how it loads the placeholder data. • Writing the python code that assigns the placeholder values to your Jinja2 templates and renders a new text string according to these values. Let’s create a simple template just as an illustration. This template will simply be a text file that prints out the value of a name. All you have to do it create a text file (let’s call it name.txt). Then in this file, simply add one line: Your name is: {{ name }} Here, ‘name’ is the name of the python variable that we’ll pass into the template, which holds the string placeholder that we want to include in the template. Now that we have our template created, we need to write the python code that fills in the placeholder values in the template with what you need. You do this with the render function. Say, we want to create a version of the template where the name is “Mark”. Then write the following code: Now, outputText holds a string of the template where {{ name }} is now equal to “Mark”. You can confirm this by writing the following on the command line: The arguments to template.render() are the placeholder variables contained in the template along with what you want to assign them to: template.render(placeholder_variable_in_template1=value_you_want_it_assigned1, placeholder_variable_in_template2=value_you_want_it_assigned2, ..., placeholder_variable_in_templateN=value_you_want_it_assignedN) There is much much more you can to with Jinja2 templates. For example, we have only shown how to render a simple variable here but Jinja2 allows more complex expressions, such as for loops, if-else statements, and template inheritance. Another useful fact about Jinja2 templates is you can pass in arbitrary python objects like lists, dictionaries, or pandas data frames and you are able to use the objects directly in the template. Check out Jinja2 Template Designer Documentation for a full list of features. I also highly recommend the book Flask Web Development: Developing Web Applications with Python which includes an excellent guide on Jinja2 templates (which are the built-in template engine for the Flask web development framework). Creating PDF Reports Let’s say you want to print PDFs of tables that show the growth of a bank account. Each table shows the growth rate year by year of$100, $500,$20,000, and \$50,000 dollars. Each separate pdf report uses a different interest rate to calculate the growth rate. We’ll need 10 different reports, each of which prints tables with 1%, 2%, 3%, …, 10% interest rates, respectively.

Lets first define the Pandas Dataframes that we need.

data_frames contains 10 dictionaries, each of which contain the data frame and the interest rate used to produce that data frame.

Next, we create the template file. We will generate one report for each of the 10 data frames above, and generate them by passing each data frame to the template along with the interest rate used.

After creating this template, we then write the following code to produce 10 HTML files for our reports.

Our HTML reports now look something like this:

As a final step, we need to convert these HTML files to PDFs. To do this, we use pdfkit. All you have to do is iterate through your HTML files and then use a single line of code from pdfkit to each file to convert it into a pdf.

All of this code combined will pop out the following HTML files with PDF versions:

You can then click on 1.pdf to see that we’re getting the results we’re looking for.

We’ve given a very stripped down example of how you can create reports using python in an automated way. With more work, you can develop much more sophisticated reports limited only by what’s possible with HTML.

How to Find Underrated People on Twitter with TURI (Twitter Underrated Index)

One of the best skills that you can develop is the ability to find talented people before anyone else does.

This great advice from Tyler Cowen (economist and blogger at Marginal Revolution) got me thinking: What are some strategies for finding talented but underrated people?

One possible source is Twitter. For a long time, I didn’t “get” Twitter, but after following Michael Nielsen’s advice I’m officially a convert. The key is carefully selecting the list of people you follow. If you do this well, your Twitter feed becomes a constant stream of valuable information and interesting people.

If you look carefully, you can find a lot of highly underrated people on Twitter, i.e. incredibly smart people that put out valuable and interesting content, but have a smaller following that you would expect.

These are the kinds of people that are the best to follow: you get access to insights that a lot of other people are not getting (since not many people are following them), and they are more likely to respond to queries or engage in discussion (since they have a smaller following to manage).

One option for finding these people is trial and error, but I wanted to see if it’s possible to quantify how underrated people are on Twitter and automate the process for finding good people to follow.

I call this the TURI (Twitter Underrated Index), because hey, it needs a name and acronyms make things sound so official.

Components of TURI

The index has three main components: Growth, Influence of Followers, and the Number of Followers.

Growth (G): The number of followers a user has per unit of content they have published (i.e. per tweet).

A user that is growing their Twitter following quickly suggests that they are underrated. It implies they are putting out quality content and people are starting to notice rapidly. The way I measure this is the number of followers a person acquires per Tweet.

Another possible measurement of growth is the number of followers the user has acquired per unit of time (i.e. number of followers divided by the length of time the Twitter account has existed). However, there are a couple of problems with this option:

• Twitter accounts can be dormant for years. For example, someone might start an account but not tweet for 5 years and then put out great content. Measuring growth in terms of time would unfairly punish these people.
• A person may have a strategy of Tweeting constantly. Some of the content results in followers, but the overall quality is still low. We are looking for people that publish great content, not necessarily people that put out a lot of content.

Influence of Followers (IF): The average number of the user’s follower’s followers.

In my opinion, the influence of a person’s following is the most important factor determining whether they are underrated on Twitter. Here’s a few reasons why:

• Influential people are, on average, better judges of good content.
• Influential people are more selective in who they decide to follow, especially if Twitter is an important part of their online “brand”.
• Influential people tend to engage with or are in some way related to other high quality people in their offline personal lives, and these people are more likely to appear in their Twitter feed even if they are not widely known or appreciated yet.

I’m somewhat biased toward this measure because, from my own personal experience, it has worked out really well when I browse through people who are followed by influential people on my feed. If I see someone followed by Tyler Cowen, Alex Tabarrok, Russ Roberts, Patrick Collison, and Marc Andreessen, and yet they only have 5,000 followers, then I’m pretty confident that person is currently underrated.

After some consideration, I believe the best way to measure the influence of a user’s followers with the data available in the Twitter API is taking the average number of the user’s follower’s followers.

I mulled over the possibility of using the median rather than the average, but decided against it: If someone with 1 million followers follows someone with 50 followers, I want to know more about that person, even though their TURI is high only because of that one highly influential follower. Outliers are good – we’re looking for diamonds in the rough.

Total Number of Followers (NF): The total number of followers the user currently has.

Our very definition of “underrated” in this context is when a user does not have as many followers as you expect, so total number of followers is obviously going to play an important role in TURI.

So to summarize the main idea behind TURI: if a person has a large number of “influential” followers, is growing their number of followers quickly relative to the volume of content they put out, and they still have a small number of total followers, then they are likely underrated.

Defining the Index

For any user i, we can calculate their Twitter Underrated Index (TURI) as:

$TURI_i = \frac{G_iIF_i}{NF_i}$

Where G is growth, IF is influence of followers, and NF is the number of followers.

This formula has the general relationships we are looking for: high growth in users for each unit of content, highly influential followers, and a low number of total followers all push TURI upward.

We can simplify the equation by rewriting G = NF / T, where T is the total number of tweets for the user. Cancelling out some terms, this gives us our final version of the index:

$TURI_i = \frac{IF_i}{T_i}$

In other words, our index of how under-rated a person is on twitter is given by the average number of i’s followers followers per tweet of user i.

Before calculating TURI for a group of users, there are a couple of pre-processing steps you will likely want to take to improve the results:

1. Filter out verified accounts. One of the shortcomings of TURI is that a user’s growth / trajectory (i.e. G) will be very high if they are already a celebrity. These people typically have a large number of followers per Tweet not because of the content they put out, but because they’ve attained fame already elsewhere. Luckily, Twitter has a feature called a “verified account”, which applies “if it is determined to be an account of public interest. Typically this includes accounts maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas”. This is a prime group to filter out, because we are not looking for well-known people.
2. Filter users by number of followers: There are a few reasons why you might want to only calculate TURI for users that have a following within some range (e.g. between 500 and 1,000):
• Although there may be situations where a person with 500,000 followers is underrated, but this seems unlikely to be the kind of person you’re looking for so not worth the API resources.
• Filtering by some upper follower threshold mitigates the risk of including celebrities without verified accounts.
• You limit the number of calls you make to the API. The most costly operation in terms of API calls is figuring out the influence of followers. The more followers a person has, the more API calls required to calculate TURI.

Trying out TURI

To test out the index, I calculated its value on a subset of 49 people that Tyler Cowen follows who have 1,000 or fewer followers (Tyler blogs at my favourite blog, inspired this project, and has good taste in people to follow).

The graph below illustrates TURI for these users (not including 4 accounts that were not accessible due to privacy settings).

As you can see, one user (@xgabaix) is a significant outlier. Looking a bit more closely at the profile, this is Xavier Gabaix, a well known economist at Harvard University. His TURI is so high because he has several very influential followers and he has not tweeted yet.

So did TURI fail here? I don’t think so, since this is very likely someone to follow if he was actually Tweeting. However, it does seem a little strange to put someone at the top of the list that doesn’t actually have any Twitter content.

So, I filtered again for users that have published at least 20 tweets:

The following chart looks solely at the various user’s IF (Influence of their Followers). Interestingly, another user @shanagement has the most influential followers by far. However, they rank in third place for overall TURI since they tweet significantly more than @davidbrooks13 or @davidhgillen.

Limitations

Of course, TURI has some shortcomings:

• Difficult to tell how well TURI works: The measures are based on intuition and there is obviously no “ground truth” data about how underrated twitter users actually are. So, we don’t really have a data-based method for seeing how well the index works and improving it systematically. For example, you might question is whether Growth G should be included in the index at all. I think there’s a good argument for it: if people get followers quickly per unit of content there must be something about that content that draws others. But, on the other hand, maybe they aren’t truly underrated. Maybe the truly underrated people have good content but your average Twitter user underestimates them even after reading a few posts. People don’t always know high quality when the see it.
• It takes a fairly long time to calculate TURI: This is due to Twitter rate limits of API requests. For example, calculating TURI for 49 Twitter users above took about an hour. It would take even longer for people with larger followings (remember, I only focused on people with 1,000 or fewer followers). So, if you want to do a large batch of people, it’s probably a good idea run this on a server on an ongoing basis and store user and TURI information into a database.

Other ideas?

There are many many different ways you could potentially specify an index like this. Leave a comment or reach out to me on Twitter or email if you have any suggestions.

One other possible tweak is accounting for the number of people that the user follows. I notice that some Twitter users seem to have a strategy of following a huge number of people in the hopes of being followed back. In these cases, their following strategy may be the cause of their high growth and not the quality of their content. One solution is to adjust the TURI by multiplying by (Number of User’s Followers) / (Number of People the User Follows). This would penalize people that, for example, have 15,000 followers but follow 15,000 people themselves.

Technical Details

You can find the code I used to interact with the API and calculate TURI here. The code uses the python-twitter package, which provides a nice way of interacting with the Twitter API, abstracting away annoying details so you don’t have to deal with them (e.g. authenticating with OAuth, dealing with rate limits).

Digging into Data Science Tools: Using Plotly’s Dash to Build Interactive Dashboards

Dash is a tool designed for building interactive dashboards and web applications using only Python (no CSS, HTML, or JavaScript required). I came across Dash while surveying options for building dashboards and reporting tools in my current position as Data Scientist with the City of Winnipeg Transportation Division.

Why use Dash?

Some of the lowest-hanging fruit working as a data science involves simply getting the data in front of the right people in a way that is easily digestible and actionable. No fancy ML, just presenting data to the right people at the right time in the right format. Dashboards are often a great way of doing that.

However, they can be a pain to build. Off-the-shelf dashboard tools like Tableau and PowerBI can be expensive and limiting (you are constrained to the features that they choose to include in their software). Developing dashboards from scratch as a web application is also a pain, since that often requires writing CSS, HTML, and JavaScript: as a data scientist, your focus is typically programming in python or R and your job typically does not require being an expert in JavaScript.

Dash strikes a perfect middle ground for a data scientist, providing way more flexibility and customizability than off-the-shelf tools at a much lower cost (i.e. free), while still having an easy-to-use API that a python-oriented data scientist should not have difficulty grasping.

A Demo Dash app of Blog Post Data from Marginal Revolution

To become familiar with Dash and see what it can do, I started out with a simple dashboard using some of the the scraped blog post data from Marginal Revolution (see this post on the MR scraper project). You can find the dashboard here, which provides an interactive chart of the number of posts on Marginal Revolution over time for each author. You can choose any of the authors in the drop down box, and the relevant data shows up in the chart:

I have to say, I can’t believe how easy it was to build this dashboard. In total, it’s only 33 lines of python code (you can find the code here. It was also easy to deploy to heroku, following this guide.

How Dash Works

I highly recommend going through the Dash tutorial here, which walks you through app creation and the main components of Dash. Here I’ll give a high level overview of some key points.

Dash apps have two components: a layout component which describes how the application looks visually, and an interactivity component which specifies how your dashboard interacts with the user and responds to input.

Layout Component

The layout defines what the dashboard looks like. When defining the layout of your application, you create a hierarchical “tree” of components, and each component will come from one of the following two main classes:

• dash_html_components: This includes a component to represent every HTML tag. E.g. dash_html_components.H1(children = ‘Hello Dash’) adds <h1>Hello Dash</h1> to your dashboard.
• dash_core_components: This describes higher-level components that combine HTML, JavaScript, and CSS created through the React.js library. These components have an interactive element to them (e.g. graphs with tooltips). One of the key classes within dash_core_components is Graph. Graph components represent data visualizations created using plotly.js (Plotly is the company that created Dash). Graph provides 35 different chart types to meet your needs. Another nice feature included within dash_core_components is the ability to include Markdown, which is often more convenient for presenting text based copy in your dashboards. There are many other options available to you in the dash_core_components library to use in your dashboard (e.g. sliders, checkboxes, date ranges, interactive tables, tabs, and more – see the gallery here).

To define the layout for a page and the data contained within it, you nest these various components as appropriate. You assign this nested list to app.layout, where app = dash.Dash(). To see an example of how these components fit together and how they are rendered in the browser, see the getting started page on the Dash tutorial.

Interactivity Component

Another key feature of Dash is its ability to respond to user input and change the visualization accordingly (dashboards aren’t really of much use without this feature). Dash makes this functionality quite easy, although there is a bit too much to go into great detail in this blog post. Read part 2 of the Dash user guide.

In short, Dash lets you add interactivity by adding an @app.callback decorator to a Python function you write which takes the input from one of the layout components and returns outputs to send to another layout component that you want to change according to the new input.

You can have multiple inputs and multiple outputs. For example, if you want a graph that gives the user the option of filtering by multiple variables in your dataset, you can use multiple input filters. The function that you apply the @app.callback decorator to automatically fires whenever there are any changes to any of the inputs.

Deploying to Production

Dash uses the python Flask web application framework under the hood. Flask is an awesome, lightweight web application framework for Python that is definitely worth knowing (and relatively easy to learn given how stripped down it is). You can access the underlying Flask app instance using app.server where app is the instance of your dash app (i.e. app = Dash()).

Since the Dash app is essentially a Flask app under the hood, deploying a Dash app is the same as deploying a Flask app, and there are many guides online for deploying flask apps in a variety of situations. The Dash website provides some more details about deployment as well as steps for deploying to Heroku.

Comment below if you have any questions or if you want to share your experience using Dash!

Further Resources

Big Transportation Data for Big Cities Conference: My Takeaways

For a long time, I’ve been interested in transportation and urban economics. When I was doing my Masters, I planned to specialize in these areas if I continued on to a PhD. So, when I saw a job position open for Data Scientist at the City of Winnipeg Transportation Assets Division, I didn’t have to spend much more than two seconds considering whether I would apply.

Well, a few months have passed and I’m happy to announce that I was successful: I’m starting the position this week. To say I’m excited is a huge understatement. The Division has been doing very great things with the recent development of the Transportation Management Centre (TMC) and I’m looking forward to being a part of these cutting edge efforts to improve the City’s transportation system.

To get up to speed, I’ve been looking through various sources to get an idea what municipalities have been up to in this space. I was pointed to the Big Transportation Data for Big Cities Conference, which took place in 2016 in Toronto and involved transportation leaders from 18 big cities across North America. The presentations are all available online and are a great source to understand the kind of transportation data cities are collecting, how they’re using it, possibilities for future use, and challenges that remain.

How cities are using transportation data

Municipalities are collecting unprecedented amounts of data and working to apply  it in a variety of ways. Steve Buckley from the City of Toronto Transportation Services provides a useful categorization of the main areas of use for city transportation data: describing, evaluating, operating, predicting, and planning.

Describing (Understanding)

A fundamental application of the transportation data flowing into municipalities is simply to provide situational awareness about what is actually happening on the ground. This understanding is a prerequisite to all other forms of data use.

In the past, this was hard and expensive, but with widespread GPS, mobile applications, wireless communication technology, and inexpensive sensors, this kind of descriptive data is becoming cheaper to collect, easier to collect, and more detailed.

There appears to still be a lot of “low hanging fruit” for improving safety and congestion by simply having more detailed data and observing what is actually happening on the ground. For example, one particularly interesting presentation from Nat Gale from the Los Angeles Department of Transportation points out that only 6% of their streets account for 65% of deaths and serious injuries for people walking and biking (obviously prime targets for safety improvements). His presentation goes on to describe how they installed a simple and inexpensive “scramble” pedestrian crossing at one of the most dangerous intersections in the city (Hollywood / Highland) and this appears to have increased the safety of the intersection dramatically.

Evaluating (Measuring)

While descriptive data is crucial, it is not sufficient. You also need to understand what is most important in the data (i.e. key performance indicators) and have reliable ways of figuring out whether an intervention (e.g. light timing change) actually produced better results.

Along these lines, one particularly interesting presentation was from Dan Howard (San Francisco Municipal Transportation Agency) on their use of transit arrival and departure data to determine transit travel times (no GPS data required). Using this data, they can compare travel times before and after interventions, and understand the source of delays by simply examining the statistical distribution of travel times (e.g. lognormal distribution means good schedule adherence, normal distribution implies random events affect travel times, and multiple peaks indicate intersection / signal issues).

Operating

A key theme throughout many of the presentations is the potential benefits of being able to get traffic data in real time. For example, several municipalities have live real-time camera observations, weather data, and mobile application data (among other sources). These sources can provide real-time insight into operational improvements, such as real time congestion and light timing adjustment, traffic officer deployment planning, construction management, and detecting equipment / mechanical failures.

Predicting

The improved detail of data, the real-time nature of the data, and evaluation techniques come together to enable a variety of valuable predictive analytics allowing municipalities to take proactive response (e.g. determining the locations at highest risk of congestion or accidents and preventing accidents before they happen).

Planning (Prioritizing investments)

With improved data and improved insights from the data, municipalities can do better planning of investments to yield the highest value in terms of some target (e.g. commute times, accidents).

Municipalities are starting to capitalize on the benefits of open data

One common thread throughout many of the presentations is the benefits of opening up city data to the public, third parties, and other government departments. Although this is not without its challenges, there are many potential benefits.

Personally, as a data-oriented person, I’m particularly gung-ho about opening data up to the public, as long as the data does not infringe on anyone’s privacy and the cost of making the data public is not too high. I feel like this should be almost a moral imperative of public institutions – if you’re collecting public data, then the public should be able to access that data (again, after considering privacy concerns and resource constraints).

But there are much more selfish reasons other than moral principle for cities to open up the data, and based on these presentations, municipalities starting to understand these benefits.

One important advantage is by making the data public, you create opportunities for others to do analysis or write software applications that your organization simply does not have the resources to do. For example, it may not be a core competency of a transportation department to build, deploy, and maintain mobile applications. However, many people want something like this to exist, and making transit schedules accessible through a public API facilitates others to do this work. In these cases, the municipality plays the role of enabler.

Another thing to consider is that people can be quite ingenious and figure out things to do with the data that you never dreamed of. By making the data public, you can crowdsource the ingenuity and resourcefulness of citizens for the benefit of the public. Municipalities can do this not only by opening the data, but also by hosting public events such as urban data challenges or open data hackathons. Sara Diamond from OCAD University went through several examples of clever visualizations and related projects resulting from open transit data.

Another advantage of opening data is that it promotes collaboration with other municipalities and other departments within a single municipality. Opening the data builds competencies that can come in handy even if the data is not made public: for example, it may help a municipality share critical transportation data with other departments (e.g. emergency response teams).

This collaborative approach seems central for many municipalities in the conference. For example, Abraham Emmanuel from the City of Chicago talks about the City’s Transportation Management Center, which is working to “develop an integrated and modular system that can be accessed from anywhere on the City network” and “create interfaces with external systems to collect and share data” (where “external systems” can include the Chicago Transit Authority, Utilities, Third Parties, and others).

Municipalities are opening up to open source

Increasingly, municipalities are beginning to understand the value of open source software and incorporating it into their operations. Bibiana McHugh from TriMet Portland provides a useful comparison of the advantages of proprietary software versus open source software, with open source providing more control, fostering innovation / competition, resulting in a broader user and developer base, and the low entry costs.

Catherine Lawson from the The University at Albany Visualization and Informatics Lab (AVAIL) similarly presents benefits of open source, noting advantages such as defensible outputs (open platforms allow for 3rd party verification of output) and trustworthiness (open platforms can lead to a robust shared confidence in outcomes). In contrast, the advantages of proprietary models include alignment with procurement processes and the fact that it is the traditional, (currently) best-understood model.

Perhaps the best illustration of open source in action is given in Holly Krambeck’s (World Bank) presentation showing how open-source solutions can “leapfrog” traditional intelligent transportation systems in resource-constrained cities. She talks about the OpenTraffic program where “data providers” (e.g. taxi hailing companies) collect GPS location data from mobile devices host an open-source application called “Traffic Engine” that translates the raw GPS data into anonymized traffic statistics. These are sent to an server, pooled with other data providers statistics, and served with an API for users to access the data. OpenTraffic is built using fully open-source software and you can find a detailed report of how the project works here.

I think this is very exciting not just for the municipalities that reap the benefits of open source, but for programmers who now have the opportunity to build a reputation for themselves and their city, all while contributing a public good that benefits everyone.

Challenges

Of course, there are challenges that come along with the opportunities of producing large scale, highly detailed transportation data. Mark Fox from the University of Toronto Transportation Research Institute has an extremely useful presentation outlining some of the main challenges often associated with open city data. These include:

• Granularity (datasets often have different level of aggregation),
• Completeness: important to think carefully about what to open to the public and having a reason behind opening it
• Interoperability: datasets across different departments may describe similar things but may not be comparable due to slightly differing schemas / data types)
• Complexity: the data presented may be very complex and thus the public presentation of that data is limited
• Reliability: whenever you collect data, there are questions of the reliability of the data that limit the ability to use it and apply it.
• Empowerment: This is an interesting challenge I had not considered, which refers to the the incentives often built into government organizations to avoid failure at all costs and not engage in any risk-taking through experimentation. There also may tend to be a focus on short-term delivery of political goals and a lack of a long-term strategy of innovation.

Ann Cavoukian from Ryerson University (and formerly the Information and Privacy Commissioner for Ontario) adds privacy to this list of challenges. Her presentation focuses entirely on these issues, along with “Privacy by Design” standards to help mitigate these risks. She points out that extensive data collection and analytics can lead to “expanded surveillance, increasing the risk of unauthorized use and disclosure, on a scale previously unimaginable”. With recent privacy and data breach scandals from Equifax and Facebook since this presentation took place, I assume these issues are even more at the forefront of municipalities’ concerns with respect to transportation data.

Why Data Scientists Should Join Toastmasters

Public speaking used to be a big sore spot for me. I was able to do it, but I truly hated it and it caused me a great deal of grief. When I knew I had to speak it would basically ruin the chunk of time from when I knew I would have to speak to when I did it. And don’t get me started on impromptu speaking – whenever something like that would pop up, I would feel pure terror.

Things got a bit better once I entered the working world and had to speak somewhat regularly, giving presentations to clients and staff and participating in meetings. But still, I hated it. I thought I was no good and had a lot of anxiety associated with it.

A little over a year ago, I finally had enough and decided I needed to do something about it. I joined the Venio Dictum toastmasters group in Winnipeg. After only a few months, I started to become much more relaxed and at ease when speaking. Today, one year later, I actually look forward to giving speeches and facing the challenge of impromptu speaking. A year ago, if you told me I would feel this way now, I wouldn’t have believed you.

Imagine being the type of person that volunteers to address a crowd or give a toast off the cuff. Imagine looking forward to speaking at a wedding, meeting, or other event. Imagine being totally comfortable in one of those “networking event” situations where you enter a room crowded full of people you don’t know. A year ago, I used to think people that enjoyed this stuff were from another planet. Now, I understand this attitude and I feel like I’m getting there myself.

Why should a Data Scientist care about speaking skills?

• Communication is a critical part of the job

Yes, a huge part of being a data scientist is having skills in mathematics, statistics, machine learning, programming, and having domain expertise.

However, technical skills are not anywhere close to the entire picture. You might be fantastic at data analysis, but if you aren’t able to communicate your results to others, your work is basically useless. You’ll wind up producing great analysis that ultimately never get implemented or even considered at all because you failed to properly explain its value to anyone.

Speaking is a huge part of communication, so you need to be good at it (the other big area of communication is writing, but that’s a topic for another day).

To get to the next level in your career (and to get a data scientist job in the first place), it really helps to be a confident and persuasive speaker.

Job interviews are a great example. When you apply for a job, there will always be an interview component where you’ll have to speak and answer questions you have not prepared for in advance. Even if your resume and portfolio look great, it’s going to be hard for an employer to hire you if you bomb the interview.

This also applies to promotions from within your current company. Advanced positions typically require rely more on communication and management skills like speaking and less on specific technical skills.

• Network / connection building

Improved speaking doesn’t just make your presentations better: it makes your day-to-day communications with colleagues and acquaintances better too. You’ll become a better conversationalist and a better listener.

As a data scientist, you’ll likely be working with multiple teams within your organization and outside your organization. You will need to gain their trust and support, and better speaking helps you do that.

• It makes you a better thinker / learner

The motto of my toastmasters club is “better listening, thinking, and speaking” because a huge part of speaking is learning how to organize your thoughts in a clear package so they are ready for delivery to your audience. As George Horace Latimer said in his book Letters from a Self Made Man to his Merchant Son:

“There’s a vast difference between having a carload of miscellaneous facts sloshing around loose in your head and getting all mixed up in transit, and carrying the same assortment properly boxed and crated for convenient handling and immediate delivery.”

Having a lot of disparate facts in your head is not very useful if they are not organized in a way that lets you easily access them when the time is right. Preparing a speech forces you to organize your thoughts, create a coherent narrative, and understand the principles underlying the ideas that you’re trying to communicate.

This habit of understanding the underlying rules and principles behind what you learn is referred to by psychologists as “structure building” or “rule learning”. As described in the book Make it Stick, people who do this as a habit are more successful learners than people that take everything they learn at face value, never extracting principles that can be applied to new situations. Public speaking cultivates this skill.

This is particularly important for data scientists, given the incredibly diverse range of subjects we are required to develop expertise in and the constantly evolving nature of our field. To manage this firehose of information, we must have efficient learning habits.

One great thing about Toastmasters is you can give a speech on any topic you want. So go ahead, give a speech on deep reinforcement learning to help solidify your understanding of the topic (but explain in a way that your grandmother could understand).

• Speaking is a fundamental skill that will impact your life in many other ways

Speaking is a great example of a highly transferable skill that pays off no matter what you decide to do. Since it deeply pervades everything we do in our personal and professional lives, the ROI for improving is tremendous. (In my opinion, some other skills that fall into this category include writing, sales, and negotiation.)

Consider all the non-professional situations in your life where you speak to others (e.g. your spouse, kids, parents, family, acquaintances, community groups). Toastmasters will make all of these interactions more fruitful.

Suppose you decide data science isn’t right for you. Well, you can be close to 100% sure the speaking skills you develop through Toastmasters will still be valuable whatever you decide to do instead.

In terms of the 80-20 rule, working on your public speaking is definitely part of the 20% that yields 80% of the results. It’s worth your time.

How Does Toastmasters Work?

Although each club may do things a little differently, all use the same fundamental building blocks to make you a better speaker:

• Roles: At each meeting, there are a list of possible “roles” that you can play. Each of these roles trains you in a different public speaking skill. For example, the “grammarian” observes everyones language and eloquence, “gruntmaster” monitors all the “ums”, “uhs”, “likes”, “you knows” etc. There is a “Table Topics Master” role where you propose random questions to members they have not prepared for in advance and they have to do an impromptu speech about it for two minutes (an incredibly valuable training exercise, especially if you fear impromptu speaking). Here are the complete list of roles and descriptions of the roles in my club.
• Prepared speeches: Of course, there are also tons of opportunities to provide prepared speeches. Toastmasters provides you with manuals listing various speeches to prepare that give you practice in different aspects of public speaking. You do these speeches at your own pace.
• Evaluations: Possibly the most valuable feature of Toastmasters is that everyone’s performance is evaluated by others. In a Toastmasters group, you’ll often have a subset of members that are very skilled and experienced speakers (my club has several members that have been with the club 25+ years), and the feedback you get can be invaluable. It’s crucial for improvement, and it’s something you don’t usually get when speaking in your day-to-day life.

Go out and find a local Toastmasters group and at check it out as a guest to see how it works up close. They will be more than happy to have you. You owe it to yourself to at least give it a try. If your experience is anything like mine, you’ll be kicking yourself for not starting earlier (and by “earlier” I mean high school or even sooner – it’s definitely something I’m going to encourage my daughter to do).