Getting Started with Airflow Using Docker

Lately I’ve been reading intensively on data engineering after being inspired by this great article by Robert Chang providing an introduction to the field.  The underlying message of the article really resonated with me: when most people think of data science they immediately think about the stuff being done by very mature tech companies like Google or Twitter, like deploying uber-sophisticated machine learning models all the time.

However, many organizations are not at the stage where these kind of models makes sense as a top priority. This is because, to build and deploy these kind of models efficiently and effectively, you need to have foundation data infrastructure in place that you can build the models on. Yes, you can develop a machine learning model with the data you have in your organization, but you have to ask: how long did it take you to do it, is your work repeatable / automatable, and are you able to deploy or actually use your solution in a meaningful and reliable way? This is where data engineering comes in: it’s all about building the data warehouses and ETL pipelines (extract-transform-load) that provide the fundamental plumbing required to do everything else.

One tool that keeps coming up in my research on data engineering is Apache Airflow, which is “a platform to programmatically author, schedule and monitor workflows”. Essentially, Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks. It’s becoming very popular among data engineers / data scientists as a great tool for orchestrating ETL pipelines and monitor them as they run.

In this post, I’ll give a really brief overview of some key concepts in Airflow and then show a step-by-step deployment of Airflow in a Docker container.

Key Airflow Concepts

Before we get into deploying Airflow, there are a few basic concepts to introduce. See this page in the Airflow docs which go through these in greater detail and describe additional concepts as well.

Directed Acyclic Graph (DAG): A DAG is a collection of the tasks you want to run, along with the relationships and dependencies between the tasks. DAGs can be expressed visually as a graph with nodes and edges, where the nodes represent tasks and the edges represent dependencies between tasks (i.e. the order in which the tasks must run). Essentially, DAGs represent the workflow that you want to orchestrate and monitor in Airflow. They are “acyclic”, which means that the graph has no cycles – in English, this means means your workflows must have a beginning and an end (if there was a cycle, the workflow would be stuck in an infinite loop).

Operators: Operators represent what is actually done in the tasks that compose a DAG workflow. Specifically, an operator represents a single task in a DAG. Airflow provides a lot of pre-defined classes with tons of flexibility about what you can run as tasks. This includes classes for very common tasks, like BashOperator, PythonOperator, EmailOperator, OracleOperator, etc. On top of the multitude of operator classes available, Airflow provides the ability to define your own operators. As a result, a task in your DAG can do almost anything you want, and you can schedule and monitor it using Airflow.

Tasks: A running instance of an operator. During the instantiation, you can define specific parameters associated with the operator and the parameterized task becomes a node in a DAG.

Deploying Airflow with Docker and Running your First DAG

This rest of this post focuses on deploying Airflow with docker and it assumes you are somewhat familiar with Docker or you have read my previous article on getting started with Docker.

As a first step, you obviously need to have Docker installed and have a Docker Hub account. Once you do that, go to Docker Hub and search “Airflow” in the list of repositories, which produces a bunch of results. We’ll be using the second one: puckel/docker-airflow which has over 1 million pulls and almost 100 stars. You can find the documentation for this repo here. You can find the github repo associated with this container here.

So, all you have to do to get this pre-made container running Apache Airflow is type:

docker pull puckel/docker-airflow

And after a few short moments, you have a Docker image installed for running Airflow in a Docker container. You can see your image was downloaded by typing:

docker images

Now that you have the image downloaded, you can create a running container with the following command:

docker run -d -p 8080:8080 puckel/docker-airflow webserver

Once you do that, Airflow is running on your machine, and you can visit the UI by visiting http://localhost:8080/admin/

On the command line, you can find the container name by running:

docker ps

You can jump into your running container’s command line using the command:

docker exec -ti <container name> bash

So in my case, my container was automatically named competent_vaughan by docker, so I ran the following to get into my container’s command line:

Running a DAG

So your container is up and running. Now, how do we start defining DAGs?

In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is:

/usr/local/airflow/dags

The thing is, you don’t want to jump into your container and add the DAG definition files directly in there. One reason is that the minimal version of Linux installed in the container doesn’t even have a text editor. But a more important reason is that jumping in containers and editing them is considered bad practice and “hacky” in Docker, because you can no longer build the image your container runs on from your Dockerfile.

Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container with the following command:

docker run -d -p 8080:8080 -v /path/to/dags/on/your/local/machine/:/usr/local/airflow/dags  puckel/docker-airflow webserver

The DAG we’ll add can be found  in this repo created by Manasi Dalvi. The DAG is called Helloworld and you can find the DAG definition file here. (Also see this YouTube video where she provides an introduction to Airflow and shows this DAG in action.)

To add it to Airflow, copy Helloworld.py to /path/to/dags/on/your/local/machineAfter waiting a couple of minutes, refreshed your Airflow GUI  and voila, you should see the new DAG Helloworld:

You can test individual tasks in your DAG by entering into the container and running the command airflow test. First, you enter into your container using the docker exec command described earlier. Once you’re in, you can see all of your dags by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list:

One useful command you can run on the command line before you run your full DAG is the airflow test command, which allows you to test individual tests as part of your DAG and logs the output to the command line. You specify a date / time and it simulates the run at that time. The command doesn’t bother with dependencies and doesn’t communicate state (running, success, failed, …) to the database, so you won’t see the results of the test in the Airflow GUI. So, with our Helloworld DAG, you could run a test on task_1

airflow test Helloworld task_1 2015-06-01

Note that when I do this, it appears to run without error; however, I’m not getting any logs output to the console. If anyone has any suggestions about why this may be the case, let me know. 

You can run the backfill command, specifying a start date and an end date to run the Helloworld DAG for those dates. In the example below, I run the dag 7 times, each day from June 1 – June 7, 2015:

When you run this, you can see the following in the Airflow GUI, which shows the success of the individual tasks and each of the runs of the DAG.

Resources

Deploying and Maintaining a Web App Part 3: Adding Tests with Pytest

Testing is an important part of development, including developing web applications. Among other benefits, they increase your confidence that your web application is doing what you expect, and a provide a basis for preventing bugs in an automated way when you make changes to your code.

Since testing is such an integral part of web application maintenance and deployment, in this Part 3 of our app-deploy project we’ll put in some basic tests to see how they can be implemented in flask applications and so we will eventually see how tests fit into the deployment workflow. We’ll be using the pytest python library as our testing framework.

To follow along, you can clone the repository:

git clone https://github.com/marknagelberg/app-deploy.git

And then go to the appropriate location in the code with the following command which includes all of the changes made in this post:

git checkout ed2ed02f8b3db

Finally, create the python environment with:

conda env create -f environment.yml

First, a Slight Upgrade to our App

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To make testing a bit more interesting, we first add a little bit of additional functionality: our app will now print out a list of all the existing names in the database on the main page.

First we add a few lines to app/app.py so that it queries the database to get all names, and then sends the list of names to the template.

Then, we update the template to print out these names.

Now, when you enter in new names, they are listed out for you, like so:

Pytest Installation and Set-up

To start off, let’s activate our app’s conda environment and install pytest:

source activate app-deploy
conda install pytest

Our tests will be stored in a top level directory in a folder that we’ll call `tests`. Our tests will be stored in *.py files within this folder. These files must either be named like test_*.py or *_test.py: this is how you tell pytest that these files represent tests (in other words, this is how pytest does “test discovery”).

As a reminder, your top level directory should look something like this:

So, let’s add a file called test_app.py in /test.

touch test/test_app.py

To run your tests, you simply just have to enter ‘pytest’ in the command line as follows. No tests have been added to test_app.py yet, so when you run ‘pytest’ now, the result should look like this:


A couple of other small preliminaries to get testing set up:

  • Set WTF_CSRF_ENABLED configuration variable equal to False in the testing configuration found in config.py. This is required for tests to run properly. Keep in mind that you normally want to have this set equal to True when you’re running your application in production, since this protects your forms against Cross Site Request Forgery (CSRF) attacks.
  • Add __init__.py in your /tests directory.

Adding Tests

The great thing about pytest is that it allows you to write tests in a very concise, pythonic way using assert statements. Suppose you have the following file which defines a function that takes the square of a number. You want to test this. Adding a couple of tests is as easy as this:

Then, you run your tests by typing `pytest` in the command line when you’re in the app’s main directory. This returns the following result (one test pass, one test fail):


Another great feature of pytest is the ability to produce “fixtures”, which provides tests with pre-initialized objects you require to run your tests. In our case, a great use of fixtures is initializing the test database or the test application instance.

Pytest fixtures have some advantages over the usual setup and teardown functions used in other testing frameworks. One is the ability to run fixtures using different “scopes”, allowing you to do the setup / teardown operation at different times when you run your test to maximize test efficiency. The options for scope here include function, class, module, and session. So, for example, a fixture in the function scope means that the fixture object is invoked once for each test function you define in pytest.

So, say all your tests need to be able to access the Flask test client application instance. Using the module fixture, the application instance is only created once at the very beginning when you run your test, rather than being created and destroyed once for every test function.

For our small app, we’ll create 3 fixtures:

  • A database record in our Name table
  • The Flask application instance
  • The test database

The following code creates a fixture for one database record in our Name table which will make this object available to all our tests.

As you can see, you register fixtures using the pytest.fixtures decorator. This function returns the fixture object that you want your tests to access. You access this fixture object in your tests by supplying them as an argument to the test function. The code below shows a test we can add to test_app.py that accesses the new_name fixture and ensures it has the value we expect.

We can take advantage of the work we did in Part 2 of the series using our application factory pattern to easily create a fixture for our application instance with our testing configuration options. The code below does this by creating an application instance with testing configuration and returning its test client (the test client provides a simple interface to the application where we can trigger requests application and track cookies).

Note that the part after the yield statement represents the “teardown” part of the test: this is where the application instance is removed when the tests are done running.

We also need to create a similar fixture for our database. The following code creates the database and initializes the data with create_all(), adds one Name “Mark” and then cleans up the database after the tests are complete.

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To add some more interesting system tests, let’s add a little bit of additional functionality to our application: it will now print out a list of all the existing names in the database on the main page.

Finally, we add a few more tests to use all of our new fixtures, including tests to make sure our application instance exists and ensure the HTML output produced contains data we expect. 

(Note that “Mark” appears on the main page since this record was added to the database as part of the fixture.)

Now, in your main app directory, you can run your tests by simply typing “pytest” in the command line. The result for our code here should look like this:

The three dots ‘…’ indicate that there were three tests and they each passed (if a test failed, one of these dots appear as an F).

Aside: during the creation of this post, I was puzzled to see that my code was running in the flask production environment, despite my configuration and environment variables clearly indicating that I was in development. Turns out there is another environment variable that needs to be set called FLASK_ENV, which defaults to “production”, turning off debug mode in Flask and throwing a warning when you run your application on the Flask development server.  To fix this, run `export FLASK_ENV=“development”` on the command line. Note that this must be set as an environment variable: adding it in config.py doesn’t work. I’ve updated this information in Part 2.  For further information in the Flask documentation, see here.

Resources

https://www.patricksoftwareblog.com/testing-a-flask-application-using-pytest/

https://piotr.banaszkiewicz.org/blog/2014/02/22/how-to-bite-flask-sqlalchemy-and-pytest-all-at-once/

Digging into Data Science Tools: Docker

Docker is a tool for creating and managing “containers” which are like little virtual machines where you can run your code. A Docker container is like a little Linux OS, preinstalled with everything you need to run your web app, machine learning model, script, or any other code you write.

Docker containers are like a really lightweight version of virtual machines. They use way less computer resources than a virtual machine, and can spin up in seconds rather than minutes. (The reason for this performance improvement is Docker containers share the kernel of the host machine, whereas virtual machines run a separate OS with a separate kernel for every virtual machine.)

Aly Sivji provides a great comparison of Docker containers to shipping containers. Shipping containers improved efficiency of logistics by standardizing the design: they all operate the same way and we have standardized infrastructure for dealing with them, and as a result you can ship them regardless of transportation type (truck, train, or boat) and logistics company (all are aware of shipping containers and mold to their standards). In a similar way, Docker provides a standardized software container which you can pass into different environments and be confident they’ll run as you expect.  

Brief Overview of How Docker Works

To give you a really high-level overview of how Docker works, first let’s define three big Docker-related terms – “Dockerfile”, “Image”, and “Container”:

  • Dockerfile: A text file you write to build the Docker “image” that you need (see definition of image below). You can think of the Dockerfile like a wrapper around the Linux command line: the commands that you would use to set up a Linux system on the command line have equivalents which you can place in a docker file. “Building” the Dockerfile produces an image that represents a Linux machine that’s in the exact state that you need. You can learn all about the ins-and-outs of the syntax and commands at the Dockerfile reference page. To get an idea of what Dockerfiles look like, here is a Dockerfile you would use to create an image that has the Ubuntu 15.04 Linux distribution, copy all the files from your application to ./app in the image, run the make command on /app within your image’s Linux command line, and then finally run the python file defined in /app/app.py:
FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py
  • Image: A “snapshot” of the environment that you want the containers to run. The images include all you need to run your code, such as code dependencies (e.g. python venv or conda environment) and system dependencies (e.g. server, database). You “build” images from Dockerfiles which define everything the image should include. You then use these images to create containers.
  • Container: An “instance” of the image, similar to how objects are instances of classes in object oriented programming. You create (or “run” using Docker language) containers from images. You can think of containers as a running the “virtual machine” defined by your image.

To sum up these three main concepts: you write a Dockerfile to “build” the image that you need, which represents the snapshot of your system at a point in time. From this image, you can then “run” one or more containers with that image.

Here are a few other useful terms to know:

  • Volume: “Shared folders” that lets a docker container see the folder on your host machine (very useful for development, so your container is automatically updated with your code changes). Volumes also allow one docker container to see data in another container. Volumes can be “persistent” (the volume continues to exist after the container is stopped) or “ephemeral” (the volume disappears as soon as the container is stopped).
  • Container Orchestration: When you first start using Docker, you’ll probably just spin up one container at a time. However, you’ll soon find that you want to have multiple containers, each running using a different image with different configurations. For example, a common use of Docker is deployment of applications as “microservices”, where each Docker container represents an individual microservice that interacts with your other microservices to deliver your application. Since it can get very unwieldy to manage multiple containers manually, there are “container orchestration” tools that automate tasks such as starting up all your containers, automatically restarting failing containers, connecting containers together so they can see each other, and distributing containers across multiple computers. Examples of tools in this space include docker-compose and Kubernetes.
  • Docker Daemon / Docker Client: The Docker Daemon must be running on the machine where you want to run containers (could be on your local or remote machine). The Docker Client is front-end command line interface to interact with Docker, connect to the Docker Daemon, and tell it what to do. It’s through the Docker client where you run commands to build images from Dockerfiles, create containers from images, and do other Docker-related tasks.

Why is Docker useful to Data Scientists?

You might be thinking “Oh god, another tool for me to learn on top of the millions of other things I have to keep on top of? Is it worth my time to learn it? Will this technology even exist in a couple years?

I think the answer is, yes, this is definitely a worthwhile tool for you to add to your data science toolbox.

To help illustrate, here is a list of reasons for using Docker as a data scientist, many of which are discussed in Michael D’agostino’s “Docker for Data Scientists” talk as well as this Lynda course from Arthur Ulfeldt:

  • Creating 100% Reproducible Data Analysis: Reproducibility is increasingly recognized as critical for both methodological and legal reasons. When you’re doing analysis, you want others to be able to verify your work. Jupyter notebooks and Python virtual environments are a big help, but you’re out of luck if you have critical system dependencies. Docker ensures you’re running your code in exactly the same way every time, with the same OS and system libraries.
  • Documentation: As mentioned above, the basis for building docker containers is a “Dockerfile”, which is a line by line description of all the stuff that needs to exist in your image / container. Reading this file gives you (and anyone else that needs to deploy your code) a great understanding about what exactly is running on the container.
  • Isolation: Using Docker helps ensure that your tools don’t conflict with one another. By running them in separate containers, you’ll know that you can run Python 2, Python 3, and R and these pieces of software will not interfere with each other.
  • Gain DevOps powers: in the words of Michaelangelo D’Agostino, “Docker Democratizes DevOps”, since it opens up opportunities to people that used to only available to systems / DevOps experts:
    • Docker allows you to more easily “sidestep” DevOps / system administration if you aren’t interested, since someone can create a container for you and all you have to do it run it. Similarly, if you like working with Docker,  you can create a container less technically savvy coworkers that lets them run things easily in the environment they need.
    • Docker provides the ability to build docker containers starting from existing containers. You can find many of these on DockerHub, which holds thousands of pre-built Dockerfiles and images. So if you’re running a well-known application (or even obscure applications), there is often a Dockerfile already available that can give you a tremendous running start to deploy your project. This includes “official” Docker repositories for many tools, such as ubuntu, postgres, nginx, wordpress, python, and much more.
    • Using Docker helps you work with your IT / DevOps colleagues, since you can do your Data Science work in a container, and simply pass it over to DevOps as a black box that they can run without having to know everything about your model.

Here are a few examples of applications relevant to data science where you might try out with Docker:

  • Create an ultra-portable, custom development workflow: Build a personal development environment in a Dockerfile, so you can access your workflow immediately on any machine with Docker installed. Simply load up the image wherever you are, on whatever machine you’re on, and your entire work environment is there: everything you need to do your job, and how you want to do your job.
  • Create development, testing, staging, and production environments: Rest assured that your code will run as you expect and become able to create staging environments identical to production so you know when you push to production, you’re going to be OK.
  • Reproduce your Jupyter notebook on any machine: Create a container that runs everything you need for your Jupyter Notebook data analysis, so you can pass it along to other researchers / colleagues and know that it will run on their machine. As great as Jupyter Notebooks are for doing analysis, they tend to suffer from the “it works on my machine” issue, and Docker can solve this issue.

For more inspiration, check out Civis Analytics Michaelangelo D’Agostino describe the Docker containers they use (start at the 18:08 mark). This includes containers specialized for survey processing, R shiny apps and other dashboards, Bayesian time series modeling and poll aggregation, as well as general purpose R/Python packages that have all the common packages needed for staff.

Further Resources

If you’re serious about starting to use Docker, I highly recommend the Lynda Course Learning Docker by Arthur Ulfeldt as a starting point. It’s well-explained and concise (only about 3 hours of video in total).

Here are a few other useful resources you might want to check out:

Workflow to Deploy and Maintain a Web App

In this series, I’m going to work through the process of deploying and maintaining a simple web application. The goal is for you to get an understanding of all the steps that may be involved in deploying a web app and maintaining it in production with modern tools and techniques.

To follow along in this series, it would help to have an understanding of the command line, python programming, and basic web application development. I’ll be using the Flask web development framework. You’re in my target audience if you’ve built simple Flask applications running locally, and you now want to deploy it and use related best practices, such as implementing development, staging, and production environments, using docker containers, and making changes to the application via continuous integration / continuous deployment (CI / CD) tools.

The focus here is not on the web application itself –  rather, it’s on the process of deployment and maintenance. I’m going to get into all the nuts and bolts related to getting the app running live and maintaining it in a smart way. Along the way, we’ll learn about the best practices around deploying a web application that I find are not readily available in many tutorials.

Table of contents:

Part 1 – Setting up the server (currently reading)
Part 2 – Setting up the Database and App Configuration
Part 3 – Adding Tests with Pytest
… To come, as I finish the series 🙂

Getting Started

As the basis for our adventure, we’ll start with the simplest Flask app possible: the hello world app from the Flask website.

To get the full code for this application, run:

git clone https://github.com/marknagelberg/app-deploy.git

You need install Conda, which is the tool I use for managing the python environments (see my related post here). The environment information is stored in environment.yml in the repository.

To create the python environment from the environment.yml file, run the following command in the main project directory (i.e. app-deploy/):

conda env create -f environment.yml

The conda environment is called ‘app-deploy’. To run the environment after you’ve created it, simply type (on Mac or Linux):

source activate app-deploy

Or on Windows:

activate app-deploy

For each part of this series, I will provide a git checkout command that will allow you to see the code in the state relevant to that part of the series.

To start off with the initial, most basic version of the application, run:

git checkout 75521eb0e3

The main document in the repository is a file called app.py that looks like this:

First activate the environment, which will add ‘(app-deploy)’ to your command line to indicate you are in the environment, like this:

Now that you are in the appropriate python environment, you can run:

python app.py

Which runs Flask’s development server and serves the app to http://127.0.0.1:5000/.

If you visit the site in the browser, you should see this:

Part 1 – Setting up the Server

As a first step in this process, we’re going to spin up a server that will ultimately run the web application in production.

An aside for data scientists: Running and configuring web servers is a valuable skill for data scientists. You often need to do this to run data-driven applications you build for others (you don’t want to be running these things on your desktop computer). Furthermore, there are many valuable tools and services out there for data scientists and their coworkers that are designed to run on a server, rather than as a desktop application (e.g. Apache Superset, which is a free and open source business intelligence dashboard like Tableau or PowerBI). Servers also come in handy for your own personal use even if you have no plans to deploy a web application. For example, your server can run scripts that automatically create reports and send you email notifications, or you can run web scrapers that collect data while you sleep.

Back in the day, you needed an on-site physical server to do this. Thankfully, modern cloud computing makes it incredibly easy and cheap to run your own server. There are a few good options out there, but in this series, I’m going to use a Digital Ocean “droplet”. 

First, create an account with Digital Ocean. Then, choose the option to create a droplet:

Then, choose the operating system that you want your server to run on. I went with Ubuntu, since it’s what I’m most familiar with. It has a large user base, and as a result, there is lots of online tutorials / documentation to help you troubleshoot when things go wrong.

Then choose the memory and CPU power of your server. I went with the smallest / cheapest one. It’s easy to “resize” your droplet later on if you realize you need more juice.

Now choose a datacenter region – this is the location where your server will run. I suggest picking a data center closest to you, or closest to where your users will be. For me, that’s their Toronto data center.

Finally, you have to specify an SSH key. This will allow you to use the ssh command line program to log into your server remotely (you’ll be working with your server entirely through the command line). DigitalOcean gives you the option to enter it right inside their online dashboard interface as a step in creating the droplet. They also provide this useful guide. Here’s what I did on my Mac.

  1. Run the command:
ssh-keygen

This generates two files which represent the public-private key pairs that you’ll need to authenticate via SSH. Save these files in your ~/.ssh directory (this is the directory where the ssh program will automatically look when you attempt to log into the server). You’ll also get an option to add a password to the file, which is recommended as protection against your laptop being stolen and someone getting access to the files.

id_rsa and id_rsa.pub are the default names for these public / private key pairs. Since these already exist in my .ssh directory, I chose a different filename: app-deploy and app-deploy.pub.  

Once you create these files, you then need to copy the contents of the public key (i.e. the file ending in ‘.pub’) into the “New SSH Key” form that Digital Ocean provides when setting up your droplet:

After entering in your public key, you can click the button to create your server.

Finally, to login to my server I run:

ssh -i ~/.ssh/app_deploy root@server_ip_address

If you chose the default filename (id_rsa and id_rsa.pub), then you only need to run:

ssh root@server_ip_address

And congratulations – you are now logged into your own personal server running in the cloud!