## Building a Rare Event Reporting System in Python (Part 1 – Crunching Historical Data to Determine What is Unusual)

Here’s the situation: you have a dataset where each row represents some type of event and the time it occurred. You want to build a system to let you know if there were unusual events occurring in some time period (a common application). You also want your users to be able to explore the events in more detail when the get an alert and understand where the event sits in the history of time the data has been collected.

After starting a task similar to this at work, I broke it down into the following parts (each of which will receive a separate blog post):

1. Using historical data to understand what is “unusual” (Part 1 – this post):
2. Reporting “unusual” events taken from a live feed of your data (Part 2): Developing a system that monitors a live feed of your data, compares it to historical trends, and sends out an email alert when the number of incidents exceed some threshold.
3. Developing a front end for users to sign up for email alerts (Part 3): Developing a system for signing up with email to receive the updates and specify how often you want to see them (i.e. specifying how “unusual” an event you want to see).
4. Creating interactive dashboards (Part 4): Developing dashboards complementing the alert system where users can look up current status of data and how it compares to historical data.

The Data: Obviously, this scenario applies to many, many datasets, but I have to pick something specific for this series. So, I’ll be using this dataset on air quality in the City of Winnipeg from Winnipeg’s open data portal. This data is perfect because it’s updated regularly (every 5 minutes), and there is historical data as well as a live feed API (you can find documentation on the API here). We’ll be using the historical data to help figure out what’s “unusual” and the live data  as the basis for alerts by comparing its values against historical and provide alerts.

### Data Exploration and Pre-Processing

After downloading our air quality dataset, we see it looks like this:

As you can see, there are different measurements in different rows, including Temperature, Humidity, and PM2.5 Particulates. Let’s filter our dataset to only contains information on PM2.5 Particulates (we only really care about air pollution in this application).

Looking at the values in MeasurementValue, we see it is measured in units of micrograms per cubic metre (ug/m3) and all values are positive integers. Plotting the histogram, you can see that almost all of the values are below 1,000, with a relatively small number of outliers around 5000.

Counting the number of Measurement Values and filtering for lower values reveals that the vast majority of the observations are less than 100.

Looking at the Measurement Values greater than 100 confirms that there are no observations between 500 and 5,000, with a sudden spike around 5,000.

I don’t have domain expertise related to air quality measurement and the nature of the sensors used to collect this data; however, but the fact that there are no observations in this wide range suggests that these are measurement errors. Exporting the Measurement Values to a table shows that there is one observation with 673 ug/m3 and then the next highest observation suddenly jumps by almost an order of magnitude to 4892 ug/m3.

Given the clear cutoff from these outliers, we’ll set a threshold and define an observation as a measurement error if it is greater than 1000 ug/m3. So as a first step in our analysis, we’ll take out these outliers.

Once these have been filtered out, we’ll do a bit of aggregation across longer time periods. This helps further smooth out other one-off outliers that are likely measurement errors rather than a real increase in overall air quality in the city.

To do this, we put together a dataset that groups data into hourly chunks and takes the average particulate matter over that period. Plotting this out shows the average hourly particulate matter over the historical time period provided is between 0 and 120 ug/m3.

### Determining Alert Thresholds

From this hourly dataset of average particulate levels, here’s the general approach we’ll take to determine what is “unusual”: we’ll define some percentile threshold where we want to be alerted, and call anything above this threshold “unusual”.

Python’s pandas package has a nice Series.quantile() function to help out with this: you pass in the quantile you’re interested in and it pops out the value in your data representing that quantile. For example, if you want to see the value representing the 95th percentile in our hourly average air quality data (hrly_avg_aqp), then you would write this:

The 95th percentile in the dataset is 32.47 ug/m3.

The next logical question is: what percentile threshold should we use to send alerts?

As a solution, we’ll leave it to the user to input about how rare an event they want to receive alerts for: once-in-two-years events, once-in-a-year events, six-in-a-year events, or once-in-a-month events. Since we’re looking at hourly data, if a user specifies the number of alerts they want to receive per year (n_alerts_per_year), the calculation of the appropriate percentile threshold is:

percentile_threshold = [1 - n_alerts_per_year  / Total Number of Observations in a Year] * 100 = [1 - n_alerts_per_year  / (365 * 24)] * 100

So now we have all the fundamental pieces of data we need: when the user specified n_alerts_per_year, we calculate the percentile threshold using the calculation above, and then finally, we can calculate the threshold value of air quality that should set off an alert for that user with

hrly_avg_aqp.quantile(percentile_threshold)

Here’s an overview of our data pre-processing steps for our unusual event application.

### What’s Next?

Now that we have the basic data in place for understanding what levels of air quality are “unusual”, in Part 2 of the series we’ll write code to ingest the JSON data feed of air quality readings reporting “unusual” events based on your data developed in #1 and the live data feed (Part 2): A system that regularly monitors a particular “feed” of incidents, compares the number of incidents with historical data, and then sends out an email whenever the number of incidents exceed a percentile threshold.

## Deploying and Maintaining a Web App Part 2: Setting up the Database and App Configuration

It is good practice when developing a web application to set up different environments for deploying changes. The number and nature of environments that are used can vary, but we’ll be using the following commonplace architecture:

• Development Environment: Where you develop the web application, typically on your local machine.
• Testing Environment: Where you perform tests to help ensure the quality of your application.
• Staging Environment: Where you deploy the application for the purpose of conducting “integration tests” that testing the broader functionality of your web app.
• Production Environment: Where you deploy your web app to your users.

In this Part 2 of our web app deployment and maintenance project, we’ll begin to set up some of these environments, including a database with an extremely simple schema as well as some additions to our application to incorporate the database and configure the app.

To follow along, you can clone the repository:

git clone https://github.com/marknagelberg/app-deploy.git

And then go to the appropriate location in the code changes with the following command:

git checkout 75521eb0e3d

Finally, create the python environment with:

conda env create -f environment.yml

### Installing Additional Python Packages

To start off, we’ll install a couple of additional packages to our app’s Conda environment:

• Flask-SQLAlchemy: SQLAlchemy is a fantastic Python library that provides a way of interacting with your database purely in python, without having to write SQL query strings. Another benefit of using SQLAlchemy and similar ORMs is the ability to easily swap out databases if you decide you want to change in the future. Flask-SQLAlchemy is an extension for Flask that makes it easier to incorporate SQLAlchemy into Flask apps.
• Flask-Migrate: A Flask extension that provides a lightweight wrapper around the Alembic database migration framework. This will come in handy later when we need to make changes to our database schema after it’s already been deployed to production.
• Flask-WTF: A Flask extension that makes it easy to work with web forms.
• psycopg2: An adapter for PostgreSQL databases in Python. PostgreSQL will be our database of choice and this package is required to connect it up to our application.

To install these applications, first start up the Conda environment with the following command:

source activate app-deploy

Then install the packages above with the following commands:

pip install Flask-SQLAlchemy

conda install psycopg2

### Installing and Setting Up the PostgreSQL Databases

Next, we need to install PostgreSQL. You can find the proper install for your system here.

Postgres comes with a handy front end terminal program called psql, which allows you to examine your database’s tables and records using SQL queries. We first use psql to initialize two of the databases we’ll be working with for our program: the development database and the test database.

First enter into psql by typing psql into the terminal:

From here, you can run a bunch of useful commands. For example \l lists all the databases you currently have in postgres:

For now, we’ll need to use psql to create two databases that we’ll call app_deploy_development and app_deploy_test. This is as simple as running the two commands below.

Running the command \l shows that these two databases have been created:

You can then exit postgres with the command \q or simply press control-d.

To be able to connect to this database from Flask / SQL Alchemy, we need the URL string associated with the database. In PostgreSQL, this takes the form:

postgresql://username:password@hostname:port/database

We’ll eventually be running the the development and test databases on localhost, so the hostname is ‘localhost’, the port is 5432 (this is a postgres default) and database is ’app_deploy_development’. You can confirm the port Postgres is running on by running the following command:

### Configuring our Flask App

Now we’ll make some changes to the app so that the database is actually used.

FIrst we will begin to incorporate the application environments that we need. The main way to do this in Flask is through the configuration variables, which are stored in app.config variable (where app is the instance of your application).

The simplistic version of app.config is treated like a simple dictionary (e.g. you can set app.config[‘SECRET_KEY’] = ‘some_secret’), but this does not account for the fact that you need to work with different sets of configurations depending on how you’re running your application. Instead, a more flexible way to set up the app.config variable is to use a hierarchy of configuration classes, defined like this:

SQLALCHEMY_DATABASE_URI is an environmental variable required by SQLAlchemy to tell the extension where to find the database. Since these environmental variables are sensitive (they contain username and password information that you would not want to share), they are defined in your system’s environmental variables rather than including them in version control (you define these using the export command). Add the configuration code above is to a file called config.py in the app’s top directory.

### Refactoring our App to use the Application Factory Pattern

We need to do some restructuring of our application to account for the database and implement the “application factory” pattern in Flask that allows us to provide different app instances with different configurations. We first move our app.py file into the /app folder. We also create an app/templates/ folder to store the front-end template that will hold the template for our front end form to enter names the and “hello world” message.

We also create /app/__init__.py:

This file stores our factory function ‘create_app’ that returns instances of the application. create_app takes a config_name argument, allowing us to create an app with the desired configuration option (e.g. create_app(‘development’), create_app(‘testing’), etc.) The configuration is assigned to the app within the factory function using the Flask app.config.from_object function along with the dictionary of configuration classes defined in config.py.

### Adding the Database to our App

Rather than having just “Hello World”, we want to have our application make use of the database in some minimal way. So, we add a form where the user can enter names which are stored in the database inside a single table, consisting of a single column holding the names.

To define the database schema, we create a file called models.py:

This file imports our database instance db. Here we define a ‘name’ string column, and an integer id that serves as a primary key (your table must have a primary key).

We also create a file create_app.py in the app’s top directory:

This file is tasked with creating the instances of app from the create_app function defined in app/__init__.py. This function is what allows us to choose a particular configuration, which we define in an environment variable on our system called FLASK_CONFIG. It is also a place where we can do additional configuring of our newly generated application instance (e.g. creating commands for your app with the flask command line program by decorating functions with @app.cli.command()). This file may seem a little unnecessary now but it will come in handy later.

We also need to insert a form for the user to enter in data. We’ll use the assistance of the great Flask extension, Flask-WTF. With Flask-WTF, you define forms as classes that inherit from FlaskForm. Each of the class variables represents a field in the form that takes in certain types of values. Here our form is quite simple as it only takes a single field where the user can enter in names. There is also a ‘submit’ button which provides exactly what you expect.

Now that we have a form defined, we can define the new front end template index.html.

In our main application, we need to pass our NameForm class to render this template so we add the following code to app.py. Also note that, because of the way that we used the app_factory pattern, we must now use ‘BluePrints” for our view functions rather than defining routes in a single file directly to the app instance (blueprints are basically objects that can hold a bunch of an app’s routes and code and then can be registered with the application in the apps factory function).

### Running our Updated App With the Database

To run the application first define environmental variables:

• FLASK_APP, which tells the flask command line program where the flask application is located so that it can run it:
export FLASK_APP=create_app.py
• FLASK_CONFIG, which is the environmental variable that tells the program what environment your app should be running in. As a reminder, this value is used in create_app.py and references one of the values in the dict found in config.py.
export FLASK_CONFIG=development
• DEV_DATABASE_URL, which points to your development database. As a reminder, this is assigned to the environment variable SQLALCHEMY_DATABASE_URI in config.py.
export DEV_DATABASE_URL=”postgresql://<insert username>:<insert password>@localhost:5432/app_deploy_development”

Before you run the application the first time, you have to initialize the database. You only have to do this once, using the db.create_all() method provided by Flask-SQLAlchemy. From the top application directory run:

flask shell
>> from app import db
>> db.create_all()

Then you can exit the python interpreter with quit() or control-d.

Finally, run the application with:

flask run

Now, when you visit http://127.0.0.1:5000/, you should see this:

Your app should now be running with a live development database in the background ready to store whatever is entered into the form. To make sure it’s working, enter a few names into the form (for my test, I typed in “Foobar”, “Mark”, and “Joanna”). Then, in the terminal from the top directory in your app, write the command “flask shell” to run the python interpreter in the context of your app. This throws you into the python interpreter. We’ll then query the database to make sure our names exist:

Looks like we’re good to go! Just to be sure, I loop through the names to make sure they’re the values we expect:

## Using Python to Figure out Sample Sizes for your Study

It’s common wisdom among data scientists that 80% of your time is spent cleaning data, while 20% is the actual analysis.

There’s a similar issue when doing an empirical research study: typically, there’s tons of work to do up front before you get to the fun part (i.e. seeing and interpreting results).

One important up front activity in empirical research is figuring out the sample size you need. This is a crucial, since it significantly impacts the cost of your study and the reliability of your results. Collect too much sample: you’ve wasted money and time. Collect too little: your results may be useless.

Understanding the sample size you need depends on the statistical test you plan to use. If it’s a straightforward test, then finding the desired sample size can be just a matter of plugging numbers into an equation. However, it can be more involved, in which case a programming language like Python can make life easier. In this post, I’ll go through one of these more difficult cases.

Here’s the scenario: you are doing a study on a marketing effort that’s intended to increase the proportion of women entering your store (say, a change in signage). Suppose you want to know whether the change actually increased the proportion of women walking through. You’re planning on collecting the data before and after you change the signs and determine if there’s a difference. You’ll be using a two-proportion Z test for comparing the two proportions. You’re unsure how long you’ll need to collect the data to get reliable results – you first have to figure out how much sample you need!

### Overview of the Two Proportion Z test

The first step in determining the required sample size is understanding the statical test you’ll be using. The two sample Z test for proportions determines whether a population proportion p1 is equal to another population proportion p2. In our example, p1 and p2 are the proportion of women entering the store before and after the marketing change (respectively), and we want to see whether there was a statistically significant increase in p2 over p1, i.e. p2 > p1.

The test test the null hypothesis: p1 – p2 = 0. The test statistic we use to test this null hypotheses is:

$Z = \frac{p_2 - p_1}{\sqrt{p*(1-p*)(\frac{1}{n_1} + \frac{1}{n_2})}}$

Where p* is the proportion of “successes” (i.e. women entering the store) in the two samples combined. I.e.

$p* = \frac{n_1p_1 + n_2p_2}{n_1 + n_2}$

Z is approximately normally distributed (i.e. ~N(0, 1)), so given a Z score for two proportions, you can look up its value against the normal distribution to see the likelihood of that value occurring by chance.

So how to figure out the sample size we need? It depends on a few factors:

• The confidence level: How confident do we need to be to ensure the results didn’t occur by chance? For a given difference in results, detecting it with higher confidence requires more sample. Typical choices here include 95% or 99% confidence, although these are just conventions.
• The percentage difference that we want to be able to detect: The smaller the differences you want to be able to detect, the more sample will be required.
• The absolute values of the probabilities you want to detect differences on: This is a little trickier and somewhat unique to the particular test we’re working with. It turns out that, for example, detecting a difference between 50% and 51% requires a different sample size than detecting a difference between 80% and 81%. In other words, the sample size required is a function of p1, not just p1 – p2.
• The distribution of the data results: Say that you want to compare proportions within sub-groups (in our case, say you subdivide proportion of women by age group). This means that you need the sample to be big enough within each subgroup to get statistically significant comparisons. You often don’t know how the sample will pan out within each of these groups (it may be much harder to get sample for some). There are at least a couple of alternatives for you here: i) you could assume sample is distributed uniformly across subgroups ii) you can run a preliminary test (e.g. sit outside the store for half a day to get preliminary proportions of women entering for each age group).

So, how do you figure out sample sizes when there are so many factors at play?

### Figuring out Possibilities for Sample Sizes with Python

Ultimately, we want to make sure we’re able to calculate a difference between p1 and p2 when it exists. So, let’s assume you know that the “true” difference that exists between p1 and p2. Then, we can look at sample size requirements for various confidence levels and absolute levels of p1.

We need a way of figuring out Z, so we can determine whether a given sample size provides statistically significant results, so let’s define a function that returns the Z value given p1, p2, n1, and n2.

Then, we can define a function that returns the sample required, given p1 (the before probability), pdiff (i.e. p2 – p1), and alpha (which represents the p-value, or 1 minus the confidence level). For simplicity we’ll just assume that n1 = n2. If you know in advance that n1 will have about a quarter of the size of n2, then it’s trivial to incorporate this into the function. However, you typically don’t know this in advance and in our scenario an equal sample assumption seems reasonable.

The function is fairly simplistic: it counts up from n starting from 1, until n gets large enough where the probability of that statistic being that large (i.e. the p-value) is less than alpha (in this case, we would reject the null hypothesis that p1 = p2). The function uses the normal distribution available from the scipy library to calculate the p value and compare it to alpha.

These functions we’ve defined provide the main tools we need to determine minimum sample levels required.

As mentioned earlier, one complication to deal with is the fact that the sample required to determine differences between p1 and p2 depend on the absolute level of p1. So, the first question we want to answer is “what p1 that would require the biggest sample size to determine a given difference with p2?” Figuring this out allows you to calculate a lower bound on the sample you need for any p1. If you calculate the sample for the p1 with the highest required sample, you know it’ll be enough for any other p1.

Let’s say we want to be able to calculate a 5% difference with 95% confidence level, and we need to find a p1 that gives us the largest sample required. We first generate a list in Python of all the p1 to look at, from 0% to 95% and then use the sample_required function for each difference to calculate the sample.

Then, we plot the data with the following code.

Which produces this plot:

This plot makes it clear that p1 = 50% produces the highest sample sizes.

Using this information, let’s say we want to calculate the sample sizes required to calculate differences in p1 and p2 where p2 – p1 is between  2% and 10%, and confidence levels are 95% or 99%. To ensure we get a sample large enough, we know to set p1 = 50%. We first write the code to build up the data frame to plot.

Then we write the following code to plot the data with Seaborn.

The final result is this plot:

This shows the minimum sample required to detect probability differences between 2% and 10%, for both 95% and 99% confidence levels. So, for example, detecting a difference of 2% at 95% confidence level requires a sample of ~3,500, which translates into n1 = n2 = 1,750. So, in our example, you would need about 1,750 people walking into the store before the marketing intervention, and 1,750 people after to detect a 2% difference in probabilities at a 95% confidence level.

### Conclusion

The example shows how Python can be a very useful tool for performing “back of the envelope” calculations, such as estimates of required sample sizes for tests where this determination is not straightforward. These calculations can save you a lot of time and money, especially when you’re thinking about collecting your own data for a research project.

## Overview Python (and non-Python) Mapping Tools for Data Scientists

Very often, data needs to be understood on a geographic basis. As a result, data scientists should be familiar with the main mapping tools at their disposal.

I have done quite a bit of mapping before, but given its central nature in my current position as a Transportation Assets Data Scientist with the City of Winnipeg, it quickly became clear that I need to do a careful survey of the geographic mapping landscape.

There is a dizzying array of options out there. Given my main programming language is python, I’ll start with tools in that ecosystem, but then I’ll move on to other software tools available for geographic analysis.

Mapping Tools in Python

GeoPandas

GeoPandas is a fantastic library that that makes munging geographic data in Python easy.

At its core, it is essentially pandas (a must-know library for any data scientist working with python). In fact, it is actually built on top of pandas, with data structures like “GeoSeries” and “GeoDataFrame” that extend the equivalent pandas data structures with useful geographic data crunching features. So, you get all the goodness of pandas, with geographic capabilities baked in.

GeoPandas works its magic by combining the capabilities of several existing geographic data analysis libraries that are each worth being familiar with. This includes shapely, fiona, and built in geographic mapping capabilities through descartes and matplotlib.

You can read spatial data into GeoPandas just like pandas, and GeoPandas works with the geographic data formats you would expect, such as GeoJSON and ESRI Shapefiles. Once you have the data loaded, you can easily change projections, conduct geometric manipulations, aggregate data geographically, merge data with spatial joins,  and conduct geocoding (which relies on the geocoding package geopy).

Basemap

Basemap is a geographic plotting library built on top of matplotlib, which is the granddaddy of python plotting libraries. Similarly to matplotlib, basemap is quite powerful and flexible, but at the cost of being somewhat time consuming and cumbersome to get the map you want.

Another notable issue is that Basemap recently came under new management in 2016 and is going to be replaced by Cartopy (described below). Although Basemap will be maintained until 2020, the Matplotlib website indicates that all development efforts are now focused on Cartopy and users should switch to Cartopy. So, if you’re planning on using Basemap, consider using…..

Cartopy

Cartopy provides geometric transformation capabilities as well as mapping capabilities. Similarly to Basemap, Cartopy exposes an interface to matplotlib to create maps on your data. You get a lot of mapping power and flexibility coming from matplotlib, but the downside is similar: creating a nice looking map tends to be relatively more involved compared to other options, with more code and tinkering required to get what you want.

geoplotlib

Geoplotlib is another geographic mapping option for Python that appears to be a highly flexible and powerful tool, allowing static map creation, as well as animated visualizations, and interactive maps. I have never used this library, and it appears to be relatively new, but it might be one to keep an eye out for in the future.

gmplot

gmplot allows you to easily plot polygons, lines, and points on google maps, using a “matplotlib-like interface”. This allows you to quickly and easily plot your data and piggy-back on the interactivity inherent to Google Maps. Plots available include polygons with fills, drop pins, scatter points, grid lines, and heatmaps. It appears to be a great option for quick and simple interactive maps.

Mapnik

Mapnik is a toolkit written in C++ (with Python bindings) to produce serious mapping applications. It is aimed primarily at developing these mapping applications on the web. It appears to be a heavy duty tool that powers a lot of maps you see on the web today, including OpenStreetMap and MapBox.

Folium

Folium lets you tap into the popular leaflet.js framework for creating interactive maps, without having to write a single line of JavaScript. It’s a great library that I have used quite often in recent months (I used folium to generate all the visualizations for my Winnipeg tree data blog post).

Folium allows you to map points, lines, and polygons, produce choropleth maps and heat maps, create map layers (that users can enable or disable themselves), and produce pop-up tooltips for your geographic data (bonus: these tooltips support html so you can really customize them to make them look nice). There is also a good amount of customization possible with the markers and lines used on your map.

Overall, Folium strikes a great balance between features, customizability, and ease of programming.

Plotly

Plotly is a company offering a large suite of online data analytics and visualization tools. The focus of Plotly is providing frameworks that make it easier to present visualizations on the web. You write your code in Python (or R), talking to a plotly library and the visualizations are rendered using the extremely powerful D3.js library. For a taste of what is possible, check out their website, which showcases a bunch of mapping possibilities.

On top of the charts and mapping tools, they have a bunch of additional related products of interest on their website that are worth checking out. One that I’m particularly interested in is Dash, which allows you to create responsive data-driven web applications (mainly designed for dashboards) using only Python (no JavaScript or HTML required). This is something I’m definitely going to check out and will probably produce a “diving into data science” post in the near future.

Bokeh

Bokeh is a library specializing in interactive visualizations presented in the browser. This includes geographic data and maps. Similarly to Dash, there is also the possibility of using Bokeh to create interactive web applications that update data in real time and respond to user input (it does this with a “Bokeh Server”).

Other Tools for Mapping

There are obviously a huge amount of mapping tools outside of the python ecosystem. Here is a brief summary of a few that you might want to check out. Keep in mind that there are tons and tons of tools out there that are missing from this list. These are just some of the tools that I’m somewhat familiar with.

Kepler.gl

Kepler is a web-based application that allows you to explore geodata. It’s a brand spanking new tool released in late May 2018 by Uber. You can use the software live on its website – Kepler is a client side application with no server backend so all the data resides on your local machine / browser even.  It’s not just for use in on the Kepler website however; you can install the application and run it on localhost or a server, and you can also embed it into your existing web applications.

The program has some great features, with most of the basic features you would expect in an interactive mapping application, plus some really great bonus features such as split maps and playback. The examples displayed on the website are quite beautiful and the software is easy to use (accessible for use non-programmers).

This user guide provides more information on Kepler, what it can do, and how to do it. I’m very much looking forward to checking it out for some upcoming projects.

Mapbox

Mapbox provides a suite of tools related to mapping, aimed at developers to help you create applications that use maps and spatial analysis. It provides a range of services, from creating a nice map for your website to helping you build a geoprocessing application.

This overview provides a good idea of what’s possible. Mapbox provides a bunch of basemap layers, allow you to customize your maps, add your own data, and build web and mobile applications. It also provides options for extending functionality of web apps with tools like geocoding, directions, spatial analysis, and other features. Although Mapbox is not a free service, they do seem to have generous free API call limits (see their pricing here).

Heavy Duty GIS Applications

Not included here but also really important for people doing mapping are the full fledged GIS applications such as ArcGIS and QGIS. These are extremely powerful tools that are worth knowing as a geodata analyst. Note that ArcGIS is quite expensive; however, it is an industry standard so worth knowing. QGIS is also fairly commonly used and has the advantage of being free and open source.

Any glaring ommissions in this post? Let me know in the comments below or send me an email.

Further resources

Basemap

Mapnik

Folium

Bokeh

## Digging into Data Science Tools: Anaconda

In the Digging into Data Science Series, I dive into specific tools and technologies used in data science and provide a list of resources you can find to learn more yourself. I keep posts in this series updated on a regular basis as I learn more about the technology or find new resources. Post Last updated on May 28, 2018

If you’re familiar at all with Data Science, you have probably heard of Anaconda. Anaconda is a distribution of Python (and R) targeted towards people doing data science. It’s totally free, open-source, and runs on Windows, Linux, and Mac.

Up until recently, I basically treated my installation of Anaconda as a vanilla python install and was totally ignorant about the unique features and benefits it provides (big mistake). I decided to finally do some research into what Anaconda does and how to get the most out of it.

### What is Anaconda and why should you use it?

Anaconda is much more than simply a distribution of Python. It provides two main features to make your life way easier as a data scientist: 1) pre-installed packages and 2) a package and environment manager called Conda.

### Pre-installed packages

One of the main selling points of Anaconda is that is comes with over 250 popular data science packages pre-installed. This includes popular packages such as NumPy, SciPy, Pandas, Bokeh, Matplotlib, among many others. So, instead of installing python and running pip install a bunch of times for each of the packages you need, you can just install Anaconda and be fairly confident that  most what you’ll need for your project will be there.

Anaconda has also created an “R Essentials” bundle of packages for the R language as well, which is another reason to use Anaconda if you are an R programmer or expect to have to do some development in R.

In addition to the packages, it comes with other useful data science tools preinstalled:

• iPython / Jupyter notebooks: I’ll be doing a separate “Digging into Data Science Tools” post on this later, but Jupyter notebooks are an incredibly useful tool for exploring data in Python and sharing explorations with others (including non-Python programmers). You work in the notebook right in your web browser and can add python code, markdown, and inline images. When you’re ready to share, you can save your results to PDF or HTML.
• Spyder: Spyder is a powerful IDE for python. Although I personally don’t use it, I’ve heard it’s quite good, particularly for programmers who are used to working with tools like RStudio.

In addition to preventing you from having to do pip install a million times, having all this stuff pre-installed is also super useful you’re teaching a class or workshop where everyone needs to have the same environment. You can just have them all install Anaconda, and you’ll know exactly what tools they have available (no need to worry about what OS they’re running, no need to troubleshoot issues for each person’s particular system).

### Package and environment management with Conda

Anaconda comes with an amazing open source package manager and environment manager called Conda.

Conda as a package manager

A package manager is a tool that helps you find packages, install packages, and manage dependencies across packages (i.e. packages often require certain other packages to be installed, and a package manager handles all this messiness for you).

Probably the most popular package manager for python is it’s built-in tool called pip. So, why would you want to use Conda over pip?

• Conda is really good at making sure you have the proper dependencies installed for data science packages. Researching for this blog post, I came across many stories of people having a horrible time installing important and widely used packages with pip (especially on Windows). The fundamental issue seems to be that many scientific packages in Python have external dependencies on libraries in other languages like C and pip does not always handle this well. Since Conda is a general-purpose package management system (i.e. not just a python package management system), it can easily install python packages that have external dependencies written in other languages (e.g. NumPy, SciPy, Matplotlib).
• There are many open source packages available to install through Conda which are not available through pip.
• You can use Conda to install and manage different versions of python (python itself is treated as just another package).
• You can still use pip while using Conda if you cannot find a package through Conda.

Together, Conda’s package management and along with the pre-installed packages makes Anaconda particularly attractive to newcomers to python, since it significantly lowers the amount of pain and struggle required to get stuff running (especially on Windows).

Conda as an environment manager

An “environment” is basically just a collection of packages along with the version of python you’re using. An environment manager is a tool that sets up particular environments you need for particular applications. Managing your environment avoids many headaches.

For example, suppose you have an application that’s working perfectly. Then, you update your packages and it no longer works because of some “breaking change” to one of the packages. With an environment manager, you set up your environment with particular versions of the packages and ensure that the packages are compatible with the application.

Environments also have huge benefits when sending your application to someone else to run (they are able to run the program with the same environment on their system) and deploying applications to production (the environment on your local development machine has to be the same as the production server running the application).

If you’re in the python world, you’re probably familiar with the built-in environment manager virtualenv. So why use Conda for environment management over virtualenv?

• Conda environments can manage different versions of python. In contrast, virtualenv must be associated with the specific version of python you’re running.
• Conda still gives you access to pip and pip packages are still tracked in Conda environments
• As mentioned earlier, Conda is better than pip at handling external dependencies of scientific computing packages.

For a great introduction on managing python environments and packages using Conda, see this awesome blog post by Gergely Szerovay. It explains why you need environments and basics of how to manage them in Conda. Environments can be a somewhat confusing topic, and like a lot of things in programming, there are some up-front costs in learning how to use them, but they will ultimately save tons of time and prevent many headaches.

### Bonus: no admin privileges required

In Anaconda, installations and updates of packages are independent of system library or administrative privileges. For people working on their personal laptop, this may not seem like a big deal, but if you are working on a company machine where you don’t have access to admin privileges, this is crucial. Imagine having to run to IT whenever you wanted to install a new python package – It would be totally miserable and it’s not a problem to be underestimated.

## Why Data Scientists Should Join Toastmasters

Public speaking used to be a big sore spot for me. I was able to do it, but I truly hated it and it caused me a great deal of grief. When I knew I had to speak it would basically ruin the chunk of time from when I knew I would have to speak to when I did it. And don’t get me started on impromptu speaking – whenever something like that would pop up, I would feel pure terror.

Things got a bit better once I entered the working world and had to speak somewhat regularly, giving presentations to clients and staff and participating in meetings. But still, I hated it. I thought I was no good and had a lot of anxiety associated with it.

A little over a year ago, I finally had enough and decided I needed to do something about it. I joined the Venio Dictum toastmasters group in Winnipeg. After only a few months, I started to become much more relaxed and at ease when speaking. Today, one year later, I actually look forward to giving speeches and facing the challenge of impromptu speaking. A year ago, if you told me I would feel this way now, I wouldn’t have believed you.

Imagine being the type of person that volunteers to address a crowd or give a toast off the cuff. Imagine looking forward to speaking at a wedding, meeting, or other event. Imagine being totally comfortable in one of those “networking event” situations where you enter a room crowded full of people you don’t know. A year ago, I used to think people that enjoyed this stuff were from another planet. Now, I understand this attitude and I feel like I’m getting there myself.

#### Why should a Data Scientist care about speaking skills?

• Communication is a critical part of the job

Yes, a huge part of being a data scientist is having skills in mathematics, statistics, machine learning, programming, and having domain expertise.

However, technical skills are not anywhere close to the entire picture. You might be fantastic at data analysis, but if you aren’t able to communicate your results to others, your work is basically useless. You’ll wind up producing great analysis that ultimately never get implemented or even considered at all because you failed to properly explain its value to anyone.

Speaking is a huge part of communication, so you need to be good at it (the other big area of communication is writing, but that’s a topic for another day).

To get to the next level in your career (and to get a data scientist job in the first place), it really helps to be a confident and persuasive speaker.

Job interviews are a great example. When you apply for a job, there will always be an interview component where you’ll have to speak and answer questions you have not prepared for in advance. Even if your resume and portfolio look great, it’s going to be hard for an employer to hire you if you bomb the interview.

This also applies to promotions from within your current company. Advanced positions typically require rely more on communication and management skills like speaking and less on specific technical skills.

• Network / connection building

Improved speaking doesn’t just make your presentations better: it makes your day-to-day communications with colleagues and acquaintances better too. You’ll become a better conversationalist and a better listener.

As a data scientist, you’ll likely be working with multiple teams within your organization and outside your organization. You will need to gain their trust and support, and better speaking helps you do that.

• It makes you a better thinker / learner

The motto of my toastmasters club is “better listening, thinking, and speaking” because a huge part of speaking is learning how to organize your thoughts in a clear package so they are ready for delivery to your audience. As George Horace Latimer said in his book Letters from a Self Made Man to his Merchant Son:

“There’s a vast difference between having a carload of miscellaneous facts sloshing around loose in your head and getting all mixed up in transit, and carrying the same assortment properly boxed and crated for convenient handling and immediate delivery.”

Having a lot of disparate facts in your head is not very useful if they are not organized in a way that lets you easily access them when the time is right. Preparing a speech forces you to organize your thoughts, create a coherent narrative, and understand the principles underlying the ideas that you’re trying to communicate.

This habit of understanding the underlying rules and principles behind what you learn is referred to by psychologists as “structure building” or “rule learning”. As described in the book Make it Stick, people who do this as a habit are more successful learners than people that take everything they learn at face value, never extracting principles that can be applied to new situations. Public speaking cultivates this skill.

This is particularly important for data scientists, given the incredibly diverse range of subjects we are required to develop expertise in and the constantly evolving nature of our field. To manage this firehose of information, we must have efficient learning habits.

One great thing about Toastmasters is you can give a speech on any topic you want. So go ahead, give a speech on deep reinforcement learning to help solidify your understanding of the topic (but explain in a way that your grandmother could understand).

• Speaking is a fundamental skill that will impact your life in many other ways

Speaking is a great example of a highly transferable skill that pays off no matter what you decide to do. Since it deeply pervades everything we do in our personal and professional lives, the ROI for improving is tremendous. (In my opinion, some other skills that fall into this category include writing, sales, and negotiation.)

Consider all the non-professional situations in your life where you speak to others (e.g. your spouse, kids, parents, family, acquaintances, community groups). Toastmasters will make all of these interactions more fruitful.

Suppose you decide data science isn’t right for you. Well, you can be close to 100% sure the speaking skills you develop through Toastmasters will still be valuable whatever you decide to do instead.

In terms of the 80-20 rule, working on your public speaking is definitely part of the 20% that yields 80% of the results. It’s worth your time.

#### How Does Toastmasters Work?

Although each club may do things a little differently, all use the same fundamental building blocks to make you a better speaker:

• Roles: At each meeting, there are a list of possible “roles” that you can play. Each of these roles trains you in a different public speaking skill. For example, the “grammarian” observes everyones language and eloquence, “gruntmaster” monitors all the “ums”, “uhs”, “likes”, “you knows” etc. There is a “Table Topics Master” role where you propose random questions to members they have not prepared for in advance and they have to do an impromptu speech about it for two minutes (an incredibly valuable training exercise, especially if you fear impromptu speaking). Here are the complete list of roles and descriptions of the roles in my club.
• Prepared speeches: Of course, there are also tons of opportunities to provide prepared speeches. Toastmasters provides you with manuals listing various speeches to prepare that give you practice in different aspects of public speaking. You do these speeches at your own pace.
• Evaluations: Possibly the most valuable feature of Toastmasters is that everyone’s performance is evaluated by others. In a Toastmasters group, you’ll often have a subset of members that are very skilled and experienced speakers (my club has several members that have been with the club 25+ years), and the feedback you get can be invaluable. It’s crucial for improvement, and it’s something you don’t usually get when speaking in your day-to-day life.

Go out and find a local Toastmasters group and at check it out as a guest to see how it works up close. They will be more than happy to have you. You owe it to yourself to at least give it a try. If your experience is anything like mine, you’ll be kicking yourself for not starting earlier (and by “earlier” I mean high school or even sooner – it’s definitely something I’m going to encourage my daughter to do).

## Creating a Commonplace Book with Google Drive

Here’s a problem I’ve had for a long time: I would invest a lot of time into reading a great book, then inevitably as time passed the insights I gained would slowly disappear from my mind. This was pretty discouraging to me and seemed to defeat the purpose of reading the book in the first place.

So, I was very excited to come across this blog post by Ryan Holiday on keeping a “commonplace book”: your own personal repository of important insights from the various sources you encounter throughout your life.

The source of these insights can come from anywhere, like books, blogs, speeches, interactions you’ve had with others, interesting situations you’ve been in, jokes, personal life stories, ideas, etc. I’ve been doing it for about a year and have added 220 notes and counting.

There are a bunch of benefits from keeping a commonplace book:

• You can easily review it and go back to categories of ideas as you encounter challenges in your life, and your commonplace book will have all the most important insights for you at your disposal, ready to go. For example, if you have a challenge with one of your kids, you have your “parenting” category in your commonplace book ready to go to provide you with support.
• You improve your reading skills by consolidating and condensing the most important and relevant material from your sources, as it forces you to think more carefully about what you’re reading and what it means.
• By reviewing and reflecting on your commonplace book entries, you improve your writing skills and increase your memory and comprehension about the materials you’ve read.
• You can use the quotes from your commonplace book to enhance in your own writing. For example, I’ve noticed Ryan Holiday’s writing liberally uses quotes from other sources. These often come from an extremely wide variety of sources and are really effective at supporting the points he is making in his writing. This is clearly the result of his voracious reading habits and commonplace book note-taking.

A commonplace book is like an investment that grows and grows over time. Much like a stock or bond, the earlier you invest, the bigger the payoff.

## My commonplace book system

There are a million different ways that you can develop a commonplace book. There’s obviously no one “correct” way to do it, but hopefully my personal commonplace book system gives you some inspiration.

My system uses Google Docs, using a template designed to maximize my retention and reflection on the commonplace book notes and make the commonplace book easily searchable. The system also uses the Google Drive API to send myself daily emails each morning with commonplace book notes to review.

As much as possible, I’ve designed the system to take advantage of scientifically proven learning and retention techniques, including testing and recalling, spaced repetition, varied practice, and elaboration.

Aside: I highly recommend the book Make it Stick, which outlines the best evidence-supported study and learning methods and debunks a lot of common misconceptions, For example, re-reading passages and highlighting are horribly inefficient learning techniques.

#### My Template for Notes

Here is the template that I use for each of my commonplace book notes:

I make one of these notes for each important point or insight that I come across. For a good book packed with useful information, I’ll probably create about 10-20 of these notes. Each of the components of this template have a specific purpose:

• Title of Note: This typically labels the content of the note in some way to trigger my memory about what the passage is about. I try to make it somewhat vague so I don’t give away all the content. Then, when I review my notes, I’ll read only the title and then look away to try to recall what the passage is about. This is a way of incorporating testing and recall into my review. It helps improve retention and memory of the note and makes it more likely that I’ll have it in mind, ready to apply when the time is right.
• Content: The main content of the note – usually a quotation, but not always.
• Notes: A place where I can connect the content with existing knowledge and add any personal ideas or insights. Doing this kind of elaboration helps with understanding and retention.
• Citation information: The author, source, page, and url so I know exactly where each passage came from and can look it up or cite it easily if necessary. This also makes the notes way more searchable. Google Drive has great built-in search features (as you would expect), making it easy to find notes from a particular author, book, or tag.

Here’s an example of a note I took from the business and management book The Effective Executive by Peter Drucker. I added this quote to my commonplace book because it had actually never really occurred to me that a job could be poorly designed and unfit for humans. Seemed like a good insight to keep in mind as a prospective employee and if I’m ever in the position of creating job positions myself.

#### The Folder System

I divide my notes into folders and subfolders related to the topic. Often, one note could fit in more than one folder. To solve this problem, I make sure to write tags in the filename, and then randomly pick one of the relevant folders to put it in. Using tags, I can rely on search more easily for notes applicable to multiple topics. So if something belongs to both “Business” and “Productivity”, I can just add it to productivity and make sure I add both the business and productivity tags.

The great thing about having your notes in Google Drive is that you can take advantage of Google’s powerful search feature to find exactly what you need. For example, you can see below the options available in the search feature. You can search by file type, folder location, filename, and contents. With the structured note templates, you can find pretty much anything you need at the snap of a finger. For example, finding all the notes from Peter Drucker is simply a matter of writing in “Author: Peter Drucker” in the “Item Name” option shown below.

#### Daily Reviews Using Python and the Google Drive API

If you don’t have any programming knowledge, this commonplace book system will still serve you well and you don’t have to read on. But since this is a blog about hacking APIs and open data for the purposes of automation and competitive intelligence, there is more to my commonplace book system than simply adding documents to a Google Drive folder.

The commonplace book notes aren’t of much use if they’re just sitting in Google Docs unused, so I wanted to create a system of regular and automated review. Specifically, I wanted to receive an email every morning with 5 randomly selected notes from and review these notes a part of my daily routine. This helps incorporate the learning techniques of testing, spaced repetition, and varied practice. Regular review emails also give me a chance to edit notes if there are things I want to change or add.

You can find all of the code for this review system here. Here is an overview of what it does:

• Selects 5 documents at random across all the files in your commonplace book folder and subfolders
• Builds an email template with links to the five randomly selected commonplace book notes. It does this using the Jinja2 template engine.
• Sends the email of commonplace book notes to review to yourself (and any other recipients you want). See this previous blog post on how to write programs to send automated email updates.

I run the code on a DigitalOcean droplet, set on a cron job to run build_email.py at 7 am every day.

Before you try to run the code, you should follow steps 1 and 2 in this Python quick-start guide to turn on the Google Drive API and install the Google Drive client library. This will produce client_secret.json that you will need to place in the top directory of the code.

You’ll also need to make a few substitutions to placeholders in the code:

• Enter in your Gmail email and password in the file email_user_pass.json.
• Enter in the email that will be sending the email update and the list of recipients in the file emails.json.
• In build_email.py, you need to provide a value for COMMONPLACE_BOOK_FOLDER_ID. You can find this by looking in the URL when you navigate to your commonplace book folder in Google Drive.
• Install any of the required packages in build_email.py

You can customize the way that the email looks by modifying templates/email.html.

Note that this code is useful not just for this commonplace book system, but any system where you need to receive automated email updates that randomly select files in your Google Drive.

#### Just do it

Start your common book now. Even if you don’t know Python and can’t do the automated email update stuff, it doesn’t matter. This is just icing on the cake, and there are other ways you can review your commonplace book notes.

Trust me, you won’t regret taking the time to do this. The only thing you’ll regret is the fact that you didn’t start doing it 20 years ago.