Blog

Notes on Developing Transformative Tools for Thought

Occasionally, special tools come along that amplify our minds, enabling new kinds of thought not otherwise possible. Computers, writing, speaking, and the printing press are all examples of these “Tools for Thought” that surge human potential. 

This essay from Andy Matuschak and Michael Nielsen explores whether we can accelerate the development of these kind of tools. They also provide a taste of potential tools for thought with their prototype “mnemonic medium”, an interactive post on quantum mechanics called “Quantum Country” with embedded flashcards combined with a spaced repetition system delivered through email follow-ups. 

Their essay is a must-read for anyone interested in spaced-repetition or productive learning. There are several points I found thought-provoking. I believe each of these insights indicate a need for a new Tool for Thought for flashcard development and sharing.

Spaced repetition creates exponential returns to studying

Based on Quantum Country user data, Matuschak and Nielsen estimate that devoting only 50% more time to spaced repetition after reading the essay resulted in users recalling the key points for months or years. 

In other words, relatively small investments in spaced repetition after reading an article produces outsized results – more evidence to place on top of the mountain of research suggesting spaced repetition works. 

Good flashcard development is difficult

Matuschak and Nielsen note that it takes a surprising amount of skill and time to build quality flashcards, especially for abstract concepts. This is probably a big reason why most people fail to adopt spaced-repetition tools like Anki. Since flashcard development is a skill that you develop over much time and effort, new users tend to add cards in a way that inevitably leads to frustration and failure. 

This may partially explain the efficacy of Quantum Country: the authors are experts in both quantum mechanics and flashcard development – a rare but essential combination of skills for their essay to work. 

Flashcards written by others can be useable

Some people the spaced repetition community don’t believe in using flashcards created by others, and with good reason. They’re often poorly written. They’re idiosyncratic. They’re missing crucial contextual information that you lack as someone who hasn’t read the original source material. I used to be one of these non-believers.

But the effectiveness of the Quantum Country essay suggests that shared flashcards can work well. This has benefits of saving users of the burden of flash card creation, as well as preventing new user frustration from poor flashcard building skills and poor domain knowledge.

Matuschak and Nielsen hypothesize that the quality of their flashcards is what makes this work. I agree, but I have a few more hypotheses: 

  • Their flashcards are introduced in a logical progression as users read the essay. In contrast, shared decks in Anki shuffle cards randomly and are not encoded with dependency information.
  • Their flashcards are clearly connected to a source (i.e. the essay), providing important context for the user.
  • Users learn the material before they review flashcards. This is in line with the common wisdom that flashcards don’t work if you don’t already understand the material – they are a tool for retention, not learning. Aside: is this common wisdom true? I’m not so sure. Socrates taught using Q and A, so why can’t you teach a subject entirely with flashcards? If it is possible, what are the prerequisites to making it work? 

Elaborative encoding

Matuschak and Nielsen note elaborative encoding as another learning tool shown to be extremely powerful in promoting memory. Essentially, it means connecting new ideas you want to remember with old ideas you know well, providing a fast path in your brain to new information. 

Remember this concept while developing your flashcards. Whenever you add a new card, think about what you already know well and how you can connect this to the new knowledge.

A New Tool for Thought?

Matuschak and Nielsen’s article has renewed my interest in a tool for thought idea I’ve been pondering for quite a while: a platform for collaborative flashcards development and sharing. I believe such a tool, if properly developed, can address the issues that limit the use of spaced repetition:

  • Spaced repetition practitioners currently need to develop their own flashcards, which requires a significant amount of time, domain expertise, and flashcard-building skill. There needs to be a place where experts can create shared flashcards, and there should be a proper incentive structure encouraging creators to improve these flashcards over time.
  • Flashcards are not clearly connected to original sources. Spaced repetition practitioners should be able to pull up pre-built flashcards for a source document they are working through. 
  • Current tools do not provide information that link flashcards together (other than knowing two flashcards are part of the same deck, or have the same tag). At the very least, flashcards should have a notion of “depends on” or “prerequisite to”. This would make shared decks more useful by showing the intended progression of knowledge. It would also aid elaborative encoding (e.g. examining cards you’ve reviewed and linking them up to cards “nearby” in a knowledge graph)

I strongly believe a tool like this needs to exist, as you may have guessed if you noticed the Download my Brain feature I built for this site that provides a platform for sharing my personal Anki decks. I have started work on a more generic tool for collaborative flashcard construction and sharing and will keep you posted once I have something ready for production. 

Thanks to Andy Matuschak and Michael Nielsen for the inspiration to follow this path.

The Hidden Power of Compounding (and 4 Ideas for Harnessing it)

Ways to take advantage of this powerful and often-overlooked force for improvement 

Ever wonder how successful people reach such heights? Think of a wildly successful person you admire. How did they get there? 

The typical answers are hard work, innate gifts (personality, natural ability) and luck. These factors play a role, but the most important factor is left out: they leverage the power of compounding.

What is Compounding?

For compounding to occur, only two things are required:

  • Growth: Something must be growing by some percentage each year
  • Time: The growth process happens over multiple years

You’ve probably heard of a specific type of compounding: compound interest. In this context, compounding means that growing your money by some percentage every year eventually snowballs into huge results if you give it enough time. 

For example, say you are able to earn 7% per year on your money. That doesn’t sound like a lot: if you invest $1000, that’s $70 in a year. Seems pretty modest.

Yes, you’ll earn $70 in year one, but in year 2 you earn $70 (7% of $1000) plus 7% on the $70 of additional money in your account that grew from year 1. As time passes, that 7% per year pumps out bigger and bigger piles of cash:

  • In Year 10: You’ve doubled the size of your account ($1,967)
  • In Year 20: You’ve quadrupled the size of your account ($3,870)
  • In Year 30: Your account is now almost 8 times your original investment of $1,000 ($7,612)

One lesson from this exercise is to save and invest some of your money. Your future self will thank you. 

Another lesson is the earlier you start, the better. More years means more opportunity for exponential growth. 

But there’s an interesting corollary to this exercise that is less obvious, and much more exciting: compounding isn’t just something that happens to your bank account – it applies to many other areas of life. 

“We’re all sort of blundering fools, but if you just get some rate of improvement and let it keep compounding, you can do pretty well…You always want to be on some sort of curve where you’re compounding.”

– Tyler Cowen, on David Perell’s Podcast

All you need for compounding is growth and time – there is nothing about dollars or bank accounts mentioned in that definition. If you can get some percentage growth rate in some area of interest every year, you’ll eventually reach heights you never dreamed possible.

Think about some area where you want to excel. As an example, let’s say you’re in sales. If you improved at sales 10% every year, you would be twice as good in 7 years, four times as good in 15 years ten times as good in 25 years, and seventeen times as good in 30 years. In any given year, you’re not making tremendous improvements, but over time persistence leads to tremendous outcomes.

Also, keep in mind that although 10% growth is a great rate for a bank account, who’s to say this is a good rate of growth for your sales career? Maybe a reasonable growth rate is much higher. If that’s the case, you can expect much more dramatic results. 

I believe this is how extraordinary people like Elon Musk reach rarefied heights: achieving a high growth rate in an area (e.g. managing a private space company) through intense focus and then relentlessly persisting to maintain that high growth rate over many years.

4 Ideas for Better Compounding

There are lots of ways you can compound in an area you want to improve. Look for anything that 1) gives you growth by some percentage or 2) helps you maintain that growth over multiple years. Some of the obvious ideas here include reading, taking lessons, attending talks, working with a mentor, and just simply doing work in the area you want to compound.

That being said, I have a few other tips that can both increase your percentage growth rate (ideas #2 and #3) and ensure you stick to it over the years (ideas #1 and #4). 

Idea #1 – Have a plan

The key to compounding is consistent, focused effort over multiple years. It’s hard to do that without being clear about where you want to improve. If you don’t have clear goals, you’ll forget them or lose discipline, stifling your compounding efforts.

So, I recommend writing down the areas you would like to compound over time. Check back on this list regularly (I check weekly) and make sure that every year you’re making some effort to improve your abilities by some percentage.

For example, here’s a list of areas I’m personally focused on compounding over time:

  • Data science
    • Statistics / math
    • Programming
  • Communication
    • Writing
    • Speaking
    • Sales / persuasion
  • Managing teams / project management
  • Personal brand / developing followers on my blog and email list
  • Health
  • Relationships
  • Cooking
  • Drumming

I keep a Google document of this list and keep track of specific things that I am doing within each category to propel myself forward. 

Idea #2 – Use spaced repetition

You’ve probably had this experience: you read a book or take a course and you want to retain it and apply as much as you can. Inevitably, the precious knowledge slowly exits your mind and 6 months later it feels like you didn’t learn anything in the first place.

Spaced repetition is a solution to this problem.

Spaced repetition is quizzing yourself on knowledge in increasing intervals of time. It’s extremely effective and time efficient. I have flashcards in my spaced repetition system that I just answered that I will not be quizzed on again in 2 years. This spacing allows you to hold tens of thousands of flashcards in your mind while only reviewing tens of flashcards a day. 

For more information about the science behind spaced repetition, check out this great summary by Gwern Branwen.

I have been a long-time user of spaced-repetition tools for committing things to long-term memory. During my university years, I wrote cards in Supermemo for all of my courses, and it was the secret weapon to my performance. Today, I use Anki, and it continues to be a key tool for compounding my knowledge in data science and retaining the vast amount of information required for success in the field.

I’m so excited about spaced repetition, I even have a feature on my website that I built to share my data science Anki decks with the world, organized by source: Download my Brain! I’m also running a Spaced Repetition Newsletter that will provide the latest news and links related to Anki, tips on using Anki, and ideas related to spaced repetition and productive learning.

Idea #3 – Keep a commonplace book

A commonplace book is a place where you store wisdom or interesting things you’ve read or heard from others, or have thought about yourself.  I’ve written previously about the benefits of keeping a commonplace book and a summary of my personal commonplace book system that involves a spaced-repetition-esque component (I wrote code to email myself 5 randomly selected commonplace book notes from my collection). Lots of others have written about them as well (here is a good article from Ryan Holiday, which is where I first heard about this idea). 

A commonplace book is similar to spaced repetition in that it’s a tool to help you retain more of the important things that you learn. It also provides you with a body of material you can draw from, which is particularly valuable if you are a writer. 

I personally use my commonplace book system to hold long-form wisdom that doesn’t lend itself well to flash card quizzing.

Idea #4 – Use commitment contracts

Once you have a plan for compounding in your improvement areas, you need to follow-through with that plan. Over the years I’ve tried many strategies to help deal with this issue of staying motivated and sticking to a plan, and the most effective tool I’ve found are commitment contracts. I’ve used commitment contracts to read and write more than I ever have in the past, despite being busier than ever with a full time job and a 3-year-old at home.

Commitment contracts are agreements where you commit to doing something, and failing leads to actual consequences. For example, I have a commitment contract to spend a minimum number of hours each week on various areas of interest. I track my time during the week, report back each week, and if I’m below my target I’m penalized $5. 

I have found this surprisingly effective, especially considering the low stakes. It doesn’t take much of a penalty to provide enough motivation to follow through with your targets. I think this may be tapping into our inherent loss-aversion: we are irrationally repulsed by a loss, much more than the size of the loss would indicate. Of course, you can make your failure penalties higher if you want, but I personally don’t find it necessary. 

The specific tool that I use for commitment contracts is StickK. I highly recommend it. 

Conclusion

I think these tips will help you out, but the most important point to remember is compounding only requires two things: growth and time. Growth ensures you make progress, and time is what allows you to reach heights you never thought possible. In any given year, you may not be making huge improvements, but over time your persistence leads to huge outcomes.

If you want to commit this article to long-term memory, download the Anki deck I put together for it here.

I made a speech based on this blog post that you can find here:

This is a speech version of this post from my local Toastmasters group

Spaced Repetition for Efficient Learning by Gwern Branwen

If you’re interested in spaced repetition and haven’t read it yet, Gwern Branwen’s essay is a fantastic review, including research into the benefits of testing, what spaces repetition is used for, software, and other observations. There are lots of resources here for people that want to know the research behind why spaced repetition works, including many studies on the effects of testing and the effects of spacing your learning. 

Here are a few points I found most interesting in Gwern’s essay:

People underestimate the benefits of spaced repetition: Gwern references fascinating research on how students and teachers grossly underestimate how much better spaced repetition is compared to cramming for learning. 

Spaced repetition as a tool for maintenance of knowledge, not gaining new knowledge: According to some of the research referenced in the essay, spaced repetition doesn’t teach you new skills or abilities. Rather, it just helps you maintain existing skills.

  • Skills like gymnastics and music performance raise an important point about the testing effect and spaced repetition: they are for the maintenance of memories or skills, they do not increase it beyond what was already learned. If one is a gifted amateur when one starts reviewing, one remains a gifted amateur.” 
  • I’m resistant to this idea that you can’t learn using flashcards. I agree if new flashcards are thrown at you randomly, then yes that is a very inefficient way to learn. But what if flashcards are presented to you in a specific order when you are learning them, and “advanced” cards are not shown to you until you’ve learned the prerequisite “beginner” cards? I don’t know the research on this, but this strikes me as potentially a better way to learn than reading, if the cards are formulated properly.  To my knowledge, none of the spaced repetition tools out there (e.g. Supermemo, Anki) allow for this kind of “card dependency” – if you know of a tool that does this, let me know.

Tradeoff between lookup time and mental space: A key question that you have to ask yourself when using spaced repetition is what information should I add to my spaced repetition system? Why add anything at all to it when you can just look it up on the internet?

  • The answer is that lookup costs can be large, especially for information that you use a lot. Often when you want to apply your knowledge, there isn’t enough time to look it up, or the time to look-up impedes your thinking. An extreme example of this would be trying to recall an important piece of knowledge in a job interview – good luck pulling out your phone in front of the interviewer to get the right answer.
  • To figure out what to add, you have to strike a balance with the tradeoff between lookup costs and the cost of adding the item to your spaced repetition system and reviewing it. Gwern provides a 5 minute rule as a criteria for deciding what to add: if you think you’ll spend over 5 minutes over your lifetime looking something up or not having the knowledge in your head will wind up somehow costing you more than 5 minutes, then it’s worth it to add it to spaced repetition.

Idea – dynamically generated cards: Gwern offers some interesting ideas about the possibility of dynamically generated cards. For example, having a card that teaches multiplication by randomly generating numbers to multiply. Similar ideas apply to chess, go, programming, grammar, and more.

How to Make Decisions: A Mental Framework and 5 Common Failures

Whether you like it or not, you are faced with a barrage of decisions every day. Human beings are essentially decision machines, and the quality of your life hinges on these choices you make. The better you are at it, the better your life will be.

Some decisions are low stakes. Sometimes, the stakes are tremendous. What should I let my kids do? Where should I invest? Should I switch jobs? Should I ride my bike or take a car? Should I move to that neighbourhood? Should I reach out to the family member I haven’t talked to in ages? Should I cross the street? 

It’s valuable to have a mental decision framework that guides you to increase your decision quality and decrease the stress of the process. It’s also valuable to be aware of some common decision-making traps to avoid. 

The Framework

The mental framework I use for making decisions looks like this:

As you can probably tell, this mental framework is heavily influenced by economics (specifically, cost-benefit analysis which is a decision framework used by economists). 

At the highest level, making a decision in this framework involves 3 steps:

  1. Estimate the benefits
  2. Estimate the costs
  3. If the benefits outweigh the costs, do it. If not, don’t. When you’re choosing between multiple courses of action, pick the one with the highest benefit-to-cost ratio 

Steps #1 and #2 each break down into understanding two sub-components: probability and impact.

  • Probability is the chance of the chance the benefit or cost happens
  • Impact is how good the benefit is or how bad the cost is if it happens

Higher probability or higher impact means you should place more weight on that benefit or cost. 

Most of us probably use a similar framework in our heads without even knowing it, but spelling it out like this helps clarify tough decisions and exposes hidden flaws in our decision-making.

5 Common Decision-Making Failures 

This model may seem simple at first glance, but it’s deceptively tough to apply. This is because we’re human beings, not computers or all-knowing gods. We will never come close to applying this decision-making model perfectly. But even though we’ll never reach perfection, there are many ways to improve. 

Here are five common failures you can keep an eye out for.

Common Failure #1: Loss aversion

Loss aversion means that we tend to hate losses much more than we love gains. As a result of this tendency, you irrationally underestimate benefits and overestimate costs in your decision-making. 

This leads to particularly skewed decisions when considering benefits that are huge, but low probability (in other words, high payoff but a chance of a “loss”). 

For example, say you’re applying for a big grant and you estimate your probability of getting it is less than 5%. Many would instantly choose to not write the grant without giving a second thought because the chances are slim. It’s true that winning the grant is unlikely, the payoff is huge. Applying might still be worthwhile and shouldn’t be dismissed without consideration. 

Or consider Elon Musk. When he started SpaceX, he didn’t say “well, the chance of building a profitable, successful space company is low, so let’s just forget it”. Instead, he thought “the probability of success with this company is only about 10 percent, but the upside if I’m successful is tremendous so I’m going to try”. 

Always remember that the emotional toll of losing is often irrationally high and doesn’t represent the actual cost of losing.

Common Failure #2: Not valuing your time

I remember in university, there was occasionally a “free pancake day”. Students would line up out the door for them, waiting about an hour in line to get these “free” pancakes. 

The pancakes are only “free” if you don’t value your time, which is a huge mistake. 

I personally believe you should value your time at least as high as your equivalent hourly salary at work. You may disagree with that number, but the point is that it’s absolutely worth something. This was one of the key takeaways from my economics background and probably the best piece of advice I can give.

Common Failure #3: Focusing only on money and measurable things

We often ignore benefits or costs that are not monetary, and in general we have a difficult time considering things that are not easy to measure. Here are some difficult-to-measure areas where there are often huge costs or benefits:

  • Health and wellness (e.g “I could never ride a bicycle because it’s too dangerous”)
  • Relationships with loved ones (e.g. cutting ties with a family member over a $100 dispute)
  • Education, skills, and learning (e.g. “our company doesn’t have the skills to perform this core business task so let’s just contract it out”)
  • Culture / morale (e.g. “We can’t fire Sarah. Sure, she’s a toxic, unethical person and makes everyone in the office miserable, but damn is she ever good at making widgets!”)

Common Failure #4: Not considering the costs of inaction

No matter what you do, you are making a decision. Choosing to not do anything or to “not make a decision” is itself a decision. As a result, the benefits and costs of inaction must be considered alongside the benefits and costs of all other courses of action. 

Common Failure #5: Ignoring hidden benefits and costs

Many benefits and costs are commonly ignored because they are less obvious. For example:

  • Transporting yourself by bicycle leads to improvements to your physical appearance, mood, strength, focus, and energy. 
  • Many people think the cost of owning a home is just a monthly mortgage payment. Hidden costs include property taxes, maintenance, and insurance, among many other expenses.
  • Many people think the cost of driving a car is just the cost of gas. Hidden costs here include insurance, parking, and maintenance.

Keep an eye out for costs and benefits that may be lurking slightly under the surface.

Conclusion

Decisions are hard. You’ll never make perfect decisions, but if you have a solid mental framework at your side and have some awareness of common mistakes, you can get better. More importantly, you can relieve some stress and feel more confident that you’re doing the best you can with what you have.

If you want to commit this article to long-term memory, download the Anki deck I put together for it here.

Take your Python Skills to the Next Level With Fluent Python

You’ve been programming in Python for a while, and although you know your way around dicts, lists, tuples, sets, functions, and classes, you have a feeling your Python knowledge is not where it should be. You have heard about “pythonic” code and yours falls short. You’re an intermediate Python programmer. You want to move to advanced. 

This intermediate zone is a common place for data scientists to sit. So many of us are self-taught programmers using our programming languages as a means to a data analysis end. As a result, we are less likely to take a deep dive into our programming language and we’re less exposed to the best practices of software engineering.

If this is where you are in your Python journey, then I highly recommend you take a look at the book “Fluent Python” by Luciano Ramalho. As an intermediate Python programmer, I’m the target audience for this book and it exceeded my expectations. 

The book consists of 21 chapters divided into five broad sections:

  • Data Structures: Covers the fundamental data structures within Python and their intricacies. This includes “sequences, mappings, and sets, as well as the str versus bytes split”. Some of this will be review for intermediate Python users, but it covers the topics in-depth and explains the counterintuitive twists of some of these data structures.
  • Functions as Objects: Describes the implications of functions being “first class objects” in Python (which essentially means functions to be passed around as arguments and returned from functions). This influences how some design patterns are implemented (or not required to be implemented) in Python. First class functions are related to the powerful function decorator feature which is also covered in this section.
  • Object-Oriented Idioms: A dive into various concepts related to object-oriented python programming, including object references, mutability, and recycling, using protocols and abstract base classes, inheritance, and operator overloading, among other topics. 
  • Control Flow: Covers a variety of control flow structures and concepts in Python, including iterables, context managers, coroutines, and concurrency. 
  • Metaprogramming: Metaprogramming in Python means creating or customizing classes at runtime. This section covers dynamic attributes and properties, descriptors, and class metaprogramming (including creating or modifying classes with functions, class decorators, and metaclasses). 

There are a few things I really liked about this book:

  • Advanced Target Market: There are lots of materials for Python beginners, but not nearly as many self-contained, organized resources that explain advanced material for intermediate users. This book helps fill that gap. 
  • Lots of great recommended resources: Each chapter has a “further reading” section with references to resources to expand your knowledge on the topic. The author not only links to these resources, but provides a brief summary and how they fit into the chapter’s material. Working through some of these references will be my game plan for the next several years as I work to propel my Python skills further. 
  • It’s opinionated and entertaining: The end of each chapter has a “soapbox” section providing an entertaining, informative, and opinionated aside about the subject. Reading this section feels like having a chat over a beer with a top expert of their field. It provides context and injects some passion into topics that can be dry. These opinionated sections are clearly quarantined from the rest of the book, so it never feels like the author’s personal tastes are pushed on you. I wish every programming book had a section like this.
  • It pushes your understanding. I thought I was quite knowledgeable about Python and expected to work through this book quickly. I was wrong – it was slow reading and I plan on doing another deep read-through in 2020 to absorb more of the material. 

So check it out. Also, check out an Anki deck I created that you can use as a complement to the book, which should help you to add insights that you want to commit to memory forever (here is a short guide on how to use my data science Anki decks). 

Fluent Python Chapter Overview

Part 1: Prologue

  • Chapter 1: The Python Data Model. Provides a background into the data model that makes Python such a great language to code, allowing experienced Python programmers to anticipate features in new packages / APIs before even looking at the documentation. 

Part 2: Data Structures

  • Chapter 2: An Array of Sequences. Sequences includes strings, bytes, lists, tuples, among other objects and are one of the most powerful concepts in Python. Python provides a common interface for iteration, slicing, sorting, and concatenating these objects and understanding the sequence types available and how to use them is key to writing Pythonic code.
  • Chapter 3: Dictionaries and Sets. Dictionaries are not only widely used by Python programmers – they are widely used in the actual implementation code of the language. This chapter goes in-depth into the intricacies of dictionaries so you can make the best use of them. 
  • Chapter 4: Text versus Bytes. Python 3 made some big changes in its treatment of strings and bytes, getting rid of implicit conversion of bytes to Unicode. This chapter covers the details to understand Python 3 Unicode strings, binary sequences and encodings to convert from one to the other. 

Part 3: Functions as Objects

  • Chapter 5: First Class Functions. Python treats functions as first class objects, which means you can pass them around as objects. For example, you can pass a function as an argument to a function or return a function as the return value of another function. This chapter explores the implications of this power.
  • Chapter 6: Design Patterns with First Class Functions. Many of the design patterns discussed in classic design patterns books change when you’re dealing with a dynamic programming language like Python. This chapter explores those patterns and how they look in Python’s first class function environment. 
  • Chapter 7: Function Decorators and Closures. Decorators are a powerful feature in Python that lets you augment the behaviour of functions. 

Part 4: Object-Oriented Idioms

  • Chapter 8: Object References, Mutability, and Recycling. This chapter covers details around references, object identity, value, aliasing, garbage collection, and the del command.
  • Chapter 9: A Pythonic Object. A continuation of Chapter 1, providing hands-on examples of the power of the Python data model and giving you a taste of how to build a “Pythonic” object by building a class representing mathematical vectors.
  • Chapter 10: Sequence Hacking, Hashing, and Slicing. An expansion of the Pythonic Vector object built in Chapter 9, adding support a variety of standard Python sequence operations, such as slicing.
  • Chapter 11: Interfaces – From Protocols to ABCs. Building interfaces for a language like Python is typically different than C++ or Java. Specifically, Abstract Base Classes (ABCs) are less common in Python and the approach to interfaces is typically less strict, using “duck typing” and “protocols”. This chapter describes the various approaches to defining an interface in Python and food for thought about when to use each approach.
  • Chapter 12: Inheritance – For Good or For Worse. Subclassing, with a focus on multiple inheritance, and the difficulties of subclassing the built-in types in Python.
  • Chapter 13: Operator Overloading – Doing it Right. Operator overloading lets your objects work with Python operators such as |, +, &, -, ~, and more. This chapter covers some restrictions Python places on overloading these operators and shows how to properly overload the various types of operators available.

Part 5: Control Flow

  • Chapter 14: Iterables, Iterators, and Generators. The iterator pattern from the classic design pattern texts is built into Python, so you never have to implement it yourself. This chapter studies iterables and related constructs in Python.
  • Chapter 15: Context Managers, and else Blocks. The bulk of this chapter covers the with statement (and related concepts), which is a powerful tool for automatically building up a temporary context and tearing it down after you’re done with it (e.g. opening / closing a file or database). 
  • Chapter 16: Coroutines. This chapter describes coroutines, how they evolved from generators, and how to work with them, including an example of using coroutines for discrete event simulation (simulating a taxi fleet). 
  • Chapter 17: Concurrency with Futures. This chapter teaches the concept of futures as “objects representing the asynchronous execution of an operation”. It also focuses on the concurrent.futures library, for which futures are foundational concept. 
  • Chapter 18: Concurrency with asyncio. A dive into the asyncio package that implements concurrency and is part of the standard library. 

Part 6: Metaprogramming

  • Chapter 19: Dynamic Attributes and Properties. In programming languages like Java, it’s considered bad practice to let clients directly access a class public attributes. In Python, this is actually a good idea thanks to properties and dynamic attributes that can control attribute access. 
  • Chapter 20: Attribute Descriptors. Descriptors are like properties since they let you define access logic for attributes; however, descriptors let you generalize and reuse the access logic across multiple attributes.
  • Chapter 21: Class Metaprogramming. Metaprogramming in Python means creating or customizing classes at runtime. Python allows you to do this by creating classes with functions, inspecting or changing classes with class decorators, and using metaclasses to create whole new categories of classes. 

You can access an Anki flashcard deck I created to accompany Fluent Python here.

Syncing your Jupyter Notebook Charts to Microsoft Word Reports

Here’s the situation: You’re doing a big data analysis in your Jupyter Notebook. You’ve got tons of charts and you want to report on them. Ideally, you’d create your final report in the Jupyter notebook itself, with all its fancy markdown features and the ability to keep your code and reporting all in the same place. But here’s the rub: most people still want Word document reports, and don’t care about your code, reproducibility, etc. When reporting it’s important to give people the information in a format most useful to them.

So you’ve got tons of charts and graphs that you want to put in the Word report – how do you keep the two in sync? What if your charts change slightly (e.g. changing the styling of every chart in your report)? You’re stuck copying and pasting charts from your notebook, which is a manual, time-consuming, and error prone process.

In this post, I’ll show you my solution to this problem. It involves the following steps:

  • Saving the chart images from Jupyter Notebook to your desktop in code.
  • Preparing your Word Document report, referencing the image names that you saved in your desktop in the appropriate location in your report.
  • Loading the images into a new version of your Word Document.

Saving Chart Images from Jupyter Notebook to your Desktop

The first step is to gather the images you want to load into your report by saving them from Jupyter Notebook to image files on your hard drive. For this post, I’ll be using the data and analysis produced in my “Lets Scrape A Blog” post from a few months ago, where I scraped my favourite blog Marginal Revolution and did some simple analyses.

That analysis used matplotlib to produce some simple charts of results. To save these images to your desktop, matplotlib provides a useful function called savefig. For example, one chart produced in that analysis looks at the number of blog posts by author:

The following code produces this chart and saves it to a file named ‘num_posts_by_author_plot.png’ in the folder ‘report_images’.

https://gist.github.com/marknagelberg/8bf83aedb5922a325cd673dec362871e

A few pointers about this step:

  • Make sure you give your images useful, descriptive names. This helps ensure you place the proper reference in the Word document. I personally like following the convention of giving the plot image the same name as the plot object in your code.
  • Your images must have unique names or else the first image will be overwritten by the second image
  • To stay organized, store the images in a separate folder designed specifically for the purpose of holding your report images.

Repeating similar code for my other charts, I’m left with 5 chart images in my report_images folder:

Preparing the Word Document Report with Image References

There is a popular Microsoft Word document package out there for Python called python-docx, which is great library for manipulating Word Documents with your Python code. However, its API does not easily allow you to insert an image in the middle of the document.

What we really need is something like Jinja2 for Word Docs: a packages where you can specify special placeholder values in your document and then automatically loads images in. Well, luckily exactly such a package exists: python-docx-template. It is built on top of python-docx and Jinja2 and lets you use Jinja2-like syntax in your Word Documents (see my other post on using Jinja2 templates to create PDF Reports).

To get images into Word using python-docx-template, it’s simple: you just have to use the usual Jinja2 syntax {{ image_variable }} within your Word document. In my case, I had six images, and the Word template I put together for testing looked something like this:

For the system to work, you have to use variable names within {{ }} that align with the image names (before ‘.png’) and the plot variable names in your Jupyter notebook.

Loading the Images Into your Document

The final and most important step is to get all the images in your template. To do this, the code roughly follows the following steps: load your Word document template, load the images from your image directory as a dict of InlineImage objects, render the images in the Word document, and save the loaded-image version to a new filename.

Here is the code that does this:

https://gist.github.com/marknagelberg/8895f9f60a3f696b4e24169f8134d33f

To run the code, you need to specify the Word template document, the new Word document name which will contain the images, and the directory where your images are stored. 

python load_images.py <template Word Doc Filename> <image-loaded Word Doc filename> <image directory name>

In my case, I used the following command, which takes in the template template.docx, produced the image-loaded Word Document result.docx, and grabs the images from the folder report_images:

python load_images.py template.docx result.docx report_images

And voila, the image-loaded Word Document looks like this:

You can find the code I used to create this guide on Github here.

Building a Rare Event Reporting System in Python (Part 1 – Crunching Historical Data to Determine What is Unusual)

Here’s the situation: you have a dataset where each row represents some type of event and the time it occurred. You want to build a system to let you know if there were unusual events occurring in some time period (a common application). You also want your users to be able to explore the events in more detail when the get an alert and understand where the event sits in the history of time the data has been collected.

After starting a task similar to this at work, I broke it down into the following parts (each of which will receive a separate blog post):

  1. Using historical data to understand what is “unusual” (Part 1 – this post):
  2. Reporting “unusual” events taken from a live feed of your data (Part 2): Developing a system that monitors a live feed of your data, compares it to historical trends, and sends out an email alert when the number of incidents exceed some threshold.
  3. Developing a front end for users to sign up for email alerts (Part 3): Developing a system for signing up with email to receive the updates and specify how often you want to see them (i.e. specifying how “unusual” an event you want to see).
  4. Creating interactive dashboards (Part 4): Developing dashboards complementing the alert system where users can look up current status of data and how it compares to historical data.

The Data: Obviously, this scenario applies to many, many datasets, but I have to pick something specific for this series. So, I’ll be using this dataset on air quality in the City of Winnipeg from Winnipeg’s open data portal. This data is perfect because it’s updated regularly (every 5 minutes), and there is historical data as well as a live feed API (you can find documentation on the API here). We’ll be using the historical data to help figure out what’s “unusual” and the live data  as the basis for alerts by comparing its values against historical and provide alerts.

Data Exploration and Pre-Processing

After downloading our air quality dataset, we see it looks like this:

As you can see, there are different measurements in different rows, including Temperature, Humidity, and PM2.5 Particulates. Let’s filter our dataset to only contains information on PM2.5 Particulates (we only really care about air pollution in this application).

Looking at the values in MeasurementValue, we see it is measured in units of micrograms per cubic metre (ug/m3) and all values are positive integers. Plotting the histogram, you can see that almost all of the values are below 1,000, with a relatively small number of outliers around 5000.

Counting the number of Measurement Values and filtering for lower values reveals that the vast majority of the observations are less than 100.

Looking at the Measurement Values greater than 100 confirms that there are no observations between 500 and 5,000, with a sudden spike around 5,000.

I don’t have domain expertise related to air quality measurement and the nature of the sensors used to collect this data; however, but the fact that there are no observations in this wide range suggests that these are measurement errors. Exporting the Measurement Values to a table shows that there is one observation with 673 ug/m3 and then the next highest observation suddenly jumps by almost an order of magnitude to 4892 ug/m3.

Given the clear cutoff from these outliers, we’ll set a threshold and define an observation as a measurement error if it is greater than 1000 ug/m3. So as a first step in our analysis, we’ll take out these outliers.

Once these have been filtered out, we’ll do a bit of aggregation across longer time periods. This helps further smooth out other one-off outliers that are likely measurement errors rather than a real increase in overall air quality in the city.

To do this, we put together a dataset that groups data into hourly chunks and takes the average particulate matter over that period. Plotting this out shows the average hourly particulate matter over the historical time period provided is between 0 and 120 ug/m3.

Determining Alert Thresholds

From this hourly dataset of average particulate levels, here’s the general approach we’ll take to determine what is “unusual”: we’ll define some percentile threshold where we want to be alerted, and call anything above this threshold “unusual”.

Python’s pandas package has a nice Series.quantile() function to help out with this: you pass in the quantile you’re interested in and it pops out the value in your data representing that quantile. For example, if you want to see the value representing the 95th percentile in our hourly average air quality data (hrly_avg_aqp), then you would write this:

The 95th percentile in the dataset is 32.47 ug/m3.

The next logical question is: what percentile threshold should we use to send alerts?

One way to think about this is how often you would like to or expect to see emails about these events. Being alerted about 99th percentile events seems like it might be rare enough. However, remember that we are collecting data by the hour; which means if you set a 99th percentile criteria, that means you would receive an email in about 1 out of every 100 hours or about once every four days. For our application, this seems like far too many alerts: we want to hear about the rare air quality events (e.g. once a year events).

As a solution, we’ll leave it to the user to input about how rare an event they want to receive alerts for: once-in-two-years events, once-in-a-year events, six-in-a-year events, or once-in-a-month events. Since we’re looking at hourly data, if a user specifies the number of alerts they want to receive per year (n_alerts_per_year), the calculation of the appropriate percentile threshold is:

percentile_threshold = [1 - n_alerts_per_year  / Total Number of Observations in a Year] * 100 = [1 - n_alerts_per_year  / (365 * 24)] * 100

So now we have all the fundamental pieces of data we need: when the user specified n_alerts_per_year, we calculate the percentile threshold using the calculation above, and then finally, we can calculate the threshold value of air quality that should set off an alert for that user with

hrly_avg_aqp.quantile(percentile_threshold)

Here’s an overview of our data pre-processing steps for our unusual event application.

What’s Next?

Now that we have the basic data in place for understanding what levels of air quality are “unusual”, in Part 2 of the series we’ll write code to ingest the JSON data feed of air quality readings reporting “unusual” events based on your data developed in #1 and the live data feed (Part 2): A system that regularly monitors a particular “feed” of incidents, compares the number of incidents with historical data, and then sends out an email whenever the number of incidents exceed a percentile threshold.

Getting Started with Airflow Using Docker

Lately I’ve been reading intensively on data engineering after being inspired by this great article by Robert Chang providing an introduction to the field.  The underlying message of the article really resonated with me: when most people think of data science they immediately think about the stuff being done by very mature tech companies like Google or Twitter, like deploying uber-sophisticated machine learning models all the time.

However, many organizations are not at the stage where these kind of models makes sense as a top priority. This is because, to build and deploy these kind of models efficiently and effectively, you need to have foundation data infrastructure in place that you can build the models on. Yes, you can develop a machine learning model with the data you have in your organization, but you have to ask: how long did it take you to do it, is your work repeatable / automatable, and are you able to deploy or actually use your solution in a meaningful and reliable way? This is where data engineering comes in: it’s all about building the data warehouses and ETL pipelines (extract-transform-load) that provide the fundamental plumbing required to do everything else.

One tool that keeps coming up in my research on data engineering is Apache Airflow, which is “a platform to programmatically author, schedule and monitor workflows”. Essentially, Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks. It’s becoming very popular among data engineers / data scientists as a great tool for orchestrating ETL pipelines and monitor them as they run.

In this post, I’ll give a really brief overview of some key concepts in Airflow and then show a step-by-step deployment of Airflow in a Docker container.

Key Airflow Concepts

Before we get into deploying Airflow, there are a few basic concepts to introduce. See this page in the Airflow docs which go through these in greater detail and describe additional concepts as well.

Directed Acyclic Graph (DAG): A DAG is a collection of the tasks you want to run, along with the relationships and dependencies between the tasks. DAGs can be expressed visually as a graph with nodes and edges, where the nodes represent tasks and the edges represent dependencies between tasks (i.e. the order in which the tasks must run). Essentially, DAGs represent the workflow that you want to orchestrate and monitor in Airflow. They are “acyclic”, which means that the graph has no cycles – in English, this means means your workflows must have a beginning and an end (if there was a cycle, the workflow would be stuck in an infinite loop).

Operators: Operators represent what is actually done in the tasks that compose a DAG workflow. Specifically, an operator represents a single task in a DAG. Airflow provides a lot of pre-defined classes with tons of flexibility about what you can run as tasks. This includes classes for very common tasks, like BashOperator, PythonOperator, EmailOperator, OracleOperator, etc. On top of the multitude of operator classes available, Airflow provides the ability to define your own operators. As a result, a task in your DAG can do almost anything you want, and you can schedule and monitor it using Airflow.

Tasks: A running instance of an operator. During the instantiation, you can define specific parameters associated with the operator and the parameterized task becomes a node in a DAG.

Deploying Airflow with Docker and Running your First DAG

This rest of this post focuses on deploying Airflow with docker and it assumes you are somewhat familiar with Docker or you have read my previous article on getting started with Docker.

As a first step, you obviously need to have Docker installed and have a Docker Hub account. Once you do that, go to Docker Hub and search “Airflow” in the list of repositories, which produces a bunch of results. We’ll be using the second one: puckel/docker-airflow which has over 1 million pulls and almost 100 stars. You can find the documentation for this repo here. You can find the github repo associated with this container here.

So, all you have to do to get this pre-made container running Apache Airflow is type:

docker pull puckel/docker-airflow

And after a few short moments, you have a Docker image installed for running Airflow in a Docker container. You can see your image was downloaded by typing:

docker images

Now that you have the image downloaded, you can create a running container with the following command:

docker run -d -p 8080:8080 puckel/docker-airflow webserver

Once you do that, Airflow is running on your machine, and you can visit the UI by visiting http://localhost:8080/admin/

On the command line, you can find the container name by running:

docker ps

You can jump into your running container’s command line using the command:

docker exec -ti <container name> bash

So in my case, my container was automatically named competent_vaughan by docker, so I ran the following to get into my container’s command line:

Running a DAG

So your container is up and running. Now, how do we start defining DAGs?

In Airflow, DAGs definition files are python scripts (“configuration as code” is one of the advantages of Airflow). You create a DAG by defining the script and simply adding it to a folder ‘dags’ within the $AIRFLOW_HOME directory. In our case, the directory we need to add DAGs to in the container is:

/usr/local/airflow/dags

The thing is, you don’t want to jump into your container and add the DAG definition files directly in there. One reason is that the minimal version of Linux installed in the container doesn’t even have a text editor. But a more important reason is that jumping in containers and editing them is considered bad practice and “hacky” in Docker, because you can no longer build the image your container runs on from your Dockerfile.

Instead, one solution is to use “volumes”, which allow you to share a directory between your local machine with the Docker container. Anything you add to your local container will be added to the directory you connect it with in Docker. In our case, we’ll create a volume that maps the directory on our local machine where we’ll hold DAG definitions, and the locations where Airflow reads them on the container with the following command:

docker run -d -p 8080:8080 -v /path/to/dags/on/your/local/machine/:/usr/local/airflow/dags  puckel/docker-airflow webserver

The DAG we’ll add can be found  in this repo created by Manasi Dalvi. The DAG is called Helloworld and you can find the DAG definition file here. (Also see this YouTube video where she provides an introduction to Airflow and shows this DAG in action.)

To add it to Airflow, copy Helloworld.py to /path/to/dags/on/your/local/machineAfter waiting a couple of minutes, refreshed your Airflow GUI  and voila, you should see the new DAG Helloworld:

You can test individual tasks in your DAG by entering into the container and running the command airflow test. First, you enter into your container using the docker exec command described earlier. Once you’re in, you can see all of your dags by running airflow list_dags. Below you can see the result, and our Helloworld DAG is at the top of the list:

One useful command you can run on the command line before you run your full DAG is the airflow test command, which allows you to test individual tests as part of your DAG and logs the output to the command line. You specify a date / time and it simulates the run at that time. The command doesn’t bother with dependencies and doesn’t communicate state (running, success, failed, …) to the database, so you won’t see the results of the test in the Airflow GUI. So, with our Helloworld DAG, you could run a test on task_1

airflow test Helloworld task_1 2015-06-01

Note that when I do this, it appears to run without error; however, I’m not getting any logs output to the console. If anyone has any suggestions about why this may be the case, let me know. 

You can run the backfill command, specifying a start date and an end date to run the Helloworld DAG for those dates. In the example below, I run the dag 7 times, each day from June 1 – June 7, 2015:

When you run this, you can see the following in the Airflow GUI, which shows the success of the individual tasks and each of the runs of the DAG.

Resources

Deploying and Maintaining a Web App Part 3: Adding Tests with Pytest

Testing is an important part of development, including developing web applications. Among other benefits, they increase your confidence that your web application is doing what you expect, and a provide a basis for preventing bugs in an automated way when you make changes to your code.

Since testing is such an integral part of web application maintenance and deployment, in this Part 3 of our app-deploy project we’ll put in some basic tests to see how they can be implemented in flask applications and so we will eventually see how tests fit into the deployment workflow. We’ll be using the pytest python library as our testing framework.

To follow along, you can clone the repository:

git clone https://github.com/marknagelberg/app-deploy.git

And then go to the appropriate location in the code with the following command which includes all of the changes made in this post:

git checkout ed2ed02f8b3db

Finally, create the python environment with:

conda env create -f environment.yml

First, a Slight Upgrade to our App

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To make testing a bit more interesting, we first add a little bit of additional functionality: our app will now print out a list of all the existing names in the database on the main page.

First we add a few lines to app/app.py so that it queries the database to get all names, and then sends the list of names to the template.

https://gist.github.com/marknagelberg/f4cb543ca596d383362ef9383b03ee6a

Then, we update the template to print out these names.

https://gist.github.com/marknagelberg/5b2237cc9157bdbe84b9b3283d0ba133

Now, when you enter in new names, they are listed out for you, like so:

Pytest Installation and Set-up

To start off, let’s activate our app’s conda environment and install pytest:

source activate app-deploy
conda install pytest

Our tests will be stored in a top level directory in a folder that we’ll call `tests`. Our tests will be stored in *.py files within this folder. These files must either be named like test_*.py or *_test.py: this is how you tell pytest that these files represent tests (in other words, this is how pytest does “test discovery”).

As a reminder, your top level directory should look something like this:

So, let’s add a file called test_app.py in /test.

touch test/test_app.py

To run your tests, you simply just have to enter ‘pytest’ in the command line as follows. No tests have been added to test_app.py yet, so when you run ‘pytest’ now, the result should look like this:


A couple of other small preliminaries to get testing set up:

  • Set WTF_CSRF_ENABLED configuration variable equal to False in the testing configuration found in config.py. This is required for tests to run properly. Keep in mind that you normally want to have this set equal to True when you’re running your application in production, since this protects your forms against Cross Site Request Forgery (CSRF) attacks.
  • Add __init__.py in your /tests directory.

Adding Tests

The great thing about pytest is that it allows you to write tests in a very concise, pythonic way using assert statements. Suppose you have the following file which defines a function that takes the square of a number. You want to test this. Adding a couple of tests is as easy as this:

https://gist.github.com/marknagelberg/aa99d1be3dca7149c12f0be984a0542a

Then, you run your tests by typing `pytest` in the command line when you’re in the app’s main directory. This returns the following result (one test pass, one test fail):


Another great feature of pytest is the ability to produce “fixtures”, which provides tests with pre-initialized objects you require to run your tests. In our case, a great use of fixtures is initializing the test database or the test application instance.

Pytest fixtures have some advantages over the usual setup and teardown functions used in other testing frameworks. One is the ability to run fixtures using different “scopes”, allowing you to do the setup / teardown operation at different times when you run your test to maximize test efficiency. The options for scope here include function, class, module, and session. So, for example, a fixture in the function scope means that the fixture object is invoked once for each test function you define in pytest.

So, say all your tests need to be able to access the Flask test client application instance. Using the module fixture, the application instance is only created once at the very beginning when you run your test, rather than being created and destroyed once for every test function.

For our small app, we’ll create 3 fixtures:

  • A database record in our Name table
  • The Flask application instance
  • The test database

The following code creates a fixture for one database record in our Name table which will make this object available to all our tests.

https://gist.github.com/marknagelberg/c7d5c71bd8e35c452939ba2192910aec

As you can see, you register fixtures using the pytest.fixtures decorator. This function returns the fixture object that you want your tests to access. You access this fixture object in your tests by supplying them as an argument to the test function. The code below shows a test we can add to test_app.py that accesses the new_name fixture and ensures it has the value we expect.

https://gist.github.com/marknagelberg/9fa2ff9679370cb0a088864ace29df89

We can take advantage of the work we did in Part 2 of the series using our application factory pattern to easily create a fixture for our application instance with our testing configuration options. The code below does this by creating an application instance with testing configuration and returning its test client (the test client provides a simple interface to the application where we can trigger requests application and track cookies).

https://gist.github.com/marknagelberg/ecd07825d53088492a29f248482e6a5e

Note that the part after the yield statement represents the “teardown” part of the test: this is where the application instance is removed when the tests are done running.

We also need to create a similar fixture for our database. The following code creates the database and initializes the data with create_all(), adds one Name “Mark” and then cleans up the database after the tests are complete.

https://gist.github.com/marknagelberg/1ca20b79e6bdf0931d9be3b869735958

Right now, when out user enters information into the app’s simple web form, the data is simply added to the database and nothing changes from the user’s perspective. To add some more interesting system tests, let’s add a little bit of additional functionality to our application: it will now print out a list of all the existing names in the database on the main page.

Finally, we add a few more tests to use all of our new fixtures, including tests to make sure our application instance exists and ensure the HTML output produced contains data we expect. 

https://gist.github.com/marknagelberg/a1ec5be4825d4498bc7c3cdefb9d919f

(Note that “Mark” appears on the main page since this record was added to the database as part of the fixture.)

Now, in your main app directory, you can run your tests by simply typing “pytest” in the command line. The result for our code here should look like this:

The three dots ‘…’ indicate that there were three tests and they each passed (if a test failed, one of these dots appear as an F).

Aside: during the creation of this post, I was puzzled to see that my code was running in the flask production environment, despite my configuration and environment variables clearly indicating that I was in development. Turns out there is another environment variable that needs to be set called FLASK_ENV, which defaults to “production”, turning off debug mode in Flask and throwing a warning when you run your application on the Flask development server.  To fix this, run `export FLASK_ENV=“development”` on the command line. Note that this must be set as an environment variable: adding it in config.py doesn’t work. I’ve updated this information in Part 2.  For further information in the Flask documentation, see here.

Resources

https://www.patricksoftwareblog.com/testing-a-flask-application-using-pytest/

https://piotr.banaszkiewicz.org/blog/2014/02/22/how-to-bite-flask-sqlalchemy-and-pytest-all-at-once/

Digging into Data Science Tools: Docker

Docker is a tool for creating and managing “containers” which are like little virtual machines where you can run your code. A Docker container is like a little Linux OS, preinstalled with everything you need to run your web app, machine learning model, script, or any other code you write.

Docker containers are like a really lightweight version of virtual machines. They use way less computer resources than a virtual machine, and can spin up in seconds rather than minutes. (The reason for this performance improvement is Docker containers share the kernel of the host machine, whereas virtual machines run a separate OS with a separate kernel for every virtual machine.)

Aly Sivji provides a great comparison of Docker containers to shipping containers. Shipping containers improved efficiency of logistics by standardizing the design: they all operate the same way and we have standardized infrastructure for dealing with them, and as a result you can ship them regardless of transportation type (truck, train, or boat) and logistics company (all are aware of shipping containers and mold to their standards). In a similar way, Docker provides a standardized software container which you can pass into different environments and be confident they’ll run as you expect.  

Brief Overview of How Docker Works

To give you a really high-level overview of how Docker works, first let’s define three big Docker-related terms – “Dockerfile”, “Image”, and “Container”:

  • Dockerfile: A text file you write to build the Docker “image” that you need (see definition of image below). You can think of the Dockerfile like a wrapper around the Linux command line: the commands that you would use to set up a Linux system on the command line have equivalents which you can place in a docker file. “Building” the Dockerfile produces an image that represents a Linux machine that’s in the exact state that you need. You can learn all about the ins-and-outs of the syntax and commands at the Dockerfile reference page. To get an idea of what Dockerfiles look like, here is a Dockerfile you would use to create an image that has the Ubuntu 15.04 Linux distribution, copy all the files from your application to ./app in the image, run the make command on /app within your image’s Linux command line, and then finally run the python file defined in /app/app.py:
FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py
  • Image: A “snapshot” of the environment that you want the containers to run. The images include all you need to run your code, such as code dependencies (e.g. python venv or conda environment) and system dependencies (e.g. server, database). You “build” images from Dockerfiles which define everything the image should include. You then use these images to create containers.
  • Container: An “instance” of the image, similar to how objects are instances of classes in object oriented programming. You create (or “run” using Docker language) containers from images. You can think of containers as a running the “virtual machine” defined by your image.

To sum up these three main concepts: you write a Dockerfile to “build” the image that you need, which represents the snapshot of your system at a point in time. From this image, you can then “run” one or more containers with that image.

Here are a few other useful terms to know:

  • Volume: “Shared folders” that lets a docker container see the folder on your host machine (very useful for development, so your container is automatically updated with your code changes). Volumes also allow one docker container to see data in another container. Volumes can be “persistent” (the volume continues to exist after the container is stopped) or “ephemeral” (the volume disappears as soon as the container is stopped).
  • Container Orchestration: When you first start using Docker, you’ll probably just spin up one container at a time. However, you’ll soon find that you want to have multiple containers, each running using a different image with different configurations. For example, a common use of Docker is deployment of applications as “microservices”, where each Docker container represents an individual microservice that interacts with your other microservices to deliver your application. Since it can get very unwieldy to manage multiple containers manually, there are “container orchestration” tools that automate tasks such as starting up all your containers, automatically restarting failing containers, connecting containers together so they can see each other, and distributing containers across multiple computers. Examples of tools in this space include docker-compose and Kubernetes.
  • Docker Daemon / Docker Client: The Docker Daemon must be running on the machine where you want to run containers (could be on your local or remote machine). The Docker Client is front-end command line interface to interact with Docker, connect to the Docker Daemon, and tell it what to do. It’s through the Docker client where you run commands to build images from Dockerfiles, create containers from images, and do other Docker-related tasks.

Why is Docker useful to Data Scientists?

You might be thinking “Oh god, another tool for me to learn on top of the millions of other things I have to keep on top of? Is it worth my time to learn it? Will this technology even exist in a couple years?

I think the answer is, yes, this is definitely a worthwhile tool for you to add to your data science toolbox.

To help illustrate, here is a list of reasons for using Docker as a data scientist, many of which are discussed in Michael D’agostino’s “Docker for Data Scientists” talk as well as this Lynda course from Arthur Ulfeldt:

  • Creating 100% Reproducible Data Analysis: Reproducibility is increasingly recognized as critical for both methodological and legal reasons. When you’re doing analysis, you want others to be able to verify your work. Jupyter notebooks and Python virtual environments are a big help, but you’re out of luck if you have critical system dependencies. Docker ensures you’re running your code in exactly the same way every time, with the same OS and system libraries.
  • Documentation: As mentioned above, the basis for building docker containers is a “Dockerfile”, which is a line by line description of all the stuff that needs to exist in your image / container. Reading this file gives you (and anyone else that needs to deploy your code) a great understanding about what exactly is running on the container.
  • Isolation: Using Docker helps ensure that your tools don’t conflict with one another. By running them in separate containers, you’ll know that you can run Python 2, Python 3, and R and these pieces of software will not interfere with each other.
  • Gain DevOps powers: in the words of Michaelangelo D’Agostino, “Docker Democratizes DevOps”, since it opens up opportunities to people that used to only available to systems / DevOps experts:
    • Docker allows you to more easily “sidestep” DevOps / system administration if you aren’t interested, since someone can create a container for you and all you have to do it run it. Similarly, if you like working with Docker,  you can create a container less technically savvy coworkers that lets them run things easily in the environment they need.
    • Docker provides the ability to build docker containers starting from existing containers. You can find many of these on DockerHub, which holds thousands of pre-built Dockerfiles and images. So if you’re running a well-known application (or even obscure applications), there is often a Dockerfile already available that can give you a tremendous running start to deploy your project. This includes “official” Docker repositories for many tools, such as ubuntu, postgres, nginx, wordpress, python, and much more.
    • Using Docker helps you work with your IT / DevOps colleagues, since you can do your Data Science work in a container, and simply pass it over to DevOps as a black box that they can run without having to know everything about your model.

Here are a few examples of applications relevant to data science where you might try out with Docker:

  • Create an ultra-portable, custom development workflow: Build a personal development environment in a Dockerfile, so you can access your workflow immediately on any machine with Docker installed. Simply load up the image wherever you are, on whatever machine you’re on, and your entire work environment is there: everything you need to do your job, and how you want to do your job.
  • Create development, testing, staging, and production environments: Rest assured that your code will run as you expect and become able to create staging environments identical to production so you know when you push to production, you’re going to be OK.
  • Reproduce your Jupyter notebook on any machine: Create a container that runs everything you need for your Jupyter Notebook data analysis, so you can pass it along to other researchers / colleagues and know that it will run on their machine. As great as Jupyter Notebooks are for doing analysis, they tend to suffer from the “it works on my machine” issue, and Docker can solve this issue.

For more inspiration, check out Civis Analytics Michaelangelo D’Agostino describe the Docker containers they use (start at the 18:08 mark). This includes containers specialized for survey processing, R shiny apps and other dashboards, Bayesian time series modeling and poll aggregation, as well as general purpose R/Python packages that have all the common packages needed for staff.

Further Resources

If you’re serious about starting to use Docker, I highly recommend the Lynda Course Learning Docker by Arthur Ulfeldt as a starting point. It’s well-explained and concise (only about 3 hours of video in total). I created a set of Anki flashcards from this course you can access here. I also recommend the book Docker Deep Dive by Nigel Poulton. I also created Anki flashcards from this book that you can access here.

Here are a few other useful resources you might want to check out: