Digging into Data Science Tools: Anaconda

In the Digging into Data Science Series, I dive into specific tools and technologies used in data science and provide a list of resources you can find to learn more yourself. I keep posts in this series updated on a regular basis as I learn more about the technology or find new resources. Post Last updated on May 28, 2018

If you’re familiar at all with Data Science, you have probably heard of Anaconda. Anaconda is a distribution of Python (and R) targeted towards people doing data science. It’s totally free, open-source, and runs on Windows, Linux, and Mac.

Up until recently, I basically treated my installation of Anaconda as a vanilla python install and was totally ignorant about the unique features and benefits it provides (big mistake). I decided to finally do some research into what Anaconda does and how to get the most out of it.

What is Anaconda and why should you use it?

Anaconda is much more than simply a distribution of Python. It provides two main features to make your life way easier as a data scientist: 1) pre-installed packages and 2) a package and environment manager called Conda.

Pre-installed packages

One of the main selling points of Anaconda is that is comes with over 250 popular data science packages pre-installed. This includes popular packages such as NumPy, SciPy, Pandas, Bokeh, Matplotlib, among many others. So, instead of installing python and running pip install a bunch of times for each of the packages you need, you can just install Anaconda and be fairly confident that  most what you’ll need for your project will be there.

Anaconda has also created an “R Essentials” bundle of packages for the R language as well, which is another reason to use Anaconda if you are an R programmer or expect to have to do some development in R.

In addition to the packages, it comes with other useful data science tools preinstalled:

  • iPython / Jupyter notebooks: I’ll be doing a separate “Digging into Data Science Tools” post on this later, but Jupyter notebooks are an incredibly useful tool for exploring data in Python and sharing explorations with others (including non-Python programmers). You work in the notebook right in your web browser and can add python code, markdown, and inline images. When you’re ready to share, you can save your results to PDF or HTML.
  • Spyder: Spyder is a powerful IDE for python. Although I personally don’t use it, I’ve heard it’s quite good, particularly for programmers who are used to working with tools like RStudio.

In addition to preventing you from having to do pip install a million times, having all this stuff pre-installed is also super useful you’re teaching a class or workshop where everyone needs to have the same environment. You can just have them all install Anaconda, and you’ll know exactly what tools they have available (no need to worry about what OS they’re running, no need to troubleshoot issues for each person’s particular system).

Package and environment management with Conda

Anaconda comes with an amazing open source package manager and environment manager called Conda.

Conda as a package manager

A package manager is a tool that helps you find packages, install packages, and manage dependencies across packages (i.e. packages often require certain other packages to be installed, and a package manager handles all this messiness for you).

Probably the most popular package manager for python is it’s built-in tool called pip. So, why would you want to use Conda over pip?

  • Conda is really good at making sure you have the proper dependencies installed for data science packages. Researching for this blog post, I came across many stories of people having a horrible time installing important and widely used packages with pip (especially on Windows). The fundamental issue seems to be that many scientific packages in Python have external dependencies on libraries in other languages like C and pip does not always handle this well. Since Conda is a general-purpose package management system (i.e. not just a python package management system), it can easily install python packages that have external dependencies written in other languages (e.g. NumPy, SciPy, Matplotlib).
  • There are many open source packages available to install through Conda which are not available through pip.
  • You can use Conda to install and manage different versions of python (python itself is treated as just another package).
  • You can still use pip while using Conda if you cannot find a package through Conda.

Together, Conda’s package management and along with the pre-installed packages makes Anaconda particularly attractive to newcomers to python, since it significantly lowers the amount of pain and struggle required to get stuff running (especially on Windows).

Conda as an environment manager

An “environment” is basically just a collection of packages along with the version of python you’re using. An environment manager is a tool that sets up particular environments you need for particular applications. Managing your environment avoids many headaches.

For example, suppose you have an application that’s working perfectly. Then, you update your packages and it no longer works because of some “breaking change” to one of the packages. With an environment manager, you set up your environment with particular versions of the packages and ensure that the packages are compatible with the application.

Environments also have huge benefits when sending your application to someone else to run (they are able to run the program with the same environment on their system) and deploying applications to production (the environment on your local development machine has to be the same as the production server running the application).

If you’re in the python world, you’re probably familiar with the built-in environment manager virtualenv. So why use Conda for environment management over virtualenv?

  • Conda environments can manage different versions of python. In contrast, virtualenv must be associated with the specific version of python you’re running.
  • Conda still gives you access to pip and pip packages are still tracked in Conda environments
  • As mentioned earlier, Conda is better than pip at handling external dependencies of scientific computing packages.

For a great introduction on managing python environments and packages using Conda, see this awesome blog post by Gergely Szerovay. It explains why you need environments and basics of how to manage them in Conda. Environments can be a somewhat confusing topic, and like a lot of things in programming, there are some up-front costs in learning how to use them, but they will ultimately save tons of time and prevent many headaches.

Bonus: no admin privileges required

In Anaconda, installations and updates of packages are independent of system library or administrative privileges. For people working on their personal laptop, this may not seem like a big deal, but if you are working on a company machine where you don’t have access to admin privileges, this is crucial. Imagine having to run to IT whenever you wanted to install a new python package – It would be totally miserable and it’s not a problem to be underestimated.

Further resources / sources

For access to my shared Anki deck and Roam Research notes knowledge base as well as regular updates on tips and ideas about spaced repetition and improving your learning productivity, join "Download Mark's Brain".

One thought on “Digging into Data Science Tools: Anaconda”

Leave a Reply

Your email address will not be published. Required fields are marked *