Webops for Python, part 2: the how-to

In part 1 of this 2-part series we used a comic strip to depict Python programmers and web operations folk working together to figure out how to deploy some scientific computing to an e-commerce site. Joking aside, let's describe exactly what were were trying to accomplish, and how we did it.

Our goal was to get our Python web services into continuous deployment to our production servers, in a manner as close as possible to what we do with our other code. Our systems team had set a high bar in that area. To be more specific, we wanted to:

package some opensource Python modules, including some that come with wrapped C and C++ code
build an application that imports those modules
configure servers, FreeBSD and Linux, without build tools installed, to run the application
deploy the app.

Modules, build, configuration, deployment. We think about these things separately, and that helps us operate effectively at a large scale. But there is a danger in mapping out such a multi-part process: it might become unwieldy. Day in and day out, I have a more immediate goal than operating at a large scale: I need to keep my development teams productive. 'Productive', in our group, means 'able to evaluate new ideas and tools quickly'. So I want a frictionless process for developers, but I don't want them working in a way so different from what we do in production that they will have to do a major redesign, when it's time to go from the lab to the real world.

In order to have a workable system, I needed to do three things, which have led me to the three recommendations/how-tos to this article:

Install appropriate versions of Python across systems from dev to prod.
Adopt a coherent way of packaging and using Python modules
Develop a Python-specific module of our push tool.

Configuration management is also an important component of the total system, but for that we are using an off-the-shelf tool, Puppet, and there's nothing particular to Python about the way we use it, so I will not be describing that in any detail.

Throughout this article I will be referring both to Puppet and to our homegrown Wayfair push tool, which we use for code deployment. In smaller-scale operations I use Fabric for both of these purposes. Fabric is written in Python, and it's a great place to start for this kind of thing. If you can already use ssh and install a Python module, Fabric takes about 5 minutes to learn. If your infrastructure grows to the point where Fabric doesn't work well any more, congratulations! For configuration management you can choose from Puppet, Chef, and other tools, and there are deployment-tool frameworks out there as well.

Choosing a version of Python and installing it

I believe these instructions would work pretty well for all 2.x and maybe 3.x Pythons, but it seems like a no-brainer to me to settle on Python 2.7. It is in the sweet spot right now for availability of quality opensource libraries and mature testing components. 2.6 is relatively deficient in the latter, and the good stuff does not all work in 3.x.

I don't believe there's any need for a how-to, for installing Python 2.7 on Debian, Ubuntu, MacOS, FreeBSD, and most other unices. However, if you're using a Linux that packages software with rpm/yum, such as any Redhat derivative, and the system Python is version < 2.7, you have a problem because you cannot upgrade Python without wrecking yum. In that case, I recommend this srpm for CentOS 5 by Nathan Milford, which does not interfere with the system Python or yum. We did this on our CentOS boxes, and we now have the Python we want at /usr/local/bin/python, happily coexisting with the /usr/bin/python that yum needs. The modifications to his .spec file were very slight, to get it to work on CentOS/RHEL 6.x. Here's the patch file:

72a73,74 > Provides: python-abi = %{libvers} > Provides: python(abi) = %{libvers}

When we're ready to put a new version of Python into production, we definitely want an OS-level package for it. On the other hand, *after* Python is installed, although we *could* build setuptools the way he suggests, as an rpm, we have chosen to build and deploy a setuptools egg, just like all the other modules. Make sure you use the right Python to do that, if you have more than one installed! The setuptools guys, realizing they would otherwise have a chicken-before-the-egg problem, make an eggsecutable egg, so you can just run it from the shell.

I strongly prefer to build and install everything besides Python itself as an egg, not an OS-level package: setuptools, virtualenv, pip, the MySQL driver, everything. The reason is simple. I and some of my developers need to tinker with Python compilation options and point-release versions, and experiment with major version upgrades. We want to be able to compile a new Python on a development machine, try out a new feature, run a test suite on our existing code to see how much of a big deal an upgrade is going to be, etc. While we're doing that, we do not want to get bogged down in debs, rpms, or pkgs. And if we had to build those types of packages for every module we're using, it would take us much longer than it does with eggs. I don't even want to build a Python rpm for that kind of experiment, let alone build rpms for a dozen modules and install them as root. I can avoid all of that with the procedures outlined below.

Python module packaging

Python module packaging is a hodgepodge, a dog's breakfast, an embarrassment. Let's give a brief history:

At first, no uniform packaging at all
Python 2: ' distutils'.
Setuptools, building on distutils and providing the 'easy_install' command, which works with either source or pre-built binary/zip (egg-file) distributions. But there is no 'uninstall' command. "They must be joking!" They are not.
Pip, another distutils derivative, and a 'replacement' for setuptools, in many ways superior to it, with 'uninstall' and version checking. But it can't handle egg files. "They must be joking!" They are not.
Distribute, a fork of setuptools, allegedly handling the Python 2-3 transition better. This looks very promising, but I'm not quite sure where they're headed with their roadmap. Distribute guys: please make sure we can always install from egg files, even non-eggsecutable ones. Not everybody wants gcc on production servers! And I hope when you say 'easy_install is going to be deprecated ! use Pip !', you don't mean you're losing egg support.

This list does not even address the question of how to roll C and C++ libraries into a Python module. There are a *lot* of ways to do that. We'll gloss over that a bit and say that Cython is currently our favorite way, and we want to start a movement of winning hearts and minds over to using it for all libraries. If the hearts-and-minds approach doesn't work, I'm thinking cajoling, bribery, public shaming, any means necessary. But that doesn't really matter for the purpose under discussion. We can use Python-C(++) hybrids whether or not they are built with Cython.

Here is our process for downloading a 3rd-party module from PyPI or elsewhere, and incorporating it into our build and deployment process. For most libraries, it takes about 2 minutes. Sometimes there are dependencies to sort out, but even for the hairiest science projects, we're usually done in a short time.

First, download something, e.g. python-dateutil-2.1.tar.gz, from PyPI or wherever its publicly available source code lives. Then do:
tar zxvf python-dateutil-2.1.tar.gz cd python-dateutil-2.1 ls setup.py echo $?

If that shows '0' (i.e. no errors, setup.py is present) then we can almost certainly use a distutils derivative; if something else (an exit code >0), then not. Basic distutils support is almost always present, but even the edge cases do not pose much of a problem in my experience, at least on widely used platforms. Now do this:
python setup.py bdist --help-formats | grep egg; echo $?
If that shows '0', proceed. If something else, take a handy file we have checked into our source code repository, and which is conventionally called 'setupegg.py', and copy it into your working directory. This file or something like it is distributed with a lot of Python projects. Its contents are as follows:
#!/usr/bin/env python """Wrapper to run setup.py using setuptools.""" import setuptools execfile('setup.py')

If you don't know what /usr/bin/env is, look it up and start using it. It will help with 'virtualenv' later on.

Now do this ($setupfile is either setup.py or setupegg.py, depending on the previous steps):
python $setupfile bdist_egg
In a new 'dist' subdirectory, there should now be a file with one of two types of names:

${MODULENAME}-${MODULEVERSION}-py${PYTHONVERSION}.egg, for 100%-pure Python modules, or
${MODULENAME}-${MODULEVERSION}-py${PYTHONVERSION}-${OS}-{ARCHITECTURE}.egg, for modules that link to C or other code that makes shared objects and the like.

Some examples are python_dateutil-2.1-py2.7.egg (pure Python), numpy-1.6.1-py2.7-macosx-10.4-x86_64.egg, psycopg2-2.4.5-py2.7-linux-x86_64.egg, Twisted-12.0.0-py2.7-freebsd-9.0-RELEASE-amd64.egg.

Individual module packagers do not always do this in the exact standard way. You may have to read the README, INSTALL or BUILD files to figure out how to build an egg and supply the dependencies. In only one case, in all the libraries I have evaluated and wanted to install in the last year, did I have to modify setup.py or any other file, in order to build an egg I wanted. I'll be submitting a simple patch to that author pretty soon.

So now our module builds, and maybe it's a candidate drop-in replacement, and major improvement, for some code that is very important to us. Does it run? That's an easy question to answer... if you have an effective set of automated tests. That's a big if. And of course if you bite off more than you can chew, you may become mired in tests that fail but actually aren't a problem. And then you'll start ignoring them, and by then your good intentions will have taken you well down the road to Hell. But for Heaven's sake, do some testing: no compiler has your back in this environment. If you're not doing automated testing already, just get started, with the most basic test we can think of: can we do all the imports our code is trying to do? I have a script that finds all the import statements in a project, and attempts to run them in the target environment. If that returns an error, the build fails or the push is aborted.

If you want to be able to push code that imports your new module to production without any additional fuss, you'll want to build the module on a machine that's identical, dependencies-outside-Python-wise, to your production servers, with the possible exception that you have build tools, header files, etc., on this machine, and not in production. I make a point of building for all our platforms, when I make a module available for the rest of the group to use. I also make a note of lower-level dependencies, and have Puppet ensure that the packages are installed on the relevant boxes. To stick with the yum-oriented example, if we're doing 'yum install atlas-devel' in our build environment, we will do 'yum install atlas' in production.

Now we have our egg. So we copy it into a directory where all the other eggs live. Since the namespace of Python projects, unlike say the Java project world of the Maven repositories, is flat rather than hierarchical, a single directory is the right thing here. Then we (deep breath!) commit it to version control. Yes, I know, built code in version control: no good! If you can't stand it, set up a build server, commit a script that does the exact steps for building the module, and run that. In my defense, I'll just say that I got to a rock-solid production environment faster this way than I otherwise would have, and we can always make our process dogmatically correct later if we wish.

Then we rsync the egg directory over to a web server on our internal network, that all our servers can see. Jenkins is a good tool for watching checkins and doing something like this, but you can just write a cron job that does 'svn update' or 'git pull' or whatever, and expose that with a web server, or rsync from there.

A word about PyPI and EggBasket, which are, respectively, the worldwide Python community library of publicly available projects, and a caching proxy for the same: I'm sure they can be made to work. When I was working primarily in Java, this was exactly the way I used to do things, with a combination of the publicly available Maven repositories, ant/ivy/maven, and Archiva. But I'm passing for now. To start doing that, I would need to convince myself that my production deployments can actually depend on my local EggBasket, and that such a setup would not download unverified things from the internet during a build or push. For now, I don't want PyPI on my critical path, any more than I wanted CPAN there back in the day.

Deploying Python code into an environment with supporting modules loaded.

Now we have all the eggs we need in a place where we can find them, and our puppetmasters and puppet daemons are busily preparing the relevant servers for them. We are ready to modify our Python code, import the new module, do some development, and prepare a release for production. We can't have our servers getting confused about which libraries, or which versions of libraries, are installed in the environment where they want to run the code. We will also need to be able to roll back in case it all goes awry. For this we create a virtual environment, with the tool virtualenv. Creating a new environment with the right libraries goes something like this:
virtualenv v1234 source v1234/bin/activate easy_install http://egg-hosting-webserver/packages/egg/modulename-moduleversion-py2.7.egg (several more like that)
Then we bring the first server in the pool out of the load balancer, bring the service down, create a new virtualenv with the right libraries, switch a symbolic link from the old one to the new one, bring the service back up, run some tests, and put it back in the load balancer. If that all works, we do the same for the next one, and repeat until finished. If a mixed state is undesirable, we can do half-and-half instead of a rolling deployment.

We happen to be using Tornado, and playing around with Twisted and gunicorn, as the conduit for delivering our payloads to our front end, but the techniques I have described work just as well for Django-based web sites and all manner of Python applications.

That's it. We'll write some more in future, about some of the science projects we have been able to deploy in this way.