The Architecture of Open Source Applications (Volume 1)Python Packaging

14.1. Introduction

There are two schools of thought when it comes to installing applications. The first, common to Windows and Mac OS X, is that applications should be self-contained, and their installation should not depend on anything else. This philosophy simplifies the management of applications: each application is its own standalone "appliance", and installing and removing them should not disturb the rest of the OS. If the application needs an uncommon library, that library is included in the application's distribution.

The second school, which is the norm for Linux-based systems, treats software as a collection of small self-contained units called packages. Libraries are bundled into packages, any given library package might depend on other packages. Installing an application might involve finding and installing particular versions of dozens of other libraries. These dependencies are usually fetched from a central repository that contains thousands of packages. This philosophy is why Linux distributions use complex package management systems like dpkg and RPM to track dependencies and prevent installation of two applications that use incompatible versions of the same library.

There are pros and cons to each approach. Having a highly modular system where every piece can be updated or replaced makes management easier, because each library is present in a single place, and all applications that use it benefit when it is updated. For instance, a security fix in a particular library will reach all applications that use it at once, whereas if an application ships with its own library, that security fix will be more complex to deploy, especially if different applications use different versions of the library.

But that modularity is seen as a drawback by some developers, because they're not in control of their applications and dependencies. It is easier for them to provide a standalone software appliance to be sure that the application environment is stable and not subject to "dependency hell" during system upgrades.

Self-contained applications also make the developer's life easier when she needs to support several operating systems. Some projects go so far as to release portable applications that remove any interaction with the hosting system by working in a self-contained directory, even for log files.

Python's packaging system was intended to make the second philosophy—multiple dependencies for each install—as developer-, admin-, packager-, and user-friendly as possible. Unfortunately it had (and has) a variety of flaws which caused or allowed all kinds of problems: unintuitive version schemes, mishandled data files, difficulty re-packaging, and more. Three years ago I and a group of other Pythoneers decided to reinvent it to address these problems. We call ourselves the Fellowship of the Packaging, and this chapter describes the problems we have been trying to fix, and what our solution looks like.

Terminology

In Python a package is a directory containing Python files. Python files are called modules. That definition makes the usage of the word "package" a bit vague since it is also used by many systems to refer to a release of a project.

Python developers themselves are sometimes vague about this. One way to remove this ambiguity is to use the term "Python packages" when we talk about a directory containing Python modules. The term "release" is used to define one version of a project, and the term "distribution" defines a source or a binary distribution of a release as something like a tarball or zip file.

14.2. The Burden of the Python Developer

Most Python programmers want their programs to be usable in any environment. They also usually want to use a mix of standard Python libraries and system-dependent libraries. But unless you package your application separately for every existing packaging system, you are doomed to provide Python-specific releases—a Python-specific release is a release aimed to be installed within a Python installation no matter what the underlying Operating System is—and hope that:

packagers for every target system will be able to repackage your work,
the dependencies you have will themselves be repackaged in every target system, and
system dependencies will be clearly described.

Sometimes, this is simply impossible. For example, Plone (a full-fledged Python-powered CMS) uses hundreds of small pure Python libraries that are not always available as packages in every packaging system out there. This means that Plone must ship everything that it needs in a portable application. To do this, it uses zc.buildout, which collects all its dependencies and creates a portable application that will run on any system within a single directory. It is effectively a binary release, since any piece of C code will be compiled in place.

This is a big win for developers: they just have to describe their dependencies using the Python standards described below and use zc.buildout to release their application. But as discussed earlier, this type of release sets up a fortress within the system, which most Linux sysadmins will hate. Windows admins won't mind, but those managing CentOS or Debian will, because those systems base their management on the assumption that every file in the system is registered, classified, and known to admin tools.

Those admins will want to repackage your application according to their own standards. The question we need to answer is, "Can Python have a packaging system that can be automatically translated into other packaging systems?" If so, one application or library can be installed on any system without requiring extra packaging work. Here, "automatically" doesn't necessarily mean that the work should be fully done by a script: RPM or dpkg packagers will tell you that's impossible—they always need to add some specifics in the projects they repackage. They'll also tell you that they often have a hard time re-packaging a piece of code because its developers were not aware of a few basic packaging rules.

Here's one example of what you can do to annoy packagers using the existing Python packaging system: release a library called "MathUtils" with the version name "Fumanchu". The brilliant mathematician who wrote the library have found it amusing to use his cats' names for his project versions. But how can a packager know that "Fumanchu" is his second cat's name, and that the first one was called "Phil", so that the "Fumanchu" version comes after the "Phil" one?

This may sound extreme, but it can happen with today's tools and standards. The worst thing is that tools like easy_install or pip use their own non-standard registry to keep track of installed files, and will sort the "Fumanchu" and "Phil" versions alphanumerically.

Another problem is how to handle data files. For example, what if your application uses an SQLite database? If you put it inside your package directory, your application might fail because the system forbids you to write in that part of the tree. Doing this will also compromise the assumptions Linux systems make about where application data is for backups (/var).

In the real world, system administrators need to be able to place your files where they want without breaking your application, and you need to tell them what those files are. So let's rephrase the question: is it possible to have a packaging system in Python that can provide all the information needed to repackage an application with any third-party packaging system out there without having to read the code, and make everyone happy?

14.3. The Current Architecture of Packaging

The Distutils package that comes with the Python standard library is riddled with the problems described above. Since it's the standard, people either live with it and its flaws, or use more advanced tools like Setuptools, which add features on the top of it, or Distribute, a fork of Setuptools. There's also Pip, a more advanced installer, that relies on Setuptools.

However, these newer tools are all based on Distutils and inherit its problems. Attempts were made to fix Distutils in place, but the code is so deeply used by other tools that any change to it, even its internals, is a potential regression in the whole Python packaging ecosystem.

We therefore decided to freeze Distutils and start the development of Distutils2 from the same code base, without worrying too much about backward compatibility. To understand what changed and why, let's have a closer look at Distutils.

14.3.1. Distutils Basics and Design Flaws

Distutils contains commands, each of which is a class with a run method that can be called with some options. Distutils also provides a Distribution class that contains global values every command can look at.

To use Distutils, a developer adds a single Python module to a project, conventionally called setup.py. This module contains a call to Distutils' main entry point: the setup function. This function can take many options, which are held by a Distribution instance and used by commands. Here's an example that defines a few standard options like the name and version of the project, and a list of modules it contains:

from distutils.core import setup

setup(name='MyProject', version='1.0', py_modules=['mycode.py'])

This module can then be used to run Distutils commands like sdist, which creates a source distribution in an archive and places it in a dist directory:

$ python setup.py sdist

Using the same script, you can install the project using the install command:

$ python setup.py install

Distutils provides other commands such as:

upload to upload a distribution into an online repository.
register to register the metadata of a project in an online repository without necessary uploading a distribution,
bdist to creates a binary distribution, and
bdist_msi to create a .msi file for Windows.

It will also let you get information about the project via other command line options.

So installing a project or getting information about it is always done by invoking Distutils through this file. For example, to find out the name of the project:

$ python setup.py --name
MyProject

setup.py is therefore how everyone interacts with the project, whether to build, package, publish, or install it. The developer describes the content of his project through options passed to a function, and uses that file for all his packaging tasks. The file is also used by installers to install the project on a target system.

Figure 14.1: Setup

Having a single Python module used for packaging, releasing, and installing a project is one of Distutils' main flaws. For example, if you want to get the name from the lxml project, setup.py will do a lot of things besides returning a simple string as expected:

$ python setup.py --name
Building lxml version 2.2.
NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c'
needs to be available.
Using build configuration of libxslt 1.1.26
Building against libxml2/libxslt in the following directory: /usr/lib/lxml

It might even fail to work on some projects, since developers make the assumption that setup.py is used only to install, and that other Distutils features are only used by them during development. The multiple roles of the setup.py script can easily cause confusion.

14.3.2. Metadata and PyPI

When Distutils builds a distribution, it creates a Metadata file that follows the standard described in PEP 314¹. It contains a static version of all the usual metadata, like the name of the project or the version of the release. The main metadata fields are:

Name: The name of the project.
Version: The version of the release.
Summary: A one-line description.
Description: A detailed description.
Home-Page: The URL of the project.
Author: The author name.
Classifiers: Classifiers for the project. Python provides a list of classifiers for the license, the maturity of the release (beta, alpha, final), etc.
Requires, Provides, and Obsoletes: Used to define dependencies with modules.

These fields are for the most part easy to map to equivalents in other packaging systems.

The Python Package Index (PyPI)², a central repository of packages like CPAN, is able to register projects and publish releases via Distutils' register and upload commands. register builds the Metadata file and sends it to PyPI, allowing people and tools—like installers—to browse them via web pages or via web services.

Figure 14.2: The PyPI Repository

You can browse projects by Classifiers, and get the author name and project URL. Meanwhile, Requires can be used to define dependencies on Python modules. The requires option can be used to add a Requires metadata element to the project:

from distutils.core import setup

setup(name='foo', version='1.0', requires=['ldap'])

Defining a dependency on the ldap module is purely declarative: no tools or installers ensure that such a module exists. This would be satisfactory if Python defined requirements at the module level through a require keyword like Perl does. Then it would just be a matter of the installers browsing the dependencies at PyPI and installing them; that's basically what CPAN does. But that's not possible in Python since a module named ldap can exist in any Python project. Since Distutils allows people to release projects that can contain several packages and modules, this metadata field is not useful at all.

Another flaw of Metadata files is that they are created by a Python script, so they are specific to the platform they are executed in. For example, a project that provides features specific to Windows could define its setup.py as:

from distutils.core import setup

setup(name='foo', version='1.0', requires=['win32com'])

But this assumes that the project only works under Windows, even if it provides portable features. One way to solve this is to make the requires option specific to Windows:

from distutils.core import setup
import sys

if sys.platform == 'win32':
    setup(name='foo', version='1.0', requires=['win32com'])
else:
    setup(name='foo', version='1.0')

This actually makes the issue worse. Remember, the script is used to build source archives that are then released to the world via PyPI. This means that the static Metadata file sent to PyPI is dependent on the platform that was used to compile it. In other words, there is no way to indicate statically in the metadata field that it is platform-specific.

14.3.3. Architecture of PyPI

Figure 14.3: PyPI Workflow

As indicated earlier, PyPI is a central index of Python projects where people can browse existing projects by category or register their own work. Source or binary distributions can be uploaded and added to an existing project, and then downloaded for installation or study. PyPI also offers web services that can be used by tools like installers.

Registering Projects and Uploading Distributions

Registering a project to PyPI is done with the Distutils register command. It builds a POST request containing the metadata of the project, whatever its version is. The request requires an Authorization header, as PyPI uses Basic Authentication to make sure every registered project is associated with a user that has first registered with PyPI. Credentials are kept in the local Distutils configuration or typed in the prompt every time a register command is invoked. An example of its use is:

$ python setup.py register
running register
Registering MPTools to http://pypi.python.org/pypi
Server response (200): OK

Each registered project gets a web page with an HTML version of the metadata, and packagers can upload distributions to PyPI using upload:

$ python setup.py sdist upload
running sdist
…
running upload
Submitting dist/mopytools-0.1.tar.gz to http://pypi.python.org/pypi
Server response (200): OK

It's also possible to point users to another location via the Download-URL metadata field rather than uploading files directly to PyPI.

Querying PyPI

Besides the HTML pages PyPI publishes for web users, it provides two services that tools can use to browse the content: the Simple Index protocol and the XML-RPC APIs.

The Simple Index protocol starts at http://pypi.python.org/simple/, a plain HTML page that contains relative links to every registered project:

<html><head><title>Simple Index</title></head><body>
⋮    ⋮    ⋮
<a href='MontyLingua/'>MontyLingua</a><br/>
<a href='mootiro_web/'>mootiro_web</a><br/>
<a href='Mopidy/'>Mopidy</a><br/>
<a href='mopowg/'>mopowg</a><br/>
<a href='MOPPY/'>MOPPY</a><br/>
<a href='MPTools/'>MPTools</a><br/>
<a href='morbid/'>morbid</a><br/>
<a href='Morelia/'>Morelia</a><br/>
<a href='morse/'>morse</a><br/>
⋮    ⋮    ⋮
</body></html>

For example, the MPTools project has a MPTools/ link, which means that the project exists in the index. The site it points at contains a list of all the links related to the project:

links for every distribution stored at PyPI
links for every Home URL defined in the Metadata, for each version of the project registered
links for every Download-URL defined in the Metadata, for each version as well.

The page for MPTools contains:

<html><head><title>Links for MPTools</title></head>
<body><h1>Links for MPTools</h1>
<a href="../../packages/source/M/MPTools/MPTools-0.1.tar.gz">MPTools-0.1.tar.gz</a><br/>
<a href="http://bitbucket.org/tarek/mopytools" rel="homepage">0.1 home_page</a><br/>
</body></html>

Tools like installers that want to find distributions of a project can look for it in the index page, or simply check if http://pypi.python.org/simple/PROJECT_NAME/ exists.

This protocol has two main limitations. First, PyPI is a single server right now, and while people usually have local copies of its content, we have experienced several downtimes in the past two years that have paralyzed developers that are constantly working with installers that browse PyPI to get all the dependencies a project requires when it is built. For instance, building a Plone application will generate several hundreds queries at PyPI to get all the required bits, so PyPI may act as a single point of failure.

Second, when the distributions are not stored at PyPI and a Download-URL link is provided in the Simple Index page, installers have to follow that link and hope that the location will be up and will really contain the release. These indirections weakens any Simple Index-based process.

The Simple Index protocol's goal is to give to installers a list of links they can use to install a project. The project metadata is not published there; instead, there are XML-RPC methods to get extra information about registered projects:

>>> import xmlrpclib
>>> import pprint
>>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
>>> client.package_releases('MPTools')
['0.1']
>>> pprint.pprint(client.release_urls('MPTools', '0.1'))
[{'comment_text': &rquot;,
'downloads': 28,
'filename': 'MPTools-0.1.tar.gz',
'has_sig': False,
'md5_digest': '6b06752d62c4bffe1fb65cd5c9b7111a',
'packagetype': 'sdist',
'python_version': 'source',
'size': 3684,
'upload_time': <DateTime '20110204T09:37:12' at f4da28>,
'url': 'http://pypi.python.org/packages/source/M/MPTools/MPTools-0.1.tar.gz'}]
>>> pprint.pprint(client.release_data('MPTools', '0.1'))
{'author': 'Tarek Ziade',
'author_email': 'tarek@mozilla.com',
'classifiers': [],
'description': 'UNKNOWN',
'download_url': 'UNKNOWN',
'home_page': 'http://bitbucket.org/tarek/mopytools',
'keywords': None,
'license': 'UNKNOWN',
'maintainer': None,
'maintainer_email': None,
'name': 'MPTools',
'package_url': 'http://pypi.python.org/pypi/MPTools',
'platform': 'UNKNOWN',
'release_url': 'http://pypi.python.org/pypi/MPTools/0.1',
'requires_python': None,
'stable_version': None,
'summary': 'Set of tools to build Mozilla Services apps',
'version': '0.1'}

The issue with this approach is that some of the data that the XML-RPC APIs are publishing could have been stored as static files and published in the Simple Index page to simplify the work of client tools. That would also avoid the extra work PyPI has to do to handle those queries. It's fine to have non-static data like the number of downloads per distribution published in a specialized web service, but it does not make sense to have to use two different services to get all static data about a project.

14.3.4. Architecture of a Python Installation

If you install a Python project using python setup.py install, Distutils—which is included in the standard library—will copy the files onto your system.

Python packages and modules will land in the Python directory that is loaded when the interpreter starts: under the latest Ubuntu they will wind up in /usr/local/lib/python2.6/dist-packages/ and under Fedora in /usr/local/lib/python2.6/sites-packages/.
Data files defined in a project can land anywhere on the system.
The executable script will land in a bin directory on the system. Depending on the platform, this could be /usr/local/bin or in a bin directory specific to the Python installation.

Ever since Python 2.5, the metadata file is copied alongside the modules and packages as project-version.egg-info. For example, the virtualenv project could have a virtualenv-1.4.9.egg-info file. These metadata files can be considered a database of installed projects, since it's possible to iterate over them and build a list of projects with their versions. However, the Distutils installer does not record the list of files it installs on the system. In other words, there is no way to remove all files that were copied in the system. This is a shame since the install command has a --record option that can be used to record all installed files in a text file. However, this option is not used by default and Distutils' documentation barely mentions it.

14.3.5. Setuptools, Pip and the Like

As mentioned in the introduction, some projects tried to fix some of the problems with Distutils, with varying degrees of success.

The Dependencies Issue

PyPI allowed developers to publish Python projects that could include several modules organized into Python packages. But at the same time, projects could define module-level dependencies via Require. Both ideas are reasonable, but their combination is not.

The right thing to do was to have project-level dependencies, which is exactly what Setuptools added as a feature on the top of Distutils. It also provided a script called easy_install to automatically fetch and install dependencies by looking for them on PyPI. In practice, module-level dependency was never really used, and people jumped on Setuptools' extensions. But since these features were added in options specific to Setuptools, and ignored by Distutils or PyPI, Setuptools effectively created its own standard and became a hack on a top of a bad design.

easy_install therefore needs to download the archive of the project and run its setup.py script again to get the metadata it needs, and it has to do this again for every dependency. The dependency graph is built bit by bit after each download.

Even if the new metadata was accepted by PyPI and browsable online, easy_install would still need to download all archives because, as said earlier, metadata published at PyPI is specific to the platform that was used to upload it, which can differ from the target platform. But this ability to install a project and its dependencies was good enough in 90% of the cases and was a great feature to have. So Setuptools became widely used, although it still suffers from other problems:

If a dependency install fails, there is no rollback and the system can end up in a broken state.
The dependency graph is built on the fly during installation, so if a dependency conflict is encountered the system can end up in a broken state as well.

The Uninstall Issue

Setuptools did not provide an uninstaller, even though its custom metadata could have contained a file listing the installed files. Pip, on the other hand, extended Setuptools' metadata to record installed files, and is therefore able to uninstall. But that's yet another custom set of metadata, which means that a single Python installation may contain up to four different flavours of metadata for each installed project:

Distutils' egg-info, which is a single metadata file.
Setuptools' egg-info, which is a directory containing the metadata and extra Setuptools specific options.
Pip's egg-info, which is an extended version of the previous.
Whatever the hosting packaging system creates.

14.3.6. What About Data Files?

In Distutils, data files can be installed anywhere on the system. If you define some package data files in setup.py script like this:

setup(…,
  packages=['mypkg'],
  package_dir={'mypkg': 'src/mypkg'},
  package_data={'mypkg': ['data/*.dat']},
  )

then all files with the .dat extension in the mypkg project will be included in the distribution and eventually installed along with the Python modules in the Python installation.

For data files that need to be installed outside the Python distribution, there's another option that stores files in the archive but puts them in defined locations:

setup(…,
    data_files=[('bitmaps', ['bm/b1.gif', 'bm/b2.gif']),
                ('config', ['cfg/data.cfg']),
                ('/etc/init.d', ['init-script'])]
    )

This is terrible news for OS packagers for several reasons:

Data files are not part of the metadata, so packagers need to read setup.py and sometimes dive into the project's code.
The developer should not be the one deciding where data files should land on a target system.
There are no categories for these data files: images, man pages, and everything else are all treated the same way.

A packager who needs to repackage a project with such a file has no choice but to patch the setup.py file so that it works as expected for her platform. To do that, she must review the code and change every line that uses those files, since the developer made an assumption about their location. Setuptools and Pip did not improve this.

14.4. Improved Standards

So we ended up with with a mixed up and confused packaging environment, where everything is driven by a single Python module, with incomplete metadata and no way to describe everything a project contains. Here's what we're doing to make things better.

14.4.1. Metadata

The first step is to fix our Metadata standard. PEP 345 defines a new version that includes:

a saner way to define versions
project-level dependencies
a static way to define platform-specific values

Version

One goal of the metadata standard is to make sure that all tools that operate on Python projects are able to classify them the same way. For versions, it means that every tool should be able to know that "1.1" comes after "1.0". But if project have custom versioning schemes, this becomes much harder.

The only way to ensure consistent versioning is to publish a standard that projects will have to follow. The scheme we chose is a classical sequence-based scheme. As defined in PEP 386, its format is:

N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]

where:

N is an integer. You can use as many Ns as you want and separate them by dots, as long as there are at least two (MAJOR.MINOR).
a, b, c and rc are alpha, beta and release candidate markers. They are followed by an integer. Release candidates have two markers because we wanted the scheme to be compatible with Python, which uses rc. But we find c simpler.
dev followed by a number is a dev marker.
post followed by a number is a post-release marker.

Depending on the project release process, dev or post markers can be used for all intermediate versions between two final releases. Most process use dev markers.

Following this scheme, PEP 386 defines a strict ordering:

alpha < beta < rc < final
dev < non-dev < post, where non-dev can be a alpha, beta, rc or final

Here's a full ordering example:

1.0a1 < 1.0a2.dev456 < 1.0a2 < 1.0a2.1.dev456
  < 1.0a2.1 < 1.0b1.dev456 < 1.0b2 < 1.0b2.post345
    < 1.0c1.dev456 < 1.0c1 < 1.0.dev456 < 1.0
      < 1.0.post456.dev34 < 1.0.post456

The goal of this scheme is to make it easy for other packaging systems to translate Python projects' versions into their own schemes. PyPI now rejects any projects that upload PEP 345 metadata with version numbers that don't follow PEP 386.

Dependencies

PEP 345 defines three new fields that replace PEP 314 Requires, Provides, and Obsoletes. Those fields are Requires-Dist, Provides-Dist, and Obsoletes-Dist, and can be used multiple times in the metadata.

For Requires-Dist, each entry contains a string naming some other Distutils project required by this distribution. The format of a requirement string is identical to that of a Distutils project name (e.g., as found in the Name field) optionally followed by a version declaration within parentheses. These Distutils project names should correspond to names as found at PyPI, and version declarations must follow the rules described in PEP 386. Some example are:

Requires-Dist: pkginfo
Requires-Dist: PasteDeploy
Requires-Dist: zope.interface (>3.5.0)

Provides-Dist is used to define extra names contained in the project. It's useful when a project wants to merge with another project. For example the ZODB project can include the transaction project and state:

Provides-Dist: transaction

Obsoletes-Dist is useful to mark another project as an obsolete version:

Obsoletes-Dist: OldName

Environment Markers

An environment marker is a marker that can be added at the end of a field after a semicolon to add a condition about the execution environment. Some examples are:

Requires-Dist: pywin32 (>1.0); sys.platform == 'win32'
Obsoletes-Dist: pywin31; sys.platform == 'win32'
Requires-Dist: foo (1,!=1.3); platform.machine == 'i386'
Requires-Dist: bar; python_version == '2.4' or python_version == '2.5'
Requires-External: libxslt; 'linux' in sys.platform

The micro-language for environment markers is deliberately kept simple enough for non-Python programmers to understand: it compares strings with the == and in operators (and their opposites), and allows the usual Boolean combinations. The fields in PEP 345 that can use this marker are:

Requires-Python
Requires-External
Requires-Dist
Provides-Dist
Obsoletes-Dist
Classifier

14.4.2. What's Installed?

Having a single installation format shared among all Python tools is mandatory for interoperability. If we want Installer A to detect that Installer B has previously installed project Foo, they both need to share and update the same database of installed projects.

Of course, users should ideally use a single installer in their system, but they may want to switch to a newer installer that has specific features. For instance, Mac OS X ships Setuptools, so users automatically have the easy_install script. If they want to switch to a newer tool, they will need it to be backward compatible with the previous one.

Another problem when using a Python installer on a platform that has a packaging system like RPM is that there is no way to inform the system that a project is being installed. What's worse, even if the Python installer could somehow ping the central packaging system, we would need to have a mapping between the Python metadata and the system metadata. The name of the project, for instance, may be different for each. That can occur for several reasons. The most common one is a conflict name: another project outside the Python land already uses the same name for the RPM. Another cause is that the name used include a python prefix that breaks the convention of the platform. For example, if you name your project foo-python, there are high chances that the Fedora RPM will be called python-foo.

One way to avoid this problem is to leave the global Python installation alone, managed by the central packaging system, and work in an isolated environment. Tools like Virtualenv allows this.

In any case, we do need to have a single installation format in Python because interoperability is also a concern for other packaging systems when they install themselves Python projects. Once a third-party packaging system has registered a newly installed project in its own database on the system, it needs to generate the right metadata for the Python installaton itself, so projects appear to be installed to Python installers or any APIs that query the Python installation.

The metadata mapping issue can be addressed in that case: since an RPM knows which Python projects it wraps, it can generate the proper Python-level metadata. For instance, it knows that python26-webob is called WebOb in the PyPI ecosystem.

Back to our standard: PEP 376 defines a standard for installed packages whose format is quite similar to those used by Setuptools and Pip. This structure is a directory with a dist-info extension that contains:

METADATA: the metadata, as described in PEP 345, PEP 314 and PEP 241.
RECORD: the list of installed files in a csv-like format.
INSTALLER: the name of the tool used to install the project.
REQUESTED: the presence of this file indicates that the project installation was explicitly requested (i.e., not installed as a dependency).

Once all tools out there understand this format, we'll be able to manage projects in Python without depending on a particular installer and its features. Also, since PEP 376 defines the metadata as a directory, it will be easy to add new files to extend it. As a matter of fact, a new metadata file called RESOURCES, described in the next section, might be added in a near future without modifying PEP 376. Eventually, if this new file turns out to be useful for all tools, it will be added to the PEP.

14.4.3. Architecture of Data Files

As described earlier, we need to let the packager decide where to put data files during installation without breaking the developer's code. At the same time, the developer must be able to work with data files without having to worry about their location. Our solution is the usual one: indirection.

Using Data Files

Suppose your MPTools application needs to work with a configuration file. The developer will put that file in a Python package and use __file__ to reach it:

import os

here = os.path.dirname(__file__)
cfg = open(os.path.join(here, 'config', 'mopy.cfg'))

This implies that configuration files are installed like code, and that the developer must place it alongside her code: in this example, in a subdirectory called config.

The new architecture of data files we have designed uses the project tree as the root of all files, and allows access to any file in the tree, whether it is located in a Python package or a simple directory. This allowed developers to create a dedicated directory for data files and access them using pkgutil.open:

import os
import pkgutil

# Open the file located in config/mopy.cfg in the MPTools project
cfg = pkgutil.open('MPTools', 'config/mopy.cfg')

pkgutil.open looks for the project metadata and see if it contains a RESOURCES file. This is a simple map of files to locations that the system may contain:

config/mopy.cfg {confdir}/{distribution.name}

Here the {confdir} variable points to the system's configuration directory, and {distribution.name} contains the name of the Python project as found in the metadata.

Figure 14.4: Finding a File

As long as this RESOURCES metadata file is created at installation time, the API will find the location of mopy.cfg for the developer. And since config/mopy.cfg is the path relative to the project tree, it means that we can also offer a development mode where the metadata for the project are generated in-place and added in the lookup paths for pkgutil.

Declaring Data Files

In practice, a project can define where data files should land by defining a mapper in their setup.cfg file. A mapper is a list of (glob-style pattern, target) tuples. Each pattern points to one of several files in the project tree, while the target is an installation path that may contain variables in brackets. For example, MPTools's setup.cfg could look like this:

[files]
resources =
        config/mopy.cfg {confdir}/{application.name}/
        images/*.jpg    {datadir}/{application.name}/

The sysconfig module will provide and document a specific list of variables that can be used, and default values for each platform. For example {confdir} is /etc on Linux. Installers can therefore use this mapper in conjunction with sysconfig at installation time to know where the files should be placed. Eventually, they will generate the RESOURCES file mentioned earlier in the installed metadata so pkgutil can find back the files.

Figure 14.5: Installer

14.4.4. PyPI Improvements

I said earlier that PyPI was effectively a single point of failure. PEP 380 addresses this problem by defining a mirroring protocol so that users can fall back to alternative servers when PyPI is down. The goal is to allow members of the community to run mirrors around the world.

Figure 14.6: Mirroring

The mirror list is provided as a list of host names of the form X.pypi.python.org, where X is in the sequence a,b,c,…,aa,ab,…. a.pypi.python.org is the master server and mirrors start with b. A CNAME record last.pypi.python.org points to the last host name so clients that are using PyPI can get the list of the mirrors by looking at the CNAME.

For example, this call tells use that the last mirror is h.pypi.python.org, meaning that PyPI currently has 6 mirrors (b through h):

>>> import socket
>>> socket.gethostbyname_ex('last.pypi.python.org')[0]
'h.pypi.python.org'

Potentially, this protocol allows clients to redirect requests to the nearest mirror by localizing the mirrors by their IPs, and also fall back to the next mirror if a mirror or the master server is down. The mirroring protocol itself is more complex than a simple rsync because we wanted to keep downloads statistics accurate and provide minimal security.

Synchronization

Mirrors must reduce the amount of data transferred between the central server and the mirror. To achieve that, they must use the changelog PyPI XML-RPC call, and only refetch the packages that have been changed since the last time. For each package P, they must copy documents /simple/P/ and /serversig/P.

If a package is deleted on the central server, they must delete the package and all associated files. To detect modification of package files, they may cache the file's ETag, and may request skipping it using the If-None-Match header. Once the synchronization is over, the mirror changes its /last-modified to the current date.

Statistics Propagation

When you download a release from any of the mirrors, the protocol ensures that the download hit is transmitted to the master PyPI server, then to other mirrors. Doing this ensures that people or tools browsing PyPI to find out how many times a release was downloaded will get a value summed across all mirrors.

Statistics are grouped into daily and weekly CSV files in the stats directory at the central PyPI itself. Each mirror needs to provide a local-stats directory that contains its own statistics. Each file provides the number of downloads for each archive, grouped by use agents. The central server visits mirrors daily to collect those statistics, and merge them back into the global stats directory, so each mirror must keep /local-stats up-to-date at least once a day.

Mirror Authenticity

With any distributed mirroring system, clients may want to verify that the mirrored copies are authentic. Some of the possible threats include:

the central index may be compromised
the mirrors might be tampered with
a man-in-the-middle attack between the central index and the end user, or between a mirror and the end user

To detect the first attack, package authors need to sign their packages using PGP keys, so that users can verify that the package comes from the author they trust. The mirroring protocol itself only addresses the second threat, though some attempt is made to detect man-in-the-middle attacks.

The central index provides a DSA key at the URL /serverkey, in the PEM format as generated by openssl dsa -pubout³. This URL must not be mirrored, and clients must fetch the official serverkey from PyPI directly, or use the copy that came with the PyPI client software. Mirrors should still download the key so that they can detect a key rollover.

For each package, a mirrored signature is provided at /serversig/package. This is the DSA signature of the parallel URL /simple/package, in DER form, using SHA-1 with DSA⁴.

Clients using a mirror need to perform the following steps to verify a package:

Download the /simple page, and compute its SHA-1 hash.
Compute the DSA signature of that hash.
Download the corresponding /serversig, and compare it byte for byte with the value computed in step 2.
Compute and verify (against the /simple page) the MD5 hashes of all files they download from the mirror.

Verification is not needed when downloading from central index, and clients should not do it to reduce the computation overhead.

About once a year, the key will be replaced with a new one. Mirrors will have to re-fetch all /serversig pages. Clients using mirrors need to find a trusted copy of the new server key. One way to obtain one is to download it from https://pypi.python.org/serverkey. To detect man-in-the-middle attacks, clients need to verify the SSL server certificate, which will be signed by the CACert authority.

14.5. Implementation Details

The implementation of most of the improvements described in the previous section are taking place in Distutils2. The setup.py file is not used anymore, and a project is completely described in setup.cfg, a static .ini-like file. By doing this, we make it easier for packagers to change the behavior of a project installation without having to deal with Python code. Here's an example of such a file:

[metadata]
name = MPTools
version = 0.1
author = Tarek Ziade
author-email = tarek@mozilla.com
summary = Set of tools to build Mozilla Services apps
description-file = README
home-page = http://bitbucket.org/tarek/pypi2rpm
project-url: Repository, http://hg.mozilla.org/services/server-devtools
classifier = Development Status :: 3 - Alpha
    License :: OSI Approved :: Mozilla Public License 1.1 (MPL 1.1)

[files]
packages =
        mopytools
        mopytools.tests

extra_files =
        setup.py
        README
        build.py
        _build.py

resources =
    etc/mopytools.cfg {confdir}/mopytools

Distutils2 use this configuration file to:

generate META-1.2 metadata files that can be used for various actions, like registering at PyPI.
run any package management command, like sdist.
install a Distutils2-based project.

Distutils2 also implements VERSION via its version module.

The INSTALL-DB implementation will find its way to the standard library in Python 3.3 and will be in the pkgutil module. In the interim, a version of this module exists in Distutils2 for immediate use. The provided APIs will let us browse an installation and know exactly what's installed.

These APIs are the basis for some neat Distutils2 features:

installer/uninstaller
dependency graph view of installed projects

14.6. Lessons learned

14.6.1. It's All About PEPs

Changing an architecture as wide and complex as Python packaging needs to be carefully done by changing standards through a PEP process. And changing or adding a new PEP takes in my experience around a year.

One mistake the community made along the way was to deliver tools that solved some issues by extending the Metadata and the way Python applications were installed without trying to change the impacted PEPs.

In other words, depending on the tool you used, the standard library Distutils or Setuptools, applications were installed differently. The problems were solved for one part of the community that used these new tools, but added more problems for the rest of the world. OS Packagers for instance, had to face several Python standards: the official documented standard and the de-facto standard imposed by Setuptools.

But in the meantime, Setuptols had the opportunity to experiment in a realistic scale (the whole community) some innovations in a very fast pace, and the feedback was invaluable. We were able to write down new PEPs with more confidence in what worked and what did not, and maybe it would have been impossible to do so differently. So it's all about detecting when some third-party tools are contributing innovations that are solving problems and that should ignite a PEP change.

14.6.2. A Package that Enters the Standard Library Has One Foot in the Grave

I am paraphrasing Guido van Rossum in the section title, but that's one aspect of the batteries-included philosophy of Python that impacts a lot our efforts.

Distutils is part of the standard library and Distutils2 will soon be. A package that's in the standard library is very hard to make evolve. There are of course deprecation processes, where you can kill or change an API after 2 minor versions of Python. But once an API is published, it's going to stay there for years.

So any change you make in a package in the standard library that is not a bug fix, is a potential disturbance for the eco-system. So when you're doing important changes, you have to create a new package.

I've learned it the hard way with Distutils since I had to eventually revert all the changes I had done in it for more that a year and create Distutils2. In the future, if our standards change again in a drastic way, there are high chances that we will start a standalone Distutils3 project first, unless the standard library is released on its own at some point.

14.6.3. Backward Compatibility

Changing the way packaging works in Python is a very long process: the Python ecosystem contains so many projects based on older packaging tools that there is and will be a lot of resistance to change. (Reaching consensus on some of the topics discussed in this chapter took several years, rather than the few months I originally expected.) As with Python 3, it will take years before all projects switch to the new standard.

That's why everything we are doing has to be backward-compatible with all previous tools, installations and standards, which makes the implementation of Distutils2 a wicked problem.

For example, if a project that uses the new standards depends on another project that don't use them yet, we can't stop the installation process by telling the end-user that the dependency is in an unknown format!

For example, the INSTALL-DB implementation contains compatibility code to browse projects installed by the original Distutils, Pip, Distribute, or Setuptools. Distutils2 is also able to install projects created by the original Distutils by converting their metadata on the fly.

14.7. References and Contributions

Some sections in this paper were directly taken from the various PEP documents we wrote for packaging. You can find the original documents at http://python.org:

PEP 241: Metadata for Python Software Packages 1.0: http://python.org/peps/pep-0214.html
PEP 314: Metadata for Python Software Packages 1.1: http://python.org/peps/pep-0314.html
PEP 345: Metadata for Python Software Packages 1.2: http://python.org/peps/pep-0345.html
PEP 376: Database of Installed Python Distributions: http://python.org/peps/pep-0376.html
PEP 381: Mirroring infrastructure for PyPI: http://python.org/peps/pep-0381.html
PEP 386: Changing the version comparison module in Distutils: http://python.org/peps/pep-0386.html

I would like to thank all the people that are working on packaging; you will find their name in every PEP I've mentioned. I would also like to give a special thank to all members of The Fellowship of the Packaging. Also, thanks to Alexis Metaireau, Toshio Kuratomi, Holger Krekel and Stefane Fermigier for their feedback on this chapter.

The projects that were discussed in this chapter are:

Distutils: http://docs.python.org/distutils
Distutils2: http://packages.python.org/Distutils2
Distribute: http://packages.python.org/distribute
Setuptools: http://pypi.python.org/pypi/setuptools
Pip: http://pypi.python.org/pypi/pip
Virtualenv: http://pypi.python.org/pypi/virtualenv

The Architecture of Open Source Applications (Volume 1)
Python Packaging