There are two schools of thought when it comes to installing applications. The first, common to Windows and Mac OS X, is that applications should be self-contained, and their installation should not depend on anything else. This philosophy simplifies the management of applications: each application is its own standalone "appliance", and installing and removing them should not disturb the rest of the OS. If the application needs an uncommon library, that library is included in the application's distribution.
The second school, which is the norm for Linux-based systems, treats
software as a collection of small self-contained units called
packages. Libraries are bundled into packages, any given
library package might depend on other packages.
Installing an application might involve finding and installing
particular versions of dozens of other libraries. These dependencies
are usually fetched from a central repository that contains thousands
of packages. This philosophy is why Linux distributions use complex
package management systems like dpkg
and RPM
to track
dependencies and prevent installation of two applications that use
incompatible versions of the same library.
There are pros and cons to each approach. Having a highly modular system where every piece can be updated or replaced makes management easier, because each library is present in a single place, and all applications that use it benefit when it is updated. For instance, a security fix in a particular library will reach all applications that use it at once, whereas if an application ships with its own library, that security fix will be more complex to deploy, especially if different applications use different versions of the library.
But that modularity is seen as a drawback by some developers, because they're not in control of their applications and dependencies. It is easier for them to provide a standalone software appliance to be sure that the application environment is stable and not subject to "dependency hell" during system upgrades.
Self-contained applications also make the developer's life easier when she needs to support several operating systems. Some projects go so far as to release portable applications that remove any interaction with the hosting system by working in a self-contained directory, even for log files.
Python's packaging system was intended to make the second philosophy—multiple dependencies for each install—as developer-, admin-, packager-, and user-friendly as possible. Unfortunately it had (and has) a variety of flaws which caused or allowed all kinds of problems: unintuitive version schemes, mishandled data files, difficulty re-packaging, and more. Three years ago I and a group of other Pythoneers decided to reinvent it to address these problems. We call ourselves the Fellowship of the Packaging, and this chapter describes the problems we have been trying to fix, and what our solution looks like.
Terminology
In Python a package is a directory containing Python files. Python files are called modules. That definition makes the usage of the word "package" a bit vague since it is also used by many systems to refer to a release of a project.
Python developers themselves are sometimes vague about this. One way to remove this ambiguity is to use the term "Python packages" when we talk about a directory containing Python modules. The term "release" is used to define one version of a project, and the term "distribution" defines a source or a binary distribution of a release as something like a tarball or zip file.
Most Python programmers want their programs to be usable in any environment. They also usually want to use a mix of standard Python libraries and system-dependent libraries. But unless you package your application separately for every existing packaging system, you are doomed to provide Python-specific releases—a Python-specific release is a release aimed to be installed within a Python installation no matter what the underlying Operating System is—and hope that:
Sometimes, this is simply impossible. For example, Plone (a
full-fledged Python-powered CMS) uses hundreds of small pure Python
libraries that are not always available as packages in every packaging
system out there. This means that Plone must ship everything
that it needs in a portable application. To do this, it uses
zc.buildout
, which collects all its dependencies and creates a
portable application that will run on any system within a single
directory. It is effectively a binary release, since any piece of C
code will be compiled in place.
This is a big win for developers: they just have to describe their
dependencies using the Python standards described below
and use zc.buildout
to
release their application. But as discussed earlier, this type of
release sets up a fortress within the system, which most Linux sysadmins
will hate. Windows admins won't mind, but those managing
CentOS or Debian will, because those systems base their management on
the assumption that every file in the system is registered,
classified, and known to admin tools.
Those admins will want to repackage your application according to
their own standards. The question we need to answer is, "Can Python
have a packaging system that can be automatically translated into
other packaging systems?" If so, one application or library can be
installed on any system without requiring extra packaging work. Here,
"automatically" doesn't necessarily mean that the work should be
fully done by a script: RPM
or dpkg
packagers will tell
you that's impossible—they always need to add some specifics in the
projects they repackage. They'll also tell you that they
often have a hard time re-packaging a piece of code because its
developers were not aware of a few basic packaging rules.
Here's one example of what you can do to annoy packagers using the existing Python packaging system: release a library called "MathUtils" with the version name "Fumanchu". The brilliant mathematician who wrote the library have found it amusing to use his cats' names for his project versions. But how can a packager know that "Fumanchu" is his second cat's name, and that the first one was called "Phil", so that the "Fumanchu" version comes after the "Phil" one?
This may sound extreme, but it can happen with today's tools and
standards. The worst thing is that tools like easy_install
or
pip
use their own non-standard registry to keep track of
installed files, and will sort the "Fumanchu" and "Phil" versions
alphanumerically.
Another problem is how to handle data files. For example, what if
your application uses an SQLite database? If you put it inside your
package directory, your application might fail because the system
forbids you to write in that part of the tree. Doing this will
also compromise the assumptions Linux systems make about where
application data is for backups (/var
).
In the real world, system administrators need to be able to place your files where they want without breaking your application, and you need to tell them what those files are. So let's rephrase the question: is it possible to have a packaging system in Python that can provide all the information needed to repackage an application with any third-party packaging system out there without having to read the code, and make everyone happy?
The Distutils
package that comes with the Python standard
library is riddled with the problems described above. Since it's the
standard, people either live with it and its flaws, or use more
advanced tools like Setuptools
, which add features on the top of it,
or Distribute
, a fork of Setuptools
. There's also Pip
,
a more advanced installer, that relies on Setuptools
.
However, these newer tools are all based on Distutils
and inherit its
problems. Attempts were made to fix Distutils
in place, but
the code is so deeply used by other tools that any change to it, even
its internals, is a potential regression in the whole Python packaging
ecosystem.
We therefore decided to freeze Distutils
and start the
development of Distutils2
from the same code base, without
worrying too much about backward compatibility. To understand what
changed and why, let's have a closer look at Distutils
.
Distutils
contains commands, each of which is a class with a
run
method that can be called with some options. Distutils
also provides a Distribution
class that contains global values
every command can look at.
To use Distutils
, a developer adds a single Python module to a
project, conventionally called setup.py
. This module contains a
call to Distutils
' main entry point: the setup
function. This
function can take many options, which are held by a
Distribution
instance and used by commands. Here's an example
that defines a few standard options like the name and version of the
project, and a list of modules it contains:
from distutils.core import setup setup(name='MyProject', version='1.0', py_modules=['mycode.py'])
This module can then be used to run Distutils
commands like
sdist
, which creates a source distribution in an archive and
places it in a dist
directory:
$ python setup.py sdist
Using the same script, you can install the project using the
install
command:
$ python setup.py install
Distutils
provides other commands such as:
upload
to upload a distribution into an online repository.register
to register the metadata of a project in an online
repository without necessary uploading a distribution,bdist
to creates a binary distribution, andbdist_msi
to create a .msi
file for Windows.It will also let you get information about the project via other command line options.
So installing a project or getting information about it is always done
by invoking Distutils
through this file. For example, to find
out the name of the project:
$ python setup.py --name MyProject
setup.py
is therefore how everyone interacts with the project,
whether to build, package, publish, or install it. The developer
describes the content of his project through options passed to a
function, and uses that file for all his packaging tasks. The file is
also used by installers to install the project on a target system.
Figure 14.1: Setup
Having a single Python module used for packaging, releasing,
and installing a project is one of Distutils
' main
flaws. For example, if you want to get the name
from the lxml
project, setup.py
will do a lot of things besides returning a
simple string as expected:
$ python setup.py --name Building lxml version 2.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.26 Building against libxml2/libxslt in the following directory: /usr/lib/lxml
It might even fail to work on some projects, since
developers make the assumption that setup.py
is used only to
install, and that other Distutils
features are only used by
them during development. The multiple roles of the setup.py
script can easily cause confusion.
When Distutils
builds a distribution, it creates a
Metadata
file that follows the standard described in
PEP 3141. It contains a
static version of all the usual metadata, like the name of the project
or the version of the release. The main metadata fields are:
Name
: The name of the project.Version
: The version of the release.Summary
: A one-line description.Description
: A detailed description.Home-Page
: The URL of the project.Author
: The author name.Classifiers
: Classifiers for the project. Python provides a list
of classifiers for the license, the maturity of the release (beta,
alpha, final), etc.Requires
, Provides
, and Obsoletes
:
Used to define dependencies with modules.These fields are for the most part easy to map to equivalents in other packaging systems.
The Python Package Index (PyPI)2, a central repository of packages like CPAN, is able to
register projects and publish releases via Distutils
'
register
and upload
commands. register
builds
the Metadata
file and sends it to PyPI, allowing people and
tools—like installers—to browse them via web pages or via web
services.
Figure 14.2: The PyPI Repository
You can browse projects by Classifiers
, and get the author name
and project URL. Meanwhile, Requires
can be used to define
dependencies on Python modules. The requires
option can be
used to add a Requires
metadata element to the project:
from distutils.core import setup setup(name='foo', version='1.0', requires=['ldap'])
Defining a dependency on the ldap
module is purely declarative:
no tools or installers ensure that such a module exists. This would be
satisfactory if Python defined requirements at the module level
through a require
keyword like Perl does. Then it would
just be a matter of the installers browsing the dependencies at PyPI
and installing them; that's basically what CPAN does. But that's not
possible in Python since a module named ldap
can exist in any
Python project. Since Distutils
allows people to release
projects that can contain several packages and modules, this metadata
field is not useful at all.
Another flaw of Metadata
files is that they are created by a
Python script, so they are specific to the platform they are executed
in. For example, a project that provides features specific to Windows
could define its setup.py
as:
from distutils.core import setup setup(name='foo', version='1.0', requires=['win32com'])
But this assumes that the project only works under Windows, even if it
provides portable features. One way to solve this is to make the
requires
option specific to Windows:
from distutils.core import setup import sys
if sys.platform == 'win32': setup(name='foo', version='1.0', requires=['win32com']) else: setup(name='foo', version='1.0')
This actually makes the issue worse. Remember, the script is used to
build source archives that are then released to the world via PyPI.
This means that the static Metadata
file sent to PyPI is
dependent on the platform that was used to compile it. In other
words, there is no way to indicate statically in the metadata field
that it is platform-specific.
Figure 14.3: PyPI Workflow
As indicated earlier, PyPI is a central index of Python projects where people can browse existing projects by category or register their own work. Source or binary distributions can be uploaded and added to an existing project, and then downloaded for installation or study. PyPI also offers web services that can be used by tools like installers.
Registering a project to PyPI is done with the Distutils
register
command. It builds a POST request containing the
metadata of the project, whatever its version is. The request
requires an Authorization header, as PyPI uses Basic Authentication to
make sure every registered project is associated with a user that has
first registered with PyPI. Credentials are kept in the local
Distutils
configuration or typed in the prompt every time a
register
command is invoked. An example of its use is:
$ python setup.py register running register Registering MPTools to http://pypi.python.org/pypi Server response (200): OK
Each registered project gets a web page with an HTML version of the
metadata, and packagers can upload distributions to PyPI using
upload
:
$ python setup.py sdist upload running sdist … running upload Submitting dist/mopytools-0.1.tar.gz to http://pypi.python.org/pypi Server response (200): OK
It's also possible to point users to another location via the
Download-URL
metadata field rather than uploading files directly to PyPI.
Besides the HTML pages PyPI publishes for web users, it provides two services that tools can use to browse the content: the Simple Index protocol and the XML-RPC APIs.
The Simple Index protocol starts at
http://pypi.python.org/simple/
, a plain HTML page that contains
relative links to every registered project:
<html><head><title>Simple Index</title></head><body> ⋮ ⋮ ⋮ <a href='MontyLingua/'>MontyLingua</a><br/> <a href='mootiro_web/'>mootiro_web</a><br/> <a href='Mopidy/'>Mopidy</a><br/> <a href='mopowg/'>mopowg</a><br/> <a href='MOPPY/'>MOPPY</a><br/> <a href='MPTools/'>MPTools</a><br/> <a href='morbid/'>morbid</a><br/> <a href='Morelia/'>Morelia</a><br/> <a href='morse/'>morse</a><br/> ⋮ ⋮ ⋮ </body></html>
For example, the MPTools project has a MPTools/
link, which
means that the project exists in the index. The site it points at
contains a list of all the links related to the project:
Metadata
, for
each version of the project registeredMetadata
, for
each version as well.The page for MPTools contains:
<html><head><title>Links for MPTools</title></head> <body><h1>Links for MPTools</h1> <a href="../../packages/source/M/MPTools/MPTools-0.1.tar.gz">MPTools-0.1.tar.gz</a><br/> <a href="http://bitbucket.org/tarek/mopytools" rel="homepage">0.1 home_page</a><br/> </body></html>
Tools like installers that want to find distributions of a project can
look for it in the index page, or simply check if
http://pypi.python.org/simple/PROJECT_NAME/
exists.
This protocol has two main limitations. First, PyPI is a single server right now, and while people usually have local copies of its content, we have experienced several downtimes in the past two years that have paralyzed developers that are constantly working with installers that browse PyPI to get all the dependencies a project requires when it is built. For instance, building a Plone application will generate several hundreds queries at PyPI to get all the required bits, so PyPI may act as a single point of failure.
Second, when the distributions are not stored at PyPI and a Download-URL link is provided in the Simple Index page, installers have to follow that link and hope that the location will be up and will really contain the release. These indirections weakens any Simple Index-based process.
The Simple Index protocol's goal is to give to installers a list of links they can use to install a project. The project metadata is not published there; instead, there are XML-RPC methods to get extra information about registered projects:
>>> import xmlrpclib >>> import pprint >>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi') >>> client.package_releases('MPTools') ['0.1'] >>> pprint.pprint(client.release_urls('MPTools', '0.1')) [{'comment_text': &rquot;, 'downloads': 28, 'filename': 'MPTools-0.1.tar.gz', 'has_sig': False, 'md5_digest': '6b06752d62c4bffe1fb65cd5c9b7111a', 'packagetype': 'sdist', 'python_version': 'source', 'size': 3684, 'upload_time': <DateTime '20110204T09:37:12' at f4da28>, 'url': 'http://pypi.python.org/packages/source/M/MPTools/MPTools-0.1.tar.gz'}] >>> pprint.pprint(client.release_data('MPTools', '0.1')) {'author': 'Tarek Ziade', 'author_email': 'tarek@mozilla.com', 'classifiers': [], 'description': 'UNKNOWN', 'download_url': 'UNKNOWN', 'home_page': 'http://bitbucket.org/tarek/mopytools', 'keywords': None, 'license': 'UNKNOWN', 'maintainer': None, 'maintainer_email': None, 'name': 'MPTools', 'package_url': 'http://pypi.python.org/pypi/MPTools', 'platform': 'UNKNOWN', 'release_url': 'http://pypi.python.org/pypi/MPTools/0.1', 'requires_python': None, 'stable_version': None, 'summary': 'Set of tools to build Mozilla Services apps', 'version': '0.1'}
The issue with this approach is that some of the data that the XML-RPC APIs are publishing could have been stored as static files and published in the Simple Index page to simplify the work of client tools. That would also avoid the extra work PyPI has to do to handle those queries. It's fine to have non-static data like the number of downloads per distribution published in a specialized web service, but it does not make sense to have to use two different services to get all static data about a project.
If you install a Python project using python setup.py install
,
Distutils
—which is included in the standard library—will
copy the files onto your system.
/usr/local/lib/python2.6/dist-packages/
and under Fedora in
/usr/local/lib/python2.6/sites-packages/
.bin
directory
on the system. Depending on the platform, this could be
/usr/local/bin
or in a bin directory specific to the Python
installation.Ever since Python 2.5, the metadata file is copied alongside the modules
and packages as project-version.egg-info
. For example, the
virtualenv
project could have a
virtualenv-1.4.9.egg-info
file. These metadata files can be
considered a database of installed projects, since it's possible to
iterate over them and build a list of projects with their versions.
However, the Distutils
installer does not record the list of
files it installs on the system. In other words, there is no way to
remove all files that were copied in the system. This is a shame
since the install
command has a --record
option that can
be used to record all installed files in a text file. However, this
option is not used by default and Distutils
' documentation
barely mentions it.
As mentioned in the introduction, some projects tried to fix some of
the problems with Distutils
, with varying degrees of success.
PyPI allowed developers to publish Python projects that could include
several modules organized into Python packages. But at the same time,
projects could define module-level dependencies via Require
.
Both ideas are reasonable, but their combination is not.
The right thing to do was to have project-level dependencies, which is
exactly what Setuptools
added as a feature on the top of
Distutils
. It also provided a script called
easy_install
to automatically fetch and install dependencies
by looking for them on PyPI. In practice, module-level dependency was
never really used, and people jumped on Setuptools
' extensions.
But since these features were added in options specific to
Setuptools
, and ignored by Distutils
or PyPI,
Setuptools
effectively created its own standard and became a
hack on a top of a bad design.
easy_install
therefore needs to download the archive of the
project and run its setup.py
script again to get the metadata
it needs, and it has to do this again for every dependency. The
dependency graph is built bit by bit after each download.
Even if the new metadata was accepted by PyPI and browsable online,
easy_install
would still need to download all archives
because, as said earlier, metadata published at PyPI is specific to
the platform that was used to upload it, which can differ from the
target platform. But this ability to install a project and its
dependencies was good enough in 90% of the cases and was a great
feature to have. So Setuptools
became widely used, although it
still suffers from other problems:
Setuptools
did not provide an uninstaller, even though its
custom metadata could have contained a file listing the installed
files. Pip
, on the other hand, extended Setuptools
'
metadata to record installed files, and is therefore able to
uninstall. But that's yet another custom set of metadata, which means
that a single Python installation may contain up to four different
flavours of metadata for each installed project:
Distutils
' egg-info
, which is a single metadata
file.Setuptools
' egg-info
, which is a directory
containing the metadata and extra Setuptools
specific
options.Pip
's egg-info
, which is an extended version of
the previous.In Distutils
, data files can be installed anywhere on the
system. If you define some package data files in setup.py
script like this:
setup(…, packages=['mypkg'], package_dir={'mypkg': 'src/mypkg'}, package_data={'mypkg': ['data/*.dat']}, )
then all files with the .dat
extension in the mypkg
project will be included in the distribution and eventually installed
along with the Python modules in the Python installation.
For data files that need to be installed outside the Python distribution, there's another option that stores files in the archive but puts them in defined locations:
setup(…, data_files=[('bitmaps', ['bm/b1.gif', 'bm/b2.gif']), ('config', ['cfg/data.cfg']), ('/etc/init.d', ['init-script'])] )
This is terrible news for OS packagers for several reasons:
setup.py
and sometimes dive into the project's code.man
pages, and everything else are all treated the same way.A packager who needs to repackage a project with such a file has no
choice but to patch the setup.py
file so that it works as
expected for her platform. To do that, she must review the code
and change every line that uses those files, since the developer made
an assumption about their location. Setuptools
and Pip
did not improve this.
So we ended up with with a mixed up and confused packaging environment, where everything is driven by a single Python module, with incomplete metadata and no way to describe everything a project contains. Here's what we're doing to make things better.
The first step is to fix our Metadata
standard. PEP 345
defines a new version that includes:
One goal of the metadata standard is to make sure that all tools that operate on Python projects are able to classify them the same way. For versions, it means that every tool should be able to know that "1.1" comes after "1.0". But if project have custom versioning schemes, this becomes much harder.
The only way to ensure consistent versioning is to publish a standard that projects will have to follow. The scheme we chose is a classical sequence-based scheme. As defined in PEP 386, its format is:
N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]
where:
Depending on the project release process, dev or post markers can be used for all intermediate versions between two final releases. Most process use dev markers.
Following this scheme, PEP 386 defines a strict ordering:
Here's a full ordering example:
1.0a1 < 1.0a2.dev456 < 1.0a2 < 1.0a2.1.dev456 < 1.0a2.1 < 1.0b1.dev456 < 1.0b2 < 1.0b2.post345 < 1.0c1.dev456 < 1.0c1 < 1.0.dev456 < 1.0 < 1.0.post456.dev34 < 1.0.post456
The goal of this scheme is to make it easy for other packaging systems to translate Python projects' versions into their own schemes. PyPI now rejects any projects that upload PEP 345 metadata with version numbers that don't follow PEP 386.
PEP 345 defines three new fields that replace PEP 314 Requires
,
Provides
, and Obsoletes
. Those fields are
Requires-Dist
, Provides-Dist
, and Obsoletes-Dist
,
and can be used multiple times in the metadata.
For Requires-Dist
, each entry contains a string naming some
other Distutils
project required by this distribution. The
format of a requirement string is identical to that of a
Distutils
project name (e.g., as found in the Name
field)
optionally followed by a version declaration within parentheses.
These Distutils
project names should correspond to names as
found at PyPI, and version declarations must follow the rules
described in PEP 386. Some example are:
Requires-Dist: pkginfo Requires-Dist: PasteDeploy Requires-Dist: zope.interface (>3.5.0)
Provides-Dist
is used to define extra names contained in the
project. It's useful when a project wants to merge with another
project. For example the ZODB project can include the
transaction
project and state:
Provides-Dist: transaction
Obsoletes-Dist
is useful to mark another project as an obsolete
version:
Obsoletes-Dist: OldName
An environment marker is a marker that can be added at the end of a field after a semicolon to add a condition about the execution environment. Some examples are:
Requires-Dist: pywin32 (>1.0); sys.platform == 'win32' Obsoletes-Dist: pywin31; sys.platform == 'win32' Requires-Dist: foo (1,!=1.3); platform.machine == 'i386' Requires-Dist: bar; python_version == '2.4' or python_version == '2.5' Requires-External: libxslt; 'linux' in sys.platform
The micro-language for environment markers is deliberately kept simple
enough for non-Python programmers to understand: it compares strings
with the ==
and in
operators (and their opposites), and
allows the usual Boolean combinations. The fields in PEP 345 that can
use this marker are:
Requires-Python
Requires-External
Requires-Dist
Provides-Dist
Obsoletes-Dist
Classifier
Having a single installation format shared among all Python tools is mandatory for interoperability. If we want Installer A to detect that Installer B has previously installed project Foo, they both need to share and update the same database of installed projects.
Of course, users should ideally use a single installer in their
system, but they may want to switch to a newer installer that has
specific features. For instance, Mac OS X ships Setuptools
, so
users automatically have the easy_install
script. If they
want to switch to a newer tool, they will need it to be backward
compatible with the previous one.
Another problem when using a Python installer on a platform that has a
packaging system like RPM is that there is no way to inform the system
that a project is being installed. What's worse, even if the Python
installer could somehow ping the central packaging system, we would
need to have a mapping between the Python metadata and the system
metadata. The name of the project, for instance, may be different for
each. That can occur for several reasons. The most common one is
a conflict name: another project outside the Python land already uses
the same name for the RPM. Another cause is that the name used include
a python
prefix that breaks the convention of the platform.
For example, if you name your project foo-python
, there are high
chances that the Fedora RPM will be called python-foo
.
One way to avoid this problem is to leave the global Python
installation alone, managed by the central packaging system, and work
in an isolated environment. Tools like Virtualenv
allows this.
In any case, we do need to have a single installation format in Python because interoperability is also a concern for other packaging systems when they install themselves Python projects. Once a third-party packaging system has registered a newly installed project in its own database on the system, it needs to generate the right metadata for the Python installaton itself, so projects appear to be installed to Python installers or any APIs that query the Python installation.
The metadata mapping issue can be addressed in that case: since an RPM
knows which Python projects it wraps, it can generate the proper
Python-level metadata. For instance, it knows that
python26-webob
is called WebOb
in the PyPI ecosystem.
Back to our standard: PEP 376 defines a standard for installed
packages whose format is quite similar to those used by
Setuptools
and Pip
. This structure is a directory with
a dist-info
extension that contains:
METADATA
: the metadata, as described in PEP 345, PEP 314
and PEP 241.RECORD
: the list of installed files in a csv-like format.INSTALLER
: the name of the tool used to install the
project.REQUESTED
: the presence of this file indicates that the
project installation was explicitly requested (i.e., not installed
as a dependency).Once all tools out there understand this format, we'll be able to
manage projects in Python without depending on a particular installer
and its features. Also, since PEP 376 defines the metadata as a
directory, it will be easy to add new files to extend it. As a matter
of fact, a new metadata file called RESOURCES
, described in the
next section, might be added in a near future without modifying
PEP 376. Eventually, if this new file turns out to be useful for all
tools, it will be added to the PEP.
As described earlier, we need to let the packager decide where to put data files during installation without breaking the developer's code. At the same time, the developer must be able to work with data files without having to worry about their location. Our solution is the usual one: indirection.
Suppose your MPTools
application needs to work with a
configuration file. The developer will put that file in a Python
package and use __file__
to reach it:
import os here = os.path.dirname(__file__) cfg = open(os.path.join(here, 'config', 'mopy.cfg'))
This implies that configuration files are installed like code, and
that the developer must place it alongside her code: in this
example, in a subdirectory called config
.
The new architecture of data files we have designed
uses the project tree as the root
of all files, and allows access to any file in the tree, whether it is
located in a Python package or a simple directory. This allowed
developers to create a dedicated directory for data files and access
them using pkgutil.open
:
import os import pkgutil # Open the file located in config/mopy.cfg in the MPTools project cfg = pkgutil.open('MPTools', 'config/mopy.cfg')
pkgutil.open
looks for the project metadata and see if it
contains a RESOURCES
file. This is a simple map of files to
locations that the system may contain:
config/mopy.cfg {confdir}/{distribution.name}
Here the {confdir}
variable points to the
system's configuration directory, and {distribution.name}
contains the name of the Python project as found in the metadata.
Figure 14.4: Finding a File
As long as this RESOURCES
metadata file is created at
installation time, the API will find the location of mopy.cfg
for the developer. And since config/mopy.cfg
is the path
relative to the project tree, it means that we can also offer a
development mode where the metadata for the project are generated
in-place and added in the lookup paths for pkgutil
.
In practice, a project can define where data files should land by
defining a mapper in their setup.cfg
file. A mapper is a
list of (glob-style pattern, target)
tuples. Each pattern
points to one of several files in the project tree, while the target
is an installation path that may contain variables in brackets. For
example, MPTools
's setup.cfg
could look like this:
[files] resources = config/mopy.cfg {confdir}/{application.name}/ images/*.jpg {datadir}/{application.name}/
The sysconfig
module will provide and document a specific list
of variables that can be used, and default values for each platform.
For example {confdir}
is /etc
on Linux. Installers can
therefore use this mapper in conjunction with sysconfig
at
installation time to know where the files should be placed.
Eventually, they will generate the RESOURCES
file mentioned
earlier in the installed metadata so pkgutil
can find back the
files.
Figure 14.5: Installer
I said earlier that PyPI was effectively a single point of failure. PEP 380 addresses this problem by defining a mirroring protocol so that users can fall back to alternative servers when PyPI is down. The goal is to allow members of the community to run mirrors around the world.
Figure 14.6: Mirroring
The mirror list is provided as a list of host names of the form
X.pypi.python.org
, where X
is in the sequence
a,b,c,…,aa,ab,…
. a.pypi.python.org
is the master
server and mirrors start with b. A CNAME record
last.pypi.python.org
points to the last host name so clients
that are using PyPI can get the list of the mirrors by looking at the
CNAME.
For example, this call tells use that the last mirror is
h.pypi.python.org
, meaning that PyPI currently has 6 mirrors (b
through h):
>>> import socket >>> socket.gethostbyname_ex('last.pypi.python.org')[0] 'h.pypi.python.org'
Potentially, this protocol allows clients to redirect requests to the nearest mirror by localizing the mirrors by their IPs, and also fall back to the next mirror if a mirror or the master server is down. The mirroring protocol itself is more complex than a simple rsync because we wanted to keep downloads statistics accurate and provide minimal security.
Mirrors must reduce the amount of data transferred between the central
server and the mirror. To achieve that, they must use the
changelog
PyPI XML-RPC call, and only refetch the packages that
have been changed since the last time. For each package P, they
must copy documents /simple/P/
and /serversig/P
.
If a package is deleted on the central server, they must delete
the package and all associated files. To detect modification of
package files, they may cache the file's ETag, and may request
skipping it using the If-None-Match
header. Once the
synchronization is over, the mirror changes its /last-modified
to the current date.
When you download a release from any of the mirrors, the protocol ensures that the download hit is transmitted to the master PyPI server, then to other mirrors. Doing this ensures that people or tools browsing PyPI to find out how many times a release was downloaded will get a value summed across all mirrors.
Statistics are grouped into daily and weekly CSV files in the
stats
directory at the central PyPI itself. Each mirror needs
to provide a local-stats
directory that contains its own
statistics. Each file provides the number of downloads for each
archive, grouped by use agents. The central server visits mirrors
daily to collect those statistics, and merge them back into the global
stats
directory, so each mirror must keep /local-stats
up-to-date at least once a day.
With any distributed mirroring system, clients may want to verify that the mirrored copies are authentic. Some of the possible threats include:
To detect the first attack, package authors need to sign their packages using PGP keys, so that users can verify that the package comes from the author they trust. The mirroring protocol itself only addresses the second threat, though some attempt is made to detect man-in-the-middle attacks.
The central index provides a DSA key at the URL /serverkey
, in
the PEM format as generated by openssl dsa -pubout
3. This URL must not be mirrored,
and clients must fetch the official serverkey
from PyPI
directly, or use the copy that came with the PyPI client
software. Mirrors should still download the key so that they can
detect a key rollover.
For each package, a mirrored signature is provided at
/serversig/package
. This is the DSA signature of the parallel
URL /simple/package
, in DER form, using SHA-1 with
DSA4.
Clients using a mirror need to perform the following steps to verify a package:
/simple
page, and compute its SHA-1 hash./serversig
, and compare it
byte for byte with the value computed in step 2./simple
page) the MD5
hashes of all files they download from the mirror.Verification is not needed when downloading from central index, and clients should not do it to reduce the computation overhead.
About once a year, the key will be replaced with a new one. Mirrors
will have to re-fetch all /serversig
pages. Clients using
mirrors need to find a trusted copy of the new server key. One way to
obtain one is to download it from
https://pypi.python.org/serverkey
. To detect man-in-the-middle
attacks, clients need to verify the SSL server certificate, which will
be signed by the CACert authority.
The implementation of most of the improvements described in the previous
section are taking place in Distutils2
. The setup.py
file is not used anymore, and a project is completely described in
setup.cfg
, a static .ini
-like file. By doing this, we
make it easier for packagers to change the behavior of a project
installation without having to deal with Python code. Here's an
example of such a file:
[metadata] name = MPTools version = 0.1 author = Tarek Ziade author-email = tarek@mozilla.com summary = Set of tools to build Mozilla Services apps description-file = README home-page = http://bitbucket.org/tarek/pypi2rpm project-url: Repository, http://hg.mozilla.org/services/server-devtools classifier = Development Status :: 3 - Alpha License :: OSI Approved :: Mozilla Public License 1.1 (MPL 1.1)
[files] packages = mopytools mopytools.tests extra_files = setup.py README build.py _build.py resources = etc/mopytools.cfg {confdir}/mopytools
Distutils2
use this configuration file to:
META-1.2
metadata files that can be used for various
actions, like registering at PyPI.sdist
.Distutils2
-based project.Distutils2
also implements VERSION
via its version
module.
The INSTALL-DB
implementation will find its way to the standard library
in Python 3.3 and will be in the pkgutil
module. In the
interim, a version of this module exists in Distutils2
for
immediate use. The provided APIs will let us browse an installation
and know exactly what's installed.
These APIs are the basis for some neat Distutils2
features:
Changing an architecture as wide and complex as Python packaging needs to be carefully done by changing standards through a PEP process. And changing or adding a new PEP takes in my experience around a year.
One mistake the community made along the way was to deliver tools that solved some issues by extending the Metadata and the way Python applications were installed without trying to change the impacted PEPs.
In other words, depending on the tool you used, the standard library
Distutils
or Setuptools
, applications were installed
differently. The problems were solved for one part of the community
that used these new tools, but added more problems for the rest of
the world.
OS Packagers for instance, had to face several Python standards:
the official documented standard and the de-facto standard imposed by
Setuptools
.
But in the meantime, Setuptols
had the opportunity to experiment
in a realistic scale (the whole community) some innovations in a
very fast pace, and the feedback was invaluable. We were able to
write down new PEPs with more confidence in what worked and what did not,
and maybe it would have been impossible to do so differently.
So it's all about detecting when some third-party tools are contributing
innovations that are solving problems and that should ignite a PEP
change.
I am paraphrasing Guido van Rossum in the section title, but that's one aspect of the batteries-included philosophy of Python that impacts a lot our efforts.
Distutils
is part of the standard library and Distutils2
will soon be. A package that's in the standard library is very hard to
make evolve. There are of course deprecation processes, where you can kill
or change an API after 2 minor versions of Python. But once an API is
published, it's going to stay there for years.
So any change you make in a package in the standard library that is not a bug fix, is a potential disturbance for the eco-system. So when you're doing important changes, you have to create a new package.
I've learned it the hard way with Distutils
since I had to eventually
revert all the changes I had done in it for more that a year and create
Distutils2
. In the future, if our standards change again in a drastic
way, there are high chances that we will start a standalone Distutils3
project first, unless the standard library is released on its own at some point.
Changing the way packaging works in Python is a very long process: the Python ecosystem contains so many projects based on older packaging tools that there is and will be a lot of resistance to change. (Reaching consensus on some of the topics discussed in this chapter took several years, rather than the few months I originally expected.) As with Python 3, it will take years before all projects switch to the new standard.
That's why everything we are doing has to be backward-compatible with
all previous tools, installations and standards, which makes the
implementation of Distutils2
a wicked problem.
For example, if a project that uses the new standards depends on another project that don't use them yet, we can't stop the installation process by telling the end-user that the dependency is in an unknown format!
For example, the INSTALL-DB
implementation contains compatibility code
to browse projects installed by the original Distutils
,
Pip
, Distribute
, or Setuptools
.
Distutils2
is also able to install projects created by the
original Distutils
by converting their metadata on the fly.
Some sections in this paper were directly taken from the various PEP
documents we wrote for packaging. You can find the original documents
at http://python.org
:
http://python.org/peps/pep-0214.html
http://python.org/peps/pep-0314.html
http://python.org/peps/pep-0345.html
http://python.org/peps/pep-0376.html
http://python.org/peps/pep-0381.html
http://python.org/peps/pep-0386.html
I would like to thank all the people that are working on packaging; you will find their name in every PEP I've mentioned. I would also like to give a special thank to all members of The Fellowship of the Packaging. Also, thanks to Alexis Metaireau, Toshio Kuratomi, Holger Krekel and Stefane Fermigier for their feedback on this chapter.
The projects that were discussed in this chapter are:
Distutils
: http://docs.python.org/distutils
Distutils2
: http://packages.python.org/Distutils2
Distribute
: http://packages.python.org/distribute
Setuptools
: http://pypi.python.org/pypi/setuptools
Pip
: http://pypi.python.org/pypi/pip
Virtualenv
: http://pypi.python.org/pypi/virtualenv