Differences between revisions 3 and 13 (spanning 10 versions)
Revision 3 as of 2008-10-12 14:47:34
Size: 4528
Editor: 158
Comment: use ints instead of a float for the mirror reliability counter feature
Revision 13 as of 2009-02-15 11:22:29
Size: 7937
Editor: tarek
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
 * PEP: 374  * PEP: xx
Line 14: Line 14:
== Motivation == == Motivations ==
Line 18: Line 18:
and Buildout make intensive usage of PyPI.

For people making extensive use of PyPI, it can act as a single point
and zc.buildout make intensive usage of PyPI.

For people making intensive use of PyPI, it can act as a single point
Line 24: Line 24:
  The motivation of this PEP is to set up a registering
mechanism in PyPI in order to list all the public PyPI mirrors
and to provide an event based system where all mirrors get
informed via RPC when a package has been uploaded, modified,
or removed, so they can eventually sync themselves.

This PEP describes:

 * Mirror listing and registering
 * Ping mechanism

In order to make the system more reliable, this PEP describes:

- the mirror listing and registering at PyPI

- the pages a public mirror should maintain.
  these pages will be used by PyPI, in order to get
  hit counts and the last modified date.

- how a mirror should synchronize with PyPI

- how a client can implement a fail-over mechanism

- a contact form for Package maintainers
Line 38: Line 41:
A new HTML page will be added at http://pypi.python.org/mirrors A new text page will be added at http://pypi.python.org/mirrors
Line 46: Line 49:
<html>
  <head><title>PyPI mirrors</title></head>
  <body>
  
    <h1>PyPI mirrors</h1>
    
    <p>
    If you want to register a new mirror, send an email
    to the catalog-SIG@python.org with:
    </p>

    <ol>
        <li> The url of your mirror.</li>
        <li> The name and email of the maintainer.</li>
        <li> The url to ping when PyPI is updated.</li>
    </ol>

    <p>
    The registering is done manually and to become a
    mirror, you need to strictly follow the package index
    API defined here:
    </p>
    
    <p>
    http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api
    </p>
    
    <ul id="mirror-links">
        <li><a href="http://example.com/pypi">Mirror #1</a></li>
        <li><a href="http://example2.com/pypi">Mirror #2</a></li>
    </ul>

  </body>
</html>
# PyPI mirrors
#
# If you want to register a new mirror, send an email
# to the catalog-SIG@python.org with:
#
# - The urls of your mirror:
# - the root of the server
# - the index page
# - the last modified page
# - the local stats page
# - the global stats page
# - the mirrors page
#
# - The name and email of the maintainer.
#
# The registering is done manually and to become a
# mirror, you need to strictly follow the mirror protocol
# described here:
#
# http://wiki.python.org/PEP_374
#
# root,index,last-modified,local-stats,stats,mirrors
http://example.com/pypi,index,last-modified,local-stats,stats,mirrors
http://example2.com/pypi,index,last-modified,local-stats,stats,mirrors
Line 83: Line 76:
added in a dedicated SQL table in the PyPI application.

The mirror list page is a simple html page that can be browsed
added in the mirror list in the PyPI application after it
has been checked to be compliant with the mirroring rules
.

The mirror list page is a simple text page that can be browsed
Line 87: Line 81:

== Ping mechanism ==

Everytime a package is uploaded, removed or modified at PyPI, the server
will ping via XML-RPC all the mirrors that are registered in order to tell
them that something has changed.

The XML-RPC request will use the following pseudo-code:
Other package indexes that are not mirrors of PyPI are not added in the
mirror list in PyPI. Although they can provide themselve the
same mirroring list mechanism for their own mirrors.

== Special pages a mirror needs to provide ==

A mirror needs to provide four pages, beside the index one:

 * last-modified
 * local-stats
 * stats
 * mirrors
 
=== Last modified date ===

CPAN uses a freshness date system where the mirror last synchronisation
date is made available.

For PyPI, each mirror needs to maintain an url with a simple text content
that represents the last synchronisation date the mirror maintains.

The date is provided in GMT time, using the iso 8601 format
(http://en.wikipedia.org/wiki/ISO_8601)

Each mirror will be responsible to maintain its last modified date.

Conventionaly, this page should be reachable at: "/last-modified".

=== Local statistics ===

Each mirror is responsible to count all the downloads
that where done on it. This is used by PyPI to sum up all
downloads, to be able to display the grand total.

This page is a csv-like page, with a header at the first
line. It needs to obey PEP 305 http://www.python.org/dev/peps/pep-0305/#id19.
Basically, it should be readable by Python csv module.

The fields in this file are:

 * package: the distutils id of the package.
 * filename: the filename that has been downloaded.
 * useragent: the User-Agent of the client that has downloaded the package.
 * count: the number of downloads.

The page will look like this:
Line 97: Line 129:

    >>> import xmlrpclib
    >>> from socket import setdefaulttimeout
    >>> MIRROR_TIMEOUT = 2
    >>> setdefaulttimeout(MIRROR_TIMEOUT)
    >>> for mirror_url in mirror_urls:
    ... mirror = ServerProxy(mirror_url)
    ... try:
    ... mirror.removed_packages(removed_names)
    ... mirror.modified_packages(modified_names)
    ... except (timeout, xmlrpclib.ProtocolError):
    ... log_failure(mirror_url)
# package,filename,useragent,count
zc.buildout,zc.buildout-1.6.0.tgz,MyAgent,142
...
Line 112: Line 134:
 * The removed_names and modified_names are the distutils names of the
 module distributions.

 * The MIRROR_TIMEOUT will prevent PyPI to slow down everytime
 it calls the mirrors. The mirrors are responsible to do the
 synchronisation job and must do it asynchronously from that
 call, which has to remain a ping for the sake of performance.
 The mirror decides what to do with the information it gets from
 PyPI.

 * Everytime a mirror fails, it is logged in the SQL database
 where we keep a "reliability ratio". It starts with 100
 and decreases after each failure. It is reset to 100
 every 100 calls. If the number of failures during these 100
 calls goes below a 75, the mirror is declared unreliable, and a
 mail is sent to the distutils mailing list. The mirror will then be
 eventually removed if the problem persists.

== Costs ==

Someone has to manage the list of mirrors. This work should
not take too much time. I am willing to be that maintainer if
the people that maintain the server don't have the time, or
don't trust me.

== Implementation ==
The counting starts the day the mirror is launched.

Conventionaly, this page should be reachable at: "/local-stats", but any url relative to the mirror root can be given.

=== Statistics page ===

PyPI and each mirror are responsible to provide the grand total
page at "/stats". This page is calculated daily by PyPI,
by reading all mirrors local stats and suming them.

Therefore the mirrors should not try to rebuild this stat page but simply
get PyPI's one during each synchronization.

It has the same structure than the local-stats page.

Conventionaly, this page should be reachable at: "/stats".

=== Mirrors listing page ===

Like /stats, each mirror should get and provide the /mirrors
page.

Conventionaly, this page should be reachable at: "/mirrors".

== How a mirror should synchronize with PyPI ==

A mirroring protocol calls Simple Index was described
and implemented by Martin v. Loewis and Jim Fulton, based on
how easy_install works. This section synthesizes it
and give a few relevant links, plus a small part about
User-Agent.

=== The mirroring protocol ===

XXX changelog, pje link + to be defined

=== User-agent request header ===

In order to be able to differentiate actions taken by clients
over PyPI, a specific user agent name should be provided by all
mirroring softwares.

This is also true for all clients like:

 * zc.buildout
 * setuptools
 * pyinstaller
 * etc.

XXX user agent registering mechanism at PyPI ?

== How a client can use PyPI and its mirrors ==
Line 141: Line 189:

== Third-party applications ==

A sample module will be written and published to demonstrate how a mirror
can implement the XML-RPC APIs in Python.
== Fail-over mechanism ==

Clients that are browsing PyPI should be able to use
a fail-over mechanism when PyPI is not responding.

This can be done by parsing the /mirrors page of PyPI
or the one located on any PyPI mirror.

It is up to the client to decide wich mirror should
be used, depending on its geographical location and
its responsivness.

This PEP does not describe how this fail-over
mechanism should work, but it is strongly encouraged
that the clients try to use the nearest mirror.

The clients so far that could use this mechanism:

 * setuptools
 * zc.buildout (through setuptools)
 * pyinstaller

== Extra package indexes ==

It is obvious that some package will not be uploaded
to PyPI. Wether because they are private or wether because
the project maintainer runs his own server where people
get the project package. Although, it is strongly
encouraged that a public package index follows PyPI
and distutils protocols. In other words, the "register"
and "upload" command should be compatible with any
package index server out there.

Softwares that are compatible with PyPI and distutils:

 * PloneSoftwareCenter
 * EggBasket

== Merging several indexes ==

When a client needs to get some packages from several
distinct indexes, it should be able to use each one of them
as a potential source of packages. Different indexes
should be defined as a sorted list for the client to
look for a package.

Each independant index can of course provide a list of
its mirrors, if the /mirrors page is available.

That permits all combinations at client level, for a reliable
packaging system with all levels of privacy.

== Other PyPI enhancements ==

XXX

=== Contact form for Package maintainers ===

A form reachable from the package page will be added,
where a registered user can submit a message to the package
owner. This is to be used when someone wants to take over
the distutils id name, or when someone (like a packager
for example) would like to reach the package owner
for some questions.

XXX isn't the mail in the metadata enough ?
XXX the original whish here was to enforce the package owner
to upload sdist.

Mirroring infrastructure in PyPI

  • PEP: xx
  • Title: Mirroring infrastructure in PyPI
  • Author: Tarek Ziadé
  • Discussions-To: Catalog SIG
  • Status: Draft
  • Python-Version: 2.6

Abstract

This PEP describes a mirroring infrastructure for PyPI.

Motivations

PyPI is hosting over 4000 projects and is used on a daily basis by people to build applications. Especially systems like easy_install and zc.buildout make intensive usage of PyPI.

For people making intensive use of PyPI, it can act as a single point of failure. People have started to set up some mirrors, both private and public. Those mirrors are active mirrors, which means that they are browsing PyPI to get synced.

In order to make the system more reliable, this PEP describes:

- the mirror listing and registering at PyPI

- the pages a public mirror should maintain.

  • these pages will be used by PyPI, in order to get hit counts and the last modified date.

- how a mirror should synchronize with PyPI

- how a client can implement a fail-over mechanism

- a contact form for Package maintainers

Mirror listing and registering

A new text page will be added at http://pypi.python.org/mirrors that can be browsed like the simple index. This page gives a list of the mirrors through a list of links.

These links are the URL of the simple index of each mirror. The page will look like this:

# PyPI mirrors
#    
# If you want to register a new mirror, send an email
# to the catalog-SIG@python.org with:
#
# - The urls of your mirror:
#   - the root of the server
#   - the index page 
#   - the last modified page
#   - the local stats page
#   - the global stats page
#   - the mirrors page
#
# - The name and email of the maintainer.
#   
#   The registering is done manually and to become a
#   mirror, you need to strictly follow the mirror protocol
#   described here:
#
#    http://wiki.python.org/PEP_374
#    
# root,index,last-modified,local-stats,stats,mirrors
http://example.com/pypi,index,last-modified,local-stats,stats,mirrors
http://example2.com/pypi,index,last-modified,local-stats,stats,mirrors

When a mirror is proposed on the mailing list, it is manually added in the mirror list in the PyPI application after it has been checked to be compliant with the mirroring rules.

The mirror list page is a simple text page that can be browsed by any tool that wants to get a list of registered mirrors. Other package indexes that are not mirrors of PyPI are not added in the mirror list in PyPI. Although they can provide themselve the same mirroring list mechanism for their own mirrors.

Special pages a mirror needs to provide

A mirror needs to provide four pages, beside the index one:

  • last-modified
  • local-stats
  • stats
  • mirrors

Last modified date

CPAN uses a freshness date system where the mirror last synchronisation date is made available.

For PyPI, each mirror needs to maintain an url with a simple text content that represents the last synchronisation date the mirror maintains.

The date is provided in GMT time, using the iso 8601 format (http://en.wikipedia.org/wiki/ISO_8601)

Each mirror will be responsible to maintain its last modified date.

Conventionaly, this page should be reachable at: "/last-modified".

Local statistics

Each mirror is responsible to count all the downloads that where done on it. This is used by PyPI to sum up all downloads, to be able to display the grand total.

This page is a csv-like page, with a header at the first line. It needs to obey PEP 305 http://www.python.org/dev/peps/pep-0305/#id19. Basically, it should be readable by Python csv module.

The fields in this file are:

  • package: the distutils id of the package.
  • filename: the filename that has been downloaded.
  • useragent: the User-Agent of the client that has downloaded the package.
  • count: the number of downloads.

The page will look like this:

# package,filename,useragent,count
zc.buildout,zc.buildout-1.6.0.tgz,MyAgent,142
...

The counting starts the day the mirror is launched.

Conventionaly, this page should be reachable at: "/local-stats", but any url relative to the mirror root can be given.

Statistics page

PyPI and each mirror are responsible to provide the grand total page at "/stats". This page is calculated daily by PyPI, by reading all mirrors local stats and suming them.

Therefore the mirrors should not try to rebuild this stat page but simply get PyPI's one during each synchronization.

It has the same structure than the local-stats page.

Conventionaly, this page should be reachable at: "/stats".

Mirrors listing page

Like /stats, each mirror should get and provide the /mirrors page.

Conventionaly, this page should be reachable at: "/mirrors".

How a mirror should synchronize with PyPI

A mirroring protocol calls Simple Index was described and implemented by Martin v. Loewis and Jim Fulton, based on how easy_install works. This section synthesizes it and give a few relevant links, plus a small part about User-Agent.

The mirroring protocol

XXX changelog, pje link + to be defined

User-agent request header

In order to be able to differentiate actions taken by clients over PyPI, a specific user agent name should be provided by all mirroring softwares.

This is also true for all clients like:

  • zc.buildout
  • setuptools
  • pyinstaller
  • etc.

XXX user agent registering mechanism at PyPI ?

How a client can use PyPI and its mirrors

XXX

Fail-over mechanism

Clients that are browsing PyPI should be able to use a fail-over mechanism when PyPI is not responding.

This can be done by parsing the /mirrors page of PyPI or the one located on any PyPI mirror.

It is up to the client to decide wich mirror should be used, depending on its geographical location and its responsivness.

This PEP does not describe how this fail-over mechanism should work, but it is strongly encouraged that the clients try to use the nearest mirror.

The clients so far that could use this mechanism:

  • setuptools
  • zc.buildout (through setuptools)
  • pyinstaller

Extra package indexes

It is obvious that some package will not be uploaded to PyPI. Wether because they are private or wether because the project maintainer runs his own server where people get the project package. Although, it is strongly encouraged that a public package index follows PyPI and distutils protocols. In other words, the "register" and "upload" command should be compatible with any package index server out there.

Softwares that are compatible with PyPI and distutils:

Merging several indexes

When a client needs to get some packages from several distinct indexes, it should be able to use each one of them as a potential source of packages. Different indexes should be defined as a sorted list for the client to look for a package.

Each independant index can of course provide a list of its mirrors, if the /mirrors page is available.

That permits all combinations at client level, for a reliable packaging system with all levels of privacy.

Other PyPI enhancements

XXX

Contact form for Package maintainers

A form reachable from the package page will be added, where a registered user can submit a message to the package owner. This is to be used when someone wants to take over the distutils id name, or when someone (like a packager for example) would like to reach the package owner for some questions.

XXX isn't the mail in the metadata enough ? XXX the original whish here was to enforce the package owner to upload sdist.

Mirroring infrastructure (last edited 2009-02-15 18:16:54 by PaulBoddie)

Unable to edit the page? See the FrontPage for instructions.