Cloud PyPI Proposal
This is the third revision of the proposal text which was accepted by the PSF board in their December 2010 meeting.
PSF-Proposal: 100 Title: Move PyPI static data to the cloud for better availability Version: 3 Last-Modified: 2010-08-16 Author: email@example.com (Marc-AndrÃ© Lemburg) Discussions-To: firstname.lastname@example.org Status: Draft Type: Informational Created: 2010-06-14 Post-History: Proposal: Move PyPI static data to the cloud for better availability ======================================================================== Motivation ---------- PyPI has in recent months seen several outages with the index being unavailable to both users using the web GUI interface as well as package administration tools such as easy_install from setuptools. As more and more Python applications rely on tools such as easy_install for direct installation, or zc.buildout to manage the complete software configuration cycle, the PyPI infrastructure receives more and more attention from the Python community. While we just started to have monitoring for PyPI installed and the first reports of outtages are now being sent to the catalog-sig mailing list, we don't have reliable numbers available yet. However, the number of discussions about PyPI outtages in the mailing lists has increased to a point where we cannot simply ignore those complaints anymore. In order to maintain its credibility as software repository, to support the many different projects relying on the PyPI infrastructure and the many users who rely on the simplified installation process enabled by PyPI, the PSF needs to take action and move the essential parts of PyPI to a more robust infrastructure that provides: * scalability * 24/7 outsourced system administration management * redundant storage * geo-localized fast and reliable access Current Situation ----------------- PyPI is currently run from a single server hosted in The Netherlands (ximinez.python.org). This server is run by a very small team of sys admin. PyPI itself has in recent months been mostly maintained by one developer: Martin von LÃ¶wis. Projects are underway to enhance PyPI in various ways, including a proposal to add external mirroring (PEP 381), but these are still not fully finalized and require implementation changes/new features in the existing client tools. According to Martin, the server side features of PEP 381, including a few undocumented extensions to provide package signatures, are already implemented. However, without client tools to make use of them, this is not going to change the current situation for existing PyPI users. Furthermore those client tools enhancements would first have to get adopted by PyPI users by either replacing their client tools with updated versions or switching to new client tools, which is likely going to take months to years. Existing client tool users won't see an immediate improvement. Usage ----- PyPI provides four different mechanisms for accessing the stored information: * a web GUI that is meant for use by humans * an RPC interface which is mostly used for uploading new content * a semi-static /simple package listing, used by setuptools * a static area /packages for package download files and documentation, used by both the web GUI and setuptools The /simple package listing is dump of all packages in PyPI using a simple HTML page with links to sub-pages for each package. These sub-pages provide links to download files and external references. External tools like easy_install only use the /simple package listing together with the hosted package download files. While the /simple package listing is currently dynamically created from the database in real-time, this is not really needed for normal operation. A static copy created every 10-20 minutes would provide the same level of service in much the same way. Moving static data to a CDN --------------------------- Under the proposal the static information stored in PyPI (meta-information as well as package download files and documentation) is moved to a content delivery network (CDN). For this purpose, the /simple package listing is replaced with a static copy that is recreated every 10-20 minutes using a cronjob on the PyPI server. At the same intervals, another script will scan the package and documentation files under /packages for updates and upload any changes to the CDN for neartime availability. By using a CDN the PSF will enable and provide: * high availability of the static PyPI content * offload management to the CDN * enable geo-localized downloads, i.e. the files are hosted on a nearby server * faster downloads * more reliability and scalability * move away from a single point of failure setup Note that the proposal does not cover distribution of the dynamic parts of PyPI. As a result uploads to PyPI may still fail if the PyPI server goes down. However, these dynamic parts are currently not being used by the existing package installation tools. Choice of CDN: Amazon Cloudfront -------------------------------- To keep the costs low for the PSF, Amazon Cloudfront appears to be the bext choice for CDN. Cloudfront is supported by a set of Python libraries (e.g. Amazon S3 lib and boto), upload scripts are readily available and can easily be customized. http://www.saltycrane.com/blog/2008/12/card-store-project-4-notes-using-amazons-cloudfront/ Other CDNs, such as Akamai, are either more expensive or require custom integration. Availability of Python-based tools is not always given, in fact, accessing such information is difficult for most of the proprietary CDNs. Cloudfront: quality of service ------------------------------ Amazon Cloudfront uses S3 as basis for the service, S3 has been around for years and has a very stable uptime: http://www.readwriteweb.com/archives/amazon_s3_exceeds_9999_percent_uptime.php Cloudfront itself has been around since Nov 2008. Amazon still uses the web 2.0 "beta" marketing term on it. You can check their current online status using this panel: http://status.aws.amazon.com/ Apart from the gained availability and outsourced management, we'd also get faster downloads in most parts of the world, due to the local caching Cloudfront is applying. This caching can be used to further increase the availability, since we can control the expiry time of those local copies. So in summary, we are replacing a single point of failure with an N server fail-over system (with N being the number of edge caching servers they use). How Cloudfront works -------------------- Cloudfront uses Amazon's S3 storage system which is based on "buckets". These can store any number of files in a directory-like structure. The only limit is a 5GB per file limit - more than enough for any PyPI package file. Cloudfront provides a domain for each registered S3 bucket via a "distribution" which is then made available through local cache servers in various locations around the world. The management of which server to use for an incoming request is transparently handled by Amazon. Once uploaded to the S3 bucket, the files will be distributed to the cache servers on demand and as necessary. Each edge server server maintains a cache of requested files and refetches the files after an expiry time which can be defined when uploading the file to the bucket. To simplify things on our side, we'll setup a CNAME DNS alias for the Cloudfront domain issued by Amazon to our bucket: pypi-static.python.org. IN CNAME d32z1yuk7jeryy.cloudfront.net. In the unlikely event of a longer downtime of the whole Amazon Cloudfront system, our system administrators could then easily change the DNS alias pypi-static.python.org to point back to the PyPI server until the Cloudfront problem is rectified. For more details, please see the Cloudfront documentation and FAQ: http://aws.amazon.com/documentation/cloudfront/ http://aws.amazon.com/cloudfront/faqs/ Integration ----------- In order to keep the number of changes to existing client side tools and PyPI itself to a minimum, the installation will try to be as transparent to both the server and the client side as possible. This requires on the server side: * few, if any changes to the PyPI code base * simple scripts, driven by cronjobs * a simple distributed redirection setup to avoid having to change client side tools On the client side: * no need to change the existing URL http://pypi.python.org/simple to access PyPI * redirects are already supported by setuptools via urllib2 Note that we are avoiding creating a lock-in situation by moving the data to a CDN, since the needed configuration changes on the server side can easily be rolled back to the current setup, without affecting the client side. Server side: upload cronjobs ---------------------------- Since the /simple index tree is currently being created dynamically, we'd need to create static copies of it at regular intervals in order to upload the content to the S3 bucket. This can easily be done using tools such as wget or curl or using a custom Python script that hooks directly into the PyPI database (and reuses the code for generating the /simple tree). Both the static copy of the /simple tree and the static files uploaded to /packages then need to be uploaded or updated in the S3 bucket by a cronjob running every 10-20 minutes. In a second phase of the project, we could extend PyPI to automatically push updates to Cloudfront whenever a new file is uploaded or the package data changes. Server side: downloads statistics --------------------------------- The next step would then be to configure access logs: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html and add a cronjob to download them to the PyPI server. Since the format is a bit different than the Apache log format used by the PyPI software, we'd have two options: 1. convert the Cloudfront format to Apache format and simply append the converted logs to the local log files 2. write a Cloudfront log file reader and add it to the apache_count_dist.py script that updates the download counts on the web GUI Both options require no more than a few hours to implement and test. Server side: redirection setup ------------------------------ Since PyPI wasn't designed to be put on a CDN, it mixes static file URL paths with dynamic access ones, e.g. dynamic: http://pypi.python.org/pypi (and a few others) static: http://pypi.python.org/simple http://pypi.python.org/packages To move part of the URL path tree to a CDN, which works based on domains, we will need to provide a URL redirection setup that redirects client side tools to the new location. As Martin von Loewis mentioned, this will require distributing the redirection setup to more than just one server as well. Fortunately, this is not difficult to do: it requires a preconfigured lighttpd (*) setup running on N different servers which then all provide the necessary redirections (and nothing more): dynamic: http://pypi.python.org/ -> http://ximinez.python.org/pypi http://pypi.python.org/pypi -> http://ximinez.python.org/pypi (and possibly a few others) static: http://pypi.python.org/simple -> http://pypi-static.python.org/simple http://pypi.python.org/packages -> http://pypi-static.python.org/packages (note: pypi-static.python.org is a CNAME alias for the Cloudfront domain issued to the S3 bucket where we upload the data) The pypi.python.org domain would then have to be setup to map to multiple IP addresses via DNS round-robin, one entry for each redirection server, e.g. pypi.python.org. IN A 126.96.36.199 pypi.python.org. IN A 188.8.131.52 pypi.python.org. IN A 184.108.40.206 pypi.python.org. IN A 220.127.116.11 Redirection servers could be run on all PSF server machines, and, to increase availability, on PSF partner servers as well. It should be noted that current client side PyPI tools do not support automatic retry, so there still is a chance that the redirection server they pick on first try will fail. The user would then just have to retry the download to get a new server address. Automatic retry would, of course, create a better user experience, but this requires a few small changes in the existing PyPI client tools. (*) lighttpd is a lightwheight and fast HTTP server. It's easy to setup, doesn't require a lot of resources on the server machine and runs stable. Long-term changes ----------------- While enabling the above redirection setup, we should also start working on changing PyPI and the client tools to use two new domains which then cleanly separate the static CDN file access from the dynamic PyPI server access: pypi.python.org pypi-static.python.org Such a transition on the client side is expected to take at least a few years. After that, the redirection service can be shut down or used to distribute and scale the dynamic PyPI service parts. Future improvements ------------------- We could replace the cronjob system with a trigger based system that uploads changes as soon as the PyPI server receives them. Side-effects ------------ Restarts of the PyPI server, network outages, or hardware failures would not affect the static copies of the PyPI on the CDN. setuptools, easy_install, pip, zc.buildout, etc. would continue to work. The S3 bucket would serve as additional backup for the files on PyPI. Later integration with Amazon EC2 (their virtual server offering) would easily be possible for more scalability and reduced system administration load. We don't have to worry about issues such as mirror servers having out-of-date data. Manipulation of packages, e.g. to introduce trojans, is also minimized, since the Cloudfront edge servers get their data straight from the S3 bucket. By moving from a database query driven /simple index to a static file system based one, we increase the responsiveness of the PyPI index. Additionally, we could easily deliver a database snapshot of the PyPI database as SQLite file for client side tools to use for local queries, further reducing the need for client-server roundtrips and query overhead on the server side. Costs ----- Amazon charges for S3 and Cloudfront storage, transfer and access. The costs vary depending on location. http://aws.amazon.com/cloudfront/#pricing http://aws.amazon.com/s3/#pricing To get an idea of the costs, we'd have to take a closer look at the PyPI web stats: http://pypi.python.org/webstats/usage_201005.html In May 2010, PyPI transferred 819GB data and had to handle 22mio requests. Using the AWS monthly calculator this gives roughly (I used 37KB as average object size and 35% US, 35% EU, 10% HK, 10% JP as basis): USD 132 per month, or about USD 1,584 per year for Cloudfront. For the S3 storage, the costs amount to roughly USD 30 per month, or USD 360 per year (100GB storage, 50GB traffic in, 100GB traffic out, 1000 PUT requests, 1mio GET requests). Total costs are an estimated USD 1944 per year. Refinancing the costs --------------------- Since PyPI is being used as essential resource by many important Python projects (Zope, Plone, Django, etc.), it's fair to ask the respective foundations and the general Python community for donations to help refinance the administration costs. A prominent donation button should go the PyPI page with a text explaining how PyPI is being hosted and why donations are necessary. We may also be able to directly ask for donations from the above foundations. Details of this are currently being evaluated by the PSF board (there are some issues related to our non-profit status that make this more complicated than it appears at first). Unlike other less visible PSF activities, providing and running PyPI is a real tangible service to the community, creating more incentive for Python users, including companies relying on the PyPI service, to donate to the PSF. Overall, we should be able to refinance the costs of this improved service level, perhaps even generate more donations than needed to fund other PSF activities. Effort ------ Given that most of the tools are readily available, setting up the servers shouldn't take more than 2-3 developer days for developers who've worked with Amazon S3 and Cloudfront before, including testing. It is expected that we'll find volunteers to implement the necessary changes. If not, Marc-Andre Lemburg will start working on implementing the necessary changes, as time permits. Competing with PEP 381 ---------------------- A few PEP 381 developers have stated that this proposal would limit the interest in PEP 381 implementations and argue that the proposal would compete with their proposed strategy. Just to clarify, this proposal does not try to compete with the mirror proposal outlined in PEP 381. Instead it focuses on a readily available solution that can be implemented in a few days and only requires little additional system administration. In order to further underline this, the proposal will be presented to the board for approval in their August board meeting (currently scheduled for August 16), giving the PEP 381 developers more time to work and improve their PEP 381 client implementations. If the PEP 381 infrastructure gets rolled out, both the external mirrors and the cloud mirrors can work side-by-side, so there is no conflict.