Differences between revisions 6 and 7

Pypi Testing Infrastructure

Create a server that tests uploaded packages on PyPI.

Gsoc 2011

General designs

Terminology

Raw data: the data generated by tasks execution. Report: evaluation of the different features/attributes of the data. Task: execution which produce raw data and "output". eg build, install, unittest, pylint...

Environment part

The environment part of the project is responsible for creating an abstraction for the execution part . It

handles delivery of distributions (and its dependencies), to the execution part ( to run tests on them). It handles all the protocols required to communicate to the PyPI repository and also to the different architecture used in the project. It subscribes to uploaded packages from PyPI for testing them (testing done by the execution part). It is also responisible for setting up the environment required for testing and to deliver the packages to the execution part for testing them .

Architecture

Master – Slave architecture where the master dispatches jobs to the slave and

the slave executes them.

The communication between master and slave happen through an API called

command API

The slave communicates with the vm , sends the distributions require for testing

and receive raw data (after installing the distributions and conducting tests) using another API called raw data API

Tests are run on VMs and each VM is handled by a slave

Raw Data API

The task is to build a raw data API for the communication between the VM and

slave.

The raw data API handles sending the data into the corresponding VMs.
The raw data API also handles the raw data (after the execution part has finished)

on VM to be sent to the slave

Command API

The task is to build a command API to communicate between the Master and the

Slave .

The command API handles the task requests issued by the Master to and

assigns them to the slave.

Task requests can involve different configurations to be made on a VM,

what distributions to be tested,etc.

Slave

The slave performs the following tasks

Initialises an isolated VM and configures the VM using the configuration

provided by the API call to it.

It should be able to communicate with PyPI repository and get the

distributions to be tested.

Gets the distribution to be tested from the repository , computes

dependencies and also gets the dependencies from the repository.

Passes all the packages to the VM.
Receive the raw data from the VM.

Master

Master subscribes to uploaded packages in PyPI
It dispatches jobs to the Slave using command API
It receives the test results from the slave

Execution part

Detailed work

During the preparation's discussions of the project, we have cut the project into two different parts which are depending on each other. This proposal is one of them: the execution part. The other part the establishment of clean environment (inside a VM instance): Environment part.

Work can be divided into three parts, which I will detail below:

Create an execution manager. Write the essentials tasks. Write the common tasks.

Global Architecture

Even if this proposal concern the execution part of PYTI, choices made during preparation's discussions have an influence on the work, i will detail theses choices below and their influences:

Tests will be executed inside a Virtual Machine instance, so we need to execute source codes from untrusted source. We chose to cut off access to the network in order to avoid sending mail, deny of service... This choice can be problematic if tests need access to the network, but good practices recommend to mock them. If several distributions cannot mock network access, we must create a method to control network access in a more accurately way.

As vm instance will not have access to network, the program which will start the instance must prepare all that the instance will need, distribution archive, of course, but also distribution's dependencies archives. The dependencies computing must be done before starting the vm instance, this part is detailed below in the "Dependency setup" part.

The last point is about Raw Data API, in fact, tasks will generate raw data (see below for terminology) and they should be sent to the slave. This API is part of the complementary part of the project, but as the execution part will use it, it should be designed with the participation of all.

Execution manager

Running the tests on package content may be split into different independent tasks. These tasks can be written independently, but they must all be executed anyway. They cannot be performed in any order, as one specific task may depends on an another one and if a task fail, the system should not performed other tasks that depend on that which has failed.

This is the role of the execution manager, it will manage all the different tasks and execute them in the right order, ensuring that tasks success. Execution manager must manage these scenario:

Task 'A' depends on 'B' and 'B' one depends on 'C'. Execution order must be: 'C', 'B' and finally 'A'. If task 'B' fail, execution manager must mark 'A' as skipped due to error in one of it's dependencies. Execution manager must also provide a way for tasks to exchange data, and so we can imagine that tasks will depends on data instead of traditional dependencies. Moreover, as tasks should generate raw data, it makes more sense to manage raw data dependencies than tasks dependencies. For example:

Task 'A' need 'D' data. Task 'B' produce 'D' data during his execution. Execution order must be: 'B' and 'A'.

Essential tasks

Some of the tasks are essential and must be implemented as soon as possible.

The second one is the installation. It's one of the first task that will be executed. It's the most important task as it's one of the part which has motivated the entire project. Indeed, old packaging libraries has designed Setup script as traditional python module and it causes some problem because packagers can add valid python code in their script and nothing prevent them to add a "os.sytem('rm -Rf /')" statement. The idea is to authorize it and trace all the access to file system, network and system calls. The detection of all this comportment can be done with using a tracing library such as Strace or SystemTap. Utilization of such tools will produce raw data that we must to process to make them readable for human and useable for harmful comportment detection.

The third one is build, indeed some of python distributions may include non-python source code. These non-python source code must be compiled before the installation of the distribution, but it could fail. Currently, the packaging libraries take care of that, but it could be difficult to have detailed raw data. So we need to study how these libraries work and how we can have raw data from them, and if it's not possible, we should improve the building part of theses libraries.

Common tasks

Moreover, some tasks may be considered as standard, such as the following:

Test execution task, result of unittest, doctest is a very common task and should be also included in the execution. Test execution may also include the code coverage measure. Test execution causes some problems, as there is more than one library for testing. Work must include analysis of existent testing libraries and make tasks for each of them. This task may depends on external dependencies (specific databases, non easy-installable libraries, ...) and we need to choose which external dependencies we will install on our VM instances.

Another common task is quality check. Quality check is a vast subject, so we have thought about pylint and pychecker in first time. Another quality tool can be added case by case depending on features added of these tools.

A final task that can be added if we have time is a performance task. The idea is measuring the time took by the others tasks. This measure is not very important but could be implemented if we have time and if community really want it.

copy content of proposal <here>

existing solutions

Task Manager

What we need :

Dependencies between tasks
Mark tasks as skipped if at least one dependency has failed or skipped

Existing solutions :

Pony-build (integrated)
Apycot/narval (integrated)
Buildbot (integrated)
PythonTasks (separated)

report your finding <here>

weekly meeting report

2012-05-12

Blog

(update once in a week, and whenever there is something to say)
- Yeswanth - http://www.advencode.wordpress.com Boris - http://feldboris.alwaysdata.net/blog/

IRC

#python-testing

Mailing List

General Mailing List : pyti@librelist.com Specific : distutils-sig / python-testing / catalog-sig / python-dev depending the cases

Source

Vanilla Mercurial + BitBucket

Python Version

Python 2.5, if there is a need for new features, using a different python version can be discussed up to python 2.6

Wiki

http://wiki.python.org/moin/PyPITestingInfrastructure

Weekly Meeting

For the next week(Boris): Code : Finish some little things on PyTasks (need it for a project) and compare it to another task manager included in tools like waff and narval (not sure about pony-build) Design : Will read some of my bookmarks about distributed computing communication : Write a summary about task manager and about which protocol we can use for API

For the next week(Yeswanth):

Code : Not much this week( Will work on reading about python coding conventions) Other tools/design : Will go through condor or other equivalents and see how it fits in this project Communication : Report on condor (http://www.cs.wisc.edu/condor/) or other equivalents

-  ⇤ ← Revision 6 as of 2011-05-19 16:28:15 → 
  Size: 7833
  Editor: 2a01:e35:8a2e:c8a0:fa1e:dfff:feeb:c86a
  Comment:
+   ← Revision 7 as of 2011-05-20 18:52:47 → ⇥
  Size: 10216
  Editor: 115
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 14:
+==== Environment part ====

 The environment part of the project is responsible for creating an abstraction for the execution part . It
handles delivery of distributions (and its dependencies), to the execution part
( to run tests on them). It handles all the protocols required to communicate
to the PyPI repository and also to the different architecture used in the
project. It subscribes to uploaded packages from PyPI for testing them (testing
done by the execution part). It is also responisible for setting up the
environment required for testing and to deliver the packages to the execution
part for testing them .

===== Architecture =====
 * Master – Slave architecture where the master dispatches jobs to the slave and
the slave executes them.
 * The communication between master and slave happen through an API called
command API
 * The slave communicates with the vm , sends the distributions require for testing
and receive raw data (after installing the distributions and conducting tests)
using another API called raw data API
 * Tests are run on VMs and each VM is handled by a slave

===== Raw Data API =====
 * The task is to build a raw data API for the communication between the VM and
slave.
 * The raw data API handles sending the data into the corresponding VMs.
 * The raw data API also handles the raw data (after the execution part has finished)
on VM to be sent to the slave

===== Command API =====
 * The task is to build a command API to communicate between the Master and the
Slave .
 * The command API handles the task requests issued by the Master to and
assigns them to the slave.
 * Task requests can involve different configurations to be made on a VM,
what distributions to be tested,etc.

===== Slave =====
The slave performs the following tasks
 * Initialises an isolated VM and configures the VM using the configuration
provided by the API call to it.
 * It should be able to communicate with PyPI repository and get the
distributions to be tested.
 * Gets the distribution to be tested from the repository , computes
dependencies and also gets the dependencies from the repository.
 * Passes all the packages to the VM.
 * Receive the raw data from the VM.

===== Master =====
 * Master subscribes to uploaded packages in PyPI
 * It dispatches jobs to the Slave using command API
 * It receives the test results from the slave
-Line 106:
+Line 159:
-Line 109:
+Line 162:
         Vanilla Mercurial + BitBucket
-Line 111:
+Line 164:
-Line 114:
+Line 167:
-Line 128:
+Line 181:
 Other tools/design : Will go through condor or other equivalents and see how it fits in this project
-Line 130:
+Line 183:

Page

User