Differences between revisions 6 and 7

A Restructured Standard Library

Despite the continuous introduction of many new language features to Python, and compounded by the steady addition of new modules to the standard library over the years, the structure of the Python standard library has remained relatively static throughout most of Python's lifetime until the present day. However, new additions to the library have made the selection of appropriate library facilities relatively difficult, even for experienced developers. For example:

Does one choose urllib or urllib2 to open connections to remote resources?
Is popen2, commands or subprocess the best module to choose to manage spawned processes? What are the trade-offs?
Why is URL parsing done in urlparse and not in urllib?

A persuasive argument once upon a time was the simplicity of the Python standard library's layout in comparison to the "aggressively hierarchical" layout of the standard Java APIs, for example. But with a large number of overlapping modules and packages within the Python standard library reducing its relative coherency to Java's API proliferation (see java.sun.com for details), it seems appropriate to perform a reorganisation of the library's layout in order to promote a more memorable and intuitive structure that can be more coherently documented.

A Note on Backward Compatibility

One argument against reorganising the standard library is that "if you ignore them, they won't bother you" - that is, the presence of many apparently haphazardly named modules is not a problem unless you need to import many of them. Fortunately, this observation can be used to work in favour of a reorganisation: the old module and package names can be retained in addition to a new layout, existing software will continue to work by importing modules via their old names, improved documentation can focus on the new layout, reference material describing the old layout could also be provided to assist those working with older software. One disadvantage might be the additional space requirement of two different library layouts, however.

Potential Areas of Improvement

The following sections present observations about the current situation and possible recommendations for future editions of the standard library.

Activities, Grouping and Redundancy

The current standard library employs many modules as siblings at the top level of a relatively shallow namespace hierarchy. Many modules have been introduced to remedy, augment or partially replace existing modules, leading to problems of redundancy and incoherency. However, a policy of preserving APIs which resemble "system", "native" or "platform" APIs has also been maintained, leading to the provision of numerous functions and abstractions in modules such as os, select, mmap, errno, getopt, and so on.

Overlapping Module Groups

The following groups of modules exhibit overlapping functionality:

email, rfc822, mimetools, mimify, multifile
HTMLParser, htmllib (see below)
commands, subprocess, popen2
urllib, urllib2 (see below)
datetime, time, calendar

Modules in the above groups would be consolidated either within a single module or organised into a more intuitive package layout in a restructured standard library.

Functional Module Groups

The following groups of modules may intentionally provide similar functionality through different implementations, or may provide complementary functionality that belongs within a common "functional group":

StringIO, cStringIO (different implementations)
UserDict, UserList, UserString (common theme: built-in type emulation)
base64, binhex, binascii, quopri, uu (common theme: encodings)
HTMLParser, sgmllib, htmllib (common theme: HTML/SGML parsing)
anydbm, whichdb, dbm, gdbm, dbhash, bsddb, dumbdbm (common theme - database file access)
cPickle, pickle, copy_reg, marshal, shelve, pickletools (differing implementations, common theme: persistence)
optparse, getopt (common theme: command line options)
readline, rlcompleter, cmd, shlex, code (common theme: interpreter I/O)
codeop, compiler, py_compile, compileall (common theme: code generation)
pwd, spwd, grp, crypt, nis (common theme: authentication)
asyncore, asynchat, wsgiref, BaseHTTPServer, SimpleHTTPServer, CGIHTTPServer, SimpleXMLRPCServer, DocXMLRPCServer (common theme: network and Web programming)
asynchat, urllib, urllib2, urlparse, httplib, ftplib, gopherlib, poplib, imaplib, nntplib, smtplib, telnetlib, xmlrpclib (common theme: network client programming)
audioop, aifc, chunk, sunau, wave, sndhdr (common theme: audio)
imageop, rgbimg, imghdr (common theme: images)
textwrap, formatter (common theme: text formatting)
zlib, gzip, bz2, zipfile, tarfile (common theme: archiving)
hashlib, hmac, md5, sha (common theme: cryptography)
mailcap, robotparser, netrc, ConfigParser (common theme: configuration, albeit with specific and generic cases)

Modules in the above groups would be placed in intuitively named packages, possibly with improved names.

Recommendations

Just as the current standard library documentation divides the modules into particular groups, albeit with only moderate success, the above functional groupings could be used to define package boundaries that are more useful in distinguishing between different activities. A cursory review of the above could suggest the following set of packages:

archive
audio
authentication (or account)
client
compiler
commandline
configuration (or config)
cryptography (or crypto)
database
encoding
image
interpreter
parsing
persistence
server
text
types

The names employed above may not be entirely suitable, and due to the ambiguity of certain category names, it might be appropriate to establish packages with certain names (eg. parsing) within other packages (eg. compiler), thus providing a level of context (eg. Python source code parsing, as opposed to HTML/SGML parsing). One inspiration for top-level category names could be the MIME media type hierarchy which uses such names as application, audio, image, text, and so on, although enthusiasm for replicating that hierarchy would need to be restrained in areas beyond file type handling.

The issue remains of providing access to "system", "native" or "platform" APIs, especially since developers with a systems programming background may wish to make use of such APIs in preference to others, possibly to implement other abstractions or to maintain compatibility with (or resemblance to) other works. We may decide to retain a package for such APIs and not to remove them entirely from the standard library, despite the duplication of functionality that this might suggest.

Naming

The current standard library employs a number of naming conventions:

string, calendar (simple singular words)
types, collections (simple plural words)
struct, re, repr (foreshortenings and acronyms)
textwrap, unicodedata (combinations of simple words)
stringprep, fpformat (combinations of foreshortenings and acronyms)
difflib, httplib (lib-suffixed names, often using foreshortenings and acronyms)
StringIO, UserDict (mixed-cased variants of combinations)
rfc822, netrc (specific references to specifications or technical details)
copy_reg, dummy_thread (combinations involving underscores)

Recommendations

In order to simplify the recollection process, names should follow a consistent naming scheme, arguably favouring descriptive names which mention the nature of the activity supported. We might decide to permit only lower-case characters, together with numbers (only where absolutely necessary), although this can often appear confusing with acronyms and word combinations (eg. stringio, cstringio). However, since the use of acronyms may potentially be relegated to the level of class names, we may at that level employ mixed-case class names, along with upper-case acronyms as apparently tolerated by [http://www.python.org/dev/peps/pep-0008/ PEP 8 "Style Guide for Python Code"]. Thus, StringIO.StringIO would not become stringio.StringIO, but perhaps something like stringfile.StringIO or something even more descriptive.

In some situations it may be advisable to retain technical names instead of employing names which obscure the purpose of the module. For example, the base64 module refers to a specific kind of encoding, but any invented descriptive name for this module may prove be verbose and yet fail to accurately communicate the same information.

Module Functionality

The diversity of module naming provides an "archaeological" guide to the accumulation processes operating within the standard library, yet more fundamental changes in style, recommended practices and techniques exist within the code of the modules themselves. Since the results of such differing implementation techniques manifest themselves as differently organised class hierarchies or interaction patterns, users of standard library modules must often master styles of usage which are often unnecessarily complicated for the task at hand or which diverge from previously accepted abstractions for similar tasks.

However, for certain kinds of tasks it is appropriate to employ differing approaches and thus expose differing representations to users. For example, the choice of XML parsing module may involve trade-offs with respect to resource usage, convenience and performance, and no single approach is likely to satisfy the needs of all users.

Styles of Organisation/Interaction in Modules

The following styles of class organisation or interaction patterns appear in the standard library:

Abstract superclass plus handler subclasses (sgmllib)
Abstract superclass plus handler subclasses and separate processor classes (compiler, BaseHTTPServer)
Static mix-in hierarchies (SocketServer)
Dynamic mix-in hierarchies (urllib2)

The following styles of behaviour configuration are employed in the standard library:

Module-level globals to change module function behaviour (calendar, urllib)
Functionality registration mechanisms (copy_reg, xml.dom)
Environment variable access (urllib2)

Recommendations

Clearly, a diversity of patterns, mechanisms and styles are necessary to provide different approaches to particular tasks (as noted above). However, the revision of certain approaches and the subsequent "archaelogical" accumulation of modules suggests that contributors have not been able to settle, at least initially, on a style widely regarded as being satisfactory to many standard library users.

An interesting example of evolving styles, as well as a number of peculiarities in the APIs provided, can be found in the urllib and urllib2 modules. Here, a moderately simple initial API has evolved into a more complicated (and presumably more powerful) subsequent API, but despite the conveniences provided in "loading up" the configured objects with specific handler functionality in advance, an alternative might involve "flattening" the style of interactions by having users process responses explicitly using separate objects or functions.

Proposals

The most natural starting point for the definition of a restructured standard library is the package hierarchy itself. Taking the grouping recommendations into consideration, in order to identify broad categories, and taking the naming recommendations into account, we might define a more complete hierarchy:

account (or authentication)
- groups (replaces grp)
- nis
- passwords (replaces pwd, spwd)
archive
- bz2
- gzip
- tar (replaces tarfile)
- zip (replaces zipfile)
audio
- aiff (replaces aifc)
- au (replaces sunau)
- chunk
- header (replaces sndhdr)
- raw (replaces audioop)
- wav (replaces wave)
client
- ftp (replaces ftplib)
- gopher (replaces gopherlib)
- http (replaces httplib)
- mail
  - imap (replaces imaplib)
  - pop (replaces poplib)
- nntp (replaces nntplib)
- smtp (replaces smtplib)
- telnet (replaces telnetlib)
- url (replaces urllib, urllib2)
- xmlrpc (replaces xmlrpclib)
compiler
- code (replaces parts of compiler, codeop, py_compile, compileall)
- parsing (contains compiler parsing functions)
commandline
- getopt
- options (replaces optparse)
config
- mailcap
- netrc
- generic (replaces ConfigParser)
- robots (replaces robotparser)
crypto
- hash (replaces hashlib)
- hmac
- md5
- sha
database
- anydbm (includes whichdb)
- bsddb
- bsddbm (replaces dbhash?)
- dbm
- dumbdbm
- gdbm
datetime (includes calendar, sched?)
decimal
email
- message (replaces email, rfc822, mimetools, mimify, multifile)
- mailbox (includes mailbox contents)
  - mh (replaces mhlib)
encoding
- base64
- binascii
- binhex
- quopri
- uu
image
- header (replaces imghdr)
- raw (replaces imageop)
- rgb (replaces rgbimg)
interpreter
- generic (replaces cmd)
- python (replaces code)
- readline (includes rlcompleter)
math
- complex (replaces cmath)
- operator
net
- async (includes parts of asyncore, asynchat)
- url (replaces urlparse)
- xdr (replaces xdrlib)
persistence
- marshal
- pickle (replaces/combines cPickle, pickle, copy_reg, pickletools)
- shelve
random
server
- http (replaces/combines BaseHTTPServer, SimpleHTTPServer)
  - cgi (replaces CGIHTTPServer)
  - wsgi (replaces wsgiref)
  - xmlrpc (replaces/combines SimpleXMLRPCServer, DocXMLRPCServer)
system (contains "native" or "platform" APIs)
- mmap
- select (could be part of the network package)
- stat (replaces stat, statvfs)
- tempfile
- time (may be obsolete)
- zlib (could be part of a compression package)
text
- csv
- formatters
  - formatter
  - textwrap
- html (replaces/combines htmllib, HTMLParser)
- sgml (replaces sgmllib)
- shellsyntax (replaces shlex)
- xml (contains the top-level xml package)
types (augments types, contains UserDict as dict, UserList as list, UserString as str)
- array
- collections
- functional (replaces functools)
- heapqueue (replaces heapq)
- iterators (replaces itertools)
- mutex
- queue (replaces Queue)
- set (replaces sets)
- weakref

Additional Categorisation

Here, additional categorisation is introduced in order to distinguish between categories in different contexts. For example, http packages appear in both the client and server top-level packages. Instead of dividing the previously identified http category in this way, we might have decided to preserve a single http package and divide it into client and server subpackages. However, as suggested above, we regard the client and server categorisations as being more important than one of many technologies that may be relevant to both of these categorisations.

Difficult Categories and Packages

Some categories may be established at the top level despite their nature suggesting a placement in some other category. For example, the email package could in certain respects be placed in either the archive or text packages, but since this might appear counterintuitive to different users of the package, a separate placement hopefully eliminates confusion and gives the package a deservedly more prominent status in the library.

Some modules or sections of functionality can be awkward to categorise. For example, the processing of URLs as attempted by the urlparse module could be placed in various networking categories or in some other category, since URIs/URLs are also used in contexts unrelated to networking and the Internet (eg. in XML namespaces and RDF identifiers). A compromise may therefore be necessary, placing a proposed url module in the net top-level package, for example.

System Packages

As noted above, a special "system" ("native" or "platform") package could be established. Care should be taken, however, to avoid filling such a package with other packages that really ought to be disassembled, reorganised or recategorised.

Editorial Notes

This is currently a draft, featuring a number of points that should be discussed rather than being interpreted as a final opinion or a final set of recommendations. -- PaulBoddie

Open Issues

Should there be a top-level package representing the entire standard distribution, e.g., "std"? from std.database import anydbm
- I think the proposed hierarchy (which is obviously tentative) should be at the top level, although the risk of name collisions with independent packages and the issue of how "selfish" the standard library should be ought to be worked out. -- PaulBoddie

CodingProjectIdeas/StandardLibrary/RestructuredStandardLibrary (last edited 2008-11-15 14:00:36 by localhost)

-  ⇤ ← Revision 6 as of 2006-10-09 20:25:30 → 
  Size: 16908
  Editor: SkipMontanaro
  Comment:
+   ← Revision 7 as of 2006-10-09 21:39:46 → ⇥
  Size: 17172
  Editor: PaulBoddie
  Comment: Respond to open issue #1.
-Deletions are marked like this.
+Additions are marked like this.
 Line 275:
+    * ''I think the proposed hierarchy (which is obviously tentative) should be at the top level, although the risk of name collisions with independent packages and the issue of how "selfish" the standard library should be ought to be worked out.'' -- PaulBoddie

Page

User

A Restructured Standard Library

A Note on Backward Compatibility

Potential Areas of Improvement

Activities, Grouping and Redundancy

Overlapping Module Groups

Functional Module Groups

Recommendations

Naming

Recommendations

Module Functionality

Styles of Organisation/Interaction in Modules

Recommendations

Proposals

Additional Categorisation

Difficult Categories and Packages

System Packages

Editorial Notes

Open Issues