Differences between revisions 35 and 36

A Python Graph API?

This wiki page is a resource for some brainstorming around the possibility of a Python Graph API in the form of an informational PEP, similar to [http://www.python.org/peps/pep-0249.html PEP 249, the Python DB API]. The goal would be, in other words, to define how a graph (or various kinds of graphs) would be expected to behave (possibly from different perspectives) in order to increase interoperability among graph algorithms. [http://numeric.scipy.org/array_interface.html The numeric array interface], recently developed by the Numeric Python community to increase interoperability between array-handling software, illustrates the general idea.

A comp.lang.python thread discussing this (with some useful ideas) may be found [http://groups-beta.google.com/group/comp.lang.python/browse_thread/thread/cbca60cb36be39ed/313113af1aa077af here].

This is not about plotting; for a definition of the kind of graph we're talking about, see [http://mathworld.wolfram.com/Graph.html here].

If you have interest in this, please feel free to add your ideas here. If you'd like to be named as a contributor in the case that some "standard" should emerge, please add your name to the list of contributors. (You might also want to create a user name and log in when doing changes, or possibly just state your name when you edit stuff, so one can see who did what.)

A simple use-case

Just to illustrate the point of this discussion, here is a simple example:

Let's say I have some special hardware that is able to search efficiently for complex patterns in byte sequences. I somehow use this to find various relationships between the genomes of different organisms, or between different genes in a given genome. This implicitly defines a graph, but I cannot put the graph explicitly in the form of a data structure that Python can manipulate, and, hence, I cannot use a ready-made graph library. So, I instead write an API to my system so that it can be probed and manipulated by other code as if it were a graph data structure.

Now, someone else might have written some nice graph code for, say, an approximated bipartite clique cover (or whatever) and I want to use that. In an ideal world, the clique partition code would be written using only an abstract notion of what a graph is (i.e. a protocol or interface/API) and I have implemented my hardware-based graph using the same graph notion/protocol. In this case I can simply write a small program that drops my hardware-based graph right into the clique partition code and it will work.

This ideal world scenario would only be possible, though, if some abstract graph notion or protocol is defined, so that both graph implementors and graph algorithm implementors can use it. The goal of this little project is, of course, to define such a protocol/API.

Note that the goal is not to implement a specific graph representation or a set of graph algorithms. The goal is only to mediate between developers of the two.

Preliminary ideas

General thoughts

Ideas so far include basing the API on the standard notion of using adjacency maps (such as dicts of neighbor lists), as described in [http://www.python.org/doc/essays/graphs.html Guido's well-known essay] and used in [http://www.ics.uci.edu/~eppstein/PADS these examples by David Eppstein], and using object adaptation (through the adapt() function, as described in [http://www.python.org/peps/pep-0246.html PEP 246] and as used in [http://peak.telecommunity.com/PyProtocols.html PyProtocols]) to allow access to graphs through various perspectives (such as adjacency maps, incidence maps, adjacency arrays, edge lists...). This would also allow graphs implementations with completely different interfaces to be adapted to a standard interface, and thus be used by generic graph algorithms.

An example of an existing graph library for Python, written by István Albert, may be found [http://pygraphlib.sourceforge.net/ here].

[http://networkx.lanl.gov NetworkX] is another example of a graph library in Python.

[http://gadfly.sourceforge.net/kjbuckets.html kjbuckets] is a C extension and very fast, but imcomplete.

[http://cvs.zope.org/Packages/pypes/ pypes] appears never to have been released officially.

Bruno Preiss offers a very complete but somewhat unPythonic Graph type as part of an online [http://www.brpreiss.com/books/opus7/html/page519.html book on data structures in Python]. (Downloadable at the Opus7 package.)

[http://www.ece.arizona.edu/~denny/python_nest/graph_lib.py Nathan Denny's graph library] appears to be a fairly simplistic implementation.

[http://www.bioinformatics.ucla.edu/pygr/ Pygr] is billed as a "Python graph database framework for bioinformatics", and clearly aims to be more than just a graph library.

[http://www.zpr.uni-koeln.de/~gato/ Gato], the Graph Animation Toolkit, is a visual tool intended to teach graph algorithms. It contains a fully functional graph library.

[http://sourceforge.net/projects/pynetwork/ pynetwork]

[http://boost.org/libs/graph/doc/python.html BOOST] - how about BOOST? it seems to be well developed and has already Python bindings

[http://alpha-leonis.lids.mit.edu/nlp/pygraph/ pygraph], Beracah Yankama's pure python a/cyclic tk grapher for tk contains spring and hierarchical node placement algorithms, along with some basic classes for creating nodes & edges.

One possible underlying technology for graphs would be adjacency matrices using [http://www.stsci.edu/resources/software_hardware/numarray numarray]. This approach would need to use sparse matrices to be practical for large graphs.

Add your ideas here (Go wild, folks...)

I think that the graph representation should be a redundant one, that allows the fast traversals, quick node, edge and neighbour lookups as well as fast iterations over all nodes or all edges. This will of course come at the expense of storage and/or graph initialization efficieny. For example in the python graph representation mentioned above (modeled after LEDA) the nodes and the edges are stored in a separate dictionaries. Each edge_id maps to a tuple of the head_id and the tail_id (the nodes) that form the edge. (Edge_ids are automatically created as they are added.) At the same time each node_id maps to a tuple of two lists corresponding to incoming and outgoing edges. (Istvan)

IMO, one of the points of a standard interface is to let people choose their own representations. (Magnus).

I believe that just as most people don't care how lists and dictionaries work behind the scenes they won't care (or want to have to deal with) picking a graph representation. The two representation that are listed in just about every computer science book, the adjacency matrix and the adjecency list representations are just about useless in the real world. These are what I would call a node centric view of the graph. It turns out in many probelms edges have far more utility.Graphs are a very rich methaphor and one rarely uses them in the stripped down version. What we need is a representation that is simple, fast and efficient enough in every case. (Istvan)

This is not about giving people a graph library, but letting people who write graph representations and people who write graph algorithms interoperate more easily, by giving them a common interface. If people want to use a ready-made library, that's fine. The point is that people (like me) might want to represent graphs completely differently from the "textbook" version (in fact, compact graph representations is what my current research is about). It's the same with the DB API; by fixing a standard API, people can write programs that use databases without having to worry about which database or database library they use. I've added a use-case above to clear this up a bit. (Magnus)

To write any graph algorithm, you need a set of standard methods. As far as I know, there are only two ways of getting this - either agree on a fixed interface which implementations must follow (the DB API approach) or define an interface and require that implementations can be adapted to it (the adaptation approach). I don't like Andrew Dalke's suggestion of code-generating template systems, it seems to me that this would make writing algorithms far too hard, and where multiple graph representations are in use, result in code explosion (a well-known problem with C++ templates). My preference is for adaptation. It's ideal for this sort of thing, the only downside is that it isn't standard. But getting PEP 246 accepted would be easier with more use cases - avoiding its use is self-defeating. Better to lobby for its use, on the basis that we need it for this type of situation. (PaulMoore)

I share your view here. One possibility is to specify how a graph can be adapted to the (or one of) the standard interface(s) as part of the (hypothetical) Graph PEP, using adapter/factory functions. This can then be the same mechanism used by adapt(), but we don't have to require its use. (Or we could use it, of course...)

On the other hand, it isn't really necessary for us (if we go down this road) to specify how such adaptation should be done. We could define the graph interface, and graph libraries could supply their own adapter functions or wrappers. (Magnus)

I also think that one important use case for a generalized graph api would be tree representations. Any standard graph representation should as a subset present a standard tree, with the classic algorithms being portable between them. (As a minimum, pre-, in-, and post-order traversal; preferably depth-first and breadth-first searches.) (VanL)

Andrew Dalke [http://groups.google.com/groups?selm=yyq0d.3041%24xA1.1224%40newsread3.news.pas.earthlink.net&output=gplain suggests] using a template system that can transform code using a standard API into code using a custom API. This differs from a solution using object adaptation in that it works at the source level, and thus can have some performance advantages.

Perhaps we should try to keep the edge and node attributes/decorations separate from the GraphAPI. The API should specify that each node and edge store a pointer. The pointer would be set to an object that handles all decorations and manipulations of those decorations. The graph algorithms would work on the graph structure itself -- not the decorations -- and the user would need to interface those results to the decorations. (dschult)

(Nick) I think that to be the most general the API should support hypergraphs, with regular graphs being a subclass. A hypergraph is defined by two sets of objects (a.k.a. nodes & edges). So if we have 'nodes' A={a1,a2,a3} and 'edges' B={B1,B2,B3} and then we make make a map d = {a1: (B1,B2), a2: (B2), a3: (B1,B2,B3)} Then G(A,B,d) is the hypergraph. The representation can be inverted too, so that B becomes the nodes and A the edges (if you start out with an incidence matrix with actors A down the rows and events they attend B across the columns the first form is the ties between the actors and the second is the ties between the This hypergraph has a As said before, a graph is a special case of a hypergraph where the number of points 'in' each edge is 2 (i.e. len(d[Bi])=2 for all i); and of course, a directed graph is a special case of that where d[Bi][0] is taken to be the sender and d[Bi][1] the receiver.

As far as the API goes, it should be be both graph-theoretic and matrix-(uh?)-theoretic at once, because both representations are equally useful (though if I had to pick I'd pick graph because that is what this API is for, networks, not matricies).

It has to support attributes on the nodes/edges _with ease_, to be able to support valued/signed graphs, as well as to provide a convienient hook for interfacing to other things (for example, colouring and labeling nodes/edges in generated diagrams)

The nodes/edges could be arbitrary python objects. Besides just allowing python's natural expressive freedom to be made full use of, this would solve the attributes problem with ease: An implementation wrinkle would be that the map would have to distinguish between seperate nodes/edges even though the actual items might have identical values, such as in the case of a signed graph where all edges are either -1 or +1 (perhaps index them when they get entered, and store them as (obj, index)?). This is basically dschult's suggestion, but I want to flesh out the bit about "it's up to the user..."; for example, if a library was written to calculate various measures on the graph it would be important that the values of the edges are numeric. (/Nick)

Also: +1 for standard API so that algorithms can be described in a standard way. Also: I see that the requirement about allowing "Object" has already been raised on usenet Also: I missed a type of graph, the multigraph, where more than one type of relation is defined. You can fake this with multiple separate graphs, and in many ways I prefer this (then you can attach descriptive data to each relation...) but I can see that some algorithms might be more naturally expressed as operating on a multigraph. (Nick)

(Multigraphs are mentioned. - Magnus)

Specifics

Put potential specifics here, that is, what functionality an API must cover. For now, it is completely permissible to have mutually exclusive items here, as nothing is anywhere near fixed yet.

Static graphs:

Node membership: Is a node part of the graph?
Node iteration: Iterate over all the nodes.
Edge membership: Is an edge part of the graph?
Edge iteration: Iterate over all the edges.
Adjacency iteration: Iterate over all the neighbors of a node.
Incidence iteration: Iterate over all the edges incident to a node (out- and in-edges separately?)
Node decoration: labels, weights, colors, capacities (can be done with external maps)
Edge decoration: labels, weights, colors, capacities

(dschult) would add:

Size of Graph: Report the number of edges in the graph.
Order of Graph: Report the number of nodes in the graph.
Degree of node: Report the number of neighbors for a node (or sum of edge weights incident on the node).

"well this is tricky; the edges aren't neccessarily numbers, remember? (nick)"

(I wasn't talking about numbers. "The number of X" means "how many Xs" there are. - Magnus)

Accessing incident edges based on their labels can also be useful (e.g. in FSAs). "again tricky because it requires defining that a node/edge ""must"" have a label. Does this mean having class LabeledGraph(Graph)? (nick)"

(The access here would, of course, only be available if the edges or labels do indeed have labels; not otherwise. - Magnus)

Dynamic graphs:

Node addition
Node deletion
Edge addition
Edge deletion
Node decoration modification
Edge decoration modification
Edge contraction (Can't this be done using the add and delete above? Do we want to get this specialized? (dschult))
Node and edge hiding?

"these all seem to revolve around the notion of a graph-update operation of some sort. So then the question is how to record what specific operation to perform? Also, do we just record which event comes before the others or do we attach a time to each? (nick)"

(Not sure what you mean here. Why would we need to record anything like this? Why not simply modify the graph? That was the idea... - Magnus)

It may be that a given representation only supports a subset of the given operations, and that subset might not necessarily be easy to predict. For example, even though it might seem that neighbor iteration and adjacency testing go hand in hand, for implicit graph representation they are quite separate. For such a representation, you can only check whether two given nodes are neighbors; to iterate over all the neighbors of a node, you'd have to iterate over all the nodes in the graph.

Basic set operations (and more advanced operations, such as cartesian product) might be useful for graphs, but needn't be part of the standard.

Should there be separate functionality for directed and undirected graphs? Should all graphs be chain graphs (which allow both directed and undirected edges)? Should two-way directed edge be treated as an undirected edge (and vice versa)? "Chain graphs are the most general, so we must have them. However, would they be a superset or simply next to normal graphs? (nick)"

(Chain graphs are a generalization of both directed and undirected graphs. - Magnus)

How do we handle the proliferation of other graph classes (such as multigraphs, pseudographs, ...), all of which are only minor tweaks of the basic structure? How about more radical departures such as hypergraphs (which allow an arbitrary number of nodes per edge)? "oops, didn't read this far. Well you know my answer: make every graph a hypergraph. (Nick)"

One possibility would be to base the API definition on the [http://www.gupro.de/GXL Graph eXchange Langage] (GXL). It is derived from many separate graph representation languages and seems quite complete (it even supports hypergraphs and hierarchical graphs!), logical, and well-thought-out. The language itself consists of an XML and a UML notation, but it is based on a clearly defined [http://www.gupro.de/GXL/GraphModel/graphModel.html object model/ADT], which we could adapt to a Python interface. I'm not sure about how (or whether) they model graph modification (although I think there is some support for transformations of some kind) but that could probably be done as a simple, logical extension. We might not want to support all of the GXL model, of course. We might even want to define several layers of standard compliance, with Layer 0 being only the current de facto standard graph API, for example. (Magnus)

With the goal of designing an API for Python in mind, I would think one of the first decisions that needs to be made is whether to use multiple classes (ala Vertex, Edge, Graph) or just one (ala Graph); also, are those classes going to be "Pythonic" in that they provide various methods in terms of Python operators or not (let's call the "not" alternative "Textbookish")? It seems to me that without settling this first, discussing the details of the API further is moot. But maybe I am wrong? (Peter)

No, defining Vertex and Edge interfaces makes working with graphs just that much more difficult. That's the Java way, don't go there. (Nick)

Well, the "Java way" can also be the "OO way" and thus could also be the "Python way". Theoretically a Graph G is defined a set of vertices and a set of Edges. G=(V,E). (Nick V.)

The only requirement on a vertex is that it can be represented in a set, which I think maps to supporting testing for equality and hashing. In Python, this does not require a dedicated Vertex class or interface.

I tend to prefer/develop graph implementations that support representing the nodes with arbitrary objects:

g=Graph()

some_object=SomeObject(*args1, **kwds1 )

some_object2=SomeObject(*args2, **kwds2)

g.add_vertex( some_object1 )

g.add_vertex( some_object2 )

new_edge=g.add_edge( some_object1, some_object2)

So that I do not have to maintain a mapping from the objects (that I care about) to some integers that uniquely label the nodes (which I don't care about). (/David McNamara)

Vertices and Edges as Separate Objects

1. I think a Vertex and an Edge as objects are meaninful and useful. One should be able to create v1=Vertex(1) and v2=Vertex(2) and then e12=Edge(v1,v2). Then do G.add_edge(e12), which should be no different than doing G.add_vertex(v1); G.add_vertex(v2); G.add_edge(e12). In this respect this is a departure from the NetworkX idea of using integers as nodes and 2-tuples of integers as edges, having the user map back and forth between nodes and data.

2. Whatever the repsentation used, the graph could expose a set of nodes and edges like graph.nodes and graph.edges. A new graph could be built from an existing set of nodes and edges: newG=Graph(G.nodes,G.edges). .nodes and .edges could be provided on-the-fly by property descriptors, regardless if a matrix or adj. list is used for internal representation. If the graph is connected, the graph could be exactly represented by G.edges

3. A Vertex object would have an .id attribute that could uniquely identify it, and provide an eq and hash magic methods. This way it could be inserted into a set. It could also a have a label or reference to some data. Perhaps there could be a "tag" attribute which would be similar to what the Tk Canvas has. A tag could be anything like a color value, a 'visited' flag.

4. A Vertex object should not reference its edges. Edges should reference its vertices, because that is what an 'edge' is -- a relation between 2 vertices. So Edges will contain references to its nodes. An Edge object could have a 'from_vertex' and a 'to_vertex' attributes, a data attribute which could be the weight for example, and also a "tag" attribute like the Vertex.

5. Providing a way to obtain edges to neighbour vertices in this scheme would ammount to an exhaustive search of the edges set. Therefore edges would be held in a nested dict like edges[v1][v2]=edge12 . To quickly access the edges to neighbours of v1 would be G.edges[v1].values() (Nick V.)

6. We must note that Edges and Vertices are defined as sets in standar graph theory, so, we must also think how to joind the graph data structure to the standard python (set) builtin data type. (Jose Antonio Martin H.)

Hypergaphs

As far as hypergraphs are concerned, a hypergraph is equivalent to a bi-partite graph, with two different set of nodes. One represents the regualar nodes, the other a set of edges (<- these are not edges of the Graph itself, but separate Edge(s) created earlier. So it is probably not necessary to include a separate mechanism for hypergraphs. (Nick V.)

Contributors

(Add your name to this list if you contribute.)

Magnus Lie Hetland
Istvan Albert
Paul Moore
Van Lindberg
Dan Schult
Peter Froehlich
Aric Hagberg
Nick Guenther
Nick Vatamaniuc
Jose Antonio Martin H
Beracah Yankama
David McNamara

(See also the Usenet discussion above for more contributors.)

-  ⇤ ← Revision 35 as of 2007-08-09 13:35:48 → 
  Size: 21940
  Editor: uwal206
  Comment:
+   ← Revision 36 as of 2007-08-14 16:57:39 → ⇥
  Size: 22685
  Editor: 12
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 250:
+The only requirement on a vertex is that it can be represented in a set, which I think maps to supporting testing for equality and hashing.  In Python, this does not require a dedicated Vertex class or interface.

I tend to prefer/develop graph implementations that support representing the nodes with arbitrary objects:

g=Graph()

some_object=SomeObject(*args1, **kwds1 )

some_object2=SomeObject(*args2, **kwds2)

g.add_vertex( some_object1 ) 

g.add_vertex( some_object2 )

new_edge=g.add_edge( some_object1, some_object2)

So that I do not have to maintain a mapping from the objects (that I care about) to some integers that uniquely label the nodes (which I don't care about). (/David McNamara)
-Line 284:
+Line 303:
+  * David McNamara

Page

User