Author (Edward C. Jones) writes:

SeeGramWrap parses a piece of C code and the resulting parse tree is output in man and machine readable form. The result can be used for program transformations. Since a particular trnsformation algorithm may not require all the information present in the tree, the user can select what to output.

This program has been written and tested only under linux.

Thanks to John Mitchell and Monty Zukowski for "cgram.tgz". Every parser generator need to have a good C grammar. Also thanks to Terrence Parr for ANTLR.

Note: Also uploaded

CONTEXT

A number of large C libraries have been wrapped so they can be called by Python. The wrapping code is repetitive and there may be a lot of it so methods have been developed for automated wrapping.

The best-known approach is SWIG. For complex wrappings, SWIG requires the writing of typemaps, an unintuitive process where pieces of C code you write are spliced into the wrapper code generated by SWIG.

Another wrapper related approach is Pyrex. Pyrex has its own repetitive boilerplate that has to be written. But the Pyrex boilerplate is so straightforward that it can be taught algorithmically.

I think that the Pyrex boilerplate is so straightforward that it can be machine generated. Therefore I have been sporatically developing software to do this. A thoroughly buggy version of this is on my web page. It is called "cgram.tar.gz" (The name will be changed). Look at it but don't use it. SeeGramWrap is a major revision of the front end of "cgram.tar.gz".

I think the automatic-wrapper program can be made to work. It might be easier to use than SWIG. It is still a lot of work to prepare complex C header files. What we have is really a "program transformation" or "tree transformation" problem.

I think some of the issues are:

1. Since parser generators have a long and steep learning curve, I prefer to use them as black boxes which generate parsers which output results that I can analyze using Python. The parser created by a parser generator should output trees in two formats: one easy to look at and another that a program can easily read. For examples, see below.

2. I find trees very easy to work with. I want the trees to be front and center and highly visible. I prefer to "manipulate a tree" rather than "fire a rule".

3. The most common type of C macro has a type as one of its arguments:

     #define CAST(x, type) (type *) x

How can these be automatically wrapped for Python which is a dynamically typed language?

TECHNICAL OVERVIEW

I use some C grammars associated with ANTLR. The grammar package is called cgram. See http://www.antlr.org/resources.html.

In "cgram" there is a java program "TestThrough.java" which parses C code into an AST then runs a tree grammar on the AST and outputs the original code. The tree grammar is named "GnuCEmitter.g". I work with this grammar because the terminal tokens are printed in the correct order. I modified the grammar turning it into a template. A piece of the original "GnuCEmitter.g" is:

typeQualifier
         :       a:"const"                       { print( a ); }
         |       b:"volatile"                    { print( b ); }
         ;

The modified version is:

typeQualifier
         :       a:"const"                       { <@ a @> }
         |       b:"volatile"                    { <@ b @> }
         ;

In this template, strings of the form "<@ ... @>" will each be replaced by a set of print statements. Moreover the entire rule will be wrapped by prints. The template is used in "emitter/insert_prints.py". If "insert_prints.py" is run the result is:

typeQualifier
   { if ( inputState.guessing==0 ) {
           print(Open);
           print("typeQualifier");
        }
     }
         :  (
                 a:"const"        {  print(Open); print("typeQualifier.0"); print( a ); print(Close); }
         |       b:"volatile"     {  print(Open); print("typeQualifier.1"); print( b ); print(Close); }
            )
   { currentOutput.print(Close + MyTokenSep); }
         ;

If the original C program , "temp2.c", is

     char* s = "ab";

The output of the modified emitter grammar is "temp2.c.data":

     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     externalList             declarator               expr
     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     externalDef              pointerGroup             primaryExpr
     <<OPEN>>                 <<OPEN>>                 <<OPEN>>
     declaration              pointerGroup.0           stringConst
     <<OPEN>>                 *                        <<OPEN>>
     declSpecifiers           <<CLOSE>>                stringConst.0
     <<OPEN>>                 <<CLOSE>>                "ab"
     typeSpecifier            <<OPEN>>                 <<CLOSE>>
     <<OPEN>>                 declarator.0             <<CLOSE>>
     typeSpecifier.1          s                        <<CLOSE>>
     char                     <<CLOSE>>                <<CLOSE>>
     <<CLOSE>>                <<CLOSE>>                <<CLOSE>>
     <<CLOSE>>                <<OPEN>>                 <<CLOSE>>
     <<CLOSE>>                initDecl.0               <<CLOSE>>
     <<OPEN>>                 =                        ;
     initDeclList             <<CLOSE>>                <<CLOSE>>
     <<OPEN>>                 <<OPEN>>                 <<CLOSE>>
     initDecl                 initializer              <<CLOSE>>

This output can be processed by "tree.py" to produce "temp2.c.nest"

(externalList,
   (externalDef,
     (declaration,
       (declSpecifiers,
         (typeSpecifier,
           (typeSpecifier.1, |char|))),
       (initDeclList,
         (initDecl,
           (declarator,
             (pointerGroup,
               (pointerGroup.0, |*|)),
             (declarator.0, |s|)),
           (initDecl.0, |=|),
           (initializer,
             (expr,
               (primaryExpr,
                 (stringConst,
                   (stringConst.0, |"ab"|))))))), |;|)))

or "temp2.c.src":

     char * s = "ab" ;

If "temp2.c.src" is put through the entire process itself we get "temp2.c.src.src" which is identical to "temp2.c.src". This test is done by "docheck.py".

In the ".data" or ".nest" files the tokens from the original C code are in the correct order. It is easy to recover

     ('char', '*', 's', '=', '"ab"', ';')

SeeGramWrap (last edited 2008-11-15 14:00:14 by localhost)

Unable to edit the page? See the FrontPage for instructions.