srcML: A document-oriented XML representation of source code

posted by: LeoraJJohnson

Comments Off on srcML: A document-oriented XML representation of source code



The new version of the srcML toolkit includes feature for direct evaluation of XPath queries through the toolkit (no external tools needed), and for efficient transformation using XSLT on entire projects that are stored in the srcML archive format (entire project in a single srcML file). You can try these features out using trunk build of the toolkit.


srcML is a combination of source code (text) and selective AST information (tags) in a single XML document:

<unit xmlns="" xmlns:cpp="" language="C++" filename="ex.cpp">
<comment type="line">// copy the input to the output</comment>
<while>while <condition>(<expr><name><name>std</name>::<name>cin</name></name> &gt;&gt; <name>n</name></expr>)</condition>
  <expr_stmt><expr><name><name>std</name>::<name>cout</name></name> &lt;&lt; <name>n</name> &lt;&lt; '\n'</expr>;</expr_stmt></while>

The focus is to construct a document representation in XML instead of a more traditional data representation of the source code. The representation of source code as semi-structured text supports a programmer-centric rather than a compiler-centric view, providing full access to the source code at the lexical, documentary (e.g., comments, white space), structural (e.g., classes, functions), and syntactic (e.g., statement) levels.

What follows is an explanation of the characteristics of the srcML format and the accompanying toolkit. The srcML Toolkit includes src2srcml and srcml2src (GPL License) for conversion to/from srcML. Also listed are the research publications primarily on the format.



Preservation of all source-code text, e.g., comments, formatting (white space), and preprocessor directives, in the original document ordering allowing full access to the source code at the lexical and documentary levels, with an equivalent forward and reverse mapping between source code and srcML. These elements are identified for further processing by development environments and program-comprehension tools.

Tags for comments, preprocessor directives, statements, and other syntax allows for source code to be accessed through XML at the documentary, structural, and syntactic levels. These levels can be addressed using XPath, e.g., /unit/while/condition. Round-trip transformation (i.e., source-code to srcML to source-code) can utilize XML transformation languages and tools.

Opportunistic use of XML technologies: addressing with XPath, querying with XPath and XQuery, transformation with DOM, SAX, JDOM, XOM, TextReader, XSLT, and STX, and validation with schema languages DTD and RelaxNG. The srcML format is not tied to any specific XML technology and should be compatible with any XML tools and standards developed in the future.

Representation and toolkit robust to source-code irregularities, e.g., uncompilable code, code fragments, single statements, and single files, with representation based on local document information only, i.e., no symbol table is used. Parser based on the concept of Island Grammars for robustness. Complete handling of encoding issues (e.g, ISO-8859-1, UTF-8).

Scalable storage and translation with reasonable file sizes typically less than 4 times the size of the corresponding text file. The source code to srcML translator (src2srcml) is a stream parser that supports event interfaces with a translation speed over 10 KLOC per second.

File and directory aware with metadata at the file level, i.e., language, file location, and version information. Compound format allows for multiple source-code files in one srcML document, e.g., storing the entire Linux kernel in a single srcML file.

Extensible format by adding attributes on existing elements and extending the element set. XML translation on the srcML format permits further refinement of parsing and markup.


srcML Toolkit

The srcML toolkit includes src2srcml, a translator from source code to srcML, and srcml2src, a translator from srcML to source code. Actively developed it currently supports C, C++, and Java, and is under a GPL license. The beta-Nov-09-2006 release is the most recent. Note: Very recent trunk build available.

The major changes include many Java bug fixes, complete markup of struct definitions in typedefs, more control of namespaces, new index element, and new options for extensions, information and control. The extensions include markup for literal values and operators. More information can be found in the NEWS

  • NEWS CHANGES src2srcml man page srcml2src man page
  • Linux (libxml2 version)
  • Windows[zip] src2srcml.exe srcml2src.exe Note: Internet Explorer has problems downloading zip files.
  • MacOS (libxml2 version)
  • Source: [zip] [tar.gz]
  • DTD: C/C++ srcML C/C++/Java srcML



The need for a format to permit easy extraction of comments, preprocessor directives and other information from unprocessed source code was identified by Dr. Jonathan Maletic and Dr. Andrian Marcus. The srcML format was initially created by Dr. Michael Collard and Jonathan Maletic. A translator from source code to the srcML format and the completion of the set of srcML elements was done by Huzefa Kagdi with parsing infrastructure support by Michael Collard.

The srcML format and toolkit is currently developed/maintained by Michael Collard [email protected] and continues to be used in research by SDML and others.



  • Addressing Source Code Using srcML by Collard, M.L.,
    IEEE International Workshop on Program Comprehension Working Session: Textual Views of Source Code to Support Comprehension (IWPC’05)
    St. Louis, Missouri, USA, May 15, 2005, 3 pages
  • Document-Oriented Source Code Transformation using XML by Collard, M.L., Maletic, J.I.,
    Proceedings of the 1st International Workshop on Software Evolution Transformation (SET’04)
    Delft, The Netherlands, November 9, 2004, pp. 11-14
  • Supporting Source Code Difference Analysis by Maletic, J.I., Collard, M.L.,
    Proceedings of the 20th IEEE International Conference on Software Maintenance (ICSM’04)
    Chicago, Illinois, September 11-17, 2004, pp. 210-219
  • Leveraging XML Technologies in Developing Program Analysis Tools by Maletic, J.I., Collard, M.L., Kagdi, H.,
    Proceedings of the 4th International Workshop on Adoption-Centric Software Engineering (ACSE’04)
    Edinburgh, Scotland, May 25, 2004, pp. 80-85
  • An Infrastructure to Support Meta-Differencing and Refactoring of Source Code by Collard, M.L.,
    Proceedings of the 18th IEEE International Conference on Automated Software Engineering (ASE’03)
    Montreal, Quebec, October 6-10, 2003, pp. 377-380
  • An XML-Based Lightweight C++ Fact Extractor by Collard, M.L., Kagdi, H., Maletic, J.I.,
    Proceedings of the 11th IEEE International Workshop on Program Comprehension (IWPC’03)
    Portland, Oregon, May 10-11, 2003, pp. 134-143
  • Supporting Document and Data Views of Source Code by Collard, M.L., Maletic, J.I., Marcus, A.,
    Proceedings of the 2nd ACM Symposium on Document Engineering (DocEng’02)
    McLean, Virginia, November 8-9, 2002, pp. 34-41
  • Source Code Files as Structured Documents by Maletic, J.I., Collard, M.L., Marcus, A.,
    Proceedings of the 10th IEEE International Workshop on Program Comprehension (IWPC’02)
    Paris, France, June 27-29, 2002, pp. 289-292


posted in: Uncategorized