Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

sandialabs/parsegen-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParseGen

ParseGen is a C++17 library for working with ASCII text formats. For example, it can parse subsets of formats like XML and YAML. ParseGen's utility is that it is a toolbox for defining new text languages and parsing them.

A presentation on ParseGen can be downloaded here

Theory

ParseGen is based on classical language theory including regular languages, regular expressions, finite automata, context-free grammars, and LALR(1) parsing. Tokens are described by regular expressions and languages are expressed as context-free grammars. ParseGen use finite automaton theory to build table-based lexers, and uses a fast algorithm from scientific literature to build a table-based shift-reduce parser for a context-free grammar.

While we acknowledge that many text formats today are not based cleanly on context-free language theory, we believe there is a niche to be filled by a library like ParseGen that is still based on classical theory.

Implementation

The key contribution of ParseGen to the software community is to implement these parser generators are pure C++ objects and functions, as opposed to tools like Flex and Bison which have custom input formats describing languages and output generated source code that must then be re-compiled to obtain a parser. A language can be built at runtime as a C++ object in ParseGen and then a parser object can be constructed for that language and executed repeatedly to parse C++ streams. This offers maximum flexibility in the workflows users can employ.

Frequently Asked Questions

  1. How do I build it?

ParseGen uses the CMake build system and tries to be a "standard modern CMake package" as much as possible.

  1. Where do I start with the API?

The two most important classes in ParseGen are the language, which fully describes a text language, and the parser, which parses a language according to user-defined rules that react to syntactic constructs observed. The file src/parsegen_calc.cpp is a great introductory example that builds a command-line calculator app using ParseGen.

  1. Why C++?

C++ is the language of choice in the HPC community that originated this code.

  1. Why C++17?

We use the std::any feature of C++17 to allow users to return any object as the result of parsing some text.

  1. Why ASCII only, why not Unicode?

So far avoiding Unicode support has allowed a simple design and none of the formats we target really need Unicode. However, we welcome any contributions that move us towards Unicode support.

At Sandia, ParseGen is SCR# 2564.0

Morty Proxy This is a proxified and sanitized view of the page, visit original site.