ParseGen

ParseGen is a C++17 library for working with ASCII text formats. For example, it can parse subsets of formats like XML and YAML. ParseGen's utility is that it is a toolbox for defining new text languages and parsing them.

A presentation on ParseGen can be downloaded here

Theory

ParseGen is based on classical language theory including regular languages, regular expressions, finite automata, context-free grammars, and LALR(1) parsing. Tokens are described by regular expressions and languages are expressed as context-free grammars. ParseGen use finite automaton theory to build table-based lexers, and uses a fast algorithm from scientific literature to build a table-based shift-reduce parser for a context-free grammar.

While we acknowledge that many text formats today are not based cleanly on context-free language theory, we believe there is a niche to be filled by a library like ParseGen that is still based on classical theory.

Implementation

The key contribution of ParseGen to the software community is to implement these parser generators are pure C++ objects and functions, as opposed to tools like Flex and Bison which have custom input formats describing languages and output generated source code that must then be re-compiled to obtain a parser. A language can be built at runtime as a C++ object in ParseGen and then a parser object can be constructed for that language and executed repeatedly to parse C++ streams. This offers maximum flexibility in the workflows users can employ.

Frequently Asked Questions

How do I build it?

ParseGen uses the CMake build system and tries to be a "standard modern CMake package" as much as possible.

Where do I start with the API?

The two most important classes in ParseGen are the language, which fully describes a text language, and the parser, which parses a language according to user-defined rules that react to syntactic constructs observed. The file src/parsegen_calc.cpp is a great introductory example that builds a command-line calculator app using ParseGen.

Why C++?

C++ is the language of choice in the HPC community that originated this code.

Why C++17?

We use the std::any feature of C++17 to allow users to return any object as the result of parsing some text.

Why ASCII only, why not Unicode?

So far avoiding Unicode support has allowed a simple design and none of the formats we target really need Unicode. However, we welcome any contributions that move us towards Unicode support.

At Sandia, ParseGen is SCR# 2564.0

Name	Name	Last commit message	Last commit date
Latest commit History 185 Commits
src	src
CMakeLists.txt	CMakeLists.txt
LICENSE	LICENSE
README.md	README.md
config.cmake.in	config.cmake.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ParseGen

Theory

Implementation

Frequently Asked Questions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

sandialabs/parsegen-cpp

Folders and files

Latest commit

History

Repository files navigation

ParseGen

Theory

Implementation

Frequently Asked Questions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages