Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Outline

TRegex - Truffle Regular Expression Language

This Truffle language represents classic regular expressions. It treats a given regular expression as a "program" that you can execute to obtain a regular expression matcher object, which in turn can be used to perform regex searches via Truffle interop.

The expected syntax is options/regex/flags, where options is a comma-separated list of key-value pairs which affect how the regex is interpreted, and /regex/flags is equivalent to the popular regular expression literal format found in e.g. JavaScript or Ruby.

Parsing

When parsing a regular expression, TRegex will return a Truffle CallTarget, which, when called, will yield one of the following results:

  • a (Truffle) null value, indicating that TRegex cannot handle the given regex.
  • a "compiled regex" (RegexObject) object, which can be used to match the given regex.
  • a Truffle PARSE_ERROR exception may be thrown to indicate a syntax error.

An example of how to parse a regular expression:

Source source = Source.newBuilder("regex", "Flavor=ECMAScript/(a|(b))c/i", "myRegex").mimeType("application/tregex").internal(true).build();
Object regex;
try {
    regex = getContext().getEnv().parseInternal(source).call();
} catch (AbstractTruffleException e) {
    if (InteropLibrary.getUncached().getExceptionType(e) == ExceptionType.PARSE_ERROR) {
        // handle parser error
    } else {
        // fatal error, this should never happen
    }
}
if (InteropLibrary.getUncached().isNull(regex)) {
    // regex is not supported by TRegex, fall back to a different regex engine
}

The compiled regex object

A RegexObject represents a compiled regular expression that can be used to match against input strings. It exposes the following three properties:

  • pattern: the source string of the compiled regular expression.
  • flags: an object representing the set of flags passed to the regular expression compiler, depending on the flavor of regular expressions used.
  • groupCount: the number of capture groups present in the regular expression, including group 0.
  • groups: a map of all named capture groups to their respective group number, or a null value if the expression does not contain named capture groups.
  • exec: an executable method that matches the compiled regular expression against a string. The method accepts two parameters:
    • input: the character sequence to search in. This may either be a Java String, or a Truffle Object that behaves like a char-array.
    • fromIndex: the position to start searching from.
    • The return value is a RegexResult object.

The result object

A RegexResult object represents the result of matching a regular expression against a string. It can be obtained as the result of a RegexObject's exec-method and has the following properties:

  • boolean isMatch: true if a match was found, false otherwise.
  • int getStart(int groupNumber): returns the position where the beginning of the capture group with the given number was found. If the result is no match, the returned value is undefined. Capture group number 0 denotes the boundaries of the entire expression. If no match was found for a particular capture group, the returned value is -1.
  • int getEnd(int groupNumber): returns the position where the end of the capture group with the given number was found. If the result is no match, the returned value is undefined. Capture group number 0 denotes the boundaries of the entire expression. If no match was found for a particular capture group, the returned value is -1.

Compiled regex usage example in pseudocode:

regex = <matcher from previous example>
assert(regex.pattern == "(a|(b))c")
assert(regex.flags.ignoreCase == true)
assert(regex.groupCount == 3)

result = regex.exec("xacy", 0)
assert(result.isMatch == true)
assertEquals([result.getStart(0), result.getEnd(0)], [ 1,  3])
assertEquals([result.getStart(1), result.getEnd(1)], [ 1,  2])
assertEquals([result.getStart(2), result.getEnd(2)], [-1, -1])

result2 = regex.exec("xxx", 0)
assert(result2.isMatch == false)
// result2.getStart(...) and result2.getEnd(...) are undefined

Available options

These options define how TRegex should interpret a given regular expression:

User options

  • Flavor: specifies the regex dialect to use. Possible values:
    • ECMAScript: ECMAScript/JavaScript syntax (default).
    • Python: Python 3 syntax.
    • Ruby: Ruby syntax.
  • Encoding: specifies the string encoding to match against. Possible values:
    • UTF-8
    • UTF-16 (default)
    • UTF-32
    • LATIN-1
    • BYTES (equivalent to LATIN-1)
  • Validate: don't generate a regex matcher object, just check the regex for syntax errors.
  • U180EWhitespace: treat 0x180E MONGOLIAN VOWEL SEPARATOR as part of \s. This is a legacy feature for languages using a Unicode standard older than 6.3, such as ECMAScript 6 and older.

Performance tuning options

  • UTF16ExplodeAstralSymbols: generate one DFA states per (16 bit) char instead of per-codepoint. This may improve performance in certain scenarios, but increases the likelihood of DFA state explosion.
  • AlwaysEager: do not generate any lazy regex matchers (lazy in the sense that they may lazily compute properties of a {@link RegexResult}).

Debugging options

  • RegressionTestMode: exercise all supported regex matcher variants, and check if they produce the same results.
  • DumpAutomata: dump all generated parser trees, NFA, and DFA to disk. This will generate debugging dumps of most relevant data structures in JSON, GraphViz and LaTex format.
  • StepExecution: dump tracing information about all DFA matcher runs.

All options except Flavor and Encoding are boolean and false by default.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.