Can regex be compiled into efficient machine code?

Question

Most languages that have regex have a regex parsing library that interprets the regex at runtime and matches them to strings. I am a fan of eliminating as much runtime overhead as possible, and as such, sometimes write string matching functions by hand instead of using regex.

What would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand? Why do most languages not do this? What would be the advantages or disadvantages of having regex as a core language feature as opposed to a library function?

I’m not too familiar with it, but Swift’s regexen are compiled to a bytecode at compile time, so you could look into how that works. — Bbrk24, Commented May 26, 2023 at 19:25
are you asking specifically about compiling specialized code per-compile-time-known regex pattern and sidestepping a regex engine for those? Or does this question also about techniques that use a generalized regex engine? — starball, Commented May 28, 2023 at 18:16
@starball Compiling a compile-time-known regex pattern into a string searching algorithm that behaves equivalently basically. Should be more efficient than parsing regex strings at runtime. That's the idea. — CPlus, Commented May 28, 2023 at 18:18
Please always edit clarifications into your question post instead of hiding them in the comments! Comments are for soliciting clarifications- not for providing them. — starball, Commented May 28, 2023 at 18:29
Efficient machine code is a high bar, there are many more regex parsing libraries that compile to inefficient machine code — user1030, Commented Oct 13, 2023 at 15:15

kaya3 · Accepted Answer · 2023-05-26 20:21:25Z

Most languages that have regex have a regex parsing library that interprets the regex at runtime and matches them to strings.

This is mostly right, but "interpret" isn't really accurate. Regexes in most mainstream languages are compiled to a form which is efficient to check. The compilation happens at runtime, but each regex only needs to be compiled once and can be tested many times. For example, in Java you call Pattern.compile, in Python it's re.compile; but even APIs which accept regexes as strings, often cache the compiled form of the regex so that it only needs to be compiled once for the whole run of the program.

That said, the compiled form of a regex is still generally not quite as fast as a direct compilation to machine code would be.

what would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?

The classical way to compile a regular expression is to convert it to a deterministic finite automaton (DFA) ─ a state machine. Compiling a regular expression to a DFA involves a few steps, but here's a typical procedure:

First build a non-deterministic finite automaton (NFA) using Thompson's construction,
Then convert the NFA to a DFA using the powerset construction,
Then, optionally, convert the DFA into a minimal equivalent DFA.

The resulting DFA can then evaluate the regular expression on a string of length n in O(n) time with a low coefficient ─ potentially just a single array lookup per character in the input string ─ regardless of how complicated the regular expression is. However, the DFA itself may take up a lot more memory for more complicated regular expressions.

Why do most languages not do this?

Because the classical approach has two really significant downsides which make it inapplicable for many real regexes. Firstly, it only works for regexes that are truly regular expressions in the formal sense, i.e. they recognise a regular language. But many features of modern regex engines allow regexes which are not true regular expressions ─ particularly backreferences. Additionally, in the worst case the resulting DFA is exponentially large in the length of the regex, so while the regex is very efficient to execute, it is not at all efficient in terms of code size.

That's not to say that there aren't other potential approaches to compiling regular expressions into machine code, without going via NFAs and DFAs. But

Those other approaches aren't nearly as well-known;
The current state of the art (i.e. compiling at runtime to a form which enables efficient execution) can be pretty efficient regardless, so there's probably not that much room for improvement;
Compiling regexes to machine code at runtime requires some way of hooking that machine code into the program while it's executing, and most language implementations don't have a mechanism for this;
On the other hand, compiling regexes to machine code at compile-time would require the regex implementation to be part of the compiler, rather than part of the standard library, and wouldn't support dynamic construction of regexes at runtime.

starball · Accepted Answer · 2023-05-27 03:53:24Z

Can regex be compiled into efficient machine code?

In general and absent of any context about a specific language? Yes, it can. The deeper question is really about the context (design goals) and costs (working with tradeoffs).

At the library-level, one example is Hana Dusikova's compile-time regular expressions in C++. See also its WG21 proposal paper, GitHub repo, and website with links to various conference talks. It leverages C++'s compile-time facilities. I think it hits close to what you're talking about. From its proposal paper:

The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don’t need a runtime RE parser engine as the pattern is known during compilation in many common use cases. I think this breaks C++’s promise of “don’t pay for what you don’t use.” If the RE is known at compile time, the pattern should be checked during the compilation. The design of std::regex doesn’t allow for this as the RE input is a runtime string and syntax errors are reported as exceptions.

Notice a couple of things:

The proposal addresses a core design goal of the language.
The language had features that were able to support the implementation of a non-standard library implementation.

What would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?

The answer is not so black and white. There can be space and time efficiency tradeoffs. Now that you have contextual information at each site that can be used for specialized code-generation, how much code do you inline? What do you inline and what don't you inline? What do you extract to common procedures and what impact does that have on costs from function calls? Inlining isn't always the best- there can be cases where more code sharing results in better instruction-cache usage. How much code gets generated? In what way does the runtime data cost relate to the inputs? How do all those costs compare to the costs of a generalized regex engine? At what point does generating optimized code for known-at-compile-time regexes become more costly in code size than just having a regex engine in the runtime environment?

If implemented at the library level, (assuming that you care,) you'd need a way for the user to communicate at the library level what they want to optimize for / what tradeoff they want, and then for the language to provide powerful enough facilities for things living at the library level to estimate the space and time costs, and even at that point, if the language specification and language implementations are more separate, you'd hit some boundary in how accurate those estimates can be with respect to what the implementation actually does (Ex. what it compiles).

Why do most languages not do this?

Not all languages prioritize efficiency of codegen as a design goal enough to want this. That's a perfectly valid design choice. You really can't be everything. In fact, the general overarching trend in languages over time seems to be to move away from the hardware and losing some of its benefits and have higher-level, more abstract languages and runtimes.

And a lot of runtime models are not really geared towards doing heavy compile-time optimization / specialization. For example, Java bytecode has a limited set of supported instructions, and that can further limit what kinds of code you can generate. At that point, it can become a conversation about adding deeper features to the language or its components (which has its own costs), or just shrugging your shoulders and leaving it to runtime implementations to detect patterns and optimize what they do under the hood (which has its limitations).

Sometimes it's just a matter of nobody caring enough or having enough time to put the work into implementing or specifying it yet. Things don't just magically appear, and a lot of people who work on language design are not doing that as their day-job. Life can get in the way. For languages that aren't designed by just one person, collaborating on things and making decisions as a group has its own challenges as well, such as just meeting at the same time (timezone things), agreeing on the same priorities, dealing with conflicts in effects on different use-cases, etc.

What would be the advantages or disadvantages of having regex as a core language feature as opposed to a library function?

As stated above, you could have the people implementing the language compilers / interpreter optimizers implement the tradoff optimizations of time and space instead of having something at the library level try to do those optimizations within its limitations and then not have that information propagate to the compilers / interpreter optimizers.

WhiteMist · Accepted Answer · 2023-07-12 16:56:00Z

What would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?

As @9072997 pointed out, regexes are only a builtin feature of a few languages like Perl, Ruby or Awk.

If regexes are a core feature of the language then the implementation can treat regexes as any other language construct, like control flow constucts, and compile regexes at compile time as it does with rest of the program. Whether the language compiles to bytecode or native code, or whether the regex is compiled to regex bytecode (then interpreted by the regex engine/VM) or native code is irrelevant in this matter.

If regexes are not a core feature of the language and regexes are instead provided by a library however, this needs not be a strict impediment of compiling regexes at compile time. This could still be achieved if the language allowed arbitrary execution at compile time (Lisp macros can for example but compile time execution does not need to be done through macros) and since most regex libraries have the capability of compiling regexes (again, wether to regex bytecode or native code), you could call the regex_compile() function of your regex library at compile time. If regex_compile() returned a function pointer to native code and that the host language is natively compiled like C or Go, this function could then be statically linked to the program into the exectuable.

If the language have macros, the regex could be transformed into a regular program code by the macros. The macros could transform a regex expressed into an S-expression but could also process a regex contained into a string. At least one regex library in Lisp does this.

coredump · Accepted Answer · 2023-07-11 09:31:09Z

2

If you restrict yourself to pure regular expressions (not Perl-like extended expressions), it is easier to write efficient code. A possible way to compile them is to first translate them as code in your language and compile that instead.

There are interesting way to compile regular expressions, see for example Regular-expression derivatives reexamined. An implementation of that approach in Common Lisp is one-more-re-nightmare.

answered Jul 11, 2023 at 9:31

coredump

95333 silver badges77 bronze badges

$\begingroup$ note: meta.stackexchange.com/a/8259/997587 $\endgroup$

starball
– starball

2023-07-11 09:37:51 +00:00
Commented Jul 11, 2023 at 9:37

Add a comment |

Audrius Meškauskas · Accepted Answer · 2023-07-11 12:58:51Z

1

Yes, there is the algorithm proposed by R.Baeza-Yates and G.H Gonnet in 1996 (see this article by Technical University of Munich).

This algorithm describes, how to compile regular expressions into the modified suffix trees that are the well known compiled structure for the efficient search of approximate match. The classic suffix trees as known since 1973 only provide approximate search of the simple string.

answered Jul 11, 2023 at 12:58

Audrius Meškauskas

1,04233 silver badges99 bronze badges

Add a comment |

9072997 · Accepted Answer · 2023-10-13 14:47:09Z

Runtime compiled regex is pretty common, as others have pointed out. Intel has a library that compiles regex to high-performance machine code, complete with processor specific optimizations (i.e. you can compile for generic x86 for maximum compatibility or enable advanced features like AVX512 for maximum performance). It even has a format for serializing these compiled patterns and loading them so they don't need to be compiled at runtime, similar to what you might be used to with GPU shaders.

As to why compile-time-regex-compilation is not a core language feature, I think it's because regex is not a core language feature in most languages. Regex is really only a "core language feature" in languages like perl that specialize in manipulating text, and these languages tend not to be focused on performance. Regex is complex, and most languages try to contain this complexity in a library, even if it is part of the standard library.

As a final note, it's worth being aware that for simple patterns, memory bandwidth will rob you of all the efficiency gains of a high performance regex engine. It doesn't matter how fast you can match text if you can't get that text into your CPU fast enough.

EDIT: I just found out C# has this feature, so that is a notable exception.

[GeneratedRegex(@"^\s+", RegexOptions.Multiline)]
private static partial Regex IndentationRegex();

Stack Exchange Network

Can regex be compiled into efficient machine code?

6 Answers 6

You must log in to answer this question.

Hot Network Questions

Can regex be compiled into efficient machine code?

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions