Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Specify and regularise non-functional bounded repetitions #596

Copy link
Copy link
Open
@masklinn

Description

@masklinn
Issue body actions

6e65445 converted a bunch of unbounded repetitions (* and +) to bounded ({0,...} and {1,...}). These bounded repetitions are non-functional, in the sense that they're there to mitigate catastrophic backtracking, they're not otherwise relevant to the pattern's correctness.

Now while it's great (or at least useful) for backtracking engines, they're not the only ones on the scene especially since the rise of redos / catastrophic backtracking concerns: the re2 C++ library (with bindings in many languages), Go's regexp and Rust's regex (which has C bindings) are all non-backtracking engines, and by definition don't suffer from catastrophic backtracking1.

The problem is that they need to keep track of the upper bound (in order to fail if it's exceeded), which has a cost: compiling the regex .* in re2 allocates 650 bytes, .{0,100} allocates 4800, .{0,300} allocates 11420. Matching it also adds a small cost (about 10% in regex in my small experiments). Now obviously for such a small regex on its own it's not a big concern, but with uap-core's regexes.yaml having a thousand regexes many more complicated than that it starts to add up.

Now initially I thought that was not a big deal because those repetitions could just be pre-processed back into unbounded repetitions and bob's your uncle. The issue is, a few functional bounded repetitions exist, there may be others but so far I've found that ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit) fails its test if it's converted to ; {0,2}(moto)(.*)(?: Build|\) AppleWebKit):

com.abc.mobile.production/5.7.17/Mozilla/5.0 (Linux; Android 9; moto e6 (XT2005DL) Build/PPBS29.73-81-5-14-5-4-2-6-7; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/100.0.4896.79 Mobile Safari/537.36

the original version captures e6 (XT2005DL), while the relaxed version captures e6 (XT2005DL) Build/PPBS29.73-81-5-14-5-4-2-6-7; wv.

The issue is that {0,50} is one of the ranges which was used in 6e65445: there were 4 occurrences in the file before that commit, it added 21 to 25, and there are 32 now. It's unclear how many of those extra 7 are functional (aside from the one I know is).

What this means is that it's difficult to pre-process regexes back to unbounded repetitions for the benefit of non-backtracking engines, it can either be done heuristically with the risk of breaking things, or not at all and we have to leave with unnecessary memory footprint.

It would be nice if these non-functional bounded repetitions used peculiar and easily recognisable ranges, such that those could be safely reverted for memory optimisation purposes e.g. maybe the upper bound could end in 86 or 93, bounds which seem unlikely to be picked either "at random" or for a real reason, and those would signal to maintainers that those ranges are fair game for unbounding. Or maybe all the ranges added by 6e65445 could be bumped to an upper bound of 100 which would become the threshold between functional and non-functional.

Footnotes

  1. there are also hybrid backtracking engines which are very resilient to the issue, I never managed to trip the Spencer engine used by postgresql

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.