Best regex alternative for Go

Introduction

“Don’t use regular expressions or you’ll have 2 problems instead of 1” – or so the experts say. What’s left for those mischievous people who want to efficiently search through tons of templates?

Of course, there are some cool solutions to this specific problem, such as Ragel or re2c. However, for my project, mastering these fine techniques seems impractical for the time being.

In this article, we will look at alternatives to the standard regular expression library in Go and benchmark their speed and memory consumption. We will also consider their differences from a practical perspective.

Regular solution

Currently, I found the following working alternative to the default regex for finding patterns in Go (the version used in the benchmark is given in parentheses):

go-re2 (1.3.0) – Replace default regular expressions as simply as possible. Use C++re2 to improve performance when processing large inputs or complex expressions;
regexp2 (1.10.0) – A feature-rich regular expression engine for Go. It does not have runtime guarantees like the built-in regexp package, but is compatible with Perl5 and .NET;
go-pcre (1.0.0) – Provides support for Perl-compatible regular expressions using libpcre or libpcre++. JIT compilation is available, making this branch much faster than its counterpart. The disadvantage is that you need the libpcre3-dev dependency;
rure-go (regular expressions 1.9.3) – Use the Rust regular expression engine with CGo bindings. The disadvantage is that the Rust library dependencies need to be compiled;
gohs (1.2.2 + hs5.4.1) – Regular expression engine designed for high performance. It is implemented as a library providing a simple C-API. It also requires third-party dependencies to be compiled and linked;
go-yara – Tool for identifying and classifying malware samples. While YARA has functionality for templates and regular expressions, it is very limited, so I will not include this library in upcoming tests.

Existing baseline

Before we start comparing the above solutions, it’s worth showing how terrible the standard regular expression library in Go is. I found a project where the author compared the performance of standard regex engines in various languages. The benchmark focuses on repeatedly running 3 regular expressions against predefined text. Go ranks third in this benchmark! From the end…

As far as I know, benchmarking regex engines is not the easiest subject as you need to know the implementation and algorithm details to make a proper comparison

Looking at other benchmarks, I can highlight the following:

Benchmarks – Compare engine and optimized versions of each language. For example, Go is no longer at the bottom, but it’s still far from ideal… Of course, it doesn’t use native libraries, but a wrapper for PCRE – go-pcre.
Performance comparison of regular expression engines – Comparison of different regular expression engines (PCRE, PCRE-DFA, TRE, Oniguruma, RE2, PCRE-JIT).
Comparison of regular expression engines – Here the author attempts to compare engines that use different regular expressions, which can complicate things depending on the engine’s implementation. In this benchmark, the author’s top three engines are: Hyperscan, PCRE (with JIT compilation) and Rust regex (rure uses it)

Benchmark #1

Let us now try to compare the analogue with the default regular expression engine libraries of other languages. Also see how much faster they are compared to the default Go regular expressions. To do this, I updated the above project by adding new library code. This is what I get after running it on my machine:

Still, you can see that some libraries’ regular expressions can be much faster! We even surpassed Rust by using Go libraries that use Rust libraries. Maybe this is what the author of this solution is trying to explain to us in his repository.

So almost all alternative solutions make us 8-130 times faster! Except for Regexp2, which is slower than the standard library.

Benchmark #2

1. Question

While studying existing benchmarks and the results of Benchmark#1, I lack answers to the following questions:

How fast can the above libraries handle large files?
How fast is processing when grouping regular expressions?
How fast is it to process regular expressions that have no matches in the text?
How much memory do different libraries use?
How many regular expressions can I compile using grouping?

2. Baseline differences

To answer these questions, I wrote a small benchmark program that can be used to compare the speed and memory usage of different regular expression engines. If you want to test yourself or evaluate the correctness of the method used, here is the code.

The following features of this benchmark are worth mentioning:

In the test below, I used 5 different regular expressions:

allRegexps["email"] = `(?P<name>[-\w\d\.] + ?)(?:\s + at\s + |\s *@\s*|\s*(?:[\[\]@]){3}\s*)(?P<host>[-\w\d\.] *?)\s*(?:dot|\.|(?:[\[\]dot\.]){3,5})\s*(?P<domain>\ w + )`
allRegexps["bitcoin"] = `\b([13][a-km-zA-HJ-NP-Z1-9]{25,34}|bc1[ac-hj-np-zAC-HJ- NP-Z02-9]{11,71})`
allRegexps["ssn"] = `\d{3}-\d{2}-\d{4}`
allRegexps["uri"] = `[\w] + ://[^/\s?#] + [^\s?#] + (?:\?[^\s# ]*)?(?:#[^\s]*)?`
allRegexps["tel"] = `\ + \d{1,4}?[-.\s]?\(?\d{1,3}?\)?[-. \s]?\d{1,4}[-.\s]?\d{1,4}[-.\s]?\d{1,9}`

In a good way, I should have used tricky regular expressions like other benchmark authors to check the “weaknesses” of the algorithm. But I don’t know much about the underlying details of the engine, so I used a general regex. That’s why I think it should be possible to evaluate the different parameters of the library from a practical point of view.

Instead of a static file, we will use a string containing the matches that is repeated multiple times in memory to simulate files of different sizes:

var data = bytes.Repeat([] byte ( "[email protected] nümbr= + 71112223334 SSN:123-45-6789 http://1.1.1.1 3FZbgi29cpjq2GjdwV8eyHuJJnkLtktZc5 Й" ), config.repeatScanTimes)< /pre>
 <ul><li><p>In addition to running the regular expressions sequentially on the data, there will also be separate runs of the regular expressions in named groups, where they will take the form:</p></li></ul>
 <pre>`(?P<name_1>regexp_1)|(?P<name_2>regexp_2)|...|(?P<name_N>regexp_N)`

By the way, Hyperscan has a special feature where we can build a regular expression database and use it on the data. In the benchmark I will use this approach.

Unlike Benchmark#1, for each regular expression, I will measure the time to find the result, regardless of the compilation time; in the end, we will have the results for each library and each regular expression in the following form:

Generate data...
Test data size: 100.00MB
Run RURE:
  [bitcoin] count=1000000, mem=16007.26KB, time=2.11075s
  [ssn] count=1000000, mem=16007.26KB, time=62.074ms
  [uri] count=1000000, mem=16007.26KB, time=69.186ms
  [tel] count=1000000, mem=16007.26KB, time=83.101ms
  [email] count=1000000, mem=16007.26KB, time=172.915ms
Total. Counted: 5000000, Memory: 80.04MB, Duration: 2.498027s
...

As a result, we have the following data:

The following graph shows the time for all regular expressions in sequential mode and using grouping to process 100MB of data:

in conclusion:

Grouping does improve execution speed significantly, but in some cases it can make things worse :);
The fastest among sequential processing is – Rure, with grouping – Re2;
emailSome regular expressions may cause problems in some libraries (need to find in Regexp2 and PCRE);
Now it’s hard to say that some solutions are 180x faster than the standard library, the maximum gain is x8-9.

3. Unmatched regular expressions

In the previous case, we simulated the ideal situation where there was always a match in the data. But what if there is no matching regex in the text, how much of an impact will this have on performance?

In this test, I added 5 additional modified regular expressions for SSN that did not match the data. In this example, SSN represents the \d{3}-\d{2}-\d{4} regular expression, and Non-matching- \d{ 3}-\d{2}-\d{4}1. Not much difference, right? But let’s see how it affects the time it takes to find all matches:

The following graph shows the time required to process all 10 regular expressions (sorted by Non-matching processing time):

in conclusion:

This time it’s the same: the fastest in sequential processing is – Rure, with grouping expressions – Re2;
PCRE is different again, processing regular expressions in sequential mode takes 2x longer; non-matching
Some algorithms are much faster when there are no matches (Re2, Hyperscan);

4. Memory consumption

Now let’s see how much memory the different solutions consume when processing a 100MB file. Below, I’ve provided the results for each individual regular expression along with the total amount of memory consumed:

The following graph shows the memory used by the library processing 10 regular expressions (like the previous test), sorted by “non-math” time:

in conclusion:

What’s surprising about Rure is its almost zero memory consumption;
Regexp2 is very resource intensive, consuming more memory than its competitors;
Re2, although fast, is not the most memory-efficient solution;
Go regular expressions have a bright side, they are relatively cheap in sequential mode.

5. Maximum number of regular expressions

The main questions appear to have been answered. Now let’s look at the maximum number of regular expressions that can be compiled using different solutions. In this case, we will take a single regular expression and repeat it multiple times in groups. To do this I will use URI regular expression:

`[\w] + ://[^/\s?#] + [^\s?#] + (?:\?[^\s#]*)?(? :#[^\s]* )? `

Next, I list the results of compiling the regular expressions and the memory they use. The number in the first line is the number of expressions in the URI group:

Summary:

As we have seen, some solutions have limits on the size of compiled regular expressions;
Hyperscan not only allows the use of a large number of regular expressions, but also uses minimal memory to compile regular expressions;
Regexp2 and Go Regex have comparable memory consumption and also allow compilation of large numbers of regular expressions;
Re2 consumes the most memory at compile time.

Conclusion

I hope this has been helpful for you to understand alternative solutions for regular expressions in Go, and based on the data I have provided, everyone can draw some conclusions for themselves, which will allow you to choose the most suitable regular expression for your situation Expression solution.

We can compare these libraries in detail over time, the algorithms they use, their best/worst aspects. I just want to show the difference between them in the most general case.

So I recommend you consider true-go regex for maximum speedup, but if you need the simplest library installation without dependencies, go-re2. In case of processing a lot of regex hyperscan would be a good one s Choice. Also, don’t forget the cost of using CGo with certain libraries.

Introduction

Regular solution

Existing baseline

Benchmark #1

Benchmark #2

1. Question

2. Baseline differences

3. Unmatched regular expressions

4. Memory consumption

5. Maximum number of regular expressions

Summary:

Conclusion

Recommended