Revealing the Mysterious String Matching Tool-Regular Expressions

Regular expressions, also known as regular expressions (often abbreviated as regex, regexp or RE in code), are a powerful tool for matching, finding, and replacing text. It can automate text processing by matching strings with specific patterns. In many programming languages, regular expressions are widely used in text processing, data analysis, web crawling and other fields. Through regular expressions, we can accurately filter, manipulate and format text to improve work efficiency.

Regular expressions are widely used in daily life. For example, when processing phone numbers, we can use regular expressions to verify that the number is in the correct format. Chinese phone numbers usually consist of 11 digits, the first digit is 1, and the second digit is usually 3-9, we can use the following regular expression to match these numbers:

/^1[3-9]\d{9}$/

Through this regular expression, we can determine whether a phone number meets the specifications, thereby avoiding incorrect information input.

What is a regular expression

Each regular expression has a finite automaton (also called a state machine) that takes the language specified by the expression and uses Thompson’s construction algorithm to transform the regular expression into an equivalent nondeterministic finite state automaton. (NFA). At the same time, for each finite automaton, there is a regular expression that describes the language accepted by the automaton. This expression can be generated by Klein’s algorithm or Gaussian elimination.

A well-known application of regular expressions is the search and replace functionality in text editors, first implemented by computer pioneer Ken Thompson (one of the developers of the UNIX operating system) in the line-oriented editor QED in the 1960s. This function allows to find a specific string in a text and replace it with any other string as needed.

How regular expressions work

Regular expressions can use only regular characters (such as abc ) or a combination of regular characters and metacharacters (such as ab*c ). The job of metacharacters is to describe the structure or arrangement of certain characters, such as whether a character should be at the beginning of a line, or whether a character appears only once or multiple times. The regular expression example mentioned above works as follows:

  • abc: The simple regular expression pattern abc requires an exact match. In other words, the expression searches for all strings containing the characters “abc” in exact order. For example, it can match: “a abc d” and “abc oulomb”.
  • ab*c: In contrast, regular expressions with special characters are slightly different. The asterisk indicates that the expression searches for a string that begins with the letter “a” and ends with the letter “c”. However, there can be any number of b’s between a and c. So, “abc” and the strings “abbbbc” and “cbb abbc ba” also form a match.

Each regular expression can also be linked to a specific operation, such as the “replace” operation mentioned above. This will be done whenever the regular expression is true, i.e. whenever there is a match as described in the example above. Youpaiyun CDN’s edge rules support similar scenarios, matching strings based on regular expressions and performing rewriting, jumps, access control, rate limiting and other requirements.

Challenges of using regular expressions

Mastering regular expressions can improve our programming and text processing capabilities, and process large amounts of data and text more efficiently. However, there are still some challenges in mastering and using it.

  • Complexity: Regular expressions themselves are complex and have a steep learning curve. Writing and understanding complex regular expressions can require a lot of time and experience.
  • Matching efficiency: Unreasonable regular expressions may lead to inefficiency, especially when processing large amounts of data.
  • Unreadability: Complex regular expressions can be difficult to understand, making maintenance and debugging difficult.
  • Learning cost: Regular expressions have a lot of syntax and special characters, and require a certain amount of learning to be used proficiently.

When writing regular expressions, the most important thing is to master the following core concepts:

  • Metacharacters: including characters, backslashes, square brackets, asterisks, question marks, etc., which are used to match a specific character or set of characters.
  • Escape characters: Use backslashes to escape special characters so that they match the characters themselves rather than their special meaning.
  • Qualifier: used to specify the number of occurrences of the previous character or subexpression in a regular expression. For example, * means zero or more times, + means one or more times, ? means zero or one time.
  • Selector: Use the pipe symbol (|) to indicate that any one of multiple patterns can be selected for matching.
  • Atom: Used to specify a precise character or set of characters, such as \d for numeric characters, \w for letters, numbers, or underscore characters.
  • Assertion: used to specify a position rather than a specific character or character set, for example, ^ represents the beginning of the line, and $ represents the end of the line.
  • Parentheses: Used to combine multiple patterns into a more complex pattern and specify the order of matching.

Mastering these core concepts will allow you to write more accurate and complex regular expressions to solve a variety of text processing problems.

Which grammatical rules apply to regular expressions

Regular expressions can be used in a variety of languages, such as Perl, Python, Ruby, JavaScript, XML, or HTML, but their purpose or functionality can vary greatly. As in JavaScript, regular expression patterns are used in the search(), match(), or replace() string methods, while expressions in XML documents are used to separate element content. But in terms of syntax, there is almost no difference between using it in a programming language or a markup language.

Regular expressions can be made up of three parts, regardless of the language:

Patterns (expression) consists of metacharacters, ordinary characters and special characters and is used to describe the text pattern to be matched. The pattern can consist of only simple characters or a combination of simple characters and special characters.
Delimiters are used to distinguish regular expressions from other text. A commonly used delimiter is the slash (/), but other characters can also be used as delimiters.
Modifiers are used to specify the behavior of regular expressions. Common modifiers include i (ignore case), m (multiline mode), s (match dots to any character, including newlines), and x (ignore whitespace characters).

The following are some typical syntax symbols and comments used in expressions:

Special characters of regular expression syntax Functions
[] is used to specify a character set, which can match any character within the square brackets. Character sets can contain single characters, multiple characters, character ranges, etc.
() A capture group, used to capture and save a set of characters or patterns for subsequent use or matching. Capturing groups can be used to extract substrings, perform replacement operations, etc.
A hyphen used to indicate a range or specify a range. It can be used in character set or repetition modifiers.
^ In character sets, ^ is used to negate character sets; in assertions, ^ is used to indicate the beginning of a line.
$ is used to match the end of the string.
. Metacharacters that match any character. Can match any character except newline characters (\\
, \r).
* is a qualifier that specifies the number of occurrences of the previous character or subexpression. It can mean zero or more times.
+ is a qualifier that specifies the number of occurrences of the previous character or subexpression. It can mean once or multiple times.
? is a qualifier used to specify the number of occurrences of the previous character or subexpression. It can mean zero or one time.
{n} is a qualifier that specifies the number of occurrences of the previous character or subexpression. It means that the preceding character or subexpression must appear exactly n times.
{n,m} is a qualifier used to specify the range of times the previous character or subexpression appears. Among them, n represents the minimum number of times, and m represents the maximum number of times.
{n,} is a qualifier used to specify the range of times that the previous character or subexpression appears, indicating that it appears at least n times .
\b is a boundary assertion character, used to specify the boundary of a word. It matches the beginning or end of a word, which is preceded and followed by non-word characters (such as spaces, punctuation marks, etc.).
\B is a boundary assertion character, the opposite of \b. It matches the position within a word, that is, the positions that are preceded and followed by word characters.
\d is a character class used to match any decimal number. Equivalent to [0-9].
\D is a negative assertion character, used to match non-numeric characters. It is a reverse match character used to distinguish from numeric characters.
\w is a metacharacter used to match a word character. Word characters include letters, numbers, and underscores [a-zA-Z_0-9].
\W is a reverse character predicate used to match non-alphanumeric characters.

Of course, the above just introduces some basic knowledge of regular expressions. Regular expressions have high flexibility and plasticity. From simple text editors to complex development tools, regular expressions can be used for text processing. As mentioned before, the edge rule function of Youpaiyun CDN uses regular expressions to extract strings. Let’s use some examples to understand its power.

Application of regular expressions in Youpaiyun CDN

Example 1: Directory and parameter rewriting

Convert the request URL to a dynamic URL with parameters, for example, the requested URL is:

http://example.com/pay/25/8/...

The CDN edge node is required to convert the following request:

http://example.com/pay.php?payid=25 & amp;categoryid=8...

At this time, the pattern part needs to extract the directory number and generate variables such as $1 and $2, as shown in the following rules:

"rule": "/pay.php?productid=$1 & amp;categoryid=$2",
"pattern": "^pay/([0-9] + )/([0-9] + )/(.*?).html$"

Rule explanation: When the parsed URL matches the rule ^pay/([0-9] + )/([0-9] + )/(.*?).html$, then the request will be directed to /pay.php?productid =$1&categoryid=$2.

Also convert http://example.com/pay/25/8/… to http://example.com/pay.php?payid=25 & amp;categoryid=8…

Example 2: File name rewriting

pattern: /(.*)/playlist.m3u8$rule: /$1'.m3u8'

Rule explanation: When the access address is http://domain/app/stream/playlist.m3u8, rewrite the access address to http://domain/app/stream.m3u8.

Application scenarios: In live broadcast application scenarios, because the client mechanism cannot or is inconvenient to upgrade, you can rewrite the URL and change /stream/playlist.m3u8 to /stream.m3u8, where app represents the publishing point and stream represents the stream name. .

Example 3: URL speed limit

If the requested URL is: http://test.example.com/mp4/4E10F356C0FEAD359C33DC5901307461-10.mp4, the speed of this type of file needs to be limited. The speed limit requirements are: no speed limit for the first 20MB, and a speed limit of 800 KB after 20MB. /s, the rule can be written like this:

"rule": "$WHEN($1, $EQ($_HOST, 'test.example.com'))$LIMIT_RATE_AFTER(20, m)$LIMIT_RATE(800, k)",
"pattern": "^(/). + -10.mp4$"

Rule explanation: When $1 is true and the request HOST is test.example.com, the speed is not limited at first 20MB, and then limited to 800KB/s.

Youpaiyun CDN edge rule function combined with regular expressions and processing operations can help you simplify content distribution business logic and improve end-user access experience. This rule can be quickly deployed and configured simply, which can greatly reduce business implementation costs. Website and web application developers or security engineers can quickly create edge rule sets to improve website security and distribution performance. For details, please view the documentation: https://help.upyun.com/docs/edgerules/

Finally, let me recommend a useful regular expression matching testing tool: https://regex101.com/,
You can quickly test which strings can match the rules, and the matching rules are explained in detail, which is super convenient for writing and testing.