Halcon —tuple_regexp_match

tuple_regexp_match (Operator), I searched for a long time and couldn’t find the information I wanted, so I had to look it up myself. If I’m wrong, please feel free to comment. Your valuable opinions are welcome, thank you!

Name

tuple_regexp_match – Extract substrings using regular expressions.

Signature

tuple_regexp_match( : : Data, Expression : Matches)

Description

tuple_regexp_match applies the regular expression in Expression to one or more input strings in Data, and in each case returns the first matching substring in Matches. Normally, one output string is returned for each input string, the output string being empty if no match was found . However, if the regular expression contains capturing groups (see below), the behavior depends on the number of input strings: If there is only a single input string, the result is a tuple of all captured submatches. If there are multiple input strings , the output strings represent the matched pattern of the first capturing group.

The tuple_regexp_match operator applies a regular expression to one or more input strings in Data and returns the first matching substring in Matches in each case. Normally, for each input string, an output string is returned, or empty if no match is found. However, if the regular expression contains capturing groups (see below), the behavior depends on the number of input strings: if there is only one input string, the result is a tuple of all capturing submatches. If there are multiple input strings, the output string represents the matching pattern of the first capturing group.

A summary of regular expression syntax is provided here. Basically, each character in the regular expression represents a literal to match, except for the following symbols which have a special meaning (the described syntax is compatible with Perl):

The following is a summary of regular expression syntax. Basically, each character in a regular expression represents a literal to match, except for the following symbols which have special meaning (the syntax described is Perl compatible):

^ Matches start of string Matches the beginning of string
$ Matches end of string (a trailing newline is allowed) Matches the end of string (a trailing newline is allowed)
. Matches any character except newline Matches any character except newline
[…] Matches any character literal listed in the brackets. Matches any character literal listed in the brackets.
If the first character is a ‘^’, this matches any character
except those in the list. You can use the ‘-‘ character as
in ‘[A-Z0-9]’ to select character ranges. Other characters
lose their special meaning in brackets, except ‘\’.
Within these brackets it is possible to use the following
POSIX character classes (note that the additional brackets are
needed):

If the first character is ‘^’, matches any character except those in the list. You can use ‘-‘ characters, such as ‘[A-Z0-9]’, to select a character range. Within parentheses, characters other than ” lose their special meaning within the parentheses. Within these brackets, the following POSIX character classes can be used (note that additional brackets are required)
[:alnum:] alphabetic and numeric characters
[:alpha:] alphabetic characters alphabetic characters
[:blank:] space and tab space and tab
[:cntrl:] control characters control characters
[:digit:] digitsdigits
[:graph:] non-blank (like spaces or control characters) non-blank (like spaces or control characters)
[:lower:] lowercase alphabetic characters lowercase alphabetic characters
[:print:] like [:graph:] but including spaces ] Similar to [:graph:] , but including spaces
[:punct:] punctuation characters punctuation characters
[:space:] all whitespace characters ([:blank:], newline, …) All whitespace characters ([:blank:], newline, etc.)
[:upper:] uppercase alphabetic characters uppercase alphabetic characters
[:xdigit:] digits allowed in hexadecimal numbers (0-9a-fA-F).

* Allows 0 or more repetitions of preceding literal or group

Allow 0 or more repetitions of the previous text or group

+Allows 1 or more repetitions

  • Allow 1 or more duplicates

? Allows 0 or 1 repetitions Allows 0 or 1 repetitions
{n,m} Allows n to m repetitions Allows n to m repetitions
{n} Allows exactly n repetitions Allows exactly n repetitions

The repeat quantifiers above are greedy by default, i.e., they
attempt to maximize the length of the match. Appending? attempts
to find a minimal match, e.g., + ?

 The above repeated delimiters default to greedy mode, which attempts to maximize the match length. Add ? to try to find the smallest match, e.g. + ?

| Separates alternative matching expressions. Separates alternative matching expressions.
( ) Groups a subpattern and creates a capturing group.
The substrings captured by this group will be stored separately. Group the subpatterns and create a capturing group. Substrings captured by this group will be stored individually.
(?: ) Groups a subpattern without creating a capturing group Groups a subpattern without creating a capturing group
(?= ) Positive lookahead (requested condition right to the match) Positive lookahead (matches the request condition on the right)
(?! ) Negative lookahead (forbidden condition right to the match) Negative lookahead (forbidden condition right to the match)
(?<= ) Positive lookbehind (requested condition left to the match) Positive lookbehind (match the request condition on the left)
(?

\ Escapes any special symbol to treat it as a literal. Note that
some host languages like HDevelop and C/C++ already use the backslash
as a general escape character. In this case, ‘\.’ matches a
literal dot while ‘\\’ matches a literal backslash.
Furthermore, there are some special codes (the capitalized
version of each denoting the negation):
\d,\D Matches a digit
\w,\W Matches a letter, digit or underscore
\s,\S Matches a white space character
\b,\B Matches a word boundary
If the specified expression is syntactically incorrect, you will receive an error stating that the value of control parameter 2 is wrong. Additional details are displayed in a message box if set_system(‘do_low_error’, ‘true’) is set and in HDevelop’s Output Console.

Furthermore, you can set some options by passing a string tuple for Expression. In this case, the first element is used as the expression, and each additional element is treated as an option.

\ escapes any special symbols to treat them as literals. Note that some host languages such as HDevelop and C/C++ already use backslash as a universal escape character. In this case, ‘\.’ matches a literal dot, and ‘\’ matches a literal backslash. In addition, there are some special codes (each uppercase version means negation): \d,\D matches digits; \w,\W matches letters, digits, or underscores; \s,\S matches whitespace characters ;\b,\B matches word boundaries. If the specified expression is syntactically incorrect, you will receive an error stating that control parameter 2 has the wrong value. If set_system(‘do_low_error’, ‘true’) is set, additional details are displayed in a message box and displayed in HDevelop’s output console.

Additionally, you can set some options by passing a tuple of strings to Expression. In this case, the first element is used as an expression and each additional element is considered an option.

‘ignore_case’: Perform case-insensitive matching performs case-insensitive matching

‘multiline’: ‘^’ and ‘$’ match start and end of individual lines’^’ and ‘$’ match the start and end of individual lines

‘dot_matches_all’: Allow the ‘.’ character to also match newlines Allow the ‘.’ character to also match newlines

‘newline_lf’, ‘newline_crlf’, ‘newline_cr’: Specify the encoding of newlines in the input data. The default is LF on all systems (even though in Windows files usually CRLF is used as line break, when reading a file into memory the read operators return for every line break just ‘\\
‘, which is the same as LF).

For general information about string operations see Tuple / String Operations.

If the input parameter Data is an empty tuple, the operator returns an empty tuple. If Expression is an empty tuple, an exception is raised.

Unicode code points versus bytes

Regular expression matching operates on Unicode code points. One Unicode code point may be composed of multiple bytes in the UTF-8 string. If regular expression matching should only match on bytes, this operator can be switched to byte mode with set_system(‘tsp_tuple_string_operator_mode ‘, ‘byte’). If ‘filename_encoding’ is set to ‘locale’ (legacy), this operator always uses the byte mode.

‘newline_lf’, ‘newline_crlf’, ‘newline_cr’: Specify the encoding method of newline characters in the input data. Defaults to LF on all systems (even though CRLF is commonly used as newline character in Windows files, when reading a file into memory the read operator just returns ‘\\
‘ for each newline character, which is different from LF same).

For general information on string operations, see Tuple/String Operations.

If the input parameter Data is an empty tuple, the operator returns an empty tuple. If Expression is an empty tuple, an exception is thrown.

Unicode code points and bytes

Regular expression matching operations are based on Unicode code points. A Unicode code point may consist of multiple bytes in a UTF-8 string. If the regular expression match should only match bytes, you can use set_system(‘tsp_tuple_string_operator_mode’, ‘byte’) to switch this operator to byte mode. If ‘filename_encoding’ is set to ‘locale’ (legacy mode), this operator always uses byte mode.

HDevelop In-line Operation

HDevelop provides an in-line operation for tuple_regexp_match, which can be used in an expression in the following syntax:

Matches := regexp_match(Data, Expression)

Execution Information

Multithreading type: independent (runs in parallel even with exclusive operators).
Multithreading scope: global (may be called from any thread).
Processed without parallelization.
Parameters

Data (input_control) string(-array) → (string)
Input strings to match.
Expression (input_control) string(-array) → (string)
Regular expression.
Default value: ‘.*’
Suggested values: ‘.*’, ‘ignore_case’, ‘multiline’, ‘dot_matches_all’, ‘newline_lf’, ‘newline_crlf’, ‘newline_cr’
Matches (output_control) string(-array) → (string)
Found matches.