python regular expression

Directory

  • Regular expression syntax
    • Group
    • flags
  • function
    • compile
    • search
    • separate
    • Find
    • replace
  • regular expression object
  • match object

Regular expression syntax

.: The default is to match all characters except newline; if the flag is re.S or re.DOALL, it will match any character including newline.
^: Matches the beginning of the string. In multi-line mode (specify flag as re.M or re.MULTILINE), it can match the beginning of each line.
$: Matches the end position of the string. Same as ^. It only matches one line by default. It can match multiple lines under re.M or re.MULTILINE.
*: Matches 0, 1 or more previous occurrences of the character
+ : Match one or more occurrences of the preceding character
?: This symbol has two purposes:

  • Matches 0 or one occurrence of the preceding character
  • By default, *, +, ? These three symbols are greedy matching, add? Then there is non-greedy matching
    {m}: Exactly matches m previous characters. Even if the previous characters are greater than m, they cannot be matched. (re.match will fail, but re.search still works)
    {m, n}: Match m to n previous characters. If n is omitted, that is, {m, } will specify an unlimited upper boundary, such as: a{4}b will match the string "aaaab ", will also match "aaaaab" (as long as the preceding character is greater than or equal to m, it can be matched), but it will not match "aaab"
    {m, n}?: Non-greedy matching, such as r’a{2,4}b’ matching the string ‘aaaab’ will match the complete aaaab, but r’a{2,4 }?b’ will match ‘aab’. The official document explains it this way, but the actual verification results of the two methods are the same.

[]: a character set

  • character set
  • Use – to specify a range, such as [a-zA-Z], which specifies that the characters in the range from a to z and A to Z will match.

|: A|B, matches A or B

Grouping

  1. Extended notation all starts with ? Start, usually used to provide a mark before judging the match to implement a lookahead match or condition check
  2. Although () is also used, only (?P) represents a matching group, and the others do not create a group.
  3. When using “(xx)” capture group, if there is “(xx(xx)xx)” situation, the capture number inside is inferior to the capture number outside

(...): Specify a capturing group
(?:...): Specify a non-capturing group, that is, it only matches but does not capture the results, and does not assign a group number. It is applicable to match objects and regular expression objects.
(?P=name) / (?P): You can specify the alias of the matching group, use group('name') can be accessed
(?aiLmsux): does not match any string, equivalent to re.A/re.I/re.L/re.M/re.S/re.U/re.X; for example, specify (?i)xxx, indicating that the match is not case-sensitive; this is a good way if you want to specify the tag in the regular expression instead of re.compile
(?#...): No matching, only used as a comment
(?=...): Matches only if a string is followed by…
(?!...): Matches only if a string is not followed by…
(?<=...): Matches only if a string is preceded by...
(?: Matches only if a string is not preceded by...
(?(id/name)yes-pat|no-pat): If a matching group id or name exists, it will match yes-pat, otherwise it will match no-pat.
\\
umber
: represents the matching group. Matching groups start from 1

r'(^\s*(\w + )(::)?(?(3)\2|)'
# If matching group 3 (::) exists, it is matching group 2 (\w + ), otherwise it is empty

\A: only matches the beginning of the string
\b:
\B:
\d / \D: matches any decimal number; matches any other character except decimal number
\s/ \S: matches any space character (including \\
: newline, \t:); matches any character except whitespace characters
\w / \W: Under Unicode, it can match numbers, letters, underscores, and letters in other languages such as Chinese characters, but it cannot match punctuation such as ';\ ', nor can it match the comment character \ nor spaces, etc. If it is set, ASCII will only match uppercase and lowercase letters and numbers, which is the opposite.
\Z:

flags

re.A / re.ASCII: only matches ASCII characters
re.I / re.IGNORECASE: case-insensitive, both forms are acceptable
re.S / re.DOTALL: This flag allows "." to match any character including newline characters, mainly used to match the newline character "\\
"
re.M / re.MULTILINE: can match multiple lines
re.L / re.LOCALE:
re.DEBUG: debug information
re.X / re.VERBOSE:

Special characters that need to be escaped: + ;*

Function

Compile

compile(pattern, flags=0): Compile the regular expression pattern and return a regular expression object. It is recommended to use pre-compiled regular expression objects before operating. But precompilation is not necessary. If you need to compile, use the compiled method. If not, use the function. The function name and method name are the same.

Search

match(pattern, string, flags=0): Find a match from the beginning of the string. Only the beginning will be matched. Even if there is any later, it will not match. If the match is successful, a match object will be returned. If no match is found, None will be returned. Even if re.MULTILINE mode is specified, match only matches the beginning of the string rather than the beginning of each line.

# The following match is successful and the match object is output.
re.match(r'\w + ', 'hello123=@')
<re.Match object; span=(0, 8), match='hello123'>
# If the string does not begin with an alphanumeric underscore, there will be no output, the match will fail, and the output will be empty.
re.match(r'\w + ', '\(\)hello123=@')

search(pattern, string, flags=0): Search pattern, not only searches for the first occurrence of the pattern in the string, but also strictly searches the string from left to right, after matching Returns a match object. If no match is found, None is returned. Note: This method only searches the first matching place from left to right and will not search further.

# The second example above uses search to return the match object
>> re.search(r'\w + ', '\(\)hello123=@')
<re.Match object; span=(4, 12), match='hello123'>

fullmatch(pattern, string, flags=0): Different from match and search, a match object will be returned only if the entire string matches, otherwise None will be returned.

>> re.fullmatch(r'\w + ', 'hello123=@') # Return None
>> re.fullmatch(r'\w + =@', 'hello123=@') # Return the match object
<re.Match object; span=(0, 10), match='hello123=@'>

Separation

split(pattern, string, maxsplit=0, flag=0): Decompose the string according to the pattern specified by pattern and return a list consisting of the separated parts of the string. If a capturing group is used, the capturing group string also appears as an element in the list. maxsplit is the number of separations. If maxsplit is non-zero, it will be separated at most maxsplit times, and the remaining string will be treated as an element in the list.

>> re.split(r'\W + ', 'Words, words, words.')
['Words', 'words', 'words', '']
>> re.split(r'(\W + )', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '']
>> re.split(r'\W + ', 'Words, words, words.', 1)
['Words', 'words, words,']
  • Additional information for other situations

Search

findall(pattern, string, flags=0): Scan the string from left to right and return the results in the form of a list in the order found. Different from match or search, this method will traverse the entire string and find all strings that match pattern.
Returns a list of tuples containing the matched strings

>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') # No capturing group, returns a list of all matched strings
['foot', 'fell', 'fastest']
>>> re.findall(r'(\w + )=(\d + )', 'set width=20 and height=10') # Two capturing groups, return a list of tuples
[('width', '20'), ('height', '10')]
>>> re.findall(r'(\w + )', 'set width=20 and height=10') # A capturing group, returning a list of strings
['set', 'width', '20', 'and', 'height', '10']

finditer(pattern, string, flags=0): Like findall, scan the string from left to right and find the matching content in order, return the iterator of the match object, and return the match in an iterative way Object: Note that the match object is returned

Replace

sub(pattern, repl, string, count=0, flags=0): Replace pattern pattern with repl in string and return the replaced string. The original string remains unchanged.
For more complex situations, you can replace the second parameter with a callback function that returns a string for replacement and the input parameter is a matching object.

subn(): Same as sub, but also returns a total number representing the replacement. The replaced string and the number representing the total number of replacements are returned as a tuple with two elements.
Now there is a question: How to replace multiple regular expression objects?

Regular expression object

In addition to the above methods, the regular expression object compiled by compile also has the following methods or attributes:

  • Pattern.flags
  • Pattern.groups: Number of capturing groups
  • Pattern.groups
  • Pattern.groupindex
  • Pattern.pattern

match object

The object returned by the search or match method
Match.expand:
Match.group([group1, ...]): Returns a tuple, either the entire match object (parameter is 0 or none, no default is 0), or a specific subgroup on request (The parameters can be multiple), the subgroup starts from 1, and the content of the capturing group is returned. If no match is found, None is returned; if a named group name is specified, the name string can be used as the parameter of the group to return the corresponding The key value; exception: IndexError
Match.__getitem__(g):
Match.groups(default=None): Returns a tuple containing unique or all subgroups. Only matches subgroups. If there are no subgroups, group() still returns the entire match, but groups returns an empty tuple.
Match.groupdict(default=None): Returns a dictionary of named subgroups. Returns all matching dictionaries. This method seems to be only used for objects that use ?P to specify the subgroup name.
Match.start([group])/Match.end([group]):
Match.span([group]):
Match.pos:
Match.endpos:
Match.lastindex:
Match.lastgroup:
Match.re:
Match.string:

syntaxbug.com © 2021 All Rights Reserved.