12. Strings and regular expressions

Use regular expressions

Related knowledge about regular expressions

When writing programs or web pages that process strings, you often need to find strings that conform to some complex rules. Regular expressions are tools used to describe these rules. In other words, regular expressions are a tool. Defines the matching pattern of a string (how to check whether a string has a part that matches a certain pattern or extract or replace the part that matches a pattern from a string). If you have used file search in the Windows operating system and used wildcards (* and ?) when specifying file names, regular expressions are also similar tools for text matching, but compared to wildcard regular expressions The formula is more powerful and it describes your needs more precisely (of course the price you pay is that writing a regular expression is much more complicated than typing out a wildcard, so know that anything that gives you a benefit comes at a cost, Just like learning a programming language), for example, you can write a regular expression to find all the characters starting with 0, followed by 2-3 digits, then a hyphen “-“, and finally 7 or 8 digits A string of numbers (like 028-12345678 or 0813-7654321), isn’t this the domestic landline number? In the beginning, computers were born to do mathematical operations, and the information they processed were basically numerical values. Today, the information we process in our daily work is basically text data. We hope that computers can recognize and process text that matches certain patterns. , regular expressions are very important. Almost all programming languages today provide support for regular expression operations, and Python supports regular expression operations through the re module in the standard library.

We can consider the following problem: We have obtained a string from somewhere (maybe a text file, or a news on the Internet), and hope to find the mobile phone number and landline number in the string. Of course, we can set the mobile phone number to be an 11-digit number (note that it is not a random 11-digit number, because you have never seen a mobile phone number like “25012345678”) and the landline number is the same as the pattern described in the previous paragraph. If not It would be cumbersome to accomplish this task using regular expressions.

For relevant knowledge about regular expressions, you can read a very famous blog called “Regular Expressions 30-Minute Introductory Tutorial”. After reading this article, you can understand the following table. This is our understanding of regular expressions. A brief summary of some basic symbols in .

Symbol Explanation Example Explanation
. Match any character b.t Can match bat / but / b#t / b1t, etc.
\w Match letters/digits/underscores b\wt Can match bat/b1t/b_t, etc.
But it cannot match b#t
\s Match whitespace characters (including \r, \
, \t, etc.)
love\syou can match love you
\d match numbers \d\ d Can match 01 / 23 / 99, etc.
\b Match word boundaries \bThe\b
^ matches the beginning of the string ^The Can match the string starting with The
$ Match the end of the string .exe$ Can match the string at the end of .exe
\W Match non-letters/digits/underscores b\Wt can match b#t / b@t, etc.
But it cannot match but / b1t / b_t, etc.
\S Match non-whitespace characters love\Syou can match love#you, etc.
But it cannot match love you
\D match non-numbers \d\D Yes Match 9a / 3# / 0F etc.
\B Match non-word boundaries \Bio\B
[] Match any single character from the character set [aeiou] Can match any vowel character
[^] Match any single character not in the character set [^aeiou ] Can match any non-vowel character
* Match 0 or more times \w*
+ Match 1 or more times \w +
? Match 0 or 1 time \w?
{N} Match N times \w{3}
{M,} Match at least M times \w{3,}
{M,N} Match at least M times and at most N times \w{3,6}
| Branch foo|bar can match foo or bar
(?#) Comments
(exp) Match exp and capture into automatically named groups
(? exp) Match exp and capture it in the group named name
(?:exp) Match exp but do not capture the matching text
( ?=exp) Match the position before exp \b\w + (?=ing) Can match the dance in I’m dancing
(?<=exp) Match the position after exp (?<=\bdanc)\w + \b Can match the first ing in I love dancing and reading
(?!exp) The match is not followed by exp The position of
(? matches the position that is not exp before
*? Repeat any number of times, but as few times as possible a.*b
a.*?b
Apply the regular expression to aabab. The former will match the entire string aabab, and the latter will match the two strings aab and ab
+ ? Repeat 1 or more times, but as little as possible
Repeat 0 or 1 times, but as few times as possible
{M,N}? Repeat M to N times, but repeat as little as possible
{M,}? Repeat M times or more, but as little as possible

Note: If the character that needs to be matched is a special character in a regular expression, you can use \ for escape processing. For example, if you want to match a decimal point, you can write \., because writing . directly will match Any character; similarly, if you want to match parentheses, you must write \(and\), otherwise the parentheses are regarded as grouping in the regular expression.

Python’s support for regular expressions

Python provides the re module to support regular expression related operations. The following are the core functions in the re module.

Function Description
compile(pattern, flags=0) Compile regular expression and return regular expression object
match(pattern, string, flags=0) Use regular expression If the formula matches the string successfully, it returns the matching object. Otherwise, it returns None
search(pattern, string, flags=0) The first occurrence in the search string The regular expression pattern successfully returns the matching object, otherwise it returns None
split(pattern, string, maxsplit=0, flags=0) Use regular expressions Split the string with the pattern delimiter specified by the formula and return the list
sub(pattern, repl, string, count=0, flags=0) Use The specified string replaces the pattern matching the regular expression in the original string. You can use count to specify the number of replacements
fullmatch(pattern, string, flags=0) Exact match (from the beginning to the end of the string) version of the match function
findall(pattern, string, flags=0) Find All patterns in a string that match the regular expression return a list of strings
finditer(pattern, string, flags=0) Find all strings Patterns matching regular expressions return an iterator
purge() Purge the cache of implicitly compiled regular expressions
re.I / re.IGNORECASE Ignore case matching tags
re.M / re.MULTILINE Multi-line matching tag

Note: In actual development, these functions in the re module mentioned above can also be replaced by the method of regular expression objects. If a regular expression needs to be used repeatedly, then It is undoubtedly a wiser choice to first compile the regular expression through the compile function and create a regular expression object.

Below we will tell you how to use regular expressions in Python through a series of examples.

Example 1: Verify whether the entered user name and QQ number are valid and provide corresponding prompt information.
"""
Verify whether the entered user name and QQ number are valid and give corresponding prompt information

Requirements: The username must be composed of letters, numbers or underscores and be between 6 and 20 characters in length. The QQ number must be a number between 5 and 12 and the first digit cannot be 0.
"""
import re


def main():
    username = input('Please enter username: ')
    qq = input('Please enter your QQ number: ')
    #The first parameter of the match function is a regular expression string or regular expression object
    # The second parameter is the string object to be matched with the regular expression
    m1 = re.match(r'^[0-9a-zA-Z_]{6,20}$', username)
    if not m1:
        print('Please enter a valid username.')
    m2 = re.match(r'^[1-9]\d{4,11}$', qq)
    if not m2:
        print('Please enter a valid QQ number.')
    if m1 and m2:
        print('The information you entered is valid!')


if __name__ == '__main__':
    main()

Tip: When writing the regular expression above, the “original string” method is used (r is added in front of the string). The so-called “original string” is each character in the string. They are all their original meanings. To put it more directly, there are no so-called escape characters in strings. Because there are many metacharacters in regular expressions and places that need to be escaped, if you do not use the original string, you need to write the backslash as \. For example, \d representing a number must be written as \d, which not only makes it easier to write It is inconvenient and difficult to read.

Example 2: Extract domestic mobile phone numbers from a piece of text.

The picture below shows the mobile phone number ranges launched by three domestic operators as of the end of 2017.

import re


def main():
    #Create a regular expression object using lookahead and lookback to ensure that there should be no numbers before or after the mobile phone number.
    pattern = re.compile(r'(?<=\D)1[34578]\d{9}(?=\D)')
    sentence = '''
    Important things have been said 8130123456789 times. My mobile phone number is this beautiful number 13512346789.
    It's not 15600998765, it's also 110 or 119. Wang Dachui's mobile phone number is 15600998765.
    '''
    # Find all matches and save them in a list
    mylist = re.findall(pattern, sentence)
    print(mylist)
    print('--------Gorgeous dividing line--------')
    # Take out the matching object through the iterator and obtain the matching content
    for temp in pattern.finditer(sentence):
        print(temp.group())
    print('--------Gorgeous dividing line--------')
    # Specify the search position through the search function to find all matches
    m = pattern.search(sentence)
    while m:
        print(m.group())
        m = pattern.search(sentence, m.end())


if __name__ == '__main__':
    main()

Note: The above regular expression for matching domestic mobile phone numbers is not good enough, because numbers starting with 14 are only 145 or 147, and the above regular expression does not take this situation into account. It must match domestic mobile phones. No., a better regular expression is: (?<=\D)(1[38]\d{9}|14[57]\d{8}|15[0-35-9 ]\d{8}|17[678]\d{8})(?=\D), there seem to be mobile phone numbers starting with 19 and 16 in China recently, but this is not among our considerations for the time being.

Example 3: Replace bad content in string
import re


def main():
    sentence = 'Are you stupid? I'll fuck you. Fuck you.'
    purified = re.sub('[FUCK]|fuck|shit|silly [better than cunt but missing dick]|shabi',
                      '*', sentence, flags=re.IGNORECASE)
    print(purified) # Are you *? I belong to *your uncle. * you.


if __name__ == '__main__':
    main()

Note: There is a flags parameter in the regular expression related functions of the re module, which represents the matching tag of the regular expression. You can use this tag to specify whether to ignore case and whether to perform multiple matches. Line matching, whether to display debugging information, etc. If you need to specify multiple values for the flags parameter, you can use the bitwise OR operator for superposition, such as flags=re.I | re.M.

Example 4: Split long string
import re


def main():
    poem = 'There is bright moonlight in front of the window, it is suspected to be frost on the ground. Raise your head to look at the bright moon, lower your head to think about your hometown. '
    sentence_list = re.split(r'[,., .]', poem)
    while '' in sentence_list:
        sentence_list.remove('')
    print(sentence_list) # ['There is bright moonlight in front of the window', 'It is suspected to be frost on the ground', 'Looking up at the bright moon', 'Looking down at home']


if __name__ == '__main__':
    main()

Afterword

If you want to engage in the development of reptile applications, then regular expressions must be a very good assistant, because it can help us quickly discover a certain pattern we specify from the web page code and extract the information we need. Of course, for beginners Scholars come to accept that it may not be an easy task to write a correct and appropriate regular expression (of course some commonly used regular expressions can be found directly on the Internet), so when actually developing crawler applications, many people will Choose Beautiful Soup or Lxml for matching and information extraction. The former is simple and convenient but has poor performance. The latter is both easy to use and perform well, but the installation is a little troublesome. We will introduce these contents to you in the later crawler topic.