Table of Contents
1 Overview
2. Metacharacters
Basic metacharacters:
Repeating metacharacters:
Positional metacharacters:
Other metacharacters
Escapes:
3. Commonly used regular expressions
4. Methods of re module
5. Advanced use of regular expressions
.*?
pattern modifier
6. Regular parsing data demo
1. Overview
Regular Expression, translated as regular expression or regular expression, represents a regular expression, which means an expression that describes the arrangement rules of a piece of text.
>
Regular expressions are not part of Python. It is a set of powerful advanced text manipulation tools that are independent of programming languages and used to process complex text information. Regular expressions have their own unique rule grammar and an independent regular processing engine. After we write the rules (patterns) according to the regular grammar, the engine can not only perform fuzzy text search based on the rules, but also perform fuzzy segmentation, replacement and other complex text. Operation allows developers to process text information as they wish. Regular engines are generally operated by programming languages. For example, python provides the re module or regex module to call the regular processing engine.
>
Regular expressions are not as efficient as the system’s built-in string operations in processing text, but their functions are much more powerful than the system’s built-in string operations.
>
The earliest regular expressions originated from the Perl language. Other programming languages basically followed the regular syntax of the Perl language when providing regular expression operations. Therefore, after we learn python’s regular expressions, we can also use them in java, php, go, javascript, Used in programming languages such as sql.
>
For crawlers, regular expressions are the most basic data analysis tool
● Regularity is based on the processing of strings or text: nothing more than splitting, matching, finding and replacing.
● Regular online testing tool http://tool.chinaz.com/regex/
2. metacharacter
Metacharacters (Metacharacters), metacharacters are characters with special meanings.
Python provides modules for processing regular expressions, including the re module of the standard library and the third-party module regex.
After importing the `re` module, you can start using regular expressions.
Basic metacharacters:
< strong>Metacharacters | Description |
. | is called wildcard, universal wildcard or wildcard metacharacter, matching 1 character except newline\ Any atom other than |
[] | matches a Any atom appearing in square brackets |
[^atom] | Match any atom that does not appear in square brackets, and the characters are negated/not |
Usage examples are as follows:
import re # Wildcard: . Also called universal wildcard or wildcard metacharacter, matches any atom except newline character \ ret = re.findall(".", "a,b,c,d,e") ret1 = re.findall("a.b","a,b,c,d,e,acb,abb,a\ b,a\tb") print("ret:",ret) print("ret1:",ret1) print("*" * 40) #Character set: [] matches any atom appearing in a square bracket ret3 = re.findall("[ace]", "a,b,c,d,e") ret4 = re.findall("a[ace]f", "af,abbf,acef,aef") print("ret3:",ret3) print("ret4:",ret4) ret5 = re.findall("[a-z]", "a,b,c,d,e") #Extract all lowercase letters ret6 = re.findall("[0-9]", "1,2,3,4,5") #Extract all numbers ret7 = re.findall("[a-zA-Z]", "a,b,c,d,e,A,B,C") #Extract all lowercase and uppercase letters ret8 = re.findall("[a-z0-9A-Z]", "a,b,c,d,e,1,2,3,4,E,F,G") #Extract all upper/lower case letters + numbers print("ret5:",ret5) print("ret6:",ret6) print("ret7:",ret7) print("ret8:",ret8) print("*" * 40) #\dGet all numbers, and can only match one symbol ret9 = re.findall("\d", "a,b,c,d,e,1,2,3,4,E,F,G") print("ret9:",ret9) #\w takes the character and can only match one symbol ret10 = re.findall("\w", "a,b,c,d,e,1,2,3,4,E,F,G") print("ret10:",ret10) #Character set negation, does not include the meaning: get the characters that do not contain 0-9 ret11 = re.findall("[^0-9]", "a,b,c,d,e,1,2,3,4,E,F,G") print("ret11:",ret11) #Output: ret: ['a', ',', 'b', ',', 'c', ',', 'd', ',', 'e'] ret1: ['a,b', 'acb', 'abb', 'a\tb'] ****************************************** ret3: ['a', 'c', 'e'] ret4: ['aef'] ret5: ['a', 'b', 'c', 'd', 'e'] ret6: ['1', '2', '3', '4', '5'] ret7: ['a', 'b', 'c', 'd', 'e', 'A', 'B', 'C'] ret8: ['a', 'b', 'c', 'd', 'e', '1', '2', '3', '4', 'E', 'F', 'G' ] **************************************** ret9: ['1', '2', '3', '4'] ret10: ['a', 'b', 'c', 'd', 'e', '1', '2', '3', '4', 'E', 'F', 'G' ] ret11: ['a', ',', 'b', ',', 'c', ',', 'd', ',', 'e', ',', ',', ',' , ',', ',', 'E', ',', 'F', ',', 'G']
Duplicate metacharacter:
< strong>Metacharacters | Description |
+ | is called the plus greedy symbol, which specifies that the atom on the left appears one or more times |
* | is called the asterisk greedy symbol, which specifies that the atom on the left appears. 0 or more times |
? | Calling Non-greedy symbol, specifying that the left atom appears 0 or 1 times |
{n,m} | It is called the quantity range greedy operator, which specifies the quantity range of the atoms on the left. There are four writing methods: {n}, {n, }, {,m}, {n,m}, where n and m must be non-negative integers. |
The code example is as follows:
import re # Repeating metacharacters: + * {} ? #Get all complete numbers: ret = re.findall("\d", "a,b,234,D,6,888") print("ret:",ret) #The result is that 234/888 are split # +: represents 1~infinite times (default greedy), so the result meets the expectation of taking out all numbers ret1 = re.findall("\d + ", "a,b,234,D,6,888") print("ret1:",ret1) print("*" * 50) ''' ?’s two functions: 1. Cancel greed: Just write another + after it? 2. Express: 0 times or 1 time ''' ret2 = re.findall("\d + ?", "a,b,234,D,6,888") #Cancel greedy matching (equal to \d effect) print("ret2:",ret2) ret3 = re.findall("abc?","abc,abcc,abe,ab") print("ret3:",ret3) print("*" * 50) # *: means 0~multiple times ret4 = re.findall("\w + ", "apple,banana,orange,melon") print("ret4:",ret4) ret5 = re.findall("\w?", "apple,banana,orange,melon") #Equivalent to\w print("ret5:",ret5) ret6 = re.findall("\w*", "apple,banana,orange,melon") #0 times will contain null values print("ret6:",ret6) print("*" * 50) ''' {m,n}: indicates from how many times to how many times, fixed interval You can also write a single number: take out a 6-character word ''' ret7 = re.findall("\w{6}?", "apple,banana,orange,melon") print("ret7:",ret7) ret8 = re.findall("\w{1,5}?", "abc,abcc") print("ret8:",ret8) #Output: ret: ['2', '3', '4', '6', '8', '8', '8'] ret1: ['234', '6', '888'] *************************************************** ret2: ['2', '3', '4', '6', '8', '8', '8'] ret3: ['abc', 'abc', 'ab', 'ab'] *************************************************** ret4: ['apple', 'banana', 'orange', 'melon'] ret5: ['a', 'p', 'p', 'l', 'e', '', 'b', 'a', 'n', 'a', 'n', 'a', '', 'o', 'r', 'a', 'n', 'g', 'e', '', 'm', 'e', 'l', 'o', 'n', ''] ret6: ['apple', '', 'banana', '', 'orange', '', 'melon', ''] *************************************************** ret7: ['banana', 'orange'] ret8: ['a', 'b', 'c', 'a', 'b', 'c', 'c']
Positional metacharacter:
< strong>Metacharacters | Description |
^ | called the starting boundary character or the starting anchor character, matching the beginning of a line |
$ | is called the end boundary character or the end anchor character, Match the end position of a line |
The code example is as follows:
import re #Positional metacharacter: ^ $ #Get the first numeric character. If the first character is not a numeric character, an empty list will be returned. ret = re.findall("^\d + ", "34,banana,255,orange,65536") print("ret:",ret) #Get the last character ret1 = re.findall("\w + $", "peach,34,banana,255,orange,65536") print("ret1:",ret1) #For example: To get web page data, if the rules are set in the front, you can get it no matter what the numbers behind it are. ret2 = re.findall("/goods/food/\d + ","/goods/food/1003") print("ret2:",ret2) #As long as it meets the rules of "/goods/food/\d + ", you can get it ret3 = re.findall("/goods/food/\d + ","server/app01/goods/food/1003") print("ret3:",ret3) #Add the beginning and end, limiting the completeness and returning an empty list ret4 = re.findall("^/goods/food/\d + $","server/app01/goods/food/1003") print("ret4:",ret4) #Output: ret: ['34'] ret1: ['65536'] ret2: ['/goods/food/1003'] ret3: ['/goods/food/1003'] ret4: []
Other metacharacters
< strong>Metacharacters | Description |
| | Specify an atom or regular pattern to select one or more |
() | Capture, extract and group atomic or regular patterns Overall operation |
The code example is as follows:
import re '''Other element characters: |: Specify atom or regular pattern to select one or more. (): Has the ability to capture patterns, that is, the ability to extract data first. Pattern capture can be canceled through (?:) ''' #Extract 5-letter words and compare them in different ways: #Filter out 5-character words with commas before and after ret = re.findall(",\w + ,", ",apple,banana,peach,orange,melon,") print('ret:',ret) # Filter out 5-character words with commas before and after ret1 = re.findall(",\w{5},", ",apple,banana,peach,orange,melon,") print('ret1:',ret1) # Filter out 5-character words, without commas, only words; optimal way ret2 = re.findall(",(\w{5}),", ",apple,banana,peach,orange,melon,") print('ret2:',ret2) #Output: ret: [',apple,', ',peach,', ',melon,'] ret1: [',apple,', ',peach,', ',melon,'] ret2: ['apple', 'peach', 'melon'] #Extract email, compare different methods: # Filter out all emails ret3 = re.findall("\w + @\w + \.com", "[email protected],[email protected],....") print('ret3:',ret3) #Filter QQ numbers in the mailbox ret4 = re.findall("(\w + )@qq\.com", "[email protected],[email protected],....") print('ret4:',ret4) #Filter all qq mailboxes and 163 mailboxes, optimal ret5 = re.findall("(?:\w + )@(?:qq|163)\.com", "[email protected],[email protected],....") print('ret5:',ret5) #Output: ret3: ['[email protected]', '[email protected]'] ret4: ['234xyz'] ret5: ['[email protected]', '[email protected]']
Escape character:
● Two functions of the escape character (/):
1. Assign some ordinary symbols with special functions
2. Cancel special symbols and remove special functions
● Adding different letters has different functions
The summary is as follows:
< strong>Metacharacters | Description |
\d | Matches a numeric atom, equivalent to `[0-9] |
\D | Matches a non-numeric atom. Equivalent to `[^0-9]` or `[^\d] |
\w | Matches a word atom that includes an underscore. Equivalent to `[A-Za-z0-9_] |
\W | Matches any non-word character. Equivalent to `[^A-Za-z0-9_]` or `[^\w] |
\ | Match a newline character |
\ s | matches any whitespace character atom, including space, tab, form feed, etc. Equivalent to `[ \f\ \r\t\v] |
\S | Matches an atom of any non-whitespace character. Equivalent to `[^ \f\ \r\t\v]` or `[^\s] |
\b | Match a word boundary atom, which refers to the position between a word and a space |
\B | matches a non-word boundary atom, equivalent to `[^\b] |
\t | Match a tab character, tab key |
Take an example of the somewhat convoluted “\b” escape character:
import re ''' Escapes(/) : Two functions: 1. Assign some ordinary symbols with special functions 2. Cancel special symbols and remove special functions ''' #\b: Word boundary effect: txt = "my name is nana.nihao,nana" ret = re.findall(r"\bna",txt) print(ret) ret1 = re.findall(r"\bna\w + ",txt) print(ret1) #Output: ['na', 'na', 'na'] ['name', 'nana', 'nana']
3. Commonly used regular expressions
● raw-string is translated as: native string;
When using regular expressions, you can add r or not, and no error will be reported, but to be on the safe side, add r.
● At work, regular expressions are generally used to verify data, verify user input information, crawlers, operation and maintenance log analysis, etc.
If it is to verify the data entered by the user:
< strong>Scenario | Regular Expression |
Username | ^[a-z0-9_-]{3,16}$ |
Password | ^[a-z0-9_- ]{6,18}$ |
Mobile phone number | ^(?:\ + 86)?1[3-9]\d{9}$ |
Hexadecimal value of color | ^#?([a-f0-9]{6}|[a-f0-9]{3})$ |
^[a-z\d ] + (\.[a-z\d] + )*@([\da-z](-[\da-z])?) + \.[a-z] + $ | |
URL | ^(?:https:\/\/|http:\/\ /)?([\da-z\.-] + )\.([a-z\.] + ).\w + $ |
IP address | ((2[0-4]\d|25[0-5]|[01]?\d\ d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?) |
HTML tag | ^<([a-z] + )([^<] + )*(?: >(.*)<\/\1> |
Chinese character range under utf-8 encoding | ^[\?-\?] + $ |
4. Methods of re module
Python itself does not have built-in regular processing. The regular expression in python is a string. We need to use the function provided in the python module to send the string to the regular engine. The regular engine will convert the string into a real regular expression to process the text. content.
The `re` module provides a set of regular processing functions that allow us to search for matches in a string:
< strong>Function | Description |
findall | Find all matches in the text that match the regular pattern according to the specified regular pattern, and return them in list format result. |
search |
Find the first match that matches the regular pattern **any position** in the string, and return a re.Match object if it exists. Returns None if it does not exist |
match |
Determine whether the **start position** of the string matches the rules of the regular pattern. If it matches, a re.Match object will be returned. Return None if no match |
split | As specified Regular pattern to split the string and return a split list |
sub/subn | Put the string according to the specified regular pattern to find matches that match the regular pattern, and replace one or more matches with other content |
compile | Compile: compile a specific search rule, and call it directly when the same rule is followed |
The code example is as follows:
import re #findall method: ret = re.findall("\d + ","apple 122 peach 34") print("ret:",ret) #Output: ret: ['122', '34'] #search: Returns the location of the matching object ret1 = re.search("\d + ","apple 122 peach 34") print("ret1:",ret1) print("ret1:",ret1.group()) #Output: ret1: <re.Match object; span=(6, 9), match='122'> ret1: 122 #search combined with named grouping: give a name to the mobile phone number and email address to be extracted rel = re.search("(?P<tel>1[3-9]\d{9}).*?(?P<email>\d + @qq.com)", "My mobile number is 13928835900, my email is [email protected]") print("rel:",rel) print(rel.group("tel")) print(rel.group("email")) #Output: rel: <re.Match object; span=(7, 34), match='13928835900, my email is [email protected]'> 13928835900 [email protected] # match method: one more starting judgment than search^ #rel2 does not satisfy the starting value of 1, and all returns None. rel2 = re.match("^1[3-9]\d{9}.*?", "My mobile phone number is 13928835900, and my other mobile phone number is 13711112255") print("rel2:",rel2) #Satisfy the starting point is 1 rel3 = re.match("^1[3-9]\d{9}.*?", "13928835900, my other mobile phone number is 13711112255") print("rel3:",rel3.group()) #Output: rel1: None rel2: 13928835900
import re #split method: string splitting txt = "my name is moluo" ret = re.split("\s", txt) print(ret) #Output: ['my', 'name', 'is', 'moluo'] #sub/subn: Function replaces matches with selected text s = "12 23 45 67 " #Replace all numbers with hello world ret1 = re.sub("\d + ","hello",s) print("ret1:",ret1) #Replace only the first 2 numbers ret2 = re.sub("\d + ","hello",s,2) print("ret2:",ret2) #Output: ret1: hello hello hello hello ret2: hello hello 45 67 #compile method: compile s1 = "12 apple 34 peach 77 banana" rl = re.findall("\d + ",s1) s2 = "18 apple 39 peach 99 banana" rl2 = re.findall("\d + ",s2) ''' It can be seen that the findall search rules are rewritten every time. Therefore, we can directly define the rules and quote them directly, which is simple and labor-saving. ''' reg = re.compile(r"\d + ") #Define rules print("rl:",reg.findall(s1)) #Call the compiled rules print("rl2:",reg.findall(s2)) #Output: rl: ['12', '34', '77'] rl2: ['18', '39', '99']
5. Advanced use of regular expressions
.*?
.*?: These three symbols are extremely powerful when put together~
Get a feel for it with the following code examples:
import re ''' Regular advanced: .*? ''' #Need to extract all the content in txt, effect comparison: text = '<12> <xyz> <!@#$%> <1a!#e2> <>' ret = re.findall("<. + >", text) print("ret:",ret) ret1 = re.findall("<. + ?>", text) print("ret1:",ret1) ret2 = re.findall("<.*?>", text) print("ret2:",ret2) #Output: ret: ['<12> <xyz> <!@#$%> <1a!#e2> <>'] ret1: ['<12>', '<xyz>', '<!@#$%>', '<1a!#e2>'] ret2: ['<12>', '<xyz>', '<!@#$%>', '<1a!#e2>', '<>']
Modifier
Pattern modifiers, also called regular modifiers, are designed to enhance or add functionality to regular patterns.
< strong>Modifier | Variables provided by the re module | Description |
i | re.I | Make the pattern case-insensitive, that is, insensitive Capitalization |
m | re.M | Enables the mode to have multiple line headers and line positions in multi-line text, affecting ^ and $ |
s | re.S | Let wildcards. You can code all atoms (including newlines\ ) |
The code example is as follows:
import re ''' Regular advancement: pattern modifier .*?With re.S ''' #Extract the contents of all <>: text = """ <12 > <x yz> <!@#$%> <1a!# e2> <> """ ret = re.findall("<.*?>", text) print("ret:",ret) ret1 = re.findall("<.*?>", text, re.S) print("ret1:",ret1) #Output: ret: ['<!@#$%>', '<>'] ret1: ['<12\ >', '<x\ yz>', '<!@#$%>', '<1a!#\ e2>', '<>']
6. Regular parsing data demo
Taking the top 250 Douban movies as an example, we applied what we learned with regular expressions to extract the 25 movie names on the first page.
Website: https://www.douban.com/doulist/134462233/
It can be seen that the movie name is nested in multiple div tags
Beginner level, simple processing steps:
1. Paste all the returned html content into the html file
2. Create a new py file, import the html file, and then start extracting movie names
import re with open("douban.html",encoding="utf-8") as f: s = f.read() #Extract all movie names ret = re.findall('<div class="title">\s + <a.*?>(.*?)</a>', s, re.S) for movie in ret: movie_name = re.sub(r'<.*?>', '', movie) print(movie_name.strip()) print(len(ret)) #Output: The Shawshank Redemption The Shawshank Redemption Farewell My Concubine Forrest Gump Titanic This killer is not too cold Léon Beautiful life La vita è bella Spirited Away Schindler's List Schindler 's List Inception Hachi: A Dog 's Tale Interstellar The Truman Show The Truman Show The sea pianist La leggenda del pianista sull 'oceano Three Idiots Make Trouble in Bollywood 3 Idiots WALL·E The spring of the cattle herding class Les choristes Infernal Affairs Infernal Affairs Zootopia The Marriage of the Great Sage in Journey to the West The Romance of the Cinderella in the Finale of Journey to the West Furnace ? The Godfather Witness for the Prosecution Witness for the Prosecution When happiness knocks on the door The Pursuit of Happyness Thrilling Flipped Intouchables 25