Web page data extraction — regular expressions

Table of Contents

1 Overview

2. Metacharacters

Basic metacharacters:

Repeating metacharacters:

Positional metacharacters:

Other metacharacters

Escapes:

3. Commonly used regular expressions

4. Methods of re module

5. Advanced use of regular expressions

.*?

pattern modifier

6. Regular parsing data demo

1. Overview

Regular Expression, translated as regular expression or regular expression, represents a regular expression, which means an expression that describes the arrangement rules of a piece of text.

>

Regular expressions are not part of Python. It is a set of powerful advanced text manipulation tools that are independent of programming languages and used to process complex text information. Regular expressions have their own unique rule grammar and an independent regular processing engine. After we write the rules (patterns) according to the regular grammar, the engine can not only perform fuzzy text search based on the rules, but also perform fuzzy segmentation, replacement and other complex text. Operation allows developers to process text information as they wish. Regular engines are generally operated by programming languages. For example, python provides the re module or regex module to call the regular processing engine.

>

Regular expressions are not as efficient as the system’s built-in string operations in processing text, but their functions are much more powerful than the system’s built-in string operations.

>

The earliest regular expressions originated from the Perl language. Other programming languages basically followed the regular syntax of the Perl language when providing regular expression operations. Therefore, after we learn python’s regular expressions, we can also use them in java, php, go, javascript, Used in programming languages such as sql.

>

For crawlers, regular expressions are the most basic data analysis tool

● Regularity is based on the processing of strings or text: nothing more than splitting, matching, finding and replacing.

● Regular online testing tool http://tool.chinaz.com/regex/

2. metacharacter

Metacharacters (Metacharacters), metacharacters are characters with special meanings.

Python provides modules for processing regular expressions, including the re module of the standard library and the third-party module regex.

After importing the `re` module, you can start using regular expressions.

Basic metacharacters:

< strong>Metacharacters	Description
.	is called wildcard, universal wildcard or wildcard metacharacter, matching 1 character except newline\ Any atom other than
[]	matches a Any atom appearing in square brackets
[^atom]	Match any atom that does not appear in square brackets, and the characters are negated/not

Usage examples are as follows:

import re

# Wildcard: . Also called universal wildcard or wildcard metacharacter, matches any atom except newline character \

ret = re.findall(".", "a,b,c,d,e")
ret1 = re.findall("a.b","a,b,c,d,e,acb,abb,a\
b,a\tb")
print("ret:",ret)
print("ret1:",ret1)
print("*" * 40)

#Character set: [] matches any atom appearing in a square bracket
ret3 = re.findall("[ace]", "a,b,c,d,e")
ret4 = re.findall("a[ace]f", "af,abbf,acef,aef")
print("ret3:",ret3)
print("ret4:",ret4)
ret5 = re.findall("[a-z]", "a,b,c,d,e") #Extract all lowercase letters
ret6 = re.findall("[0-9]", "1,2,3,4,5") #Extract all numbers
ret7 = re.findall("[a-zA-Z]", "a,b,c,d,e,A,B,C") #Extract all lowercase and uppercase letters
ret8 = re.findall("[a-z0-9A-Z]", "a,b,c,d,e,1,2,3,4,E,F,G") #Extract all upper/lower case letters + numbers
print("ret5:",ret5)
print("ret6:",ret6)
print("ret7:",ret7)
print("ret8:",ret8)
print("*" * 40)


#\dGet all numbers, and can only match one symbol
ret9 = re.findall("\d", "a,b,c,d,e,1,2,3,4,E,F,G")
print("ret9:",ret9)
#\w takes the character and can only match one symbol
ret10 = re.findall("\w", "a,b,c,d,e,1,2,3,4,E,F,G")
print("ret10:",ret10)

#Character set negation, does not include the meaning: get the characters that do not contain 0-9
ret11 = re.findall("[^0-9]", "a,b,c,d,e,1,2,3,4,E,F,G")
print("ret11:",ret11)


#Output:
ret: ['a', ',', 'b', ',', 'c', ',', 'd', ',', 'e']
ret1: ['a,b', 'acb', 'abb', 'a\tb']
******************************************
ret3: ['a', 'c', 'e']
ret4: ['aef']
ret5: ['a', 'b', 'c', 'd', 'e']
ret6: ['1', '2', '3', '4', '5']
ret7: ['a', 'b', 'c', 'd', 'e', 'A', 'B', 'C']
ret8: ['a', 'b', 'c', 'd', 'e', '1', '2', '3', '4', 'E', 'F', 'G' ]
****************************************
ret9: ['1', '2', '3', '4']
ret10: ['a', 'b', 'c', 'd', 'e', '1', '2', '3', '4', 'E', 'F', 'G' ]
ret11: ['a', ',', 'b', ',', 'c', ',', 'd', ',', 'e', ',', ',', ',' , ',', ',', 'E', ',', 'F', ',', 'G']

Duplicate metacharacter:

< strong>Metacharacters	Description
+	is called the plus greedy symbol, which specifies that the atom on the left appears one or more times
*	is called the asterisk greedy symbol, which specifies that the atom on the left appears. 0 or more times
?	Calling Non-greedy symbol, specifying that the left atom appears 0 or 1 times
{n,m}	It is called the quantity range greedy operator, which specifies the quantity range of the atoms on the left. There are four writing methods: {n}, {n, }, {,m}, {n,m}, where n and m must be non-negative integers.

The code example is as follows:

import re

# Repeating metacharacters: + * {} ?
#Get all complete numbers:
ret = re.findall("\d", "a,b,234,D,6,888")
print("ret:",ret) #The result is that 234/888 are split

# +: represents 1~infinite times (default greedy), so the result meets the expectation of taking out all numbers
ret1 = re.findall("\d + ", "a,b,234,D,6,888")
print("ret1:",ret1)
print("*" * 50)

'''
?’s two functions:
1. Cancel greed: Just write another + after it?
2. Express: 0 times or 1 time
'''
ret2 = re.findall("\d + ?", "a,b,234,D,6,888") #Cancel greedy matching (equal to \d effect)
print("ret2:",ret2)
ret3 = re.findall("abc?","abc,abcc,abe,ab")
print("ret3:",ret3)
print("*" * 50)

# *: means 0~multiple times
ret4 = re.findall("\w + ", "apple,banana,orange,melon")
print("ret4:",ret4)
ret5 = re.findall("\w?", "apple,banana,orange,melon") #Equivalent to\w
print("ret5:",ret5)
ret6 = re.findall("\w*", "apple,banana,orange,melon") #0 times will contain null values
print("ret6:",ret6)
print("*" * 50)

'''
{m,n}: indicates from how many times to how many times, fixed interval
You can also write a single number: take out a 6-character word
'''
ret7 = re.findall("\w{6}?", "apple,banana,orange,melon")
print("ret7:",ret7)
ret8 = re.findall("\w{1,5}?", "abc,abcc")
print("ret8:",ret8)


#Output:
ret: ['2', '3', '4', '6', '8', '8', '8']
ret1: ['234', '6', '888']
***************************************************
ret2: ['2', '3', '4', '6', '8', '8', '8']
ret3: ['abc', 'abc', 'ab', 'ab']
***************************************************
ret4: ['apple', 'banana', 'orange', 'melon']
ret5: ['a', 'p', 'p', 'l', 'e', '', 'b', 'a', 'n', 'a', 'n', 'a', '', 'o', 'r', 'a', 'n', 'g', 'e', '', 'm', 'e', 'l', 'o', 'n', '']
ret6: ['apple', '', 'banana', '', 'orange', '', 'melon', '']
***************************************************
ret7: ['banana', 'orange']
ret8: ['a', 'b', 'c', 'a', 'b', 'c', 'c']

Positional metacharacter:

< strong>Metacharacters	Description
^	called the starting boundary character or the starting anchor character, matching the beginning of a line
$	is called the end boundary character or the end anchor character, Match the end position of a line

The code example is as follows:

import re

#Positional metacharacter: ^ $
#Get the first numeric character. If the first character is not a numeric character, an empty list will be returned.
ret = re.findall("^\d + ", "34,banana,255,orange,65536")
print("ret:",ret)

#Get the last character
ret1 = re.findall("\w + $", "peach,34,banana,255,orange,65536")
print("ret1:",ret1)

#For example: To get web page data, if the rules are set in the front, you can get it no matter what the numbers behind it are.
ret2 = re.findall("/goods/food/\d + ","/goods/food/1003")
print("ret2:",ret2)

#As long as it meets the rules of "/goods/food/\d + ", you can get it
ret3 = re.findall("/goods/food/\d + ","server/app01/goods/food/1003")
print("ret3:",ret3)

#Add the beginning and end, limiting the completeness and returning an empty list
ret4 = re.findall("^/goods/food/\d + $","server/app01/goods/food/1003")
print("ret4:",ret4)



#Output:
ret: ['34']
ret1: ['65536']
ret2: ['/goods/food/1003']
ret3: ['/goods/food/1003']
ret4: []

Other metacharacters

< strong>Metacharacters	Description
\|	Specify an atom or regular pattern to select one or more
()	Capture, extract and group atomic or regular patterns Overall operation

The code example is as follows:

import re

'''Other element characters:
|: Specify atom or regular pattern to select one or more.
(): Has the ability to capture patterns, that is, the ability to extract data first. Pattern capture can be canceled through (?:)
'''
#Extract 5-letter words and compare them in different ways:
#Filter out 5-character words with commas before and after
ret = re.findall(",\w + ,", ",apple,banana,peach,orange,melon,")
print('ret:',ret)

# Filter out 5-character words with commas before and after
ret1 = re.findall(",\w{5},", ",apple,banana,peach,orange,melon,")
print('ret1:',ret1)

# Filter out 5-character words, without commas, only words; optimal way
ret2 = re.findall(",(\w{5}),", ",apple,banana,peach,orange,melon,")
print('ret2:',ret2)

#Output:
ret: [',apple,', ',peach,', ',melon,']
ret1: [',apple,', ',peach,', ',melon,']
ret2: ['apple', 'peach', 'melon']




#Extract email, compare different methods:
# Filter out all emails
ret3 = re.findall("\w + @\w + \.com", "[email protected],[email protected],....")
print('ret3:',ret3)

#Filter QQ numbers in the mailbox
ret4 = re.findall("(\w + )@qq\.com", "[email protected],[email protected],....")
print('ret4:',ret4)

#Filter all qq mailboxes and 163 mailboxes, optimal
ret5 = re.findall("(?:\w + )@(?:qq|163)\.com", "[email protected],[email protected],....")
print('ret5:',ret5)


#Output:
ret3: ['[email protected]', '[email protected]']
ret4: ['234xyz']
ret5: ['[email protected]', '[email protected]']

Escape character:

● Two functions of the escape character (/):

1. Assign some ordinary symbols with special functions

2. Cancel special symbols and remove special functions

● Adding different letters has different functions

The summary is as follows:

< strong>Metacharacters	Description
\d	Matches a numeric atom, equivalent to `[0-9]
\D	Matches a non-numeric atom. Equivalent to `[^0-9]` or `[^\d]
\w	Matches a word atom that includes an underscore. Equivalent to `[A-Za-z0-9_]
\W	Matches any non-word character. Equivalent to `[^A-Za-z0-9_]` or `[^\w]
\	Match a newline character
\ s	matches any whitespace character atom, including space, tab, form feed, etc. Equivalent to `[ \f\ \r\t\v]
\S	Matches an atom of any non-whitespace character. Equivalent to `[^ \f\ \r\t\v]` or `[^\s]
\b	Match a word boundary atom, which refers to the position between a word and a space
\B	matches a non-word boundary atom, equivalent to `[^\b]
\t	Match a tab character, tab key

Take an example of the somewhat convoluted “\b” escape character:

import re

'''
Escapes(/) :
Two functions:
1. Assign some ordinary symbols with special functions
2. Cancel special symbols and remove special functions
'''
#\b: Word boundary effect:
txt = "my name is nana.nihao,nana"
ret = re.findall(r"\bna",txt)
print(ret)
ret1 = re.findall(r"\bna\w + ",txt)
print(ret1)


#Output:
['na', 'na', 'na']
['name', 'nana', 'nana']

3. Commonly used regular expressions

● raw-string is translated as: native string;

When using regular expressions, you can add r or not, and no error will be reported, but to be on the safe side, add r.

● At work, regular expressions are generally used to verify data, verify user input information, crawlers, operation and maintenance log analysis, etc.

If it is to verify the data entered by the user:

< strong>Scenario	Regular Expression
Username	^[a-z0-9_-]{3,16}$
Password	^[a-z0-9_- ]{6,18}$
Mobile phone number	^(?:\ + 86)?1[3-9]\d{9}$
Hexadecimal value of color	^#?([a-f0-9]{6}\|[a-f0-9]{3})$
E-mail	^[a-z\d ] + (\.[a-z\d] + )*@([\da-z](-[\da-z])?) + \.[a-z] + $
URL	^(?:https:\/\/\|http:\/\ /)?([\da-z\.-] + )\.([a-z\.] + ).\w + $
IP address	((2[0-4]\d\|25[0-5]\|[01]?\d\ d?)\.){3}(2[0-4]\d\|25[0-5]\|[01]?\d\d?)
HTML tag	^<([a-z] + )([^<] + )(?: >(.)<\/\1>
Chinese character range under utf-8 encoding	^[\?-\?] + $

4. Methods of re module

Python itself does not have built-in regular processing. The regular expression in python is a string. We need to use the function provided in the python module to send the string to the regular engine. The regular engine will convert the string into a real regular expression to process the text. content.

The `re` module provides a set of regular processing functions that allow us to search for matches in a string:

< strong>Function	Description
findall	Find all matches in the text that match the regular pattern according to the specified regular pattern, and return them in list format result.
search	Find the first match that matches the regular pattern any position in the string, and return a re.Match object if it exists. Returns None if it does not exist
match	Determine whether the start position of the string matches the rules of the regular pattern. If it matches, a re.Match object will be returned. Return None if no match
split	As specified Regular pattern to split the string and return a split list
sub/subn	Put the string according to the specified regular pattern to find matches that match the regular pattern, and replace one or more matches with other content
compile	Compile: compile a specific search rule, and call it directly when the same rule is followed

The code example is as follows:

import re
#findall method:
ret = re.findall("\d + ","apple 122 peach 34")
print("ret:",ret)
#Output:
ret: ['122', '34']


#search: Returns the location of the matching object
ret1 = re.search("\d + ","apple 122 peach 34")
print("ret1:",ret1)
print("ret1:",ret1.group())
#Output:
ret1: <re.Match object; span=(6, 9), match='122'>
ret1: 122



#search combined with named grouping: give a name to the mobile phone number and email address to be extracted
rel = re.search("(?P<tel>1[3-9]\d{9}).*?(?P<email>\d + @qq.com)", "My mobile number is 13928835900, my email is [email protected]")
print("rel:",rel)
print(rel.group("tel"))
print(rel.group("email"))
#Output:
rel: <re.Match object; span=(7, 34), match='13928835900, my email is [email protected]'>
13928835900
[email protected]


# match method: one more starting judgment than search^
#rel2 does not satisfy the starting value of 1, and all returns None.
rel2 = re.match("^1[3-9]\d{9}.*?", "My mobile phone number is 13928835900, and my other mobile phone number is 13711112255")
print("rel2:",rel2)

#Satisfy the starting point is 1
rel3 = re.match("^1[3-9]\d{9}.*?", "13928835900, my other mobile phone number is 13711112255")
print("rel3:",rel3.group())


#Output:
rel1: None
rel2: 13928835900

import re
#split method: string splitting
txt = "my name is moluo"
ret = re.split("\s", txt)
print(ret)
#Output:
['my', 'name', 'is', 'moluo']


#sub/subn: Function replaces matches with selected text
s = "12 23 45 67 "
#Replace all numbers with hello world
ret1 = re.sub("\d + ","hello",s)
print("ret1:",ret1)


#Replace only the first 2 numbers
ret2 = re.sub("\d + ","hello",s,2)
print("ret2:",ret2)
#Output:
ret1: hello hello hello hello
ret2: hello hello 45 67


#compile method: compile
s1 = "12 apple 34 peach 77 banana"
rl = re.findall("\d + ",s1)

s2 = "18 apple 39 peach 99 banana"
rl2 = re.findall("\d + ",s2)
'''
It can be seen that the findall search rules are rewritten every time.
Therefore, we can directly define the rules and quote them directly, which is simple and labor-saving.
'''
reg = re.compile(r"\d + ") #Define rules
print("rl:",reg.findall(s1)) #Call the compiled rules
print("rl2:",reg.findall(s2))
#Output:
rl: ['12', '34', '77']
rl2: ['18', '39', '99']

5. Advanced use of regular expressions

.*?

.*?: These three symbols are extremely powerful when put together~

Get a feel for it with the following code examples:

import re

'''
Regular advanced: .*?
'''
#Need to extract all the content in txt, effect comparison:
text = '<12> <xyz> <!@#$%> <1a!#e2> <>'
ret = re.findall("<. + >", text)
print("ret:",ret)

ret1 = re.findall("<. + ?>", text)
print("ret1:",ret1)

ret2 = re.findall("<.*?>", text)
print("ret2:",ret2)


#Output:
ret: ['<12> <xyz> <!@#$%> <1a!#e2> <>']
ret1: ['<12>', '<xyz>', '<!@#$%>', '<1a!#e2>']
ret2: ['<12>', '<xyz>', '<!@#$%>', '<1a!#e2>', '<>']

Modifier

Pattern modifiers, also called regular modifiers, are designed to enhance or add functionality to regular patterns.

< strong>Modifier	Variables provided by the re module	Description
i	re.I	Make the pattern case-insensitive, that is, insensitive Capitalization
m	re.M	Enables the mode to have multiple line headers and line positions in multi-line text, affecting ^ and $
s	re.S	Let wildcards. You can code all atoms (including newlines\ )

The code example is as follows:

import re

'''
Regular advancement: pattern modifier
.*?With re.S
'''
#Extract the contents of all <>:
text = """
<12
>

 <x
 yz>

 <!@#$%>

 <1a!#
 e2>

 <>
"""

ret = re.findall("<.*?>", text)
print("ret:",ret)
ret1 = re.findall("<.*?>", text, re.S)
print("ret1:",ret1)

#Output:
ret: ['<!@#$%>', '<>']
ret1: ['<12\
>', '<x\
 yz>', '<!@#$%>', '<1a!#\
 e2>', '<>']

6. Regular parsing data demo

Taking the top 250 Douban movies as an example, we applied what we learned with regular expressions to extract the 25 movie names on the first page.

Website: https://www.douban.com/doulist/134462233/

It can be seen that the movie name is nested in multiple div tags

Beginner level, simple processing steps:

1. Paste all the returned html content into the html file

2. Create a new py file, import the html file, and then start extracting movie names

import re
with open("douban.html",encoding="utf-8") as f:
    s = f.read()
#Extract all movie names

ret = re.findall('<div class="title">\s + <a.*?>(.*?)</a>', s, re.S)
for movie in ret:
    movie_name = re.sub(r'<.*?>', '', movie)
    print(movie_name.strip())
print(len(ret))


#Output:
The Shawshank Redemption The Shawshank Redemption
Farewell My Concubine
Forrest Gump
Titanic
This killer is not too cold Léon
Beautiful life La vita è bella
Spirited Away
Schindler's List Schindler &#39;s List
Inception
Hachi: A Dog &#39;s Tale
Interstellar
The Truman Show The Truman Show
The sea pianist La leggenda del pianista sull &#39;oceano
Three Idiots Make Trouble in Bollywood 3 Idiots
WALL·E
The spring of the cattle herding class Les choristes
Infernal Affairs Infernal Affairs
Zootopia
The Marriage of the Great Sage in Journey to the West The Romance of the Cinderella in the Finale of Journey to the West
Furnace ?
The Godfather
Witness for the Prosecution Witness for the Prosecution
When happiness knocks on the door The Pursuit of Happyness
Thrilling Flipped
Intouchables
25