Beautiful Soup4
Overview
Beautiful Soup is a Python library for parsing HTML and XML documents, providing convenient data extraction and manipulation functions. It helps extract required data from web pages such as tags, text content, attributes, etc.
Beautiful Soup will automatically convert input documents to Unicode encoding and output documents to UTF-8 encoding.
Beautiful Soup is relatively simple to use to parse HTML. The API is very user-friendly and supports multiple parsers.
Documentation: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
Main features
Flexible parsing method:
Beautiful Soup supports multiple parsers, including the html.parser parser in the Python standard library, as well as third-party libraries lxml and html5lib. In this way, we can choose the appropriate parser for processing according to our needs.
Simple and intuitive API:
Beautiful Soup provides a concise and friendly API that makes parsing HTML documents very easy. We can use concise methods to select specific tags, get the text content in tags, extract attribute values, etc.
Powerful document traversal capabilities:
Through Beautiful Soup, we can traverse the entire HTML document tree, access, modify or delete each node, and even quickly locate the required node through nested selectors.
Tolerance for broken HTML:
Beautiful Soup can handle broken HTML documents, such as automatically correcting unclosed tags, automatically adding missing tags, etc., making data extraction more stable and reliable.
Parser
Beautiful relies on the parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party libraries.
Supported parsers:
Parser | Usage | Advantages | Disadvantages |
---|---|---|---|
Python standard library | BeautifulSoup(markup, “html.parser”) | Python’s built-in standard library has moderate execution speed and strong document fault tolerance | The document fault tolerance in versions before Python 2.7.3 or 3.2.2) is poor |
lxml HTML parser | BeautifulSoup( markup, “lxml”) | Fast speed and strong document fault tolerance | Need to install C language library |
lxml XML parser | BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”) | The only parser that supports XML with fast speed | Requires installation C language library |
html5lib | BeautifulSoup(markup, “html5lib”) | The best fault tolerance in the browser way Parse documents to generate documents in HTML5 format | Slow and does not rely on external extensions |
It can be seen that:
The lxml parser can parse HTML and XML documents and is fast and fault-tolerant, so it is recommended to use it. If you use lxml, then when initializing BeautifulSoup, set the second parameter to lxml.
Basic use of Beautiful Soup4
Installation library
pip install beautifulsoup4 pip install lxml
Create HTML file
By passing a document into the constructor of BeautifulSoup, you can get a document object. You can pass in a string or a file.
Create a test.html
file here to create a document object.
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <div> <ul> <li class="class01"><span index="1">H1</span></li> <li class="class02"><span index="2" class="span2">H2</span></li> <li class="class03"><span index="3">H3</span></li> </ul> </div> </body> </html>
Basic usage
# Import module from bs4 import BeautifulSoup #Create beautifulsoup object, there are 2 ways to create it # 1. Created by string, the second parameter is used to specify the parser # soup = BeautifulSoup("html", 'lxml') # 2. Create through file soup = BeautifulSoup(open('test.html'), 'lxml') #Print output # print(soup.prettify()) # Get the element tag element, returning the first element by default print(soup.li) # Use .contents or .children to get child elements # Back to list print(soup.ul.contents) # Return iterator print(soup.li.children) # Get element content print(soup.title.get_text()) # Get the attribute value of the element. By default, the attribute value of the first element is taken. print(soup.li.get('class'))
The results of the operation are as follows:
<li class="class01"><span index="1">H1</span></li> ['\ ', <li class="class01"><span index="1">H1</span></li>, '\ ', <li class="class02"><span class="span2" index="2">H2</span></li>, '\ ', <li class="class03"><span index="3">H3</span></li>, '\ '] <list_iterator object at 0x000001C18E475F10> Title ['class01']
Object types of Beautiful Soup4
Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types:
Tag, NavigableString, BeautifulSoup, Comment
Tag object
In Beautiful Soup, Tag objects are objects used to represent tag elements in HTML or XML documents. Tag objects are the same as tags in native documents. The Tag object contains information such as the name, attributes, and content of the tag element, and provides a variety of methods to obtain, modify, and operate this information.
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'>
Common properties and methods of Tag objects
Attributes | Description | Example |
---|---|---|
name attribute | Get the name of the tag | tag_name = tag.name |
string attribute | Get the tag Text content within | tag_text = tag.string |
attrs attribute | Get the attributes of the tag and return it in the form of a dictionary | tag_attrs = tag.attrs |
get() method | Get the attribute value of the tag according to the attribute name | attr_value = tag.get(‘attribute_name’) |
find() method | Find and return the first sub-tag element that meets the conditions | child_tag = tag.find(‘tag_name’) |
find_all() method | Find and return all child tag elements that meet the conditions , returned in the form of a list | child_tags = tag.find_all(‘tag_name’) |
parent attribute | Get the current tag’s Parent tag | parent_tag = tag.parent |
parents attribute | Gets all ancestor tags of the current tag in the form of a generator Return | for parent in tag.parents: |
print(parent) | ||
children attribute | Get the direct child tag of the current tag and return it in the form of a generator | for child in tag.children: |
print(child) | ||
next_sibling attribute | Get the next sibling tag of the current tag | next_sibling_tag = tag.next_sibling |
previous_sibling attribute | Get the current tag’s Previous sibling tag | previous_sibling_tag = tag.previous_sibling |
NavigableString object
NavigableString object is a data type in the Beautiful Soup library, used to represent plain text content in HTML or XML documents. It inherits from Python’s basic string type, but has additional functions and features that make it suitable for processing textual content in documents.
Suppose we have the following HTML code snippet:
<p>This is a <b>beautiful</b> day.</p>
Use Beautiful Soup to parse it into a document object:
from bs4 import BeautifulSoup html = '<p>This is a <b>beautiful</b> day.</p>' soup = BeautifulSoup(html, 'html.parser')
Obtain
The content of the label, which is actually a NavigableString object:
p_tag = soup.find('p') content = p_tag.string print(content) # Output: This is a print(type(content)) # Output: <class 'bs4.element.NavigableString'>
You can also perform some operations on NavigableString objects, such as obtaining text content, replacing text, and removing whitespace characters:
# Get text content text = content.strip() print(text) # Output: This is a # Replacement text p_tag.string.replace_with('Hello') print(p_tag) # Output: <p>Hello<b>beautiful</b> day.</p> # Remove whitespace characters text_without_spaces = p_tag.string.strip() print(text_without_spaces) #Output: Hello
Attributes | Description | Example |
---|---|---|
string attribute | Used to obtain the text content of the NavigableString object | text = navigable_string.string |
replace_with() method | Used to replace the current NavigableString object with another string or object | navigable_string.replace_with(new_string) |
strip() method | Remove whitespace characters at both ends of the string | stripped_text = navigable_string.strip() |
parent attribute | Get The parent node to which the NavigableString object belongs (usually a Tag object) | parent_tag = navigable_string.parent |
next_sibling attribute | Get NavigableString The next sibling node of the object | next_sibling = navigable_string.next_sibling |
previous_sibling attribute | Get the previous sibling node of the NavigableString object | previous_sibling = navigable_string.previous_sibling |
BeautifulSoup object
The BeautifulSoup object is the core object of the Beautiful Soup library and is used to parse and traverse HTML or XML documents.
Commonly used methods:
find(name, attrs, recursive, string, **kwargs): Find the first matching tag based on the specified tag name, attributes, text content, etc. find_all(name, attrs, recursive, string, limit, **kwargs): Find all matching tags based on the specified tag name, attributes, text content, etc., and return a list select(css_selector): Use CSS selector syntax to find matching tags and return a list prettify(): Output the contents of the entire document in a beautiful way, including tags, text and indentation has_attr(name): Check whether the current tag has the specified attribute name and return a Boolean value get_text(): Get the text content of the current label and all sub-labels, and return a string
Common properties:
soup.title: Get the content of the first <title> tag in the document soup.head: Get the <head> tag in the document soup.body: Get the <body> tag in the document soup.find_all('tag'): Get all matching <tag> tags in the document and return a list soup.text: Get the plain text content in the entire document (remove tags, etc.)
Comment object
Comment object is a special type of object in the Beautiful Soup library, used to represent comment content in HTML or XML documents.
When parsing an HTML or XML document, Beautiful Soup represents the comment content as a Comment object. A comment is a special element in a document that is used to add notes, explanations, or temporarily delete part of the content. Comment objects can be automatically recognized and processed by the parser of the Beautiful Soup library.
Example of processing and accessing annotated content in HTML documents:
from bs4 import BeautifulSoup # Create an HTML string containing comments html = "<html><body><!-- This is a comment --> <p>Hello, World!</p></body></html>" # Parse HTML document soup = BeautifulSoup(html, 'html.parser') # Use the type() function to get the type of the annotation object comment = soup.body.next_sibling print(type(comment)) # Output <class 'bs4.element.Comment'> # Use the `.string` attribute to get the comment content comment_content = comment.string print(comment_content) # Output This is a comment
Search document tree
Beautiful Soup provides a variety of ways to find and locate elements in HTML documents.
Method selector
Use Beautiful Soup’s find() or find_all() method to select elements by tag name, combining attribute names and attribute values, combining text content, etc.
The difference between the two:
find returns the first element that meets the conditions find_all returns a list of all elements that meet the criteria
For example: soup.find('div')
will return the first div tag element, soup.find_all('a')
will return all a tag elements.
1. Find elements by tag or tag list
# Find the element of the first div tag soup.find('div') # Find all a tag elements soup.find_all('a') # Find all li tags soup.find_all('li') # Find all a tags and b tags soup.find_all(['a','b'])
2. Find elements through regular expressions
# Search for tags starting with sp import re print(soup.find_all(re.compile("^sp")))
3. Find elements by attributes
find = soup.find_all( attrs={ "property name":"value" } ) print(find)
# Find the first a tag element with href attribute soup.find('a', href=True) # Find all div tag elements with class attributes soup.find_all('div', class_=True)
4. Find elements by text content
# Find the first element containing "Hello" text content soup.find(text='Hello') # Find all elements containing "World" text content soup.find_all(text="World")
5. Find elements through keyword parameters
soup.find_all(id='id01')
6. Mix it up
soup.find_all( 'tag name', attrs={ "property name":"value" }, text="content" )
CSS selector
Use Beautiful Soup’s select() method to find elements through CSS selectors. CSS selectors are a powerful and flexible way to target elements based on tag names, class names, IDs, attributes, and combinations thereof.
1. Class Selector
To find elements by class name, use the
.
symbol plus the class name to find elements
soup.select('.className')
2.ID selector
To find an element by ID, use the
#
symbol plus the ID to find an element.
soup.select('#id')
3. Tag selector
Find elements by tag name, use tag names directly to find elements
soup.select('p')
4.Attribute selector
Find elements by attributes: You can use the format of
[attribute name = attribute value]
to find elements with specific attributes and attribute values.
soup.select('[attribute="value"]') soup.select('[href="example.com"]') soup.select('a[href="http://baidu.com"]')
5. Combination selector
Multiple selectors can be combined for more precise searches
# Return all a tag elements within the div element with a specific class name soup.select('div.class01 a') soup.select('div a')
Associated selection
In the process of element search, sometimes you cannot get the desired node element in one step. You need to select a certain node element, and then use this node as the basis to select its child nodes, parent nodes, sibling nodes, etc.
In Beautiful Soup, you can perform associated selection by using CSS selector syntax. You need to use the select() method of Beautiful Soup, which allows you to use CSS selectors to select elements.
1. Descendant selector (space): You can select all descendant elements under the specified element.
div a /* Select all a elements under the div element */ .container p /* Select all p elements under the element named container */
2. Direct descendant selector (>): You can select the direct descendant elements of the specified element.
div > a /* Select the a element that is a direct descendant of the div element */ .container > p /* Select p elements that are direct descendants of the element named container */
3. Adjacent sibling selector (+): You can select the next sibling element immediately adjacent to the specified element.
h1 + p /* Select the sibling p element immediately after the h1 element */ .container + p /* Select the sibling p element immediately following the element named container */
4. Universal sibling selector (~): can select all subsequent elements at the same level as the specified element.
h1 ~ p /* Select all p elements at the same level as the h1 element */ .container ~ p /* Select all p elements at the same level as the element named container */
5. Mixed selectors: You can combine multiple selectors of different types to select specific elements
div, p /* Select all div elements and all p elements */ .cls1.cls2 /* Select the class name is cls1 and the class name is cls2 */
Traverse the document tree
In Beautiful Soup, traversing the document tree is a common operation for accessing and processing individual nodes of an HTML or XML document
Child nodes and parent nodes of tags
contents: Get all child nodes of Tag and return a list
print(bs.tag.contents) # Use list index to get an element print(bs.tag.contents[1])
children: Get all child nodes of Tag and return a generator
for child in tag.contents: print(child)
The parent attribute gets the parent node of the label
parent_tag = tag.parent
Sibling nodes of the tag
You can use the .next_sibling and .previous_sibling properties to get the next or previous sibling node of a label.
next_sibling = tag.next_sibling previous_sibling = tag.previous_sibling
Recursively traverse the document tree
You can use the .find() and .find_all() methods to recursively search for matching tags in the document tree.
You can use the .descendants generator iterator to traverse all descendant nodes of the document tree.
for tag in soup.find_all('a'): print(tag) for descendant in tag.descendants: print(descendant)
Traverse tag attributes
You can use the .attrs attribute to get all attributes of a tag and iterate over them.
for attr in tag.attrs: print(attr)
———————————END——————- ——–
Digression
In this era of big data, how can you keep up with scripting without mastering a programming language? Python, the hottest programming language at the moment, has a bright future! If you also want to keep up with the times and improve yourself, please take a look.
Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.
CSDN gift package:The most complete “Python learning materials” on the entire network are given away for free! (Safe link, click with confidence)
1. Python learning routes in all directions
The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.
2. Essential development tools for Python
The tools have been organized for you, and you can get started directly after installation!
3. Latest Python study notes
When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.
4. Python video collection
Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher’s ideas in the video, from basic to in-depth.
5. Practical cases
What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.
6. Interview Guide
Resume template
CSDN gift package:The most complete “Python learning materials” on the entire network are given away for free! (Safe link, click with confidence)
If there is any infringement, please contact us for deletion.