Beautiful Soup4 data analysis and extraction

Beautiful Soup4

Overview

Beautiful Soup is a Python library for parsing HTML and XML documents, providing convenient data extraction and manipulation functions. It helps extract required data from web pages such as tags, text content, attributes, etc.

Beautiful Soup will automatically convert input documents to Unicode encoding and output documents to UTF-8 encoding.

Beautiful Soup is relatively simple to use to parse HTML. The API is very user-friendly and supports multiple parsers.

Documentation: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

Main features

Flexible parsing method:

Beautiful Soup supports multiple parsers, including the html.parser parser in the Python standard library, as well as third-party libraries lxml and html5lib. In this way, we can choose the appropriate parser for processing according to our needs.

Simple and intuitive API:

Beautiful Soup provides a concise and friendly API that makes parsing HTML documents very easy. We can use concise methods to select specific tags, get the text content in tags, extract attribute values, etc.

Powerful document traversal capabilities:

Through Beautiful Soup, we can traverse the entire HTML document tree, access, modify or delete each node, and even quickly locate the required node through nested selectors.

Tolerance for broken HTML:

Beautiful Soup can handle broken HTML documents, such as automatically correcting unclosed tags, automatically adding missing tags, etc., making data extraction more stable and reliable.

Parser

Beautiful relies on the parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party libraries.

Supported parsers:

Parser Usage Advantages Disadvantages
Python standard library BeautifulSoup(markup, “html.parser”) Python’s built-in standard library has moderate execution speed and strong document fault tolerance The document fault tolerance in versions before Python 2.7.3 or 3.2.2) is poor
lxml HTML parser BeautifulSoup( markup, “lxml”) Fast speed and strong document fault tolerance Need to install C language library
lxml XML parser BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”) The only parser that supports XML with fast speed Requires installation C language library
html5lib BeautifulSoup(markup, “html5lib”) The best fault tolerance in the browser way Parse documents to generate documents in HTML5 format Slow and does not rely on external extensions

It can be seen that:

The lxml parser can parse HTML and XML documents and is fast and fault-tolerant, so it is recommended to use it. If you use lxml, then when initializing BeautifulSoup, set the second parameter to lxml.

Basic use of Beautiful Soup4

Installation library

pip install beautifulsoup4

pip install lxml

Create HTML file

By passing a document into the constructor of BeautifulSoup, you can get a document object. You can pass in a string or a file.

Create a test.html file here to create a document object.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
    <ul>
         <li class="class01"><span index="1">H1</span></li>
         <li class="class02"><span index="2" class="span2">H2</span></li>
         <li class="class03"><span index="3">H3</span></li>
     </ul>
 </div>
</body>
</html>

Basic usage

# Import module
from bs4 import BeautifulSoup

#Create beautifulsoup object, there are 2 ways to create it
# 1. Created by string, the second parameter is used to specify the parser
# soup = BeautifulSoup("html", 'lxml')
# 2. Create through file
soup = BeautifulSoup(open('test.html'), 'lxml')
#Print output
# print(soup.prettify())

# Get the element tag element, returning the first element by default
print(soup.li)

# Use .contents or .children to get child elements
# Back to list
print(soup.ul.contents)
# Return iterator
print(soup.li.children)

# Get element content
print(soup.title.get_text())

# Get the attribute value of the element. By default, the attribute value of the first element is taken.
print(soup.li.get('class'))

The results of the operation are as follows:

<li class="class01"><span index="1">H1</span></li>

['\
', <li class="class01"><span index="1">H1</span></li>, '\
', <li class="class02"><span class="span2"
index="2">H2</span></li>, '\
', <li class="class03"><span index="3">H3</span></li>, '\
']

<list_iterator object at 0x000001C18E475F10>

Title

['class01']

Object types of Beautiful Soup4

Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types: Tag, NavigableString, BeautifulSoup, Comment

Tag object

In Beautiful Soup, Tag objects are objects used to represent tag elements in HTML or XML documents. Tag objects are the same as tags in native documents. The Tag object contains information such as the name, attributes, and content of the tag element, and provides a variety of methods to obtain, modify, and operate this information.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Common properties and methods of Tag objects

td>

Attributes Description Example
name attribute Get the name of the tag tag_name = tag.name
string attribute Get the tag Text content within tag_text = tag.string
attrs attribute Get the attributes of the tag and return it in the form of a dictionary tag_attrs = tag.attrs
get() method Get the attribute value of the tag according to the attribute name attr_value = tag.get(‘attribute_name’)
find() method Find and return the first sub-tag element that meets the conditions child_tag = tag.find(‘tag_name’)
find_all() method Find and return all child tag elements that meet the conditions , returned in the form of a list child_tags = tag.find_all(‘tag_name’)
parent attribute Get the current tag’s Parent tag parent_tag = tag.parent
parents attribute Gets all ancestor tags of the current tag in the form of a generator Return for parent in tag.parents:
print(parent)
children attribute Get the direct child tag of the current tag and return it in the form of a generator for child in tag.children:
print(child)
next_sibling attribute Get the next sibling tag of the current tag next_sibling_tag = tag.next_sibling
previous_sibling attribute Get the current tag’s Previous sibling tag previous_sibling_tag = tag.previous_sibling

NavigableString object

NavigableString object is a data type in the Beautiful Soup library, used to represent plain text content in HTML or XML documents. It inherits from Python’s basic string type, but has additional functions and features that make it suitable for processing textual content in documents.

Suppose we have the following HTML code snippet:

<p>This is a <b>beautiful</b> day.</p>

Use Beautiful Soup to parse it into a document object:

from bs4 import BeautifulSoup

html = '<p>This is a <b>beautiful</b> day.</p>'
soup = BeautifulSoup(html, 'html.parser')

Obtain

The content of the label, which is actually a NavigableString object:

p_tag = soup.find('p')
content = p_tag.string
print(content) # Output: This is a
print(type(content)) # Output: <class 'bs4.element.NavigableString'>

You can also perform some operations on NavigableString objects, such as obtaining text content, replacing text, and removing whitespace characters:

# Get text content
text = content.strip()
print(text) # Output: This is a

# Replacement text
p_tag.string.replace_with('Hello')
print(p_tag) # Output: <p>Hello<b>beautiful</b> day.</p>

# Remove whitespace characters
text_without_spaces = p_tag.string.strip()
print(text_without_spaces) #Output: Hello

td>

Attributes Description Example
string attribute Used to obtain the text content of the NavigableString object text = navigable_string.string
replace_with() method Used to replace the current NavigableString object with another string or object navigable_string.replace_with(new_string)
strip() method Remove whitespace characters at both ends of the string stripped_text = navigable_string.strip()
parent attribute Get The parent node to which the NavigableString object belongs (usually a Tag object) parent_tag = navigable_string.parent
next_sibling attribute Get NavigableString The next sibling node of the object next_sibling = navigable_string.next_sibling
previous_sibling attribute Get the previous sibling node of the NavigableString object previous_sibling = navigable_string.previous_sibling

BeautifulSoup object

The BeautifulSoup object is the core object of the Beautiful Soup library and is used to parse and traverse HTML or XML documents.

Commonly used methods:

find(name, attrs, recursive, string, **kwargs): Find the first matching tag based on the specified tag name, attributes, text content, etc.

find_all(name, attrs, recursive, string, limit, **kwargs): Find all matching tags based on the specified tag name, attributes, text content, etc., and return a list

select(css_selector): Use CSS selector syntax to find matching tags and return a list

prettify(): Output the contents of the entire document in a beautiful way, including tags, text and indentation

has_attr(name): Check whether the current tag has the specified attribute name and return a Boolean value

get_text(): Get the text content of the current label and all sub-labels, and return a string

Common properties:

soup.title: Get the content of the first <title> tag in the document

soup.head: Get the <head> tag in the document

soup.body: Get the <body> tag in the document

soup.find_all('tag'): Get all matching <tag> tags in the document and return a list

soup.text: Get the plain text content in the entire document (remove tags, etc.)

Comment object

Comment object is a special type of object in the Beautiful Soup library, used to represent comment content in HTML or XML documents.

When parsing an HTML or XML document, Beautiful Soup represents the comment content as a Comment object. A comment is a special element in a document that is used to add notes, explanations, or temporarily delete part of the content. Comment objects can be automatically recognized and processed by the parser of the Beautiful Soup library.

Example of processing and accessing annotated content in HTML documents:

from bs4 import BeautifulSoup

# Create an HTML string containing comments
html = "<html><body><!-- This is a comment --> <p>Hello, World!</p></body></html>"

# Parse HTML document
soup = BeautifulSoup(html, 'html.parser')

# Use the type() function to get the type of the annotation object
comment = soup.body.next_sibling
print(type(comment)) # Output <class 'bs4.element.Comment'>

# Use the `.string` attribute to get the comment content
comment_content = comment.string
print(comment_content) # Output This is a comment

Search document tree

Beautiful Soup provides a variety of ways to find and locate elements in HTML documents.

Method selector

Use Beautiful Soup’s find() or find_all() method to select elements by tag name, combining attribute names and attribute values, combining text content, etc.

The difference between the two:

find returns the first element that meets the conditions

find_all returns a list of all elements that meet the criteria

For example: soup.find('div') will return the first div tag element, soup.find_all('a') will return all a tag elements.

1. Find elements by tag or tag list

# Find the element of the first div tag
soup.find('div')

# Find all a tag elements
soup.find_all('a')

# Find all li tags
soup.find_all('li')

# Find all a tags and b tags
soup.find_all(['a','b'])

2. Find elements through regular expressions

# Search for tags starting with sp
import re
print(soup.find_all(re.compile("^sp")))

3. Find elements by attributes

find = soup.find_all(
    attrs={
         "property name":"value"
    }
)
print(find)

# Find the first a tag element with href attribute
soup.find('a', href=True)

# Find all div tag elements with class attributes
soup.find_all('div', class_=True)

4. Find elements by text content

# Find the first element containing "Hello" text content
soup.find(text='Hello')

# Find all elements containing "World" text content
soup.find_all(text="World")

5. Find elements through keyword parameters

soup.find_all(id='id01')

6. Mix it up

soup.find_all(
 'tag name',
 attrs={
   "property name":"value"
 },
 text="content"
)

CSS selector

Use Beautiful Soup’s select() method to find elements through CSS selectors. CSS selectors are a powerful and flexible way to target elements based on tag names, class names, IDs, attributes, and combinations thereof.

1. Class Selector

To find elements by class name, use the . symbol plus the class name to find elements

soup.select('.className')

2.ID selector

To find an element by ID, use the # symbol plus the ID to find an element.

soup.select('#id')

3. Tag selector

Find elements by tag name, use tag names directly to find elements

soup.select('p')

4.Attribute selector

Find elements by attributes: You can use the format of [attribute name = attribute value] to find elements with specific attributes and attribute values.

soup.select('[attribute="value"]')

soup.select('[href="example.com"]')

soup.select('a[href="http://baidu.com"]')

5. Combination selector

Multiple selectors can be combined for more precise searches

# Return all a tag elements within the div element with a specific class name
soup.select('div.class01 a')

soup.select('div a')


Associated selection

In the process of element search, sometimes you cannot get the desired node element in one step. You need to select a certain node element, and then use this node as the basis to select its child nodes, parent nodes, sibling nodes, etc.

In Beautiful Soup, you can perform associated selection by using CSS selector syntax. You need to use the select() method of Beautiful Soup, which allows you to use CSS selectors to select elements.

1. Descendant selector (space): You can select all descendant elements under the specified element.

div a /* Select all a elements under the div element */

.container p /* Select all p elements under the element named container */

2. Direct descendant selector (>): You can select the direct descendant elements of the specified element.

div > a /* Select the a element that is a direct descendant of the div element */

.container > p /* Select p elements that are direct descendants of the element named container */

3. Adjacent sibling selector (+): You can select the next sibling element immediately adjacent to the specified element.

h1 + p /* Select the sibling p element immediately after the h1 element */

.container + p /* Select the sibling p element immediately following the element named container */

4. Universal sibling selector (~): can select all subsequent elements at the same level as the specified element.

h1 ~ p /* Select all p elements at the same level as the h1 element */

.container ~ p /* Select all p elements at the same level as the element named container */

5. Mixed selectors: You can combine multiple selectors of different types to select specific elements

div, p /* Select all div elements and all p elements */

.cls1.cls2 /* Select the class name is cls1 and the class name is cls2 */

Traverse the document tree

In Beautiful Soup, traversing the document tree is a common operation for accessing and processing individual nodes of an HTML or XML document

Child nodes and parent nodes of tags

contents: Get all child nodes of Tag and return a list

print(bs.tag.contents)
# Use list index to get an element
print(bs.tag.contents[1])

children: Get all child nodes of Tag and return a generator

for child in tag.contents:
    print(child)

The parent attribute gets the parent node of the label

parent_tag = tag.parent

Sibling nodes of the tag

You can use the .next_sibling and .previous_sibling properties to get the next or previous sibling node of a label.

next_sibling = tag.next_sibling
previous_sibling = tag.previous_sibling

Recursively traverse the document tree

You can use the .find() and .find_all() methods to recursively search for matching tags in the document tree.

You can use the .descendants generator iterator to traverse all descendant nodes of the document tree.

for tag in soup.find_all('a'):
    print(tag)

for descendant in tag.descendants:
    print(descendant)

Traverse tag attributes

You can use the .attrs attribute to get all attributes of a tag and iterate over them.

for attr in tag.attrs:
    print(attr)


———————————END——————- ——–

Digression

In this era of big data, how can you keep up with scripting without mastering a programming language? Python, the hottest programming language at the moment, has a bright future! If you also want to keep up with the times and improve yourself, please take a look.

Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.

CSDN gift package:The most complete “Python learning materials” on the entire network are given away for free! (Safe link, click with confidence)

1. Python learning routes in all directions

The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.

img
img

2. Essential development tools for Python

The tools have been organized for you, and you can get started directly after installation! img

3. Latest Python study notes

When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.

img

4. Python video collection

Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher’s ideas in the video, from basic to in-depth.

img

5. Practical cases

What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.

img

6. Interview Guide

Resume template

CSDN gift package:The most complete “Python learning materials” on the entire network are given away for free! (Safe link, click with confidence)

If there is any infringement, please contact us for deletion.