Python10-Use urllib module to process URLs

Python10-Use the urllib module to process URLs

  • 1.url library description
  • 2.urllib.request
    • 2.1urlopen
    • 2.2urlretrieve
    • 2.3Request
    • 2.4 Example
  • 3.urllib.parse
    • 3.1urlparse
    • 3.2urlunparse
    • 3.3urlencode
    • 3.4quote
    • 3.5unquote
    • 3.6 Example

1.url library description

urllib is a module in the Python standard library that provides functions for processing URLs (Uniform Resource Locator). It contains some sub-modules, such as urllib.request (open and read URL), urllib.parse (parse URL), urllib.error (Exception caused by urllib.request), urllib.robotparser (parsing the robots.txt file).

2.urllib.request

This submodule provides functionality for opening and reading URLs. Use urlopen() to open a URL and read its content, use urlretrieve() to download a file, and use Request to construct an HTTP request object and send request.

2.1urlopen

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)
  • Opens the specified URL and returns a file-like object whose contents can be read using the read() method.

  • parameter:

    • url: URL to open. Can be a string or Request object.
    • data: Optional parameter, the data to be sent to the URL, which can be bytes or strings.
    • timeout: optional parameter, set the timeout.
    • cafile: Optional parameter, specifying the path of the CA certificate file.
    • capath: Optional parameter, specifying the path to the CA certificate directory.
    • cadefault: Optional parameter, specifying whether to use the default CA certificate.
    • context: Optional parameter, specifying the SSL context.
  • Return value: Returns a response object, a file-like object, whose contents can be read using the read() method.

2.2urlretrieve

urlretrieve(url, filename=None, reporthook=None, data=None)
  • Downloads the content of the specified URL and saves it to a local file.
  • Parameters:
    • url: URL to download.
    • filename: Optional parameter, the file name to save, if not provided, the file name is extracted from the URL.
    • reporthook: Optional parameter, callback function used to display download progress.
    • data: Optional parameter, the data to be sent to the URL, which can be bytes or strings.
  • Return value: A tuple containing the file name and server response headers.

2.3Request

Request(url, data=None, headers={<!-- -->}, origin_req_host=None, unverifiable=False, method=None)
  • Construct an HTTP request object, set the request header information, and pass it to the urlopen() method.
  • Parameters:
    • url: URL to request.
    • data: Optional parameter, the data to be sent to the URL, which can be bytes or strings.
    • headers: Optional parameter, dictionary of request headers to be sent.
    • origin_req_host: Optional parameter, the original host name of the request.
    • unverifiable: Optional parameter indicating whether the request is verifiable.
    • method: Optional parameter, specify the request method, such as GET, POST, etc.
  • Return value: a Request object that can be passed to the urlopen() method.

2.4 Example

import urllib.request

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200))

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200).decode())

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200).decode('utf-8'))
'''
b'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http -equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="'

<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv ="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="

<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv ="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="
'''
import urllib.request

#Create a Request object
url = 'http://www.baidu.com'
req = urllib.request.Request(url)

# Set request headers
req.add_header('User-Agent', 'Mozilla/5.0')

# Optional: Set request method
req.method = 'POST'

# Optional: Set request data
data = b'key1=value1 & amp;key2=value2'
req.data = data

#Send request and get response
response = urllib.request.urlopen(req)

# Read the response content
content = response.read()

#Print response content
print(content)

urllib.request.Request is used to construct an HTTP request object. By using the Request class, you can set the request URL, data, request headers and other information. Then by calling the urlopen() method and passing the Request object as a parameter, an HTTP request is sent and a response object response is obtained. The contents of the response can be read using the read() method and printed out in the example.

3.urllib.parse

3.1urlparse

urllib.parse provides functions such as parsing URLs, building URLs, and query string processing.

urlparse(urlstring, scheme='', allow_fragments=True)
  • Parses a URL string, returning a named tuple containing the parsed result, whose various parts can be accessed through properties, such as protocol, host, path, etc.
  • Parameters:
    • urlstring: URL string to parse.
    • scheme: Optional parameter, if urlstring does not contain the protocol part, use scheme as the default protocol.
    • allow_fragments: Optional parameter indicating whether to parse fragment identifiers in URLs.
  • Return value: A named tuple containing the parsed URL part.

3.2urlunparse

urlunparse(parts)
  • Reassembles a tuple containing the parts of a URL into a URL string.
  • Parameters: A tuple containing the parts of the URL in the order (scheme, netloc, path, params, query, fragment).
  • Return value: the reassembled URL string.

3.3urlencode

urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)
  • Convert iterable objects such as dictionaries and lists of tuples into URL-encoded query strings.
  • Parameters:
    • query: The query parameters to be encoded can be iterable objects such as dictionaries and tuple lists.
    • doseq: Optional parameter indicating whether multiple values with the same key should be processed as a list.
    • safe: Optional parameter, specifying characters that do not need to be encoded.
    • encoding: optional parameter, specifies the encoding method.
    • errors: Optional parameter, specifying how to handle encoding errors.
    • quote_via: Optional parameter, specifying the citation method, the default is quote_plus.
  • Return value: URL-encoded query string.

3.4quote

quote(string, safe='/', encoding=None, errors=None)
  • Encode special characters in URLs.
  • Parameters:
    • string: The string to be encoded.
    • safe: Optional parameter, specifying characters that do not need to be encoded.
    • encoding: optional parameter, specifies the encoding method.
    • errors: Optional parameter, specifying how to handle encoding errors.
  • Return value: encoded string.

3.5unquote

unquote(string, encoding='utf-8', errors='replace')
  • Decode a URL-encoded string.
  • Parameters:
    • string: The string to be decoded.
    • encoding: Optional parameter, specifying the decoding method.
    • errors: Optional parameter, specifying the decoding error handling method.
  • Return value: decoded string.

3.6 Example

Parse URL:

from urllib.parse import urlparse

url = 'https://www.example.com/path/to/page?param1=value1 & amp;param2=value2#fragment'

# Parse URL
parsed_url = urlparse(url)

print(parsed_url.scheme) # Output: https
print(parsed_url.netloc) # Output: www.example.com
print(parsed_url.path) # Output: /path/to/page

Build URL:

from urllib.parse import urlunparse

parts = ('https', 'www.example.com', '/path/to/page', '', 'param1=value1 & amp;param2=value2', 'fragment')

# Build URL
url = urlunparse(parts)

print(url) # Output: https://www.example.com/path/to/page?param1=value1 & amp;param2=value2#fragment

Encoded query string:

from urllib.parse import urlencode

params = {<!-- -->
    'param1': 'value1',
    'param2': 'value2'
}

# Encode query string
encoded_params = urlencode(params)

print(encoded_params) # Output: param1=value1 & amp;param2=value2

Decode the query string:

from urllib.parse import parse_qs

query_string = 'param1=value1 & amp;param2=value2'

# Decode query string
decoded_params = parse_qs(query_string)

print(decoded_params) # Output: {'param1': ['value1'], 'param2': ['value2']}

URL encoding/decoding:

from urllib.parse import quote, unquote

string = 'Hello World!'

# URL encoding
encoded_string = quote(string)

print(encoded_string) # Output: Hello World!

# URL decoding
decoded_string = unquote(encoded_string)

print(decoded_string) # Output: Hello World!