Python10-Use the urllib module to process URLs
- 1.url library description
- 2.urllib.request
-
- 2.1urlopen
- 2.2urlretrieve
- 2.3Request
- 2.4 Example
- 3.urllib.parse
-
- 3.1urlparse
- 3.2urlunparse
- 3.3urlencode
- 3.4quote
- 3.5unquote
- 3.6 Example
1.url library description
urllib
is a module in the Python standard library that provides functions for processing URLs (Uniform Resource Locator). It contains some sub-modules, such as urllib.request
(open and read URL), urllib.parse
(parse URL), urllib.error
(Exception caused by urllib.request), urllib.robotparser
(parsing the robots.txt file).
2.urllib.request
This submodule provides functionality for opening and reading URLs. Use urlopen()
to open a URL and read its content, use urlretrieve()
to download a file, and use Request
to construct an HTTP request object and send request.
2.1urlopen
urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)
-
Opens the specified URL and returns a file-like object whose contents can be read using the
read()
method. -
parameter:
url
: URL to open. Can be a string or Request object.data
: Optional parameter, the data to be sent to the URL, which can be bytes or strings.timeout
: optional parameter, set the timeout.cafile
: Optional parameter, specifying the path of the CA certificate file.capath
: Optional parameter, specifying the path to the CA certificate directory.cadefault
: Optional parameter, specifying whether to use the default CA certificate.context
: Optional parameter, specifying the SSL context.
-
Return value: Returns a response object, a file-like object, whose contents can be read using the
read()
method.
2.2urlretrieve
urlretrieve(url, filename=None, reporthook=None, data=None)
- Downloads the content of the specified URL and saves it to a local file.
- Parameters:
url
: URL to download.filename
: Optional parameter, the file name to save, if not provided, the file name is extracted from the URL.reporthook
: Optional parameter, callback function used to display download progress.data
: Optional parameter, the data to be sent to the URL, which can be bytes or strings.
- Return value: A tuple containing the file name and server response headers.
2.3Request
Request(url, data=None, headers={<!-- -->}, origin_req_host=None, unverifiable=False, method=None)
- Construct an HTTP request object, set the request header information, and pass it to the
urlopen()
method. - Parameters:
url
: URL to request.data
: Optional parameter, the data to be sent to the URL, which can be bytes or strings.headers
: Optional parameter, dictionary of request headers to be sent.origin_req_host
: Optional parameter, the original host name of the request.unverifiable
: Optional parameter indicating whether the request is verifiable.method
: Optional parameter, specify the request method, such as GET, POST, etc.
- Return value: a
Request
object that can be passed to theurlopen()
method.
2.4 Example
import urllib.request f = urllib.request.urlopen('http://www.baidu.com') print(f.read(200)) f = urllib.request.urlopen('http://www.baidu.com') print(f.read(200).decode()) f = urllib.request.urlopen('http://www.baidu.com') print(f.read(200).decode('utf-8')) ''' b'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http -equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="' <!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv ="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name=" <!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv ="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name=" '''
import urllib.request #Create a Request object url = 'http://www.baidu.com' req = urllib.request.Request(url) # Set request headers req.add_header('User-Agent', 'Mozilla/5.0') # Optional: Set request method req.method = 'POST' # Optional: Set request data data = b'key1=value1 & amp;key2=value2' req.data = data #Send request and get response response = urllib.request.urlopen(req) # Read the response content content = response.read() #Print response content print(content)
urllib.request.Request
is used to construct an HTTP request object. By using the Request
class, you can set the request URL, data, request headers and other information. Then by calling the urlopen()
method and passing the Request
object as a parameter, an HTTP request is sent and a response object response
is obtained. The contents of the response can be read using the read()
method and printed out in the example.
3.urllib.parse
3.1urlparse
urllib.parse
provides functions such as parsing URLs, building URLs, and query string processing.
urlparse(urlstring, scheme='', allow_fragments=True)
- Parses a URL string, returning a named tuple containing the parsed result, whose various parts can be accessed through properties, such as protocol, host, path, etc.
- Parameters:
urlstring
: URL string to parse.scheme
: Optional parameter, ifurlstring
does not contain the protocol part, usescheme
as the default protocol.allow_fragments
: Optional parameter indicating whether to parse fragment identifiers in URLs.
- Return value: A named tuple containing the parsed URL part.
3.2urlunparse
urlunparse(parts)
- Reassembles a tuple containing the parts of a URL into a URL string.
- Parameters: A tuple containing the parts of the URL in the order (scheme, netloc, path, params, query, fragment).
- Return value: the reassembled URL string.
3.3urlencode
urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)
- Convert iterable objects such as dictionaries and lists of tuples into URL-encoded query strings.
- Parameters:
query
: The query parameters to be encoded can be iterable objects such as dictionaries and tuple lists.doseq
: Optional parameter indicating whether multiple values with the same key should be processed as a list.safe
: Optional parameter, specifying characters that do not need to be encoded.encoding
: optional parameter, specifies the encoding method.errors
: Optional parameter, specifying how to handle encoding errors.quote_via
: Optional parameter, specifying the citation method, the default isquote_plus
.
- Return value: URL-encoded query string.
3.4quote
quote(string, safe='/', encoding=None, errors=None)
- Encode special characters in URLs.
- Parameters:
string
: The string to be encoded.safe
: Optional parameter, specifying characters that do not need to be encoded.encoding
: optional parameter, specifies the encoding method.errors
: Optional parameter, specifying how to handle encoding errors.
- Return value: encoded string.
3.5unquote
unquote(string, encoding='utf-8', errors='replace')
- Decode a URL-encoded string.
- Parameters:
string
: The string to be decoded.encoding
: Optional parameter, specifying the decoding method.errors
: Optional parameter, specifying the decoding error handling method.
- Return value: decoded string.
3.6 Example
Parse URL:
from urllib.parse import urlparse url = 'https://www.example.com/path/to/page?param1=value1 & amp;param2=value2#fragment' # Parse URL parsed_url = urlparse(url) print(parsed_url.scheme) # Output: https print(parsed_url.netloc) # Output: www.example.com print(parsed_url.path) # Output: /path/to/page
Build URL:
from urllib.parse import urlunparse parts = ('https', 'www.example.com', '/path/to/page', '', 'param1=value1 & amp;param2=value2', 'fragment') # Build URL url = urlunparse(parts) print(url) # Output: https://www.example.com/path/to/page?param1=value1 & amp;param2=value2#fragment
Encoded query string:
from urllib.parse import urlencode params = {<!-- --> 'param1': 'value1', 'param2': 'value2' } # Encode query string encoded_params = urlencode(params) print(encoded_params) # Output: param1=value1 & amp;param2=value2
Decode the query string:
from urllib.parse import parse_qs query_string = 'param1=value1 & amp;param2=value2' # Decode query string decoded_params = parse_qs(query_string) print(decoded_params) # Output: {'param1': ['value1'], 'param2': ['value2']}
URL encoding/decoding:
from urllib.parse import quote, unquote string = 'Hello World!' # URL encoding encoded_string = quote(string) print(encoded_string) # Output: Hello World! # URL decoding decoded_string = unquote(encoded_string) print(decoded_string) # Output: Hello World!