Learn python from 0 -65

Python urllib-2

Simulated header information

We generally need to simulate headers (webpage header information) when crawling web pages. At this time, we need to use the urllib.request.Request class:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
  • url : url address.
  • data : Additional data objects to send to the server, defaults to None.
  • headers : HTTP request header information, in dictionary format.
  • origin_req_host : The requested host address, IP or domain name.
  • unverifiable : Rarely used as a whole parameter, it is used to set whether the webpage needs to be verified, and the default is False. .
  • method : Request method, such as GET, POST, DELETE, PUT, etc.
## Example - py3_urllib_test.py file code
import urllib.request
import urllib. parse

url = 'https://www.runoob.com/?s=' # rookie tutorial search page
keyword = 'Python Tutorial'
key_code = urllib.request.quote(keyword) # encode the request
url_all = url + key_code
header = {<!-- -->
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
} #Head information
request = urllib.request.Request(url_all,headers=header)
response = urllib.request.urlopen(request).read()

fh = open("./urllib_test_runoob_search.html","wb") # write the file to the current directory
fh. write(response)
fh. close()

Executing the above Python code will generate the urllib_test_runoob_search.html file in the current directory, open the urllib_test_runoob_search.html file (can be opened with a browser), the content is as follows:

Form POST transfers data. Let’s create a form first. The code is as follows. Here I use PHP code to get the data of the form:

## Example - py3_urllib_test.php file code:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Rookie tutorial (runoob.com) urllib POST test</title>
</head>
<body>
<form action="" method="post" name="myForm">
    Name: <input type="text" name="name"><br>
    Tag: <input type="text" name="tag"><br>
    <input type="submit" value="submit">
</form>

<?php
// Use PHP to get the data submitted by the form, you can replace it with other
if(isset($_POST['name']) & amp; & amp; $_POST['tag'] ) {<!-- -->
   echo $_POST["name"] . ', ' . $_POST['tag'];
}
?>
</body>
</html>
import urllib.request
import urllib. parse

url = 'https://www.runoob.com/try/py3/py3_urllib_test.php' # Submit to the form page
data = {<!-- -->'name':'RUNOOB', 'tag' : 'Rookie Tutorial'} # submit data
header = {<!-- -->
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
} #Head information
data = urllib.parse.urlencode(data).encode('utf8') # Encode the parameter, use urllib.parse.urldecode for decoding
request=urllib.request.Request(url, data, header) # request processing
response=urllib.request.urlopen(request).read() # read the result

fh = open("./urllib_test_post_runoob.html","wb") # write the file to the current directory
fh. write(response)
fh. close()

Executing the above code will submit the form data to the py3_urllib_test.php file, and the output will be written to the urllib_test_post_runoob.html file.

Open the urllib_test_post_runoob.html file (you can use a browser to open it), and the displayed results are as follows:

urllib. error

The urllib.error module defines exception classes for exceptions raised by urllib.request, and the base exception class is URLError.

urllib.error contains two methods, URLError and HTTPError.

URLError is a subclass of OSError, which is used to raise this exception (or its derived exceptions) when the handler encounters a problem, and the attribute reason included is the reason for the exception.

HTTPError is a subclass of URLError, used to handle special HTTP errors such as authentication requests, the included attribute code is the status code of HTTP, and reason is the exception caused Reason, headers are the HTTP response headers for the specific HTTP request that caused the HTTPError.

Crawl and handle exceptions for non-existent web pages:

import urllib.request
import urllib.error

myURL1 = urllib.request.urlopen("https://www.runoob.com/")
print(myURL1. getcode()) # 200

try:
    myURL2 = urllib.request.urlopen("https://www.runoob.com/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404) # 404

urllib. parse

urllib.parse is used to parse URLs in the following format:

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

urlstring is the url address of the string, scheme is the protocol type,

allow_fragments parameter is false, fragment identifiers are not recognized. Instead, they are parsed as part of a path, parameter, or query component, and fragment is set to an empty string in the return value.

## instance

from urllib.parse import urlparse

o = urlparse("https://www.runoob.com/?s=python + tutorial")
print(o)

The output of the above example is:

ParseResult(scheme='https', netloc='www.runoob.com', path='/', params='', query='s=python + tutorial\ ', fragment='')

As can be seen from the results, the content is a tuple, which contains 6 strings: protocol, location, path, parameter, query, and judgment.

We can read the protocol content directly:

from urllib.parse import urlparse

o = urlparse("https://www.runoob.com/?s=python + tutorial")
print(o. scheme)

The output of the above example is:

https

The full content is as follows:

The port number is an integer (if present)

attribute index value value if not present
————————————- ———————
scheme 0 URL protocol scheme parameter
netloc 1 network location part empty string
path 2 Hierarchical Path empty string
params 3 Parameters for the last path element empty string
query 4 query component empty string
fragment 5 fragment identification empty string
username username None
password password None
hostname hostname (lowercase) None
port None

urllib.robotparser

urllib.robotparser is used to parse robots.txt files.

robots.txt (uniform lowercase) is a robots protocol stored in the root directory of the website, which is usually used to tell search engines the rules for crawling the website.

urllib.robotparser provides the RobotFileParser class, the syntax is as follows:

class urllib.robotparser.RobotFileParser(url='')

This class provides some methods that can read and parse robots.txt files:

  • set_url(url) – Sets the URL of the robots.txt file.
  • read() – Reads a robots.txt URL and feeds it into the parser.
  • parse(lines) – parses the lines argument.
  • can_fetch(useragent, url) – Returns True if useragent is allowed to fetch urls according to the rules in the parsed robots.txt file.
  • mtime() – Returns the last time the robots.txt file was fetched. This is useful for long-running web crawlers that need to periodically check for updates to the robots.txt file.
  • modified() – Sets the last time a robots.txt file was fetched to the current time.
  • crawl_delay(useragent) – Returns the Crawl-delay parameter from robots.txt for the specified useragent. Returns None if this parameter does not exist or does not apply to the specified useragent, or if there is a syntax error in the robots.txt entry for this parameter.
  • request_rate(useragent) – Returns the contents of the Request-rate parameter from robots.txt as a named tuple RequestRate(requests, seconds) . Returns None if this parameter does not exist or does not apply to the specified useragent, or if there is a syntax error in the robots.txt entry for this parameter.
  • site_maps() – Returns the contents of the Sitemap parameter from robots.txt as a list(). Returns None if this parameter does not exist or if there is a syntax error in the robots.txt entry for this parameter.
>>> import urllib. robotparser
>>> rp = urllib. robotparser. RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp. read()
>>> rrate = rp. request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San + Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True