foreword
- 1. What is DrissionPage?
- Second, scrapy + DeissionPage crawls 51 jobs
- 1. Create scrapy project
- 2. Rewrite middewares.py
- 3. Write a_51job.py
- Summary
Foreword
When crawling website data, we often encounter some encrypted data or various verification codes. However, using request directly requires js reverse engineering to take a lot of time, but it is easy to be identified by using automated tools such as selemunim. Is there a compromise that can have the speed of request and the simple operation of selemunim? Maybe DrissionPage can help you
1. What is DrissionPage?
Overview: DrissionPage is a python based web automation tool. It can control the browser, send and receive packets, and combine the two into one. It can take into account the convenience of browser automation and the high efficiency of requests. It is powerful, with countless user-friendly designs and convenient functions built in. Its syntax is concise and elegant, the amount of code is small, and it is friendly to novices.
This library adopts a fully self-developed kernel, built-in many practical functions, and integrates and optimizes commonly used functions. Compared with selenium, it has the following advantages:
- no webdriver feature
- No need to download different drivers for different versions of browsers
- run faster
- Elements can be found across iframes without cutting in and out
- Treat iframe as an ordinary element, and you can directly find elements in it after getting it, the logic is clearer
- Multiple tabs in the browser can be operated at the same time, even if the tab is inactive, without switching
- You can directly read the browser cache to save the picture, no need to use the GUI to click save
- You can take screenshots of the entire webpage, including parts outside the viewport (supported by browsers above version 90)
- shadow-root that handles non-
open
states
More usage and overview are not explained here, click here for detailed usage
2. scrapy + DeissionPage crawls 51 jobs
Preparation tools:
pycharm 3.7
scrapy-2.9.0
Drission Page 3.2.30
1. Create a scrapy project
Enter in sequence in the pycharm terminal:
scrapy startproject_51job cd_51job scrapy genspider a_51job www.51job.com
The created files are as follows:
2. Rewrite middewares.py
Because scrapy uses request to send requests, it is necessary to rewrite the middleware to use DrissionPage
class DrissionPageMiddleware: def process_request(self, request, spider): """ Request using Drissonpage :param request: :param spider: :return: """ url = request.url spider.edge.get(url) # request page, consistent with selemunim get method spider.edge.wait.load_start(3) # wait for loading html = spider.edge.html # Get the page html, which is consistent with the source attribute of selemunim return HtmlResponse(url=url, body=html, request=request, encoding='utf-8') def process_response(self, request, response, spider): """ Handle slider captcha :param request: :param response: :param spider: :return: """ url = request.url from DrissionPage.common import ActionChains while spider.edge.s_ele('#nc_1_n1z'): # The slider verification code appears spider.edge.clear_cache() # clear the cache ac = ActionChains(spider.edge) for i in range(random. randint(10, 20)): ac. move(random. randint(-20, 20), random. randint(-10, 10)) else: ac.move_to('#nc_1_n1z') ac. hold(). move(300) time. sleep(2) spider. edge. get(url) time. sleep(2) html = spider.edge.html if not 'sliding verification page' in html: # verification is successful return HtmlResponse(url=url, body=html, request=request, encoding='utf-8') else: # Validation failed spider. edge. clear_cache() spider. edge. get(url) return response
3. Write a_51job.py
import time import scrapy from DrissionPage import ChromiumPage import random from DrissionPage.errors import ElementNotFoundError class A51jobSpider(scrapy. Spider): name = "_51job" # allowed_domains = ["www.51job.com"] start_urls = [] # Search links such as: https://we.51job.com/pc/searchjobArea=190200 & amp;keyword=java & amp;searchType=2 & amp;sortType=0 & amp;metro= def __init__(self, **kwargs): super().__init__(**kwargs) self.edge = ChromiumPage() # instantiate browser self.all_urls = [] self.isdatail = False # search page basic information self.isbasic = True # detail page information def get_all_want_info(self, response): """ Get all job offers :return: """ last_page = int(self.edge.eles('.number')[-1].text) # The total number of search pages for page_num in range(1, last_page + 1): # turn page print(f'------Crawling page {page_num}---------') self.edge.ele('#jump_page').input(page_num) # input page number self.edge.ele('.jumpPage').click() # Click to jump time. sleep(3) # Get basic recruitment information if self.isbasic: page_want_info = [] job_names = self.edge.s_eles('.jname at') # job name times = self.edge.s_eles('.time') # release time sals = self.edge.s_eles('.sal') # salary requrire = self.edge.s_eles('.d at') # recruitment requirements tags = self.edge.s_eles('.tags') # benefits company_names = self.edge.s_eles('.cname at') # company company_class = self.edge.s_eles('.dc at') #company class domain_class = self.edge.s_eles('.int at') # domain for base_want_info in zip(company_names, job_names, times, sals, tags, requrire, company_class, domain_class): want_info = [] for i in base_want_info: want_info.append(i.text) page_want_info.append(want_info) print(page_want_info) # get job details if self. isdatail: for i in self.edge.s_eles('.el'): # get recruitment url self.all_urls.append(i.attr('href')) random.shuffle(self.all_urls) for url in self.all_urls: # Get recruitment information yield scrapy.Request(url=url, callback=self.parse) try: job_name = self.edge.s_ele('xpath:/html/body/div[2]/div[2]/div[2]/div/div[1]/h1').text # recruitment position compensation = self. edge. s_ele( 'xpath:/html/body/div[2]/div[2]/div[2]/div/div[1]/strong').text # recruitment salary required_info = self.edge.s_ele('.msg ltype').text # Recruitment requirements address = self.edge.s_ele('xpath:/html/body/div[2]/div[2]/div[3]/div[2]/div/p').text # work address company_info = self.edge.s_ele('.tmsg inbox').text #company information print(job_name, address, compensation) except ElementNotFoundError: pass self. edge. get(self. start_urls[0]) self.all_urls = [] def start_requests(self): yield scrapy.Request( url=self.start_urls[0], callback=self.get_all_want_info ) def parse(self, response, **kwargs): pass from scrapy import cmdline cmdline.execute("scrapy crawl _51job".split())
If you want to search by condition, you can directly control the browser and click, so I won’t go into details here
Summary
This article uses scrapy + DrissionPage to crawl 51job recruitment information, because DrissionPage opens the original browser driver, no need to download other drivers and can automate any feature values, and the pass rate of 51job slider verification code is almost 100%. The crawling speed is also good, and it is a powerful and simple tool.