It’s “3202” and still use selemunim? Teach you to use scrapy + DrissionPage to crawl 51job and pass the slider verification code

foreword

  • 1. What is DrissionPage?
  • Second, scrapy + DeissionPage crawls 51 jobs
  • 1. Create scrapy project
  • 2. Rewrite middewares.py
  • 3. Write a_51job.py
  • Summary

Foreword

When crawling website data, we often encounter some encrypted data or various verification codes. However, using request directly requires js reverse engineering to take a lot of time, but it is easy to be identified by using automated tools such as selemunim. Is there a compromise that can have the speed of request and the simple operation of selemunim? Maybe DrissionPage can help you

1. What is DrissionPage?

Overview: DrissionPage is a python based web automation tool. It can control the browser, send and receive packets, and combine the two into one. It can take into account the convenience of browser automation and the high efficiency of requests. It is powerful, with countless user-friendly designs and convenient functions built in. Its syntax is concise and elegant, the amount of code is small, and it is friendly to novices.

This library adopts a fully self-developed kernel, built-in many practical functions, and integrates and optimizes commonly used functions. Compared with selenium, it has the following advantages:

  • no webdriver feature
  • No need to download different drivers for different versions of browsers
  • run faster
  • Elements can be found across iframes without cutting in and out
  • Treat iframe as an ordinary element, and you can directly find elements in it after getting it, the logic is clearer
  • Multiple tabs in the browser can be operated at the same time, even if the tab is inactive, without switching
  • You can directly read the browser cache to save the picture, no need to use the GUI to click save
  • You can take screenshots of the entire webpage, including parts outside the viewport (supported by browsers above version 90)
  • shadow-root that handles non-open states

More usage and overview are not explained here, click here for detailed usage

2. scrapy + DeissionPage crawls 51 jobs

Preparation tools:

pycharm 3.7

scrapy-2.9.0

Drission Page 3.2.30

1. Create a scrapy project

Enter in sequence in the pycharm terminal:

scrapy startproject_51job
cd_51job
scrapy genspider a_51job www.51job.com 

The created files are as follows:

2. Rewrite middewares.py

Because scrapy uses request to send requests, it is necessary to rewrite the middleware to use DrissionPage

class DrissionPageMiddleware:

    def process_request(self, request, spider):
        """
        Request using Drissonpage
        :param request:
        :param spider:
        :return:
        """
        url = request.url
        spider.edge.get(url) # request page, consistent with selemunim get method
        spider.edge.wait.load_start(3) # wait for loading
        html = spider.edge.html # Get the page html, which is consistent with the source attribute of selemunim
        return HtmlResponse(url=url, body=html, request=request, encoding='utf-8')

    def process_response(self, request, response, spider):
        """
        Handle slider captcha
        :param request:
        :param response:
        :param spider:
        :return:
        """
        url = request.url
        from DrissionPage.common import ActionChains
        while spider.edge.s_ele('#nc_1_n1z'): # The slider verification code appears
            spider.edge.clear_cache() # clear the cache
            ac = ActionChains(spider.edge)
            for i in range(random. randint(10, 20)):
                ac. move(random. randint(-20, 20), random. randint(-10, 10))
            else:
                ac.move_to('#nc_1_n1z')
            ac. hold(). move(300)
            time. sleep(2)
            spider. edge. get(url)
            time. sleep(2)
            html = spider.edge.html
            if not 'sliding verification page' in html: # verification is successful
                return HtmlResponse(url=url, body=html, request=request, encoding='utf-8')
            else: # Validation failed
                spider. edge. clear_cache()
                spider. edge. get(url)
        return response

3. Write a_51job.py

import time
import scrapy
from DrissionPage import ChromiumPage
import random
from DrissionPage.errors import ElementNotFoundError


class A51jobSpider(scrapy. Spider):
    name = "_51job"
    # allowed_domains = ["www.51job.com"]
    start_urls = [] # Search links such as: https://we.51job.com/pc/searchjobArea=190200 & amp;keyword=java & amp;searchType=2 & amp;sortType=0 & amp;metro=

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.edge = ChromiumPage() # instantiate browser
        self.all_urls = []
        self.isdatail = False # search page basic information
        self.isbasic = True # detail page information

    def get_all_want_info(self, response):
        """
        Get all job offers
        :return:
        """
        last_page = int(self.edge.eles('.number')[-1].text) # The total number of search pages
        for page_num in range(1, last_page + 1): # turn page
            print(f'------Crawling page {page_num}---------')
            self.edge.ele('#jump_page').input(page_num) # input page number
            self.edge.ele('.jumpPage').click() # Click to jump
            time. sleep(3)
            # Get basic recruitment information
            if self.isbasic:
                page_want_info = []
                job_names = self.edge.s_eles('.jname at') # job name
                times = self.edge.s_eles('.time') # release time
                sals = self.edge.s_eles('.sal') # salary
                requrire = self.edge.s_eles('.d at') # recruitment requirements
                tags = self.edge.s_eles('.tags') # benefits
                company_names = self.edge.s_eles('.cname at') # company
                company_class = self.edge.s_eles('.dc at') #company class
                domain_class = self.edge.s_eles('.int at') # domain
                for base_want_info in zip(company_names, job_names, times, sals, tags, requrire, company_class,
                                          domain_class):
                    want_info = []
                    for i in base_want_info:
                        want_info.append(i.text)
                    page_want_info.append(want_info)
                print(page_want_info)
            # get job details
            if self. isdatail:
                for i in self.edge.s_eles('.el'): # get recruitment url
                    self.all_urls.append(i.attr('href'))
                random.shuffle(self.all_urls)
                for url in self.all_urls: # Get recruitment information
                    yield scrapy.Request(url=url, callback=self.parse)
                    try:
                        job_name = self.edge.s_ele('xpath:/html/body/div[2]/div[2]/div[2]/div/div[1]/h1').text # recruitment position
                        compensation = self. edge. s_ele(
                            'xpath:/html/body/div[2]/div[2]/div[2]/div/div[1]/strong').text # recruitment salary
                        required_info = self.edge.s_ele('.msg ltype').text # Recruitment requirements
                        address = self.edge.s_ele('xpath:/html/body/div[2]/div[2]/div[3]/div[2]/div/p').text # work address
                        company_info = self.edge.s_ele('.tmsg inbox').text #company information
                        print(job_name, address, compensation)
                    except ElementNotFoundError:
                        pass
                self. edge. get(self. start_urls[0])
                self.all_urls = []

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.get_all_want_info
        )

    def parse(self, response, **kwargs):
        pass


from scrapy import cmdline

cmdline.execute("scrapy crawl _51job".split())
search link

If you want to search by condition, you can directly control the browser and click, so I won’t go into details here

Summary

This article uses scrapy + DrissionPage to crawl 51job recruitment information, because DrissionPage opens the original browser driver, no need to download other drivers and can automate any feature values, and the pass rate of 51job slider verification code is almost 100%. The crawling speed is also good, and it is a powerful and simple tool.