Preface
Recently, I have a little demand for crawlers. I want to summarize some basic usage methods of the scrapy framework to deepen my impression. I have been used to using some script files to run crawlers. Faced with a very large amount of data, relatively high stability requirements, and relatively high efficiency requirements Under the circumstances, it is more appropriate to use scrapy. Scrapy is an asynchronous framework, and all requests are blocked. Although it can also be implemented in a single-file script, the code is very ugly and difficult to maintain. It will be forgotten after a few days Complicated processes are very difficult to debug. I don’t use scrapy much myself, but I think it is very good, with mature middleware support, convenient downloader, and high stability and efficiency. However, the running The process is also a little complicated and difficult to understand, and the asynchronous framework is very troublesome to deal with BUG debugging.
Initialize scrapy
First you need to install scrapy and selenium framework.
pip install scrapy pip install selenium
Initialize the frame
scrapy startproject testSpider
Then enter the folder according to the reference and create a new crawler file.
cd testSpider scrapy genspider myspider example.com
look at the catalog
Basic use of selenium
Selenium Preface
Today I will only talk about the basic use of selenium. The process of the scrapy framework will be summarized later. Why use selenium in scrapy, because the interface of some target sites is very difficult to reproduce through analysis, and there are usually some confusing parameters that cause request encounters In the case of interception, that is, when encountering anti-crawler measures, it is necessary to analyze the Javascript code and analyze the meaning of the parameters. This process is very complicated and the amount of work is very large. This is also the necessary knowledge for advanced crawlers at present, and some Javascript is required Reverse knowledge, for example, there is a very famous Ruishu information in the industry that has made a comeback, which belongs to the top existence. It is specially used in some financial and government websites to use this Javascript code obfuscation technology, I also understand a little bit.
Through selenium, you can bypass some key anti-pickup interfaces and get some important information. The usual situation is to use selenium to simulate the login interface with anti-climbing measures to get the cookie after login, and then the interface after login has no anti-climbing measures.
Download Driver
Using selenium requires a matching browser driver. My supporting browser is chrome, my own browser version.
Downloaded version:
Then I put the browser driver in the browser directory, and then configured the environment variables:
Key code
testSpider/spider/myspider.py is the key code, the current code is as follows:
import scrapy class MyspiderSpider(scrapy. Spider): name = 'myspider' allowed_domains = ['example.com'] start_urls = ['http://example.com/'] def parse(self, response): pass
According to the requirements of the title of the article, I only need to use the above files for coding. If I want to use other interface crawling methods, I need to change the settings in testSpider/setting.py. If you are interested, you can refer to my previous article. : Use the Scrapy framework to crawl V2ex to see what programmers are discussing during the Mid-Autumn Festival
Brighten the code directly, and try the login of Qiniuyun, because I feel that it is relatively simple, with fewer steps, and is suitable for tutorial sharing. The details are explained in the notes:
import scrapy from selenium.webdriver.chrome.options import Options from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class MyspiderSpider(scrapy. Spider): name = 'myspider' allowed_domains = ['portal.qiniu.com'] # Pay attention to setting the list of URLs that are allowed to be crawled. I stepped on the pit here. After trying for a long time, I found that the default parser of the framework has not been called. It is necessary to write the integrated domain name of the URL instead of just a single one. level domain name. start_urls = ['http://example.com/'] user_name = '********@**.com' password = '********' chorme_options = Options() chorme_options.add_argument("--disable-gpu") driver = webdriver.Chrome(options=chorme_options) # Initialize Chrome driver driver. implicitly_wait(20) headers = { 'authority': 'portal.qiniu.com', 'accept': '*/*', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'no-cache', 'referer': 'https://portal.qiniu.com/certificate/ssl', 'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-origin', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36', } # header cookie = {} def __del__(self): self.driver.close() def parse(self, response, *args, **kwargs): print('Default parser method, interface request content:') print(response. json()) def start_requests(self): self.driver.get(url='https://sso.qiniu.com/') # directly access the login page user_input = self.driver.find_element(By.ID, 'email') # Get the user name input box user_input.send_keys(self.user_name) # input user name password_input = self.driver.find_element(By.ID, 'password') # get password box password_input.send_keys(self.password) # input password self.driver.find_element(By.ID, 'login-button').click() # login try: WebDriverWait(self.driver, 60).until(EC.visibility_of_element_located( (By.CLASS_NAME, "user-plane-entry"))) # Wait for the webpage to jump, and wait for 60 seconds after timeout except: print('Login timed out, failed') # Waiting for more than 60 self.driver.quit() self.cookie = self.driver.get_cookies() # Obtaining cookies is a key-value pair at this time print(self. cookie) print(self. headers) yield scrapy.Request(url='https://portal.qiniu.com/api/gaea/billboard/list?status=1', callback=self.parse, cookies=self.cookie, headers=self.headers) # The iterator puts the request into the asynchronous task
Look at the effect from the log:
The content returned by the interface has been printed correctly. If the interface is requested directly, an error will be reported:
Summary
Using selenium in scrapy is a very common situation. Today I will only make a brief summary. The difficulties encountered in the future will be recorded and shared one by one. Please look forward to it.
Finally: The complete software testing video tutorial below has been sorted out and uploaded, friends who need it can get it by themselves [100% free guarantee]
Software Testing Interview Document
We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.