Table of Contents
1. Create a scrapy project
2. Xpath parses data
3. Data saving through pipelines
4. Middleware
1. Create a scrapy project
1. Create a folder: C06
Enter the following command in the terminal:
2. Install scrapy: pip install scrapy
3. Go to the folder: cd C06
4. Create project: scrapy startproject C06L02 (project name)
5. Switch to C06L02: cd C06L02/C06L02
Switch to spiders: cd spiders
6. Create a crawler name and enter the crawling link: scrapy genspider app https://product.cheshi.com/rank/2-0-0-0-1/
(If crawlspider is used to crawl the entire site: scrapy genspider -t crawl app http://seller.cheshi.com/jinan/
7. Pay attention to whether the links of the crawler file (newly generated app.py) are consistent
8. Run the crawler file: scrapy crawl app
9. If you want to eliminate the log file, add the command in settings.py: LOG_LEVEL=”ERROR”
If you want to bypass the ROBOTS protocol, add the command in settings.py: ROBOTSTXT_OBEY=False
10. The app.py file code of a simple scrapy project is as follows:
import scrapy class AppSpider(scrapy.Spider): name = "app" allowed_domains = ["product.cheshi.com"] started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"] def parse(self, response): print(response.text)
If crawlspider is used to crawl the entire site:
import scrapy from scrapy.linkextractors import linkExtractor from scrapy.spiders import CrawlSpider, Rule class AppSpider(CrawlSpider): name = "app" allowed_domains = ["product.cheshi.com"] started_urls = ["http://product.cheshi.com/jinan"] rules = (Rule(linkExtractor(allow=r"seller.cheshi.com/\d + ", deny=r"seller.cheshi.com/\d + /. + "),callback="parse_item",follow=True ),) def parse(self, response): print(response.url)
11. User-agent configuration: Expand the user-agent annotation content in the settings.py file and add the required content
2. xpath analysis data
Modify the parse function in the app.py file
import scrapy class AppSpider(scrapy.Spider): name = "app" allowed_domains = ["product.cheshi.com"] started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"] def parse(self, response): cars = response.xpath('//ul[@class="condition_list_con"]/li') for car in cars: title = car.xpath('./div[@class="m_detail"]//a/text()').get() price = car.xpath('./div[@class="m_detail"]//b/text()').get()
If paging crawling is implemented, the code is as follows
import scrapy from ..items import C06L10Item class AppSpider(scrapy.Spider): name = "app" allowed_domains = ["book.douban.com"] started_urls = ["http://book.douban.com/latest"] def parse(self, response): books = response.xpath('//ul[@class="chart-dashed-list"]/li') for book in books: link = book.xpath('.//h2/a/@href').get() yield scrapy.Request(url=link,callback=self.parse_details) next_url = response.xpath('//*[@id="content"]/div/div[1]/div[4]/span[4]/a/@href').get() if next_url is not None: next_url = response.urljoin(next_url) print(next_url) yield scrapy.Request(url=next_url, callback=self.parse) else: next_url = response.xpath('//*[@id="content"]/div/div[1]/div[4]/span[3]/a/@href').get() next_url = response.urljoin(next_url) print(next_url) yield scrapy.Request(url=next_url, callback=self.parse) def parse_details(self, response): item = C06L10Item() item["title"] = response.xpath('//*[id="wrapper"]/h1/span/text()').get() item["publisher"] = response.xpath('//*[id="info"]/a[1]/text()').get() yield item
3. Data saving through pipelines
1. Define the data model in the items.py file
import scrapy class C06L04Item(scrapy.Item): title = scrapy.Field() price = scrapy.Field()
2. Add the following code to the app.py file
import scrapy from ..items import C06L04Item class AppSpider(scrapy.Spider): name = "app" allowed_domains = ["product.cheshi.com"] started_urls = ["http://product.cheshi.com/rank/2-0-0-0-1/"] def parse(self, response): item = C06L04Item() cars = response.xpath('//ul[@class="condition_list_con"]/li') for car in cars: item["title"] = car.xpath('./div[@class="m_detail"]//a/text()').get() item["price"] = car.xpath('./div[@class="m_detail"]//b/text()').get() yield item
3. Expand the commented out ITEM_PIPELINES in the settings.py file without modification.
4. Modify the pipelines.py file code
from itemadapter import ItemAdapter class C06L04Pipeline: def process_item(self, item, spider): # print(item["title"],item["price"]) return item
If you want to save it as a file add the following code
from itemadapter import ItemAdapter class C06L04Pipeline: def __init__(self): self.f = open("data.tet", "w") def process_item(self, item, spider): self.f.write(item["title"] + item["price"] + "\ ") return item def __del__(self): self.f.close()
Stored in mongodb format as the following code
from itemadapter import ItemAdapter importpymongo class C06L04Pipeline: def __init__(self): self.client = pymongo.MongoClient("mongodb://localhost:27017") self.db = self.client["cheshi"] self.col = self.db["cars"] def process_item(self, item, spider): res = self.col.insert_one(dict(item)) print(res.inserted_id) return item def __del__(self): print("end")
4. Middleware
1.Middleware application: random User-Agent, proxy IP, using Selenium, adding cookies
2.Dynamic User-Agent
Open the DOWNLOADER_MIDDLEWARES commented out in the settings.py file
Add the following code to the middlewares.py file (only the modified part is shown):
import random def process_request(self, request, spider): uas = [ "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx", "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx", "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx", "User-Agent:Mxxxxxxxxxxxxxxxxxxxxxxxxxxx", ] request.headers["User-Agent"] = random.choice(uas)
2.Proxy IP
The specific operations are omitted, for example: the documentation center of fast proxy-tunnel proxy-python-scrapy has specific writing methods.
The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python entry-level skills treeWeb crawlerScrapy framework 384525 people are learning the system