Table of Contents
- background
- create scrapy
- uncomfortable start
- specified type
- Modify the template and specify to use
- run scrapy
Background
There is actually a python program that the almighty pycharm cannot solve? ? ?
Create scrapy
Since there is no option to directly create a Scrapy project in PyCharm, create a project using the command line
Install scrapy
pip install scrapy
View version
If you can see the version, the installation is successful
scrapy version
Create a scrapy project
scrapy startproject yourprojectname
Create a crawler according to the prompt
cd asd scrapy genspider example example.com
This creates a successful
Uncomfortable start
Use PyCharm to open this project, but found that the parse function in the crawler is a gray box
Signature of method 'ExampleSpider.parse()' does not match signature of the base method in class 'Spider'
Take a look at how the parse function in the parent class is defined, because we are rewriting the method of the parent class, just click this in pycharm
Enter the parent class and you can see that there is a **kwargs
parameter
Add the **kwargs
parameter to the parse of your own crawler, and the gray frame will disappear, so it’s not uncomfortable to look at
Specify type
But there is still a problem. The gray frame is just uncomfortable to look at. It is really uncomfortable without code prompts. Who knows if .
doesn’t come out?
Run it and print its type to see, you can see that it is of type scrapy.http.response.html.HtmlResponse
scrapy crawl example # example is the name of your crawler, which is the name attribute in the class
If you don’t want to see so many annoying log information, add the log level in settings.py
LOG_LEVEL = 'WARNING'
Re-run to see the effect, the world is quiet
Now that you know what type it is, just specify the type for him.
from scrapy.http.response import Response def parse(self, response: Response, **kwargs):
.
came out, comfortable
Modify the template and specify the use
Finally, you can see that there are normal code prompts, but you can’t write like this every time. Check the genspider command and find that the -t
parameter can use a custom template
You can see that the normal creation of crawlers is to use the basic template
Find this folder under your interpreter path
\Python39\Lib\site-packages\scrapy\templates\spiders
This template file can be found under this path
If you use pycharm, you can also do this faster, and directly locate the interpreter location to prevent you from using a different virtual environment path
Import the scrapy package and hold down the ctrl key on the keyboard and then click with the left mouse button to jump to its source code
After clicking, jump to __init__.py
, in fact, it doesn’t matter which one to jump, as long as it belongs to the scrapy package.
Then you can find the template file
In this way, you can see his content, and you can also right-click to open his folder
I choose to copy it out (you can also change it directly on it, so you don’t need to use genspider -t
to specify the template)
mytemplate.tmpl You can call it whatever you want
import scrapy from scrapy.http.response import Response class $classname(scrapy. Spider): name = "$name" allowed_domains = ["$domain"] start_urls = ["$url"] def parse(self, response: Response, **kwargs): pass
Then you can use custom templates to create crawlers
# scrapy genspider -t template name crawler name domain name scrapy genspider -t mytemplate test test.com
You can see that there is no problem with the newly created crawler, comfortable code prompts
Note that the template name must be written correctly, otherwise an error will be reported
If you forget your template name, you can install the prompt to check
scrapy genspider --list
Run scrapy
The aforementioned start scrapy needs to use the following command in the terminal
scrapy crawl example # example is the name of your crawler, which is the name attribute in the class
so troublesome? I have used the almighty pycharm, so there is no easy way for me to run and debug with breakpoints?
Can I right click to run or debug? Almighty ctrl + shift + F10
?
That must be possible
In the same layer as scrapy.cfg in the root directory of the project, create main.py (whatever you want line) file write the following code
main.py
from scrapy.cmdline import execute import os import sys if __name__ == '__main__': sys.path.append(os.path.dirname(os.path.abspath(__file__))) execute(['scrapy', 'crawl', 'example']) # replace the last parameter with your own crawler name
Run this file directly to run scrapy
After running it once, you can use CTRL + F5
You can also use DEBUG to debug, you can see that it has stopped at the breakpoint