Agent pool construction
ip proxy
- Each device will have its own IP address
- The computer has an ip address
?
\dashrightarrow
?Visit a website
?
\dashrightarrow
? Visited too frequently
?
\dashrightarrow
? Block IP
- Charge: Reliable and stable
?
\dashrightarrow
? Provide API
- Free: Unstable
?
\dashrightarrow
? Write your own API for use
- Open source:
https://github.com/jhao104/proxy_pool
- free proxy
?
\dashrightarrow
? Crawl free proxies
?
\dashrightarrow
? verify
?
\dashrightarrow
? Save to redis
- flask build web
?
\dashrightarrow
? Access an interface and obtain an IP randomly
- Open source:
Building steps
-
git clone:
[email protected]:jhao104/proxy_pool.git
-
Open in pycharm
-
Install dependencies: Create a virtual environment
pip install -r requirements.txt
-
Modify the configuration file:
DB_CONN = 'redis://127.0.0.1:6379/0'
-
Run scheduler and web program
- Start the scheduler:
python proxyPool.py schedule
- Start the webApi service:
python proxyPool.py server
- Start the scheduler:
-
api introduction
After starting the web service, thehttp://127.0.0.1:5010
api interface service will be enabled by default.api method Description params / GET api introduction None /get GET Get a random proxy Optional parameters: ?type=https Filter proxies that support https /pop GET Get and delete a proxy Optional parameters: ?type=https Filter proxies that support https /all GET Get all proxies Optional parameters: ?type=https Filter proxies that support https /all GET Get all proxies Optional parameters: ?type=https filtering support https proxy /count GET View proxy number None /delete GET Delete proxy ?proxy=host:ip
http and https proxy
– Use http proxy to access http address in the future
– Use https proxy to access https address
Agent pool usage
Build django backend test
Steps
- Write a django that returns the visitor’s IP whenever accessed
- Deployed on the public network
?
\dashrightarrow
?
python manage.py runserver 0.0.0.0:8000
- Use proxy to test
import requests locally res = requests.get('http://192.168.1.252:5010/get/?type=http').json()['proxy'] proxies = {<!-- --> 'http': res, } print(proxies) # We are http and want to use http proxy response = requests.get('http://139.155.203.196:8080/', proxies=proxies) print(respone.text)
Supplement
The agent has transparency and high anonymity
Transparent means: the user’s final IP can be seen
Hidden: Hide the visitor’s real IP address and cannot see it on the server
Crawling a video website
Goal: Crawl the videos from this website and save them locally https://www.pearvideo.com/
The request address is: https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0
import requests import re res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0') # Parse the video address---》Regular video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text) for video in video_list: video_id = video.split('_')[-1] url = 'https://www.pearvideo.com/' + video header = {<!-- --> 'Referer': url } res_json = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={<!-- -->video_id} & amp;mrd=0.14435938848299434', headers=header).json() mp4_url = res_json['videoInfo']['videos']['srcUrl'] real_mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % video_id) # Save the video locally res_video = requests.get(real_mp4_url) with open('./video/%s.mp4' % video_id, 'wb') as f: for line in res_video.iter_content(1024): f.write(line)
Crawling news
import requests from bs4 import BeautifulSoup res = requests.get('https://www.autohome.com.cn/news/1/#liststart') # Find all the class names in the page named article ul tags soup = BeautifulSoup(res.text, 'html.parser') # Search for bs4 ul_list = soup.find_all(class_='article', name='ul') for ul in ul_list: li_list = ul.find_all(name='li') for li in li_list: h3 = li.find(name='h3') if h3: title = h3.text url = 'https:' + li.find(name='a')['href'] desc = li.find(name='p').text reade_count = li.find(name='em').text img = li.find(name='img')['src'] print(f'''Article title: {<!-- -->title} Article address: {<!-- -->url} Article summary: {<!-- -->desc} Article read count: {<!-- -->reade_count} Article picture: {<!-- -->img}''')