Proxy pool construction, proxy pool use, crawling a video website, crawling news

Agent pool construction

ip proxy

Each device will have its own IP address
The computer has an ip address
?

\dashrightarrow

?Visit a website

?

\dashrightarrow

? Visited too frequently

?

\dashrightarrow

? Block IP

Charge: Reliable and stable
?

\dashrightarrow

? Provide API
Free: Unstable
?

\dashrightarrow

? Write your own API for use
- Open source: https://github.com/jhao104/proxy_pool
- free proxy
  ?
  
  \dashrightarrow
  
  ? Crawl free proxies
  
  ?
  
  \dashrightarrow
  
  ? verify
  
  ?
  
  \dashrightarrow
  
  ? Save to redis
- flask build web
  ?
  
  \dashrightarrow
  
  ? Access an interface and obtain an IP randomly

Building steps

git clone: [email protected]:jhao104/proxy_pool.git
Open in pycharm
Install dependencies: Create a virtual environment pip install -r requirements.txt
Modify the configuration file: DB_CONN = 'redis://127.0.0.1:6379/0'
Run scheduler and web program
- Start the scheduler: python proxyPool.py schedule
- Start the webApi service: python proxyPool.py server

api introduction
After starting the web service, the http://127.0.0.1:5010api interface service will be enabled by default.

api	method	Description	params
/	GET	api introduction	None
/get	GET	Get a random proxy	Optional parameters: ?type=https Filter proxies that support https
/pop	GET	Get and delete a proxy	Optional parameters: ?type=https Filter proxies that support https
/all	GET	Get all proxies	Optional parameters: ?type=https Filter proxies that support https
/all	GET	Get all proxies	Optional parameters: ?type=https filtering support https proxy
/count	GET	View proxy number	None
/delete	GET	Delete proxy	?proxy=host:ip

http and https proxy
– Use http proxy to access http address in the future
– Use https proxy to access https address

Agent pool usage

Build django backend test

Steps

Write a django that returns the visitor’s IP whenever accessed
Deployed on the public network
?

\dashrightarrow

? python manage.py runserver 0.0.0.0:8000

Use proxy to test

import requests locally
res = requests.get('http://192.168.1.252:5010/get/?type=http').json()['proxy']
proxies = {<!-- -->
    'http': res,
}
print(proxies)
# We are http and want to use http proxy
response = requests.get('http://139.155.203.196:8080/', proxies=proxies)
print(respone.text)

Supplement

The agent has transparency and high anonymity
Transparent means: the user’s final IP can be seen
Hidden: Hide the visitor’s real IP address and cannot see it on the server

Crawling a video website

Goal: Crawl the videos from this website and save them locally https://www.pearvideo.com/

The request address is: https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0

import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0')

# Parse the video address---》Regular
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)

for video in video_list:
    video_id = video.split('_')[-1]
    url = 'https://www.pearvideo.com/' + video
    header = {<!-- -->
        'Referer': url
    }
    res_json = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={<!-- -->video_id} & amp;mrd=0.14435938848299434',
                            headers=header).json()
    mp4_url = res_json['videoInfo']['videos']['srcUrl']
    real_mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % video_id)

    # Save the video locally
    res_video = requests.get(real_mp4_url)
    with open('./video/%s.mp4' % video_id, 'wb') as f:
        for line in res_video.iter_content(1024):
            f.write(line)

Crawling news

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.autohome.com.cn/news/1/#liststart')

# Find all the class names in the page named article ul tags
soup = BeautifulSoup(res.text, 'html.parser')
# Search for bs4
ul_list = soup.find_all(class_='article', name='ul')

for ul in ul_list:
    li_list = ul.find_all(name='li')
    for li in li_list:
        h3 = li.find(name='h3')
        if h3:
            title = h3.text
            url = 'https:' + li.find(name='a')['href']
            desc = li.find(name='p').text
            reade_count = li.find(name='em').text
            img = li.find(name='img')['src']
            print(f'''Article title: {<!-- -->title}
            Article address: {<!-- -->url}
            Article summary: {<!-- -->desc}
            Article read count: {<!-- -->reade_count}
            Article picture: {<!-- -->img}''')