Proxy pool construction, proxy pool use, crawling a video website, crawling news

Agent pool construction

ip proxy

  • Each device will have its own IP address
  • The computer has an ip address

    ?

    \dashrightarrow

    ?Visit a website

    ?

    \dashrightarrow

    ? Visited too frequently

    ?

    \dashrightarrow

    ? Block IP

  1. Charge: Reliable and stable

    ?

    \dashrightarrow

    ? Provide API

  2. Free: Unstable

    ?

    \dashrightarrow

    ? Write your own API for use

    • Open source: https://github.com/jhao104/proxy_pool
    • free proxy

      ?

      \dashrightarrow

      ? Crawl free proxies

      ?

      \dashrightarrow

      ? verify

      ?

      \dashrightarrow

      ? Save to redis

    • flask build web

      ?

      \dashrightarrow

      ? Access an interface and obtain an IP randomly

Building steps

  1. git clone: [email protected]:jhao104/proxy_pool.git

  2. Open in pycharm

  3. Install dependencies: Create a virtual environment pip install -r requirements.txt

  4. Modify the configuration file: DB_CONN = 'redis://127.0.0.1:6379/0'

  5. Run scheduler and web program

    • Start the scheduler: python proxyPool.py schedule
    • Start the webApi service: python proxyPool.py server
  6. api introduction
    After starting the web service, the http://127.0.0.1:5010api interface service will be enabled by default.

    api method Description params
    / GET api introduction None
    /get GET Get a random proxy Optional parameters: ?type=https Filter proxies that support https
    /pop GET Get and delete a proxy Optional parameters: ?type=https Filter proxies that support https
    /all GET Get all proxies Optional parameters: ?type=https Filter proxies that support https
    /all GET Get all proxies Optional parameters: ?type=https filtering support https proxy
    /count GET View proxy number None
    /delete GET Delete proxy ?proxy=host:ip

http and https proxy
– Use http proxy to access http address in the future
– Use https proxy to access https address

Agent pool usage

Build django backend test

Steps
  1. Write a django that returns the visitor’s IP whenever accessed
  2. Deployed on the public network

    ?

    \dashrightarrow

    ? python manage.py runserver 0.0.0.0:8000

  3. Use proxy to test
    import requests locally
    res = requests.get('http://192.168.1.252:5010/get/?type=http').json()['proxy']
    proxies = {<!-- -->
        'http': res,
    }
    print(proxies)
    # We are http and want to use http proxy
    response = requests.get('http://139.155.203.196:8080/', proxies=proxies)
    print(respone.text)
    
Supplement

The agent has transparency and high anonymity
Transparent means: the user’s final IP can be seen
Hidden: Hide the visitor’s real IP address and cannot see it on the server

Crawling a video website

Goal: Crawl the videos from this website and save them locally https://www.pearvideo.com/

The request address is: https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0

import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5 & amp;categoryId=1 & amp;start=0')

# Parse the video address---》Regular
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)

for video in video_list:
    video_id = video.split('_')[-1]
    url = 'https://www.pearvideo.com/' + video
    header = {<!-- -->
        'Referer': url
    }
    res_json = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={<!-- -->video_id} & amp;mrd=0.14435938848299434',
                            headers=header).json()
    mp4_url = res_json['videoInfo']['videos']['srcUrl']
    real_mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % video_id)

    # Save the video locally
    res_video = requests.get(real_mp4_url)
    with open('./video/%s.mp4' % video_id, 'wb') as f:
        for line in res_video.iter_content(1024):
            f.write(line)

Crawling news

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.autohome.com.cn/news/1/#liststart')

# Find all the class names in the page named article ul tags
soup = BeautifulSoup(res.text, 'html.parser')
# Search for bs4
ul_list = soup.find_all(class_='article', name='ul')

for ul in ul_list:
    li_list = ul.find_all(name='li')
    for li in li_list:
        h3 = li.find(name='h3')
        if h3:
            title = h3.text
            url = 'https:' + li.find(name='a')['href']
            desc = li.find(name='p').text
            reade_count = li.find(name='em').text
            img = li.find(name='img')['src']
            print(f'''Article title: {<!-- -->title}
            Article address: {<!-- -->url}
            Article summary: {<!-- -->desc}
            Article read count: {<!-- -->reade_count}
            Article picture: {<!-- -->img}''')