Python crawls Ctrip travel website attraction reviews

Project Introduction

Problem Solved

All codes

Crawl the attraction review data of Ctrip.com and use selenium to crawl the webpage text data of the edge browser.

Ctrip’s review data is relatively easy to crawl. Unlike Dianping, which requires you to log in and verify, you only need to find the web link you want to crawl to get the text data you want.

Here I have to mention the problems encountered during the crawling process, which are about the headless mode and the headless mode. First, let’s introduce what the headless mode and the headless mode are:

Headless mode and headless mode refer to whether the web crawler displays the browser interface during execution.

Headed mode means that the web crawler will display the browser interface during execution. You can see the page loading, clicks and other operations during the crawling process, and you can perform manual intervention and debugging. Headed mode is generally used in the development and debugging stages to facilitate observing the execution of the crawler.

Headless mode means that the web crawler does not display the browser interface during execution. All operations are performed in the background and will not interfere with the user’s normal use. Headless mode is generally used for actual crawling tasks, which can improve crawling efficiency and reduce resource consumption.

In general, the difference between headless mode and headless mode is whether the browser interface is displayed. Headed mode is suitable for the development and debugging phase, and headless mode is suitable for actual crawling tasks.

Problems with headless mode:

1. The browser information is missing in the headless mode, or the browser information filled by default contains crawler traces, which will be recognized as a robot and cause the crawler to fail to execute.

2. When the page is dynamically loaded, controls are sometimes laid out according to the page size. If the size is too small, the controls may fail to load.

Therefore, when I have climbed to more than 20 pages, I will suddenly get an error like “Element cannot be found and cannot be clicked.” Or when I get to more than 30 pages, it tells me that the element cannot be found and the certain list is empty, which is very annoying.

My attempt to solve this problem:

1: Extend the existence time of the page, allow the server to fully respond, and simulate the pull-down operation to load the interface that is not displayed below:

def to_the_buttom():
    js = 'document.getElementsByClassName("search-body left_is_mini")[0].scrollTop=10000'
    driver.execute_script(js)
def to_the_top():
    js = "var q=document.documentElement.scrollTop=0" #Scroll to the top
    driver.execute_script(js)
def to_deal_question():
    driver.implicitly_wait(10)
    time.sleep(3)
    to_the_buttom()
    time.sleep(3)
def to_view():
    driver.implicitly_wait(10)
    to_the_buttom()
    time.sleep(3)
    button = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[7]/a')
    driver.execute_script("arguments[0].scrollIntoView();", button)

2: Use the webdriver in the Selenium library to instantiate a Microsoft Edge browser driver and set some options.

opt = Options()
opt.add_argument("--headless")
opt.add_argument("window-size=1920x1080")
opt.add_argument('--start-maximized')
driver = webdriver.Edge(options=opt)
url = 'https://you.ctrip.com/sight/daocheng342/11875.html'
driver.get(url)
# driver.maximize_window()

Then you can happily get all 300 pages of comments.

Finally, I also used the jieba library to do some entry analysis, and wanted to see what everyone in Daocheng Yading is focusing on.

All code:

Crawling data section:

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.common.keys import Keys

from selenium.webdriver.edge.options import Options

def to_the_buttom():
    js = 'document.getElementsByClassName("search-body left_is_mini")[0].scrollTop=10000'
    driver.execute_script(js)
def to_the_top():
    js = "var q=document.documentElement.scrollTop=0" #Scroll to the top
    driver.execute_script(js)
def to_deal_question():
    driver.implicitly_wait(10)
    time.sleep(3)
    to_the_buttom()
    time.sleep(3)
def to_view():
    driver.implicitly_wait(10)
    to_the_buttom()
    time.sleep(3)
    button = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[7]/a')
    driver.execute_script("arguments[0].scrollIntoView();", button)

opt = Options()
opt.add_argument("--headless")
opt.add_argument("window-size=1920x1080")
opt.add_argument('--start-maximized')
driver = webdriver.Edge(options=opt)
url = 'https://you.ctrip.com/sight/daocheng342/11875.html'
driver.get(url)
# driver.maximize_window()

# add_argument() method adds parameters

print(1)
with open("dao_chen_ya_ding.txt", "a", encoding='utf-8') as f:
    for y in range(2,5):
        time.sleep(3)
        # to_deal_question()
        for x in range(10):
            text = driver.find_elements(By.CLASS_NAME, "commentDetail")[x].text
            f.write(text)
            f.write("\
")
        el = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[{}]/a'.format(y)) # Find the element
        ActionChains(driver).move_to_element(el).click().perform()
        button = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[{}]/a'.format(y))
        button.click()
        print(y)
with open("dao_chen_ya_ding.txt", "a", encoding='utf-8') as f:
    for y in range(5,300):
        time.sleep(3)
        # to_deal_question()
        # to_view()
        for x in range(10):
            text = driver.find_elements(By.CLASS_NAME, "commentDetail")[x].text
            f.write(text)
            f.write("\
")
        el = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[7]/a') # Find the element
        ActionChains(driver).move_to_element(el).click().perform()
        button = driver.find_element(By.XPATH, '//*[@id="commentModule"]/div[6]/ul/li[7]/a')
        button.click()
        print(y)


time.sleep(1000)
driver.close()

Analysis data extraction entry section:

import jieba
stopwords = [line.strip() for line in open('hit_stopwords.txt',encoding='utf-8').readlines()]
stopwords.append("\
")
# print(stopwords)
f1=open('dao_chen_ya_ding_1.txt','r',encoding='utf-8')
code=[]
for i in f1.read().strip().split(' '):
    words = jieba.lcut(i)
    code + =words
d={}
for word in code:
    if word not in stopwords:
        d[word]=d.get(word,0) + 1
ls=list(d.items())
ls.sort(key=lambda s:s[-1],reverse=True)
print(ls)
f1.close()
with open("dao_chen_ya_ding_1_results.txt", "a", encoding='utf-8') as f:
    for i in range(20):
        f.write(str(ls[i]))
        f.write("\
")

The stopwords inside are to remove punctuation marks, special characters and modal particles, which are provided in other articles on the homepage.

If this article can be helpful to you, please give it a like~

stopwords address:

Python removes punctuation marks from text_Remove special characters-CSDN Blog