[UCAS Natural Language Processing Assignment 1] Use BeautifulSoup to crawl Chinese and English data, calculate entropy, and verify Zip’s law

Article directory

- Preface
- Chinese
- - Data crawling
  - - Crawl interface
    - Crawl code
  - Data cleaning
  - data analysis
  - Experimental results
- English
- - Data crawling
  - - Crawl interface
    - Dynamic crawling
  - Data cleaning
  - data analysis
  - Experimental results
- in conclusion

Foreword

This article crawls Chinese and English corpora respectively, and calculates their corresponding entropy in the two languages to verify Zip’s law.
github: ShiyuNee/python-spider (github.com)

Chinese

Data crawling

This experiment crawls the content of the four major classics, and conducts Chinese text analysis, statistical entropy, and verification of Zipf’s law based on the content of the four major classics.

Crawl website: https://5000yan.com/
Take the crawling of Water Margin as an example to show the crawling process

Crawling interface

We need to find the url corresponding to all chapters of Water Margin through this page, so as to obtain the information of each chapter.
It can be noticed that each chapter here is in li of class=menu-item, and these items are included in class=panbai Within ul, therefore, by extracting these items, we can obtain the url corresponding to all chapters.
Taking Chapter 1 as an example, the page is
- As you can see, all the text parts are contained in the div of class=grap. Therefore, we only need to extract the contents of all div inside it. Text, spliced together to get the entire text

Crawling code

def get_book(url, out_path):
    root_url = url
    headers={<!-- -->'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 '} # chrome browser
    page_text=requests.get(root_url, headers=headers).content.decode()
    soup1=BeautifulSoup(page_text, 'lxml')
    res_list = []
# Get the urls of all chapters
    tag_list = soup1.find(class_='paiban').find_all(class_='menu-item')
    url_list = [item.find('a')['href'] for item in tag_list]
    for item in url_list: # Extract the content of each chapter
        chapter_page = requests.get(item, headers=headers).content.decode()
        chapter_soup = BeautifulSoup(chapter_page, 'lxml')
        res = ''
        try:
            chapter_content = chapter_soup.find(class_='grap')
        except:
            raise ValueError(f'no grap in the page {<!-- -->item}')
        chapter_text = chapter_content.find_all('div')
        print(chapter_text)
        for div_item in chapter_text:
            res + = div_item.text.strip()
        res_list.append({<!-- -->'text': res})
    write_jsonl(res_list, out_path)

We use the beautifulsoup library to simulate the Chrome browser’s header to extract the text content of each book and save the results locally.

Data cleaning

Because there will be brackets in the text, the content is the pinyin and explanation of the main text content. These explanations are not needed, so we first remove the content in parentheses. Note the Chinese brackets
```
def filter_cn(text):
    a = re.sub(u"\（.*?）|\{.*?}|\[.*?]|\【.*?】|\(.*?\)" , "", text)
    return a
```
Use stuttering word segmentation to segment Chinese sentences
```
def tokenize(text):
    return jieba.cut(text)
```

Delete punctuation items after word segmentation

def remove_punc(text):
    puncs = string.punctuation + """,.?,'':!;"
    new_text = ''.join([item for item in text if item not in puncs])
    return new_text

Remove garbled characters and numbers in Chinese

def get_cn_and_number(text):
     return re.sub(u"([^\一-\龥\0-\9])","",text)

The overall process code is as follows

def collect_data(data_list: list):
    voc = defaultdict(int)
    for data in data_list:
        for idx in range(len(data)):
            filtered_data = filter_cn(data[idx]['text'])
            tokenized_data = tokenize(filtered_data)
            for item in tokenized_data:
                k = remove_punc(item)
                k = get_cn_and_number(k)
                if k != '':
                    voc[k] + = 1
    return voc

Data analysis

For the collected dictionary type data (key is the word, value is the number of times the word appears), count the entropy of Chinese and verify Zip’s law

Entropy calculation

def compute_entropy(data: dict):
    cnt = 0
    total_num = sum(list(data.values()))
    print(total_num)
    for k, v in data.items():
        p = v / total_num
        cnt + = -p * math.log(p)
    print(cnt)

Zip’s Law Verification (Due to the large number of terms, in order to show the relatively detailed Zip’s Law diagram, we only draw the first 200 words)

def zip_law(data: dict):
    cnt_list = data.values()
    sorted_cnt = sorted(enumerate(cnt_list), reverse=True, key=lambda x: x[1])
    plot_y = [item[1] for item in sorted_cnt[:200]]
    print(plot_y)
    x = range(len(plot_y))
    plot_x = [item + 1 for item in x]
    plt.plot(plot_x, plot_y)
    plt.show()

Experimental results

journey to the west
- Entropy: 8.2221 (364221 tokens in total)
Journey to the West + Water Margin
- Entropy: 8.5814 (836392 tokens in total)
Journey to the West + Water Margin + Romance of the Three Kingdoms
- Entropy: 8.8769 (1120315 tokens in total)
Journey to the West + Water Margin + Romance of the Three Kingdoms + Dream of Red Mansions
- Entropy: 8.7349 (1,585,796 tokens in total)

English

Data crawling

This experiment crawls the books on the English reading website, and performs statistics on the crawled content, statistical entropy, and verifies Zip’s law.

Crawl website: Bilingual Books in English | AnyLang
Taking The Little Prince as an example to introduce the crawling process

Crawling interface

We need to find the url corresponding to all books through this page, and then obtain the content of each book
It can be noticed that the url of each book is in the span of class=field-content, and these items are included in class=ajax-link within a, therefore, by extracting these items, we can obtain the url corresponding to all books
Taking The Little Prince as an example, the page is
- As you can see, all the text parts are included in the div of class=page n*. Therefore, we only need to extract all the div inside it. in The text within
  can be spliced together to obtain the entire text.

Dynamic crawling

It should be noted that English books have less content, so we need to crawl multiple books. However, this page will only load new books after being pulled down, so we need to perform dynamic crawling.

Use selenium to load the Chrome browser and simulate the browser slide operation, simulated 5 times here

def down_ope(url):
    driver = webdriver.Chrome() # Select the appropriate browser driver according to your needs
    driver.get(url) # Replace with the URL of the website you want to crawl
    for _ in range(5):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
    return driver

Pass the content in driver to BeautifulSoup

 soup1=BeautifulSoup(driver.page_source, 'lxml')
    books = soup1.find_all(class_ = 'field-content')

The overall code is

def get_en_book(url, out_dir):
    root_url = url + '/en/books/en'
    headers={<!-- -->'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 '} # chrome browser
    driver = down_ope(root_url)
    soup1=BeautifulSoup(driver.page_source, 'lxml')
    books = soup1.find_all(class_ = 'field-content')
    book_url = [item.a['href'] for item in books]
    for item in book_url:
        if item[-4:] != 'read':
            continue
        out_path = out_dir + item.split('/')[-2] + '.jsonl'
        time.sleep(2)
        try:
            book_text=requests.get(url + item, headers=headers).content.decode()
        except:
            continue
        soup2=BeautifulSoup(book_text, 'lxml')
        res_list = []
        sec_list = soup2.find_all('div', class_=re.compile('page n.*'))
        for sec in sec_list:
            res = ""
            sec_content = sec.find_all('p')
            for p_content in sec_content:
                text = p_content.text.strip()
                if text != '':
                    res + = text
            print(res)
            res_list.append({<!-- -->'text': res})
        write_jsonl(res_list, out_path)

Data cleaning

Use the nltk library for word segmentation

def tokenize_en(text):
    sen_tok = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(item) for item in sen_tok]
    tokens = []
    for temp_tokens in word_tokens:
        for tok in temp_tokens:
            tokens.append(tok.lower())
    return tokens

Remove punctuation marks from tokens after word segmentation

def remove_punc(text):
    puncs = string.punctuation + """,.?,'':!;"
    new_text = ''.join([item for item in text if item not in puncs])
    return new_text

Use regular matching to keep only English

def get_en(text):
    return re.sub(r"[^a-zA-Z ] + ", '', text)

The overall process code is as follows

def collect_data_en(data_list: list):
    voc = defaultdict(int)
    for data in data_list:
        for idx in range(len(data)):
            tokenized_data = tokenize_en(data[idx]['text'])
            for item in tokenized_data:
                k = remove_punc(item)
                k = get_en(k)
                if k != '':
                    voc[k] + = 1
    return voc

Data analysis

The data analysis part has the same analysis code as the Chinese part. They both use the dictionary obtained after data cleaning to calculate entropy and draw images to verify Zipf’s law.

Experimental results

10 books (1365212 tokens)
- Entropy: 6.8537
30 books (3076942 tokens)
- Entropy: 6.9168
60 books (4737396 tokens)
- Entropy: 6.9164

Conclusion

From the analysis of Chinese and English, it is not difficult to see that the entropy of Chinese words is greater than the entropy of English words, and both have a tendency to gradually increase as the corpus increases.

The value of entropy has a lot to do with tokenizer and data preprocessing methods.
Different conclusions may result from different amounts of data, tokenizer, data processing methods

We verified Ziff’s law on the entropy of three different amounts of data in Chinese and English respectively.

Zip’s law: The frequency of a word (character) in the corpus is inversely proportional to its ranking according to the frequency of occurrence.
If Zipf’s law holds
- If we directly plot the ordering (Order) and frequency of occurrence (Count), we will get an inverse proportional image
- If we plot the logarithm of order (Log Order) and the logarithm of frequency (Log Count), we get a straight line
- Due to the long-tail distribution here, in order to facilitate analysis, only the top 1000 tokens with the most occurrences are drawn.
As can be seen from plotting the image, Zipf’s law clearly holds