Article directory
-
- Preface
- Chinese
-
- Data crawling
-
- Crawl interface
- Crawl code
- Data cleaning
- data analysis
- Experimental results
- English
-
- Data crawling
-
- Crawl interface
- Dynamic crawling
- Data cleaning
- data analysis
- Experimental results
- in conclusion
Foreword
- This article crawls Chinese and English corpora respectively, and calculates their corresponding entropy in the two languages to verify Zip’s law.
github
: ShiyuNee/python-spider (github.com)
Chinese
Data crawling
This experiment crawls the content of the four major classics, and conducts Chinese text analysis, statistical entropy, and verification of Zipf’s law based on the content of the four major classics.
- Crawl website: https://5000yan.com/
- Take the crawling of Water Margin as an example to show the crawling process
Crawling interface
-
We need to find the
url
corresponding to all chapters of Water Margin through this page, so as to obtain the information of each chapter. -
It can be noticed that each chapter here is in
li
ofclass=menu-item
, and these items are included inclass=panbai
Withinul
, therefore, by extracting these items, we can obtain theurl
corresponding to all chapters. -
Taking Chapter 1 as an example, the page is
- As you can see, all the text parts are contained in the
div
ofclass=grap
. Therefore, we only need to extract the contents of alldiv
inside it. Text, spliced together to get the entire text
- As you can see, all the text parts are contained in the
Crawling code
def get_book(url, out_path): root_url = url headers={<!-- -->'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 '} # chrome browser page_text=requests.get(root_url, headers=headers).content.decode() soup1=BeautifulSoup(page_text, 'lxml') res_list = [] # Get the urls of all chapters tag_list = soup1.find(class_='paiban').find_all(class_='menu-item') url_list = [item.find('a')['href'] for item in tag_list] for item in url_list: # Extract the content of each chapter chapter_page = requests.get(item, headers=headers).content.decode() chapter_soup = BeautifulSoup(chapter_page, 'lxml') res = '' try: chapter_content = chapter_soup.find(class_='grap') except: raise ValueError(f'no grap in the page {<!-- -->item}') chapter_text = chapter_content.find_all('div') print(chapter_text) for div_item in chapter_text: res + = div_item.text.strip() res_list.append({<!-- -->'text': res}) write_jsonl(res_list, out_path)
- We use the
beautifulsoup
library to simulate theChrome
browser’sheader
to extract the text content of each book and save the results locally.
Data cleaning
-
Because there will be brackets in the text, the content is the pinyin and explanation of the main text content. These explanations are not needed, so we first remove the content in parentheses. Note the Chinese brackets
def filter_cn(text): a = re.sub(u"\(.*?)|\{.*?}|\[.*?]|\【.*?】|\(.*?\)" , "", text) return a
-
Use stuttering word segmentation to segment Chinese sentences
def tokenize(text): return jieba.cut(text)
-
Delete punctuation items after word segmentation
def remove_punc(text): puncs = string.punctuation + """,.?,'':!;" new_text = ''.join([item for item in text if item not in puncs]) return new_text
-
Remove garbled characters and numbers in Chinese
def get_cn_and_number(text): return re.sub(u"([^\一-\龥\0-\9])","",text)
The overall process code is as follows
def collect_data(data_list: list): voc = defaultdict(int) for data in data_list: for idx in range(len(data)): filtered_data = filter_cn(data[idx]['text']) tokenized_data = tokenize(filtered_data) for item in tokenized_data: k = remove_punc(item) k = get_cn_and_number(k) if k != '': voc[k] + = 1 return voc
Data analysis
For the collected dictionary type data (key is the word, value is the number of times the word appears), count the entropy of Chinese and verify Zip’s law
-
Entropy calculation
def compute_entropy(data: dict): cnt = 0 total_num = sum(list(data.values())) print(total_num) for k, v in data.items(): p = v / total_num cnt + = -p * math.log(p) print(cnt)
-
Zip’s Law Verification (Due to the large number of terms, in order to show the relatively detailed Zip’s Law diagram, we only draw the first 200 words)
def zip_law(data: dict): cnt_list = data.values() sorted_cnt = sorted(enumerate(cnt_list), reverse=True, key=lambda x: x[1]) plot_y = [item[1] for item in sorted_cnt[:200]] print(plot_y) x = range(len(plot_y)) plot_x = [item + 1 for item in x] plt.plot(plot_x, plot_y) plt.show()
Experimental results
-
journey to the west
- Entropy: 8.2221 (364221 tokens in total)
-
Journey to the West + Water Margin
-
Entropy: 8.5814 (836392 tokens in total)
-
-
Journey to the West + Water Margin + Romance of the Three Kingdoms
-
Entropy: 8.8769 (1120315 tokens in total)
-
-
Journey to the West + Water Margin + Romance of the Three Kingdoms + Dream of Red Mansions
-
Entropy: 8.7349 (1,585,796 tokens in total)
-
English
Data crawling
This experiment crawls the books on the English reading website, and performs statistics on the crawled content, statistical entropy, and verifies Zip’s law.
- Crawl website: Bilingual Books in English | AnyLang
- Taking The Little Prince as an example to introduce the crawling process
Crawling interface
-
We need to find the
url
corresponding to all books through this page, and then obtain the content of each book -
It can be noticed that the
url
of each book is in thespan
ofclass=field-content
, and these items are included inclass=ajax-link
withina
, therefore, by extracting these items, we can obtain theurl
corresponding to all books -
Taking The Little Prince as an example, the page is
- As you can see, all the text parts are included in the
div
ofclass=page n*
. Therefore, we only need to extract all thediv
inside it.in The text within
- As you can see, all the text parts are included in the
Dynamic crawling
It should be noted that English books have less content, so we need to crawl multiple books. However, this page will only load new books after being pulled down, so we need to perform dynamic crawling.
-
Use
selenium
to load theChrome
browser and simulate the browser slide operation, simulated 5 times heredef down_ope(url): driver = webdriver.Chrome() # Select the appropriate browser driver according to your needs driver.get(url) # Replace with the URL of the website you want to crawl for _ in range(5): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(5) return driver
-
Pass the content in
driver
toBeautifulSoup
soup1=BeautifulSoup(driver.page_source, 'lxml') books = soup1.find_all(class_ = 'field-content')
The overall code is
def get_en_book(url, out_dir): root_url = url + '/en/books/en' headers={<!-- -->'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 '} # chrome browser driver = down_ope(root_url) soup1=BeautifulSoup(driver.page_source, 'lxml') books = soup1.find_all(class_ = 'field-content') book_url = [item.a['href'] for item in books] for item in book_url: if item[-4:] != 'read': continue out_path = out_dir + item.split('/')[-2] + '.jsonl' time.sleep(2) try: book_text=requests.get(url + item, headers=headers).content.decode() except: continue soup2=BeautifulSoup(book_text, 'lxml') res_list = [] sec_list = soup2.find_all('div', class_=re.compile('page n.*')) for sec in sec_list: res = "" sec_content = sec.find_all('p') for p_content in sec_content: text = p_content.text.strip() if text != '': res + = text print(res) res_list.append({<!-- -->'text': res}) write_jsonl(res_list, out_path)
Data cleaning
-
Use the
nltk
library for word segmentationdef tokenize_en(text): sen_tok = nltk.sent_tokenize(text) word_tokens = [nltk.word_tokenize(item) for item in sen_tok] tokens = [] for temp_tokens in word_tokens: for tok in temp_tokens: tokens.append(tok.lower()) return tokens
-
Remove punctuation marks from tokens after word segmentation
def remove_punc(text): puncs = string.punctuation + """,.?,'':!;" new_text = ''.join([item for item in text if item not in puncs]) return new_text
-
Use regular matching to keep only English
def get_en(text): return re.sub(r"[^a-zA-Z ] + ", '', text)
The overall process code is as follows
def collect_data_en(data_list: list): voc = defaultdict(int) for data in data_list: for idx in range(len(data)): tokenized_data = tokenize_en(data[idx]['text']) for item in tokenized_data: k = remove_punc(item) k = get_en(k) if k != '': voc[k] + = 1 return voc
Data analysis
The data analysis part has the same analysis code as the Chinese part. They both use the dictionary obtained after data cleaning
to calculate entropy and draw images to verify Zipf’s law.
Experimental results
-
10 books (1365212 tokens)
- Entropy: 6.8537
-
30 books (3076942 tokens)
-
Entropy: 6.9168
-
-
60 books (4737396 tokens)
-
Entropy: 6.9164
-
Conclusion
From the analysis of Chinese and English, it is not difficult to see that the entropy of Chinese words is greater than the entropy of English words, and both have a tendency to gradually increase as the corpus increases.
- The value of entropy has a lot to do with
tokenizer
and data preprocessing methods. - Different conclusions may result from different amounts of data,
tokenizer
, data processing methods
We verified Ziff’s law on the entropy of three different amounts of data in Chinese and English respectively.
-
Zip’s law: The frequency of a word (character) in the corpus is inversely proportional to its ranking according to the frequency of occurrence.
-
If Zipf’s law holds
- If we directly plot the ordering (
Order
) and frequency of occurrence (Count
), we will get an inverse proportional image - If we plot the logarithm of order (
Log Order
) and the logarithm of frequency (Log Count
), we get a straight line - Due to the long-tail distribution here, in order to facilitate analysis, only the
top 1000
tokens with the most occurrences are drawn.
- If we directly plot the ordering (
-
As can be seen from plotting the image, Zipf’s law clearly holds