About the use of scrapy framework and news recommendations

Overall design of news recommendation system

1. About the use of scrapy framework-crawling Sina news data

First, create a new scrapy project

scrapy startproject <project name>

The project structure is as follows:

Some of the contents in the above picture were newly created later. If there is no such content after the new creation, don’t worry.

Next, let us take a look at the data flow process of the scrapy framework

img

Scrapy data flow is controlled by the core engine of execution. The process is as follows:

1. The crawler engine gets the initial request and starts crawling.
2. The crawler engine starts to request the scheduler and prepares to crawl the next request.
3. The crawler scheduler returns the next request to the crawler engine.
4. The engine request is sent to the downloader, and the network data is downloaded through the download middleware.
5. Once the downloader completes the page download, it returns the download results to the crawler engine.
6. The engine returns the response of the downloader to the crawler through the middleware for processing.
7. The crawler processes the response and returns the processed items and new requests to the engine through the middleware.
8. The engine sends the processed items to the project pipeline, and then returns the processing results to the scheduler, which plans to process the next request to fetch.
9. Repeat the process (continue with step 1) until all URL requests have been crawled.
Crawler Engine (ENGINE)
The crawler engine is responsible for controlling the data flow between various components. When certain operations trigger events, they are processed by the engine.

Scheduler
The scheduler receives requests from the engine, puts the requests into the queue, and returns them to the engine through events.

Downloader
Download network data through engine requests and respond to the results to the engine.

Spider
Spider makes a request and processes the engine to return it the downloader response data, which is returned to the engine in the form of items and data requests (urls) within the rules.

item pipeline
Responsible for processing the data parsed by the engine and returning it to the spider, and persisting the data, such as storing the data in a database or file.

Download middleware
Download middleware is an interactive component between the engine and the downloader. It exists in the form of a hook (plug-in) and can replace receiving requests, processing data downloads, and responding to the results to the engine.

spider middleware
Spider middleware is an interactive component between engine and spider. It exists in the form of a hook (plug-in) and can replace response processing and return to engine items and new request sets.
This is a reference to the blogger’s original link here: https://blog.csdn.net/Yuyh131/article/details/83651875

Actual cases
1. item.py file processing

After we create the project, first we can enter items.py. This file is similar to an entity class for the data you want to crawl.

import scrapy


class NewsspirderItem(scrapy.Item):
    table_name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    date = scrapy.Field()
    mainpage = scrapy.Field()
    comments = scrapy.Field()
    origin_url = scrapy.Field()
    keywords = scrapy.Field()
    contentHtml = scrapy.Field()
    author = scrapy.Field()
    # video_url = scrapy.Field()
    # pic_url = scrapy.Field()
    type = scrapy.Field()
2. Create a new spider for processing

Then go to the spiders directory and create a new crawler file newsSpider_fun.py

Part of the code is as follows:

from copy import deepcopy

import scrapy

from newsSpirder import settings
from newsSpirder.items import NewsspirderItem
from newsSpirder.items import NewsUrl
import json
import re
import time

headers = {<!-- -->
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}


### This module crawls sports and entertainment news

class newsSpider_fun(scrapy.Spider):
    name = 'newsSpider_fun'
    allo_domain = 'ent.sina.com.cn'
    base_url = ['https://ent.sina.com.cn/', #娱乐1
                'http://sports.sina.com.cn/', #体育2
                'https://cul.news.sina.com.cn/' #culture3

                ]
    start_urls = []

    def start_requests(self):
        for i in range(0, settings.PAGE):
            urls = [
             'https://interface.sina.cn/pc_api/public_news_data.d.json?callback=jQuery111203513011348270476_1675935832328 &cids=209211 &pdps=PDPS000000060130,PDPS000000066866 &s martFlow= &type=std_news,std_slide, std_video & pageSize=20 & top_id=hencxtu1691422,hencxtu5974075,hencxtu5919005,hencxtu5908111 & mod=nt_culture0 & cTime=1483200000 & up=' + str(
                    i) + ' & amp;action=1 & amp;tm=1675935836 & amp;_=1675935832332' # Culture
            ]  # entertainment
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse, headers=headers, dont_filter=False)

    def parse(self, response):
        data = response.body.decode('utf8')

        data = re.findall(r'[(](.*)[)]', str(data).replace('\
', ''))
        contents = list(filter(None, json.loads(data[0])['data']))
        for content in contents:
            item = NewsspirderItem()
            item['table_name'] = 'news_newdetail'
            item["url"] = content['url'] # link
            item["date"] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(content['ctime'])) # Date
            item["title"] = content['title'] # title
            try:
                item["origin_url"] = content['orgUrl'] # Original link
                item["keywords"] = content['labels'] # Keywords
                if str(content['category']).find('Entertainment') != -1:
                    item["type"] = '1' # type
                elif str(content['category']).find('Sports') != -1:
                    item["type"] = '2'
                else:
                    item['type'] = '-1'
            except:
                try:
                    item["keywords"] = content['tags']
                except:
                    item["keywords"] = ""
                item['type'] = '3'
                item["origin_url"] = item['url']

            item["author"] = content['author'] # Author
            yield scrapy.Request(item["url"], callback=self.parse_item2, meta={<!-- -->"item": deepcopy(item)}, headers=headers, dont_filter=False)

    def parse_item2(self, response):
        url_item = NewsUrl()
        url_item['table_name'] = 'news_urlcollect'
        news_item = response.meta['item']
        news_item["mainpage"] = ''.join(response.xpath('//*[@id="article"]//text()').extract())
        content_html = ''.join(response.xpath('//*[@id="article"]').extract())
        news_item["contentHtml"] = content_html
        if str(content_html) != '':
            yield news_item
            url_item["url"] = news_item["url"]
            url_item["handle"] = 0
            url_item["type"] = news_item['type']
            url_item["time"] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
            yield url_item

Give a detailed explanation of this file

This name is the name when we start the crawler. allow_domain is the allowed domain name.

What is defined under start_requests is the link to the website to be crawled.

yield scrapy.Request will issue the request and call back the parsing function (parse).

At this point, you can perform data processing on the crawled pages in the parse function.

When we need to enter the third-level page, we can also initiate a request request after the parse function and call back parse_item2

When we need the item in the parse function in parse_item2, we only need to put the parameters into the request request like this.

Then take out the data in the second function

Note: Scrapy has link deduplication by default.

dont_Filter = True

This turns off duplication removal! ! !

3. Pipeline.py file processing

Part of the code is as follows

import copy

from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

from newsSpirder.items import NewsspirderItem, NewsUrl
from newsSpirder.tool import process_str, process_url
from newsSpirder.tool import process_html
importMySQLdb
from .settings import mysql_host, mysql_db, mysql_user, mysql_passwd, mysql_port


class NewsspirderPipeline: # Process the crawled information
    def process_item(self, item, spider):
        item['url'] = process_url(item['url'])
        if type(item) == NewsspirderItem:
            if item['author'] == '':
                item['author'] = 'Unknown'
            item['mainpage'] = process_str(item['mainpage'])
            item['contentHtml'] = process_html(item['contentHtml'])
            item['comments'] = 'Test'
            item['keywords'] = str(item['keywords'])
        return copy.deepcopy(item)


class CheckPipeline: # Remove duplicates
    """check item, and drop the duplicate one"""

    def __init__(self):
        self.names_seen = set()
        self.url_seen = set()

    def process_item(self, item, spider):
        if type(item) == NewsspirderItem:
            if item['title']:
                if item['title'] in self.names_seen:
                    DropItem("Duplicate item found: %s" % item)
                else:
                    self.names_seen.add(item['title'])
                    return item
            else:
                DropItem("Missing price in %s" % item)
        elif type(item) == NewsUrl:
            if item['url']:
                if item['url'] in self.url_seen:
                    DropItem("Duplicate item found: %s" % item)
                else:
                    self.url_seen.add(item['url'])
                    return item
            else:
                DropItem("Missing price in %s" % item)

Processing pipeline files requires setting priorities in the settings file.

ITEM_PIPELINES = {<!-- -->
   'newsSpirder.pipelines.NewsspirderPipeline': 50,
   'newsSpirder.pipelines.CheckPipeline': 75,
   'newsSpirder.pipelines.NewssqlPipeline': 100
}

The priority level is that the item with the smaller number enters it for processing first.

The above priority is to enter NewsspiderPipeline first, then CheckPipeline, and so on.

The function to save to the database is the third class. This will not be discussed here.

Since what I crawled uses two items, I need to yield two different items in the first spider file, and two items will be passed in when processing in the pipeline. Saving this to the mysql database will automatically save different items to different database tables.

The command to run a single crawler is:

scrapy crawl crawler name

If you want to run multiple crawlers together, you can create a new main.py file to run, the code is as follows:

There is a small pit here

Due to the asynchronous nature of scrapy, duplicate data will appear when saving to the database, so it can be handled by setting it in the database and processing duplicate items in the pipeline. The code for deduplication is also above.

2. Selection of recommendation algorithm

1. Principle of item-based collaborative filtering algorithm (ItemCF)

The ItemCF algorithm is currently one of the most widely used algorithms in the industry. The recommendation algorithms of Amazon, Netflix, and YouTube are all based on ItemCF.
I don’t know if you usually have such an experience when shopping online. For example, if you place an order for a mobile phone in an online mall, when the order is completed, the web page will recommend you a mobile phone case of the same model. You will probably buy it at this time. I will click in to browse and buy a mobile phone case. In fact, this is the ItemCF algorithm working silently behind the scenes. The ItemCF algorithm recommends items to users that are similar to items they have liked before. Because you bought a mobile phone before, the ItemCF algorithm calculated that the similarity between the mobile phone case and the mobile phone is relatively large, so it recommended a mobile phone case to you. This is how it works. It looks very similar to the UserCF algorithm, right? But this time, instead of calculating the similarity between users, it is replaced by calculating the similarity between items.

From the above description, we can know that the main steps of the ItemCF algorithm are as follows:

  1. Calculate the similarity between items
  2. Generate a recommendation list for users based on the similarity of items and the user’s historical behavior

So the first question before us is how to calculate the similarity between items. We need to pay special attention here:

The ItemCF algorithm does not directly calculate the similarity based on the attributes of the item itself, but calculates the similarity between items by analyzing the user’s behavior.

What does that mean? For example, a mobile phone and a mobile phone case have no other similarities except for their similar shapes. It seems impossible to calculate the similarity directly. But think about this issue from another angle. If many users buy mobile phones and mobile phone cases at the same time, can we think that mobile phones and mobile phone cases are similar?
This leads to the calculation formula of item similarity:
For this part of the content, you can refer to the link given below. Since there are too many pictures, it is not convenient to upload them, so I just put the link here.
Reference links:

Item-based collaborative filtering algorithm (ItemCF) principle and code practice – Jianshu (jianshu.com)