Python uses the scrapy framework to crawl the two-color ball data

1. I swiped into Moments today, saw a piece of data, and decided to follow the trend myself (depending on the sky for food)

Go to Baidu, and the website I decided to climb is https://caipiao.ip138.com/shuangseqiu/

Analysis: Design the database according to the picture to facilitate crawling and saving data, time, 6 red balls, and a blue ball field

DROP TABLE IF EXISTS `shuangseqiu`;
CREATE TABLE `shuangseqiu` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'primary key',
  `openDate` date NOT NULL COMMENT 'date',
  `red1` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 1',
  `red2` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 2',
  `red3` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 3',
  `red4` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 4',
  `red5` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 5',
  `red6` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'red ball 6',
  `blue` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'blue ball',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 342 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_bin ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

2. Install python, go to the official website to download a windows version, just go to the next step

3. After installation, open cmd and enter pip install scrapy to install the scrapy framework

4. After the framework is installed, enter scrapy startproject caipiao to add a new lottery project

5. Enter the spider directory, enter scrapy genspider shuangseqiu “https://caipiao.ip138.com/shuangseqiu/” to add a new double color ball crawler, and finally generate the project structure as follows

6. Define the crawled and stored fields in items.py

import scrapy


class ShuangseqiuItem(scrapy.Item):
    # define the fields for your item here like:
    openDate = scrapy. Field()
    red1 = scrapy. Field()
    red2 = scrapy. Field()
    red3 = scrapy. Field()
    red4 = scrapy. Field()
    red5 = scrapy. Field()
    red6 = scrapy. Field()
    blue = scrapy. Field()

7. Write the logic of saving the database in pipelines.py, and add the configuration in the settings.py file, and add the following configuration in the settings.py file for the database connection configuration

settings.py is configured as follows

ITEM_PIPELINES = {
   "caipiao.pipelines.ShuangsequiuscrapyPipeline": 300,
}

MYSQL_HOST = '192.168.XXX.XXX'
MYSQL_DBNAME = 'database name'
MYSQL_USER = 'user'
MYSQL_PASSWD = 'password'

The content of the pipelines.py file is as follows

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import pymysql
from caipiao import settings

class ShuangseqiuscrapyPipeline:
    def __init__(self):
        # Connect to the database
        self.connect = pymysql.connect(
            host=settings.MYSQL_HOST,
            db=settings.MYSQL_DBNAME,
            user=settings.MYSQL_USER,
            passwd=settings.MYSQL_PASSWD,
            charset='utf8',
            use_unicode=True)

        # Execute addition, deletion and modification through cursor
        self. cursor = self. connect. cursor();

    def process_item(self, item, spider):
        try:
            # delete the data first
            self. cursor. execute(
                """delete from shuangseqiu where openDate=%s""",
                (item['openDate']
                 ))
            # insert data
            self. cursor. execute(
                """insert into shuangseqiu(openDate,red1,red2,red3,red4,red5,red6,blue)
                value (%s,%s, %s, %s,%s, %s,%s, %s)""",
                (item['openDate'],
                 item['red1'],
                 item['red2'],
                 item['red3'],
                 item['red4'],
                 item['red5'],
                 item['red6'],
                 item['blue']
                 ))

            # Submit the sql statement
            self. connect. commit()

        except Exception as error:
            # Print the error log when an error occurs
            print(error)
        return item

8. Write the crawling logic under spiders/shuangseqiu.py. If you don’t know how to get the xpath structure, you can right-click the node on the website to get copy—->copy full xpath

import scrapy

from caipiao.items import ShuangseqiuItem


class ShuangseqiuSpider(scrapy. Spider):
    name = "shuangseqiu"
    allowed_domains = ["caipiao.ip138.com"]
    start_urls = ["https://caipiao.ip138.com/shuangseqiu/"]

    def parse(self, response):
        print(response. text)
        # Get the list of historical lottery
        shuangseqiuList = response.xpath("//div[@class='module mod-panel']//div[@class='panel']//tbody/tr")
        for li in shuangseqiuList:
            item = ShuangseqiuItem()
            # Get lottery time
            item["openDate"] = li.xpath('td[1]/span/text()')[0].extract()
            # Get winning numbers
            balls=li.xpath('td[3]/span/text()');
            item["red1"] = balls[0].extract()
            item["red2"] = balls[1].extract()
            item["red3"] = balls[2].extract()
            item["red4"] = balls[3].extract()
            item["red5"] = balls[4].extract()
            item["red6"] = balls[5].extract()
            item["blue"] = balls[6].extract()

            print(item)
            yield item

9. Add a run.py file to run the cmd script in the idea to crawl data

from scrapy import cmdline


name = 'shuangseqiu'
cmd = 'scrapy crawl {0}'. format(name)
cmdline. execute(cmd. split())

10. Execute run.py and find an error

11. Baidu took a look, by modifying the settings.py configuration as follows, and executing run.py, it was found that it was successful

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"

ROBOTSTXT_OBEY = False

12. Database query table, found that the data was successfully obtained

13. Take the data to analyze, not far from winning the grand prize 24″>~~~~, the following is Several simple data analysis sql

-- Count the numbers with the most balls in each position
 SELECT red1,count(red1) FROM `shuangseqiu` group by red1 order by count(red1) desc;
 
 SELECT red2,count(red2) FROM `shuangseqiu` group by red2 order by count(red2) desc;
\t
 SELECT red3,count(red3) FROM `shuangseqiu` group by red3 order by count(red3) desc;
\t 
 SELECT red4,count(red4) FROM `shuangseqiu` group by red4 order by count(red4) desc;
\t\t
 SELECT red5,count(red5) FROM `shuangseqiu` group by red5 order by count(red5) desc;
\t\t 
 SELECT red6,count(red6) FROM `shuangseqiu` group by red6 order by count(red6) desc;
\t\t\t
 SELECT blue,count(blue) FROM `shuangseqiu` group by blue order by count(blue) desc;


 -- Count the number with the most occurrences on the day of the week. 0-6 is Sunday to Saturday
 SELECT DATE_FORMAT(openDate, '%w'),red1,count(red1) FROM `shuangseqiu` group by red1,DATE_FORMAT(openDate, '%w') order by DATE_FORMAT(openDate, '%w') asc,count(red1 ) desc;

14. It’s over~~~~~~