“Data Mining” Experiment 1: File Operation

1. Experimental purpose

Proficient in the application of the built-in function open();

Understand the impact of string encoding format on text file operations;

Proficient in the usage of the context management statement with;

Understand how the standard library json reads and writes JSON files;

Understand how the extension libraries python-docx, openpyxl, and python-pptx operate on Office documents.

2. Experimental content

1. Merge the contents of two .txt files, and write the multiple lines of the two files to the result file alternately. If one file has less content, write the remaining content of the other file to the end of the result file.

1.1Experimental preparation: Prepare two Documents 1.txt 2.txt

1.2Experiment code:

def mergeTxt(txtFiles):
    with open('result.txt', 'w', encoding='utf-8') as fp:
        with open(txtFiles[0], 'r + ') as fp1, open(txtFiles[1], 'r + ') as fp2:
            # When the lines of two files can be read, enter the loop
            while True:
                line1 = fp1.readline()
                line2 = fp2.readline()
                # If there is content to read
                if line1 or line2:
                    if line1:# line1 has content
                        fp.write(line1)
                    if line2:# line2 has content
                        fp.write(line2)
                # If there is no content to read, break out of the loop
                else:
                    break

txtFiles = ['1.txt', '2.txt']
mergeTxt(txtFiles)

1.3Experimental results:

View the result.txt file:

2. Complete the code: write the following list of house information into the JSON file (house information.json), and then read and output the information.

2.1Experiment code:

import json
information = [
    {'Community name': 'Community A', 'Average price': 8000, 'Monthly transaction volume': 20},
    {'Community name': 'Community B', 'Average price': 8500, 'Monthly transaction volume': 35},
    {'Community name': 'Community C', 'Average price': 7800, 'Monthly transaction volume': 50},
    {'Community name': 'Community D', 'Average price': 12000, 'Monthly transaction volume': 18}]
#At this time information is a list type

with open("housing information.json", "w") as f:
    # Convert data type to string and store in json file
    json.dump(information, f)
    print("File writing completed!")

with open("housing information.json", 'r') as load_f:
    # Convert file opening from string to data type
    load_dict = json.load(load_f)
    print(load_dict)

2.2Experimental results:

3. Write a program to simulate and generate the turnover data of a hotel during its 100 consecutive days of trial operation starting from January 1, 2020, and write it into a CSV file. There are two columns in the file, the first column is the date, the second column is the turnover, and the first column of the file is the header or field name. Assume that the hotel’s base turnover on the first day is 500 yuan, and it increases by 5 yuan every day. In addition, it will randomly increase by 5 to 50 yuan every day.

3.1 Experiment code:

import csv
import random
import datetime

with open('turnover.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f, lineterminator='\
')
    # Write field names to the csv file
    writer.writerow(['Date', 'Turnover'])
    # Generate data on January 1, 2020
    date = datetime.date(2020,1,1)
    # Randomly generate 100 turnover data
    for i in range(100):
        # Sales will increase by 5 yuan every day, and in addition, it will increase randomly by 5 to 50 yuan every day.
        amount = 500 + i*5 + random.randint(5, 50)
        writer.writerow([str(date), amount])
        # Accumulate and generate consecutive dates starting from January 1, 2020
        date = date + datetime.timedelta(days=1)

3.2Experimental results:

View the “turnover.csv” file results:

4. Write a program to read and output the red text in the word document.

4.1Experiment preparation:

Prepare the contents of the 1.docx file:

4.2Experiment code:

from docx import Document # Need to download the python-docx library
from docx.shared import RGBColor
def CheckRed(fn):
    document = Document(fn)
    # Traverse the paragraphs of the document
    for p in document.paragraphs:
        # Traverse the character blocks of each segment
        for r in p.runs:
            # Determine whether the RGB value of the character block is red (255, 0, 0)
            if r.font.color.rgb == RGBColor(255, 0, 0):
                print(r.text)
    # Traverse the table
    for table in document.tables:
        # Traverse each row of each table:
        for row in table.rows:
            # Traverse the cells in each row:
            for cell in row.cells:
                # Traverse the contents of each cell:
                for p in cell.paragraphs:
                    for r in p.runs:
                        if r.font.color.rgb == RGBColor(255, 0, 0):
                            print(r.text)
CheckRed('1.docx')

4.3Experimental results:

5. The file “Supermarket Sales.xlsx” records the sales of each employee in a supermarket at different times and at different counters from March 1 to 5, 2019. Write a program to read the data in this file and count the total sales of each employee, the total sales of each period, and the total sales of each counter.

5.1Experiment preparation:

Period: 8:00-14:00 14:00-20:00

Counter: fresh fruits, daily necessities, snacks and drinks

Employees: 1001 Xiao Wang 1002 Xiao Li 1003 Xiao Chen 1004 Xiao Liu 1005 Xiao Zhang 1006 Xiao Du

5.2Experiment code:

from openpyxl import load_workbook
def Count():
    # Declare three dictionaries to store the total sales of employees, time periods, and counters respectively.
    staff = dict()
    periods = dict()
    counters = dict()
    # Read xlsx document
    ws = load_workbook('Supermarket turnover.xlsx').worksheets[0]
    # Traverse each row in the table
    for index, row in enumerate(ws.rows):
        # Need to skip the first line
        if index == 0:
            continue
        # Get employee name, time period, counter, sales and other information
        _, name, _, period, counter, money = map(lambda cell: cell.value, row)
        # Accumulate sales based on employee names
        staffs[name] = staffs.get(name, 0) + money
        #Accumulate sales according to different time periods
        periods[period] = periods.get(period, 0) + money
        # Accumulate sales according to different counters
        counters[counter] = counters.get(counter, 0) + money
    print("Total sales per employee:")
    print(staffs)
    print("Total sales for each period:")
    print(periods)
    print("Total sales per counter:")
    print(counters)
Count()

5.3Experimental results:

6. Write a program to operate the Excel file, read the contents of the file “Everyone’s Specialties.xlsx”, count the contents of columns A-H, and append a column at the end to summarize everyone’s specialties.

6.1Experiment code:

import random
from openpyxl import load_workbook

# Function: Randomly generate table data
def Random():
    wb = load_workbook("Everyone's expertise.xlsx")
    for w1 in wb:
        w1 = wb.active
        wb.remove(w1)
    sheet = wb.create_sheet("Everyone's Specialties")

    list_name = ["Li Ming", "Wang Ping", "Liu Jing", "Chen Hong", "Zhao Qiang"]
    sheet.append(["Name", "Football", "Basketball", "Table Tennis", "Swimming", "Singing", "Dancing", "Eating"])# Add the header content of the first row

    # Randomly generate "yes" for each person's specialty and insert the results into the table
    for i in list_name:
        temp = []
        temp.append(i)
        # For each specialty, use the choice function to randomly select "yes" and " "
        for j in range(7):
            temp.append(random.choice("Yes "))
        sheet.append(temp)
    wb.save("Everyone's expertise.xlsx")


# Function: summarize all specialties
def Summarize():
    wb = load_workbook("Everyone's expertise.xlsx")
    w = wb.active
    w["I1"] = "All specialties"
    list_sheet = [ "B", "C", "D", "E", "F", "G", "H"]
    list_column = ["football", "basketball", "table tennis", "swimming", "singing", "dancing", "eating"]
    # Traverse each student
    for i in range(2, w.max_row + 1):
        temp = []
        # Traverse the specialties of each column
        for j in range(len(list_sheet)):
            # Check whether the value in column j and row i is "Yes"
            if w[list_sheet[j] + str(i)].value == "Yes":
                temp.append(list_column[j])
        # Add the summary information of "all specialties" in column I and row i, separated by commas
        w["I" + str(i)] = ",".join(temp)
    wb.save("Everyone's Specialties (Summary Version).xlsx")

Random()
Summarize()

6.2Experimental results:

Contents of the “Everyone’s Specialties.xlsx” file randomly generated by the code:

Contents of the summarized result file “Everyone’s Specialties (Summary Version).xlsx”:

3. Experiment summary

In this experiment, I learned to use some libraries of Python to operate files. I completed 6 questions in total. The overall difficulty of the questions was not very difficult. I successfully obtained the expected results. I also learned various operations on files. The overall gain was very great. big. In addition, I also encountered some problems during the experiment, which were successfully solved after searching for information and consulting classmates. The records are as follows:

1. For question 4, find the red text in the document

In the program, I divided the text in the document into paragraph text and table text and traversed and searched respectively. There is no problem when searching for paragraph text, but when using the judgment statement if cell.font.color.rgb == RGBColor(255, 0, 0) when searching for table text, an error is reported: cell does not have a font attribute. After consulting my classmates, I found that the content in the cell can be regarded as “paragraph text” and traversed again. After this operation, the red text in the table was successfully found.

2. For question 6, summarize everyone’s strengths

When the program is running, an error occurs when zipfile.BadZipFile: File is not a zip file. The reason for the search is: the xlsx file cannot be created directly in the pycharm file, and it will be recognized as an invalid file. A real table file must be created in WPS.

When the program is running, an error message ‘Workbook’ object has no attribute ‘append’ is displayed. The reason is: data cannot be added directly to the workbook file. A new sheet must be created first. The corresponding statement is sheet = wb.create_sheet(“Everyone’s specialty”) , and then you can insert data into the sheet.

The program runs successfully, but the content of the “Everyone’s Specialty.xlsx” file is empty. To find the solution on CSDN, you need to add the following statement: for w1 in wb: w1 = wb.active wb.remove(w1)

3. For file data

At first, for the file data used in questions 5 and 6, I typed the data directly in Excel. Later, I tried to use python code to generate file data. I found that the latter can better train me to use the python library to analyze files. The ability to perform operations also enabled me to learn a lot of new knowledge about file operations.