Data Mining Experiment 1, Data Preprocessing

Data Mining Experiment 1, Data Preprocessing

1. The purpose of the experiment:
(1) Familiar with VC++ programming tools and complete data cube construction, online analysis and processing algorithms.
(2) Browse the data to be processed, find possible noise, missing values, inconsistencies, etc. of attributes of each dimension, and propose specific algorithms for data cleaning, data transformation, and data integration to address the existing problems.
(3) Write programs with VC++ programming tools to realize functions such as data cleaning, data transformation, and data integration.
(4) Debug the entire program to obtain clean, consistent, and integrated data, and select parameters suitable for global optimization.
(5) Write an experiment report.
2. Experimental principle:

  1. data preprocessing
    Databases in the real world are extremely vulnerable to noise data, missing data, and inconsistent data. In order to improve the quality of data and thus the quality of mining results, a large number of data preprocessing techniques have been produced. There are many methods of data preprocessing: data cleaning, data integration, data transformation, data reduction, etc. These data processing techniques are used before data mining, which greatly improves the quality of data mining models and reduces the time required for actual mining.
  2. data cleaning
    Data cleaning routines “clean” data by filling in missing values, smoothing noisy data, identifying and removing outliers, and resolving inconsistencies.
  3. data integration
    Data integration combines data from multiple sources into a consistent data store, such as a data warehouse or data cube.
  4. data transformation
    Through smooth aggregation, data generalization, normalization, etc., the data is converted into a form suitable for data mining.
  5. data reduction
    Using data reduction results in a compressed representation of the dataset that is much smaller but yields the same (or nearly the same) analysis results. Commonly used data reduction strategies include data aggregation, dimension reduction, data compression, and number reduction.

3. Experiment content:

  1. Experimental content
  1. Use VC++ programming tools to write programs to realize data cleaning, data transformation, data integration and other functions, and write the main preprocessing process and methods in the experiment report.
  2. Produce clean, consistent, integrated data.
  3. In the test report, write down the functions and functions of each main program segment.
  1. Experimental procedure
  1. Carefully study and review the data to identify attributes or dimensions that should be included in your analysis, discover some errors in the data, unusual values, and inconsistencies in some transaction records.
  2. Perform data cleaning to deal with missing values, noisy data, and inconsistent data.
    For example:
    1. The missing value in the date can be determined according to the unified serial number.
    2. The quantity purchased cannot be negative.
    1) Carry out data integration, data transformation and data reduction, integrate data from multiple data sources, reduce or avoid data redundancy or inconsistency in result data. and transform the data into a form suitable for mining.
    For example:
    1. After data cleaning, it is found that the purchase quantity, sales price, and total amount are interrelated items, and the total amount can be removed.
    2. The date formats of the three water meters are different and should be unified into the same date format.
    3. The door number is the same as the POS machine number, one can be removed.
    4. Additional: The serial numbers of the items in the same shopping basket should be in increasing order.
  1. block diagram

  2. key code

#include<iostream>
#include<string>
#include <fstream>
#include <algorithm>
using namespace std;
class Sales {<!-- -->
    public: //1. Define sales class
        string serial;
        int market;
        int posno;
        string date;
        int sn;
        int id;
        float num;
        float price;
        float total;
        friend bool operator <(Sales & amp;a,Sales & amp;b)
        {<!-- -->//class object sorting function
            return a.sn!=b.sn?a.sn<b.sn:a.sn>b.sn;
        }

};
int main()
{<!-- -->
    ofstream outfile("fl.txt",ifstream::app); //2. Open the source txt file and create a saved txt file
    if(!outfile)
    {<!-- -->
        cout<<"open error!"<<endl;
        exit(1);
    }
    char name[50];
    ifstream infile;
    cout<<"Enter the txt file name to be opened: 1019.txt, 1020.txt, 1021.txt"<<endl;
    cin>>name;
    infile.open(name,ios::in);
    if(infile. fail())
     {<!-- -->
        cout<<"error open!"<<endl;
     }
     
    Sales sal[13000];
    int sal_size=0;
 string temp;
    getline(infile, temp);
 //3. Access the data in the txt file
 while(!infile. eof())
    {<!-- -->
      infile >> sal[sal_size].serial >> sal[sal_size].market >>
 sal[sal_size].posno>> sal[sal_size].date>> sal[sal_size].sn>> sal[sal_size].id>> sal[sal_size].num>> sal[sal_size].price>> sal[ sal_size].total;
    sal_size++;
    }
    The length of cout<<"document"<<name<<" is: "<<sal_size<<endl;
    sort(sal,sal + sal_size-1);//The serial number of the same shopping basket should be increasing in order
    for(int i=0;i<sal_size-1;i ++ ) {<!-- -->//4, process data
        if (sal[i].num<0)//4 (1) judge whether the purchase quantity is negative
       {<!-- -->
           sal[i].num=-sal[i].num;
       }
        sal[i].date=sal[i].serial.substr(0,4) + "year" + sal[i].serial.substr(4,2) + "month" + sal[i ].serial.substr(6,2) + "day";//4 (2) the unified date is xx, xx, xxxx
        outfile<<sal[i].serial<<"\t"<<sal[i].market<<"\t"<<sal[i].date<<"\t "<<sal[i].sn<<"\t"
        <<sal[i].id<<"\t"<<sal[i].num<<"\t"<<sal[i].price<<endl;//4( 3) Save the processed data and delete redundant data (total amount, POS machine number)
        //The door number is the same as the pos machine number, one can be removed
        //Purchase quantity, sales price, and total amount are interrelated items, and the total amount can be removed
    }
        cout<<"The length of the document fl.txt is: "<<sal_size-1<<endl;
        infile.close();//5, close the txt file
        outfile. close();
        system("pause");
        return 0;
}

4. Experimental results:

  1. Experimental data

    1. 1019.txt

    2. 1020.txt

    3. 1021.txt

  2. process result

1019.txt processing results


1020.txt processing results


1021.txt processing results

Through this program, the different date formats of the three water meters should be unified into the same date format, the total amount and POS machine number are successfully deleted, and the serial numbers of the items in the same shopping basket are incremented sequentially.

3. Experimental conclusion
Initial data often has missing values, duplicate values, outliers, or error values. Usually, this kind of data is called “dirty data” and needs to be cleaned. Sometimes the original variables of the data do not meet the requirements of the analysis, and we need to perform certain processing on the data first, that is, data preprocessing. The main purpose of data cleaning and preprocessing is to improve data quality, thereby improving the reliability of mining results, which is a very necessary step in the data mining process.

Share your own university course experiments, please modify according to your personal needs