Directory
1. Feature scaling
2. Predict CO2 value
3, df column index expansion
1. Feature scaling
Feature scaling can be used for different units of measure. In the case of different measurement units, the numerical size of the features will also be different, which may affect the performance of some machine learning algorithms. For example, if one feature has units of inches and another has units of kilograms, the numerical magnitudes of the two features cannot be directly compared. In this case, feature scaling can be used to scale the value range of special features into the same range in order to make better use of them
Of course it can also be used for comparison of different attributes, for example it is difficult to compare displacement 1.0 with vehicle weight 790, but if they are both scaled to comparable values, we can easily see one value compared to the other than how many
There are various ways to scale data, here we will use a method called standardization
The normalization method uses the following formula: z = (x – u) / s
where z is the new value, x is the original value, u is the mean, and s is the standard deviation
Car | Model | Volume | Weight | CO2 |
---|---|---|---|---|
Toyota | Aygo | 1.0 | 790 | 99 |
If you get the weight column from the above dataset, the first value is 790 and the scaled value is:
(790 - 1292.23) / 238.74 = -2.1
If you get the volume column from the dataset above, the first value is 1.0 and the scaled value is:
(1.0 - 1.61) / 0.38 = -1.59
Instead of comparing 790 to 1.0, we can now compare -2.1 to -1.59
Scale all values in the Weight and Volume columns
# Create a StandardScaler() object scale = StandardScaler() df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv") X = df[['Weight', 'Volume']] # `fit_transform(X)` is a method of the estimator class object in Scikit-learn, which is used to complete the process of fitting and transforming the data set at one time scaledX = scale. fit_transform(X) print(scaledX)
G:\python_files\DeepLearning\Scripts\python.exe F:/test/one.py
[[-2.10389253 -1.59336644]
[-0.55407235 -1.07190106]
[-1.52166278 -1.59336644]
[-1.78973979 -1.85409913]
[-0.63784641 -0.28970299]
[-1.52166278 -1.59336644]
[-0.76769621 -0.55043568]
[0.3046118-0.28970299]
[-0.7551301 -0.28970299]
[-0.59595938 -0.0289703]
[-1.30803892 -1.33263375]
[-1.26615189 -0.81116837]
[-0.7551301 -1.59336644]
[-0.16871166 -0.0289703]
[0.14125238-0.0289703]
[ 0.15800719 -0.0289703 ]
[0.3046118-0.0289703]
[-0.05142797 1.53542584]
[-0.72580918 -0.0289703]
[ 0.14962979 1.01396046]
[ 1.2219378 -0.0289703 ]
[ 0.5685001 1.01396046]
[ 0.3046118 1.27469315]
[0.51404696-0.0289703]
[ 0.51404696 1.01396046]
[0.72348212 -0.28970299]
[ 0.8281997 1.01396046]
[ 1.81254495 1.01396046]
[ 0.96642691 -0.0289703 ]
[ 1.72877089 1.01396046]
[ 1.30990057 1.27469315]
[ 1.90050772 1.01396046]
[-0.23991961 -0.0289703]
[0.40932938-0.0289703]
[0.47215993-0.0289703]
[0.4302729 2.31762392]]
Process finished with exit code 0
The above are the values after converting the two columns of data, where the first two values are -2.1 and -1.59, which correspond to our calculations
2. Forecast CO2 value
The task in the Multiple Regression chapter is to predict the CO2 emissions of a car given only its weight and displacement.
After scaling the dataset, the scaling must be used when predicting values
Predicting the CO2 emissions of a 1.3 liter car weighing 2300 kg
Code before unscaling (used in multiple regression forecasting)
import pandas from sklearn import linear_model df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv") X = df[['Weight', 'Volume']] y = df['CO2'] regr = linear_model. LinearRegression() regr. fit(X, y) # Predict the CO2 emissions of a car with a weight of 2300kg and a displacement of 1300ccm: predictedCO2 = regr. predict([[2300, 1300]]) print(predictedCO2)
Code after scaling
import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv") X = df[['Weight', 'Volume']] # Do column index, use array to pass multiple columns y = df['CO2'] # This line of code combines the two steps of `fit()` and `transform()`, first use the `fit()` method to fit the data `X`, and calculate statistical information such as mean and standard deviation, # Then use the `transform()` method to standardize the data to get the standardized data `scaledX` scaledX = scale. fit_transform(X) # Instantiate `LinearRegression` object `regr` regr = linear_model. LinearRegression() regr. fit(scaledX, y) # This line of code uses the `transform()` method to standardize the new data `[[2300, 1.3]]` to get the standardized data `scaled`. # Note that the input data here must be a two-dimensional array, even if there is only one sample, it must be organized into a two-dimensional array #`scaled = scale.transform([[2300, 1.3]])` is the operation to standardize the data set `[[2300, 1.3]]`. # The return value is a NumPy array containing the normalized data scaled = scale. transform([[2300, 1.3]]) print([scaled[0]]) # [array([ 4.22104928, -4.19730382])] print(scaled[0]) # [ 4.22104928 -4.19730382] # The parameter in `predictedCO2 = regr.predict([scaled[0]])` is `[scaled[0]]` because the parameter of the `predict()` method needs to be a two-dimensional array, # where each row represents a sample, and each column represents a feature. In this example, there is only one sample, so a two-dimensional array with one row needs to be created. Since `scaled` is a two-dimensional array with only one row, # Therefore, it needs to be put into a list containing one element, which is `[scaled[0]]` predictedCO2 = regr. predict([scaled[0]]) print(predictedCO2)
But after actual testing, it is not the same value before and after zooming?
3, df column index extension
You cannot operate on a DataFrame with a column index if it has no columns
At this point, the data can be stored as a one-dimensional array or two-dimensional array, and then converted to a DataFrame.
For example, if the data is a one-dimensional array, you can use the `array` function in the `numpy` library to convert it to a two-dimensional array, and then use the `pd.DataFrame()` function to convert it to a DataFrame
import numpy as np import pandas as pd # Convert a one-dimensional array to a two-dimensional array # In reshape(-1, 1), -1 is how many rows the system automatically calculates and transforms into an array, and 1 is converted into 1 column data = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Convert 2D array to DataFrame df = pd.DataFrame(data, columns=['Value'])
In this example, we first convert the one-dimensional array `[1, 2, 3, 4, 5]` into a two-dimensional array with only one column using `reshape(-1, 1)`, and then use `pd.DataFrame ()` function converts it to a DataFrame `df` with columns named `Value`
If the data is a two-dimensional array, it can be directly converted to a DataFrame using the `pd.DataFrame()` function, e.g.
import pandas as pd # define two-dimensional array data = [[1, 2], [3, 4], [5, 6]] # Convert 2D array to DataFrame df = pd.DataFrame(data, columns=['Value1', 'Value2'])
In this example, we define a two-dimensional array `data`, which contains two columns of data. We use the `pd.DataFrame()` function to convert it into a DataFrame `df` with column names `Value1` and `Value2`