Feature scaling (Scale Features), feature scaling prediction? CO2 value, df column index expansion

Directory

1. Feature scaling

2. Predict CO2 value

3, df column index expansion


1. Feature scaling

Feature scaling can be used for different units of measure. In the case of different measurement units, the numerical size of the features will also be different, which may affect the performance of some machine learning algorithms. For example, if one feature has units of inches and another has units of kilograms, the numerical magnitudes of the two features cannot be directly compared. In this case, feature scaling can be used to scale the value range of special features into the same range in order to make better use of them

Of course it can also be used for comparison of different attributes, for example it is difficult to compare displacement 1.0 with vehicle weight 790, but if they are both scaled to comparable values, we can easily see one value compared to the other than how many

There are various ways to scale data, here we will use a method called standardization

The normalization method uses the following formula: z = (x – u) / s

where z is the new value, x is the original value, u is the mean, and s is the standard deviation

Car Model Volume Weight CO2
Toyota Aygo 1.0 790 99

If you get the weight column from the above dataset, the first value is 790 and the scaled value is:

(790 - 1292.23) / 238.74 = -2.1

If you get the volume column from the dataset above, the first value is 1.0 and the scaled value is:

(1.0 - 1.61) / 0.38 = -1.59

Instead of comparing 790 to 1.0, we can now compare -2.1 to -1.59

Scale all values in the Weight and Volume columns

# Create a StandardScaler() object

scale = StandardScaler()

df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv")

X = df[['Weight', 'Volume']]

# `fit_transform(X)` is a method of the estimator class object in Scikit-learn, which is used to complete the process of fitting and transforming the data set at one time

scaledX = scale. fit_transform(X)

print(scaledX)

G:\python_files\DeepLearning\Scripts\python.exe F:/test/one.py
[[-2.10389253 -1.59336644]
[-0.55407235 -1.07190106]
[-1.52166278 -1.59336644]
[-1.78973979 -1.85409913]
[-0.63784641 -0.28970299]
[-1.52166278 -1.59336644]
[-0.76769621 -0.55043568]
[0.3046118-0.28970299]
[-0.7551301 -0.28970299]
[-0.59595938 -0.0289703]
[-1.30803892 -1.33263375]
[-1.26615189 -0.81116837]
[-0.7551301 -1.59336644]
[-0.16871166 -0.0289703]
[0.14125238-0.0289703]
[ 0.15800719 -0.0289703 ]
[0.3046118-0.0289703]
[-0.05142797 1.53542584]
[-0.72580918 -0.0289703]
[ 0.14962979 1.01396046]
[ 1.2219378 -0.0289703 ]
[ 0.5685001 1.01396046]
[ 0.3046118 1.27469315]
[0.51404696-0.0289703]
[ 0.51404696 1.01396046]
[0.72348212 -0.28970299]
[ 0.8281997 1.01396046]
[ 1.81254495 1.01396046]
[ 0.96642691 -0.0289703 ]
[ 1.72877089 1.01396046]
[ 1.30990057 1.27469315]
[ 1.90050772 1.01396046]
[-0.23991961 -0.0289703]
[0.40932938-0.0289703]
[0.47215993-0.0289703]
[0.4302729 2.31762392]]

Process finished with exit code 0

The above are the values after converting the two columns of data, where the first two values are -2.1 and -1.59, which correspond to our calculations

2. Forecast CO2 value

The task in the Multiple Regression chapter is to predict the CO2 emissions of a car given only its weight and displacement.

After scaling the dataset, the scaling must be used when predicting values

Predicting the CO2 emissions of a 1.3 liter car weighing 2300 kg

Code before unscaling (used in multiple regression forecasting)

import pandas
from sklearn import linear_model
 
df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv")
 
X = df[['Weight', 'Volume']]
y = df['CO2']
 
regr = linear_model. LinearRegression()
regr. fit(X, y)
 
# Predict the CO2 emissions of a car with a weight of 2300kg and a displacement of 1300ccm:
 
predictedCO2 = regr. predict([[2300, 1300]])
 
print(predictedCO2)

Code after scaling

import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("C:\Users\ml\Desktop\cars.csv")

X = df[['Weight', 'Volume']] # Do column index, use array to pass multiple columns
y = df['CO2']

# This line of code combines the two steps of `fit()` and `transform()`, first use the `fit()` method to fit the data `X`, and calculate statistical information such as mean and standard deviation,
# Then use the `transform()` method to standardize the data to get the standardized data `scaledX`

scaledX = scale. fit_transform(X)

# Instantiate `LinearRegression` object `regr`
regr = linear_model. LinearRegression()
regr. fit(scaledX, y)

# This line of code uses the `transform()` method to standardize the new data `[[2300, 1.3]]` to get the standardized data `scaled`.
# Note that the input data here must be a two-dimensional array, even if there is only one sample, it must be organized into a two-dimensional array
#`scaled = scale.transform([[2300, 1.3]])` is the operation to standardize the data set `[[2300, 1.3]]`.
# The return value is a NumPy array containing the normalized data
scaled = scale. transform([[2300, 1.3]])

print([scaled[0]]) # [array([ 4.22104928, -4.19730382])]

print(scaled[0]) # [ 4.22104928 -4.19730382]

# The parameter in `predictedCO2 = regr.predict([scaled[0]])` is `[scaled[0]]` because the parameter of the `predict()` method needs to be a two-dimensional array,
# where each row represents a sample, and each column represents a feature. In this example, there is only one sample, so a two-dimensional array with one row needs to be created. Since `scaled` is a two-dimensional array with only one row,
# Therefore, it needs to be put into a list containing one element, which is `[scaled[0]]`
predictedCO2 = regr. predict([scaled[0]])
print(predictedCO2)

But after actual testing, it is not the same value before and after zooming?

3, df column index extension

You cannot operate on a DataFrame with a column index if it has no columns

At this point, the data can be stored as a one-dimensional array or two-dimensional array, and then converted to a DataFrame.

For example, if the data is a one-dimensional array, you can use the `array` function in the `numpy` library to convert it to a two-dimensional array, and then use the `pd.DataFrame()` function to convert it to a DataFrame

import numpy as np
import pandas as pd

# Convert a one-dimensional array to a two-dimensional array

# In reshape(-1, 1), -1 is how many rows the system automatically calculates and transforms into an array, and 1 is converted into 1 column

data = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)

# Convert 2D array to DataFrame
df = pd.DataFrame(data, columns=['Value'])

In this example, we first convert the one-dimensional array `[1, 2, 3, 4, 5]` into a two-dimensional array with only one column using `reshape(-1, 1)`, and then use `pd.DataFrame ()` function converts it to a DataFrame `df` with columns named `Value`

If the data is a two-dimensional array, it can be directly converted to a DataFrame using the `pd.DataFrame()` function, e.g.

import pandas as pd

# define two-dimensional array
data = [[1, 2], [3, 4], [5, 6]]

# Convert 2D array to DataFrame
df = pd.DataFrame(data, columns=['Value1', 'Value2'])

In this example, we define a two-dimensional array `data`, which contains two columns of data. We use the `pd.DataFrame()` function to convert it into a DataFrame `df` with column names `Value1` and `Value2`