Machine failure prediction: the decisive moment in the next 24 hours! ! !

1. Background introduction

The focus of this competition is to predict whether a machine will malfunction within the next 24 hours. The data includes various characteristics related to machine performance, such as temperature, vibration, power consumption and sensor readings. The target variable is binary and represents whether the machine will fail (1) or not (0) within the next 24 hours. The goal of this competition is to develop accurate models that can predict machine failures based on the provided features. This is an important issue, and in industrial settings, predicting machine failure in advance can help prevent costly downtime and repairs. To enter this competition, you can use any machine learning method of your choice. However, note that the dataset only contains numerical features, so text-based feature engineering techniques may not be applicable. Also, make sure to handle missing values appropriately before training your model. Good luck and enjoy solving this interesting problem.
Official link: Binary Classification of Machine Failures
Dataset introduction:
The data description is shown in the table below: including the following information and the explanation of the corresponding information

Column title	Column description
id	Unique identifier used to index and reference each record
Product Id	Type variable followed by a combination of identifier numbers
Type	Type of record. Understanding the type of machine can provide insights into its operation, which can be related to the probability of failure
Air Temperature [K]	The environment surrounding the machine Temperature, measured in Kelvin. May affect the behavior of the machine under different environmental conditions
Process Temperature [K]	The temperature in the process the machine is in, in Kelvin. Certain processes may increase the likelihood of machine overheating and failure
Rotational Speed [rpm]	The speed at which a machine operates, in revolutions per minute (rpm) is the unit. Higher speeds may cause increased wear
Torque [Nm]	The force causing the rotation of a machine, expressed in Newton meters (Nm). Higher torque may indicate higher load and greater risk of failure
Tool Wear [min]	The degree of wear experienced by the machine, in minutes as unit. Higher tool wear may indicate the need for maintenance
Machine Failure	Target variable: Indicates whether the machine failed (1) or did not fail (0) Binary indicator
TWF	Machine failure caused by tool wear
HDF	Machine failure due to insufficient heat energy
PWF	Machine failure due to power related issues
OSF	Machine failure due to excessive stress
RNF	Due to unspecified random issues Machine failure

Statistical indicator AUC:
AUC is the area under the ROC curve. It is a model evaluation index and can only be used for the evaluation of binary classification models. The larger the AUC value, the greater the probability that the model is classified correctly. The probabilistic explanation of AUC is: “AUC is the probability that the model scores a randomly selected positive class higher than a randomly selected negative class.”

from sklearn.metrics import roc_auc_score
y_true = [0, 1, 0, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]
auc = roc_auc_score(y_true, y_scores)
print(auc)

2. Data loading

# Training data
train = pd.read_csv("./data/train.csv")
train.head() # View the first five lines

ID	Product ID	Type	Air temperature [K]	Process temperature [K]	Rotational speed [rpm]	Torque [Nm]	Tool wear [min]
0	0	L50096	300.6	309.6	1596	36.1	140
1	1	M20343	302.6	312.1	1759	29.1	200
2	2	L49454	299.3	308.5	1805	26.5	25
3	3	L53355	301.0	310.9	1524	44.3	197
4	4	M24050	298.0	309.0	1641	35.4	34

# View the meaning expressed by the column
train.columns

Index(['id', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Machine failure', 'TWF', 'HDF', 'PWF', 'OSF',
       'RNF'],
      dtype='object')

# View all information
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136429 entries, 0 to 136428
Data columns (total 14 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 id 136429 non-null int64
 1 Product ID 136429 non-null object
 2 Type 136429 non-null object
 3 Air temperature [K] 136429 non-null float64
 4 Process temperature [K] 136429 non-null float64
 5 Rotational speed [rpm] 136429 non-null int64
 6 Torque [Nm] 136429 non-null float64
 7 Tool wear [min] 136429 non-null int64
 8 Machine failure 136429 non-null int64
 9 TWF 136429 non-null int64
 10 HDF 136429 non-null int64
 11 PWF 136429 non-null int64
 12 OSF 136429 non-null int64
 13 RNF 136429 non-null int64
dtypes: float64(3), int64(9), object(2)
memory usage: 14.6 + MB

# View data distribution train.describe() method used to describe the statistical information of the data set. It returns information about numeric columns in a dataset, such as count, mean, standard deviation, minimum, quartiles, and maximum
train.describe()

# Check the number of missing values in the data set and return the sum of the number of missing values in each column in the data set.
train.isna().sum()

ID 0
Product ID 0
Type 0
Air temperature [K] 0
Process temperature [K] 0
Rotational speed [rpm] 0
Torque [Nm] 0
Tool wear [min] 0
Machine failure 0
TWF 0
HDF 0
PWF 0
OSF 0
RNF 0
dtype: int64

# Test set
test = pd.read_csv("./data/test.csv")
test

# Re-adjust the index of the list because machine failure is not in the last column
train = train.reindex(columns=["id",'Product ID','Type',"Air temperature [K]","Process temperature [K]"," Rotational speed [rpm]","Torque [Nm]",
                               "Tool wear [min]","TWF","HDF","PWF","OSF","RNF","Machine failure"])
train

3. Data analysis

# Category features
categorical_features = train[['Product ID', 'Type', 'Machine failure', 'TWF', 'HDF',
                         'PWF', 'OSF','RNF']]

#Total number of features
num_features = train[['Air temperature [K]', 'Process temperature [K]',
                         'Rotational speed [rpm]', 'Torque [Nm]',
                         'Tool wear [min]']]

categorical_features

num_features

4. Data processing

# Use the str.maketrans() method in Python to create a translation table that maps letters to null characters. Then,
# It uses the translate() method to replace each letter in the input string with the corresponding null character. Finally, it returns the result string.
def remove_letters(input_string):
    translation_table = str.maketrans('', '', 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
    result_string = input_string.translate(translation_table)
    return result_string
#eg:
# input_string = "Hello, World!"
# result_string = remove_letters(input_string)
# print(result_string) # Output: ", !"

# Mainly remove letters in product id
cleaned_column = [remove_letters(value) for value in train['Product ID']]
cleaned_column_test = [remove_letters(value) for value in test['Product ID']]

train['Product ID'] = cleaned_column
test['Product ID'] = cleaned_column_test
train

cleaned_cat_features = [remove_letters(value) for value in categorical_features['Product ID']]
categorical_features['Product ID'] = cleaned_cat_features
categorical_features

Use the corr() function from the Pandas library to calculate the correlation coefficient between columns in the dataset. The correlation coefficient measures the strength and direction of the linear relationship between two variables. The correlation coefficient ranges from -1 to 1, where -1 means a perfect negative correlation, 1 means a perfect positive correlation, and 0 means no correlation.

The resulting correlation coefficient matrix will contain the correlation coefficients between each column and other columns. This matrix can be used to analyze the relationship between different variables in the data set, such as checking whether there is a high correlation between certain variables, or assessing the degree of influence of a variable on a target variable.

# Calculate similarity matrix: all data must be in numerical form
corr_matrix = num_features.corr()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_matrix,annot=True,fmt=".2f",cmap="coolwarm",square=True,ax=ax)
plt.title("Correlation Heatmap")
plt.show()

# Check how many machines there are under different types
# As can be seen from the figure below, there are three main types, with L type occupying the most.
type_counts = categorical_features["Type"].value_counts()
sns.barplot(x=type_counts.index,y=type_counts.values)
plt.title("Type distribution")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()

test

from sklearn.preprocessing import OneHotEncoder
One_hot_encoder = OneHotEncoder()

# View the distribution of Type above and encode the Type in train.
encoded_type = One_hot_encoder.fit_transform(train[["Type"]]).toarray()
# Create a DataFrame with encoded values
encoded_df = pd.DataFrame(encoded_type, columns=One_hot_encoder.get_feature_names_out(['Type']))
# Concatenate encoded_df with df_train
train_encoded = pd.concat([train, encoded_df], axis=1)
train_encoded.drop('Type', axis=1, inplace=True)
train_encoded

# Transform 'Type' column in df_test
encoded_type = One_hot_encoder.fit_transform(test[['Type']]).toarray()

# Create a DataFrame with encoded values
encoded_test = pd.DataFrame(encoded_type, columns=One_hot_encoder.get_feature_names_out(['Type']))
encoded_test = encoded_test.set_index(test.index) # I've to set the index because the test dataframe doesn't start in id 0

# Concatenate encoded_df with df_train
test_encoded = pd.concat([test, encoded_test], axis=1)

test_encoded.drop('Type', axis=1, inplace=True)
test_encoded

train_encoded = train_encoded.reindex(columns=['id', 'Product ID', 'Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'TWF',
       'HDF', 'PWF', 'OSF', 'RNF', 'Type_H', 'Type_L',
       'Type_M','Machine failure'])
train_encoded

test_encoded

from sklearn.preprocessing import StandardScaler
#Feature normalization
sc = StandardScaler()
# Select the numeric columns for standardization
train_numeric_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]\ ']
train_encoded[train_numeric_columns] = sc.fit_transform(train_encoded[train_numeric_columns])
num_features = train_encoded[train_numeric_columns]

# Select the numeric columns for standardization
test_numeric_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]\ ']
test_encoded[test_numeric_columns] = sc.fit_transform(test_encoded[test_numeric_columns])

# View the data distribution in the training set
for col_name in num_features.columns:
    sns.histplot(num_features[col_name])
    plt.title(f'{col_name} histogram')
    plt.show()

# Compute the correlation matrix
correlation_matrix = train_encoded.corr()

# Set up the figure and axes
plt.figure(figsize=(10, 8))

# Create the heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Set plot title
plt.title('Correlation Heatmap')

# Deleting ProductID will have no impact on the performance of the model.
df_train_encoded = train_encoded.drop('Product ID', axis=1)
df_test_encoded = test_encoded.drop('Product ID', axis=1)

df_train_encoded = train_encoded.drop("id",axis=1)
df_test_encoded = test_encoded.drop("id",axis=1)
df_train_encoded

5. Model training and prediction

# Training
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plot data
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from tensorflow import keras
from tensorflow.keras import layers


model_type = "fc"
if model_type == "fc":
    # Split the encoded train data into X & y
    X_train = df_train_encoded.drop('Machine failure', axis=1)
    X_train = X_train.drop("id",axis=1)
    y_train = df_train_encoded['Machine failure']
    # Split the train data in train and valid set
    X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train,test_size=0.2)

    input_shape = (X_train.shape[1],) # Creating a tuple with a single element

    #Building the deep learning model
    model = keras.Sequential([
        layers.Dense(units=256, activation='relu', input_shape= input_shape),
        layers.Dense(units=128, activation='relu'),
        layers.Dense(units=64, activation='relu'),
        layers.Dense(units=1, activation='sigmoid')
    ])

    # Compiling the model
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )

    # This will run the model and plot the learning curve
    early_stopping = keras.callbacks.EarlyStopping(
        patience=10,
        min_delta=0.001,
        restore_best_weights=True,
    )

    history = model.fit(
        X_train.astype(np.float32), y_train.astype(np.float32),
        validation_data=(X_valid.astype(np.float32), y_valid.astype(np.float32)),
        batch_size=512, # 512 best
        epochs=200,
        callbacks=[early_stopping],
    )

    y_pred = model.predict(X_valid.astype(np.float32))

    # Calculate the ROC curve
    fpr, tpr, thresholds = roc_curve(y_valid, y_pred)
    # Calculate the ROC AUC score
    roc_auc = roc_auc_score(y_valid, y_pred)

    print("ROC AUC Score:", roc_auc)

    # Plot the ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc='lower right')
    plt.show()

Epoch 1/200
214/214 [] – 1s 3ms/step – loss: 0.0739 – binary_accuracy: 0.9845 – val_loss: 0.0212 – val_binary_accuracy: 0.9966
Epoch 2/200
214/214 [] – 0s 2ms/step – loss: 0.0233 – binary_accuracy: 0.9960 – val_loss: 0.0207 – val_binary_accuracy: 0.9964
Epoch 3/200
214/214 [] – 0s 2ms/step – loss: 0.0226 – binary_accuracy: 0.9960 – val_loss: 0.0190 – val_binary_accuracy: 0.9968
Epoch 4/200
214/214 [] – 1s 3ms/step – loss: 0.0223 – binary_accuracy: 0.9959 – val_loss: 0.0182 – val_binary_accuracy: 0.9968
Epoch 5/200
214/214 [] – 0s 2ms/step – loss: 0.0222 – binary_accuracy: 0.9960 – val_loss: 0.0184 – val_binary_accuracy: 0.9968
Epoch 6/200
214/214 [] – 1s 2ms/step – loss: 0.0223 – binary_accuracy: 0.9960 – val_loss: 0.0190 – val_binary_accuracy: 0.9968
Epoch 7/200
214/214 [] – 0s 2ms/step – loss: 0.0219 – binary_accuracy: 0.9960 – val_loss: 0.0187 – val_binary_accuracy: 0.9968
Epoch 8/200
214/214 [] – 1s 2ms/step – loss: 0.0220 – binary_accuracy: 0.9960 – val_loss: 0.0185 – val_binary_accuracy: 0.9968
Epoch 9/200
214/214 [] – 1s 2ms/step – loss: 0.0218 – binary_accuracy: 0.9960 – val_loss: 0.0184 – val_binary_accuracy: 0.9968
Epoch 10/200
214/214 [] – 0s 2ms/step – loss: 0.0217 – binary_accuracy: 0.9960 – val_loss: 0.0183 – val_binary_accuracy: 0.9968
Epoch 11/200
214/214 [] – 0s 2ms/step – loss: 0.0216 – binary_accuracy: 0.9960 – val_loss: 0.0189 – val_binary_accuracy: 0.9968
Epoch 12/200
214/214 [] – 0s 2ms/step – loss: 0.0216 – binary_accuracy: 0.9960 – val_loss: 0.0186 – val_binary_accuracy: 0.9968
Epoch 13/200
214/214 [] – 1s 2ms/step – loss: 0.0215 – binary_accuracy: 0.9960 – val_loss: 0.0185 – val_binary_accuracy: 0.9968
853/853 [] – 1s 616us/step
ROC AUC Score: 0.9690019245164365

Or use random forest training and prediction
RandomForestClassifier is a decision tree model based on bagging framework, which consists of multiple decision trees, each decision tree is trained based on different data sets. When predicting, the random forest will average or vote the prediction results of each decision tree to get the final prediction result. RandomForestClassifier has the following parameters:

n_estimators: the number of decision trees.
criterion: A measure of split quality.
max_depth: The maximum depth of the decision tree.
min_samples_split: Minimum number of samples required to split internal nodes.
min_samples_leaf: Minimum number of samples required at leaf nodes.
min_weight_fraction_leaf: The proportion of the minimum weight contained in leaf nodes.
max_features: The number of features to consider when finding the best split.
max_leaf_nodes: Maximum number of leaf nodes.
min_impurity_decrease: The amount of impurity reduction allowed when internal nodes do not meet the threshold.
min_impurity_split: The amount of impurity split allowed when internal nodes do not meet the threshold.

Random forests have many advantages, such as:

Very high accuracy
Able to run efficiently on large data sets
Introduces randomness and is not prone to overfitting
It has good anti-noise ability, but it will overfit when the data noise is relatively large.
Can handle very high-dimensional data without dimensionality reduction
Not only can it handle discrete data, but it can also handle continuous data without the need to normalize the data set.
The training speed is fast and the importance ranking of variables can be obtained.
Easy to parallelize
Achieve good results even for missing value problems
The number of hyperparameters is not very large, and it is very intuitive to understand what they represent.

Disadvantages of random forests include:
1. Long training time: Random forest requires a lot of computing resources and time to train, so its training time is usually longer than other machine learning algorithms.
2. Easy to overfit: When random forest processes data sets, it may overfit the data, resulting in poor performance on new data.
3. Poor interpretability: The decision-making process of random forest is relatively complex, and it is difficult to explain the impact of each feature on the final result.
4. It may not perform well for certain tasks: Random forest can handle various natural language tasks, but for some specific tasks, such as image recognition, speech recognition, etc., its performance may not be as good as specially designed systems.

# Training
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plot data
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
df_train_encoded = train_encoded.drop("Product ID",axis=1)
model_type = "randomforest"
if model_type == "randomforest":
    # Split the encoded train data into X & y
    X_train = df_train_encoded.drop('Machine failure', axis=1)
    y_train = df_train_encoded['Machine failure']
    # Split the train data in train and valid set
    X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train,test_size=0.2)

    print(f"train_dataset:{X_train.shape}")

    # create a random forest classifier
    random_classifier = RandomForestClassifier(n_estimators=1000,min_samples_split=10,max_depth=957,random_state=42)

    # train the classifier on the training data
    random_classifier.fit(X_train, y_train)
\t# test
y_pred = random_classifier.predict(X_valid)
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_valid, y_pred)
# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_valid, y_pred)
print("ROC AUC Score:", roc_auc)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

6. Submission of results

# Submit results
X_test = df_test_encoded.drop("Product ID",axis=1)
ids = df_test_encoded["Product ID"]
# y_pred = random_classifier.predict(X_test)
y_pred = model.predict(X_test.astype(np.float32))
# Flatten y_pred to make it 1-dimensional
y_pred = y_pred.flatten()
submission_df = pd.DataFrame({
    'ID': ids.index, # Replace 'ids' with your list or array of IDs
    'Machine failure': y_pred # Replace 'Machine failure' with the appropriate column name
})
# Save the DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=False)