0 Preface
A series of high-quality competition projects, what I want to share today is
Machine Learning Big Data Analysis Project
This project is relatively new and suitable as a competition topic. It is highly recommended by senior students!
More information, project sharing:
https://gitee.com/dancheng-senior/postgraduate
1 Introduction to data sets
?
df = pd.read_csv(/home/kesci/input/jena1246/jena_climate_2009_2016.csv’)
df.head()
As shown above, observations are recorded every 10 minutes, there are 6 observations in an hour, and 144 (6×24) observations in a day.
Given a specific time, let’s say you want to predict the temperature for the next 6 hours. To make this prediction, a 5-day observation period was chosen. Therefore, create a window containing the last 720 (5×144) observations to train the model.
The function below returns the above time window for model training. The parameter history_size is the sliding window size of past information. target_size
It is the future time step that the model needs to learn to predict, and also serves as the label that needs to be predicted.
The first 300,000 rows of the data are used as the training data set below, and the rest are used as the validation data set. A total of about 2100 days of training data.
?
def univariate_data(dataset, start_index, end_index, history_size, target_size):
data = []
labels = []
start_index = start_index + history_size if end_index is None: end_index = len(dataset) - target_size for i in range(start_index, end_index): indices = range(i-history_size, i) # Reshape data from (history`1_size,) to (history_size, 1) data.append(np.reshape(dataset[indices], (history_size, 1))) labels.append(dataset[i + target_size]) return np.array(data), np.array(labels)
2 Start analysis
2.1 Univariate analysis
First, a model is trained using a feature (temperature) and then used to make predictions.
2.1.1 Temperature variable
Extract temperature from data set
?
uni_data = df[T (degC)’]
uni_data.index = df[Date Time’]
uni_data.head()
Observe changes in data over time
Standardize
?
#standardization
uni_train_mean = uni_data[:TRAIN_SPLIT].mean()
uni_train_std = uni_data[:TRAIN_SPLIT].std()
uni_data = (uni_data-uni_train_mean)/uni_train_std #Write functions to divide features and labels univariate_past_history = 20 univariate_future_target = 0 x_train_uni, y_train_uni = univariate_data(uni_data, 0, TRAIN_SPLIT, # Starting and ending intervals univariate_past_history, univariate_future_target) x_val_uni, y_val_uni = univariate_data(uni_data, TRAIN_SPLIT, None, univariate_past_history, univariate_future_target)
It can be seen that the feature of the first sample is the temperature of the first 20 time points, and its label is the temperature of the 21st time point. According to the same rule, the characteristics of the second sample are the temperature value at the 2nd time point to the temperature value at the 21st time point, and its label is the temperature at the 22nd time point…
2.2 Slice features and labels
?
BATCH_SIZE = 256
BUFFER_SIZE = 10000
train_univariate = tf.data.Dataset.from_tensor_slices((x_train_uni, y_train_uni)) train_univariate = train_univariate.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat() val_univariate = tf.data.Dataset.from_tensor_slices((x_val_uni, y_val_uni)) val_univariate = val_univariate.batch(BATCH_SIZE).repeat()
2.3 Modeling
?
simple_lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape=x_train_uni.shape[-2:]), # input_shape=(20,1) does not contain batch dimensions
tf.keras.layers.Dense(1)
])
simple_lstm_model.compile(optimizer='adam', loss='mae')
2.4 Training model
?
EVALUATION_INTERVAL = 200
EPOCHS = 10
simple_lstm_model.fit(train_univariate, epochs=EPOCHS, steps_per_epoch=EVALUATION_INTERVAL, validation_data=val_univariate, validation_steps=50)
training process
Training results – temperature prediction results
2.5 Multivariate analysis
Here, we use some past pressure information, temperature information, and density information to predict the temperature at a point in time in the future. In other words, the data set should include pressure information, temperature information, and density information.
2.5.1 Plot of pressure, temperature and density changing with time
2.5.2 Convert the data set to array type and normalize
?
dataset = features.values
data_mean = dataset[:TRAIN_SPLIT].mean(axis=0)
data_std = dataset[:TRAIN_SPLIT].std(axis=0)
dataset = (dataset-data_mean)/data_std def multivariate_data(dataset, target, start_index, end_index, history_size, target_size, step, single_step=False): data = [] labels = [] start_index = start_index + history_size if end_index is None: end_index = len(dataset) - target_size for i in range(start_index, end_index): indices = range(i-history_size, i, step) # step represents the sliding step size data.append(dataset[indices]) if single_step: labels.append(target[i + target_size]) else: labels.append(target[i:i + target_size]) return np.array(data), np.array(labels)
2.5.3 Multivariable modeling training training
?
single_step_model = tf.keras.models.Sequential() single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:])) single_step_model.add(tf.keras.layers.Dense(1)) single_step_model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='mae') single_step_history = single_step_model.fit(train_data_single, epochs=EPOCHS, steps_per_epoch=EVALUATION_INTERVAL, validation_data=val_data_single, validation_steps=50) def plot_train_history(history, title): loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(len(loss)) plt.figure() plt.plot(epochs, loss, 'b', label='Training loss') plt.plot(epochs, val_loss, 'r', label='Validation loss') plt.title(title) plt.legend() plt.show() plot_train_history(single_step_history, 'Single Step Training and validation loss')
6 Finally
More information, project sharing:
https://gitee.com/dancheng-senior/postgraduate