A Survey on Semisupervised Learning

Author: Zen and the Art of Computer Programming

1. Introduction

Semi-supervised learning is a machine learning method based on annotated data. Its purpose is to alleviate problems such as insufficient annotation data and solve sample imbalance, by training or learning some samples to obtain a model, and then using the remaining data to make predictions when the label information is known. Therefore, its algorithm process can be divided into two steps: the first step is to train the model; the second step is to use the model to predict unlabeled data.

Semi-supervised learning has the following advantages:

1. Reduce labeling costs: When there is little or no labeled data, semi-supervised learning can be used.

2. Improve prediction accuracy: Since there is some labeled data, part of the data can be used for training, thereby improving prediction accuracy.

3. Alleviating the problem of sample imbalance: Since there is some labeled data, a similar sample sampling strategy can be used to make the number of samples of each type similar.

4. Achieve knowledge fusion: By combining the prediction results of multiple learning models, the prediction performance can be effectively improved.

Currently, semi-supervised learning has become a popular research direction in the field of machine learning. With the advancement of technology, the application of semi-supervised learning is becoming more and more widespread. In actual projects, it can also be applied to many fields such as image classification, text classification, and sequence analysis. In addition, there are many variations of semi-supervised learning, such as weakly supervised learning, dense annotation learning, collaborative filtering, etc. Therefore, having an in-depth understanding of the theory and practice of semi-supervised learning is an essential skill for mastering specific algorithm principles, operating steps and mathematical formulas. Therefore, today we will use “A Survey on Semi-supervised Learning” as the theme to introduce the relevant knowledge of semi-supervised learning in a simple and easy-to-understand manner, and share some of our own experiences and insights.

2. Explanation of basic concepts and terms

(1)Definition

Semi-supervised learning is a machine learning method that is trained on a limited labeled data set and uses information in unlabeled data to predict a small number of labeled samples.

(2) sample

A typical semi-supervised learning problem involves a labeled data set D and an unlabeled data set U. Annotated dataset D consists of a mutually exclusive and complete set of samples, such as images or text samples. The unlabeled data set U is usually a subset of the original data set, where some samples may have label information and some may not. Given an unlabeled data set U and a labeled data set D, the goal is to learn a model f(x) to predict whether a new sample x belongs to a certain class y. That is, solve a function y=f(x).

(3) Label and non-label samples

Label some samples as labeled samples and other samples as unlabeled samples. Generally speaking, the labeled data set D consists of a mutually exclusive and complete set of samples, and each sample corresponds to a label. The unlabeled data set U is usually a subset of the original data set, in which some samples may have label information and some may not. These unlabeled samples are called unlabeled samples, and these labeled samples are called labeled samples.

(4) Loss function

The loss function is used to measure the difference between the model’s prediction results and the true label. It represents the training error of the model. The most commonly used loss functions include cross-entropy loss function and minimum mean square error function.

Cross-entropy loss function is used for binary classification tasks. Assuming that the label distribution obeys Bernoulli distribution, then the loss function can be written as L=-∑yilog(fi(xi)) + (1-yi)log(1-fi(xi)), fi (xi) is the output of the model to the input sample, and its size determines the model’s confidence in the sample. The loss value of positive label samples is very small, and the loss value of negative label samples is very large. When the model predicts correctly, the product of the two loss values is equal to 1. When the model predicts incorrectly, the product of the two loss values is equal to 0.

The minimum mean square error function is used for regression tasks. It defines the second norm of the Euclidean distance between the predicted value and the true value as the loss value.

(5) Model evaluation index

Commonly used indicators to evaluate model performance include classification accuracy, precision, recall, F1 score, etc. First, calculate the confusion matrix CM of the classification result, in which each row represents the actual situation where the model predicts a positive example, and each column represents the actual situation where the model predicts a negative example. The element Cij of the confusion matrix represents the number of samples whose true value is i and the model predicted value is j. Classification accuracy is the value obtained by dividing the sum of diagonal elements by the sum of total elements. Accuracy can measure the model’s predictive ability on all samples, but it cannot determine whether the model deviates from common sense. Precision and recall measure the model’s ability to identify positive and negative examples respectively. The precision rate represents the proportion of correctly predicted positive examples, and the recall rate represents the proportion of correctly predicted positive examples to all real positive examples. F1 score is the harmonic mean of precision and recall. The closer the value is to 1, the better the model is.

(6) Sample weight

Given a sample set, how to assign the weight of the samples is an important issue. In the learning process, the role of sample weights is mainly to control the convergence speed of different samples in the gradient descent process. For example, some samples have larger weights, while other samples have smaller weights. Different sample weights often affect the final learning efficiency.

(7) Consistency Constraint

Consistency Constraint is an important constraint in semi-supervised learning. It requires that the labels of samples appear according to the corresponding distribution. For example, in text classification tasks, the texts in the training samples should have the same length and vocabulary. Consistency constraints can also promote similarity between models, so that even if there are different types of samples, predictions can be made with a unified model.

(8) Clustering Constraint

Clustering Constraint is an important constraint in semi-supervised learning. It requires that training samples should be clustered together as much as possible to form several similar subsets. This constraint can better divide samples and enhance the robustness of the model.

(9) Topological constraints

Topology Constraint is an important constraint in semi-supervised learning. It requires the model to be able to handle the dependencies between samples, that is, there is a certain topological structure between training samples. For example, in image classification tasks, training samples often have certain similarities. Topological constraints can better extract features, reduce sample redundancy, and improve the prediction accuracy of the model.

(10) Iteration Constraint

Iterative Constraint is an important constraint in semi-supervised learning. It requires the model to ensure that certain conditions are met during each iteration. For example, the model’s prediction results should gradually converge to a stable state. Otherwise, the model will fall into the local optimal solution and cannot obtain the global optimal solution.

(11) Example Selection Constraint

Example Selection Constraint is an important constraint in semi-supervised learning. It requires the model to select only those samples that can help strengthen the model’s prediction ability. Too many irrelevant samples may hamper model training.

3. Core algorithm principles, specific operating steps and mathematical formulas explained

(1) Naive Bayes classifier

Bayes Classifier is a classification method based on statistics and probability theory. It predicts the category of the sample based on the feature vector of the sample based on two basic rules: Prior Probability and Conditional Probability.

The prediction rule of the Naive Bayes classifier is: P(Y|X)=P(X|Y).P(Y)/P(X), where Y is the category of the sample to be predicted and X is the feature vector of the sample .

(2) Maximum likelihood estimation

Maximum Likelihood Estimation (MLE) is a method of estimating model parameters. Assuming that the data set is independently and identically distributed, the optimal parameters are found by maximizing the likelihood of each sample in the training data.

For a given training data set, assuming that the model parameters obey the prior distribution, the maximization problem of maximum likelihood estimation can be formalized as:

max P(D|θ), θ is the model parameter.

θ = arg max log P(D|θ).

Among them, D is the training data set, |θ| is the dimension of the model parameters.

The limitation of MLE is that when the number of samples of a certain class in the training data set is too small, the estimation results of MLE may deviate from the true value; moreover, the probability distribution of the model parameters estimated by MLE is often discontinuous.

(3) Generative model and discriminant model

Generative models and discriminative models are the two main methods in semi-supervised learning.

Generative models assume that samples are generated from a certain distribution, and the model is able to generate samples that are similar to real samples. In the generative model, we assume a potential generative model G that can generate sample x, and the posterior probability P(y|x) and conditional probability P(x|y) can be learned through training data. The posterior probability P(y|x) can be thought of as the output distribution of G on x, and the conditional probability P(x|y) can be thought of as the distribution of G mapped to x space. In the generative model, the likelihood P(D|G,θ) of the sample and the parameters θ of the generative model are usually obtained through maximum likelihood estimation or EM algorithm.

The discriminative model assumes that there is some potential generative model G in the training data, but it is assumed that this model is unique. The discriminant model maps the sample x to the label y by learning the discriminant function y=f(x). The goal of the discriminant model is to maximize the expected risk of the discriminant function corresponding to each sample in the training data. The classification accuracy of discriminative models is usually much higher than that of generative models.

(4) Hidden Markov Model (HMM)

Hidden Markov Model (HMM) is a sequence learning method for labeling data. It consists of a state sequence, which refers to the hidden internal state sequence, and an observation sequence, which refers to the observed output sequence.

The prediction method of HMM is to use the forward-backward algorithm. The forward-backward algorithm is a dynamic programming algorithm based on two phases: the forward phase and the backward phase. The forward stage calculates the emission probability of each state through recursion; the backward stage calculates the transition probability of each state through recursion.

The training goal of HMM is to maximize the log-likelihood function of each sample in the training data. First, the probability distribution of the state sequence is determined by learning the initial state distribution π and the state transition probability A; then, the conditional probability distribution of the observation sequence is determined by learning the observation probability B. Finally, the EM algorithm or Variational Bayesian algorithm is used to maximize the likelihood of the training data.

(5) Graph structure learning

Graph Structured Learning (GSL) is a learning method used to represent dependencies between samples in semi-supervised learning. In GSL, samples are represented as graph nodes (Node), edges (Edges) represent dependencies between samples, and labels are encoded into the attributes of nodes. The main task of GSL is to discover the various interdependencies present in the training samples and encode them into the structure of the model. The learning framework of GSL is shown in the figure below:

There are three main methods for graph structure learning:

1. Graph matching algorithm: Graph matching algorithm attempts to find the structure of a graph such that it is similar to the graph in the training data set. Its commonly used algorithm is the Laplace Correction Algorithm.

2. Pairwise dependency learning algorithm: Pairwise dependency learning algorithm attempts to learn a set of paired dependencies. Its commonly used algorithm is Jaccard coefficient dependence network (JCDEP).

3. Hierarchy learning algorithm: Hierarchy learning algorithm tries to find a tree-like structure such that it is similar to the tree in the training data set. Its commonly used algorithm is hierarchical clustering algorithm (Hierarchical Clustering Algorithm).

(6) Enhanced Learning

Reinforcement Learning is a type of reinforcement learning algorithm. Its goal is to maximize long-term rewards by starting from an initial state and continuously trying to interact with the environment.

Reinforcement learning can be used to model reinforcement learning and combinatorial optimization problems. It can autonomously explore the environment and find the global optimal solution, and can also interact with human users to obtain feedback information to improve decision-making.

(7) Neural Network and Deep Learning

Neural Network is an unsupervised learning model based on perceptron and linearity. It can handle complex nonlinear relationships and is suitable for classification tasks.

Deep Learning is a deep neural network learning method based on neural networks. It uses many multi-layer neural networks that can automatically learn complex associations of features.

4. Specific code examples and explanations

Below are several typical Python code examples of semi-supervised learning algorithms.

(1) Unsupervised feature learning

Unsupervised Feature Learning refers to learning feature representations from unlabeled samples. It can extract useful information to reduce dimensions and simplify model training. Its algorithm flow is as follows:

  1. Data preprocessing: Divide the data set into a training set and a test set.

  2. Feature extraction: Extracting useful features from the data set in some way.

  3. Clustering: Cluster the extracted features and divide similar samples into the same cluster.

  4. Dimensionality reduction: Reduce features to an appropriate dimension to simplify model training.

  5. Model training: train the model based on the dimensionally reduced features, and use the training set for training.

  6. Test: Use the test set to test the performance of the model.

The following is an example of Python code:

from sklearn import cluster
from sklearn import datasets
from sklearn import decomposition
from sklearn import preprocessing
from sklearn import svm
import numpy as np

# Load dataset and split it into train and test sets
iris = datasets.load_iris()
X_train, X_test, y_train, _ = \
    model_selection.train_test_split(iris.data, iris.target, random_state=42)

# Extract features using PCA
pca = decomposition.PCA(n_components=2)
X_train = pca.fit_transform(preprocessing.scale(X_train))
X_test = pca.transform(preprocessing.scale(X_test))

# Perform clustering and reduce the number of dimensions to two
kmeans = cluster.KMeans(n_clusters=2, random_state=42).fit(X_train)
X_train = kmeans.cluster_centers_[kmeans.labels_]

# Train a SVM classifier with reduced number of dimensions
clf = svm.SVC(kernel='linear').fit(X_train, kmeans.labels_)

# Test the performance of the trained classifier
print('Test accuracy:', clf.score(X_test, kmeans.predict(X_test)))

(2) Semi-supervised clustering

Semi-Supervised Clustering refers to using some labeled samples to cluster unlabeled samples. Its algorithm flow is as follows:

  1. Data preprocessing: Divide the data set into a training set, a validation set, and a test set.

  2. Divide the training set into labeled samples (labeled training instances) and unlabeled samples (unlabeled training instances).

  3. Cluster unlabeled samples using an unsupervised learning algorithm such as K-Means.

  4. Label the clustering results (labeled as noise or abnormal samples) and reclassify the labeled samples into labeled samples and unlabeled samples.

  5. Unlabeled samples are classified using supervised learning algorithms (such as support vector machines) when label information is known.

  6. Test: Use the test set to test the performance of the model.

The following is an example of Python code:

import numpy as np
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import svm

# Generate semi-supervised dataset by labeling some of the labeled samples randomly
np.random.seed(0)
X, y = datasets.make_classification(n_samples=100, n_features=20, n_informative=5, n_redundant=5,
                                   n_clusters_per_class=2, class_sep=0.7, hypercube=True, shift=0.2)
mask = (y < 2) | (np.random.uniform(size=len(y)) > 0.5) # generate mask for unlabeled samples
X_lab, y_lab = X[~mask], y[~mask] # labeled training instances
X_ulab, y_ulab = X[mask], None # unlabeled training instances

# Split data into train and validation sets
X_train, X_val, y_train, y_val = model_selection.train_test_split(X_lab, y_lab, test_size=0.5,
                                                                   random_state=0)

# Use K-Means algorithm to cluster the unlabeled training instances
km = cluster.KMeans(n_clusters=2, init='random', max_iter=100, n_init=1,
                    random_state=0).fit(X_ulab)

# Label the resulting clusters as 'noise' or 'anomalous' based on their size
y_pred = km.predict(X_ulab)
sizes = [sum((y_pred == i)) for i in range(2)]
y_pred[(sizes[0]/sizes[1] <= 0.2) & amp; (sizes[0]/sizes[1] >= 0.01)] = -1 # noise sample
y_pred[(sizes[0]/sizes[1] <= 0.01) & amp; (sizes[0]/sizes[1] >= 0.001)] = 3 # anomalous sample
y_pred[(sizes[0]/sizes[1] > 0.01) & amp; (sizes[0]/sizes[1] <= 0.5)] = 2 # labeled sample

# Re-label the selected samples as labeled samples and combine them with original labeled samples
X_new = np.concatenate([X_lab, km.cluster_centers_, X_ulab[y_pred!= -1]])
y_new = np.concatenate([y_lab, [-1]*2])
for i, j in zip(range(-1, len(km.cluster_centers_), 2),
                [s + len(y_lab) + len(km.cluster_centers_) for s in [0, 2]]):
    if sizes[int(abs(i))] > sum((y_pred!= -1) & amp; (y_pred!= int(abs(i)))):
        X_new[j:(j + sizes[int(abs(i))])] + = km.cluster_centers_[i].reshape((-1,))
        y_new[j:(j + sizes[int(abs(i))])] = abs(i)

# Combine all labeled and unlabeled samples and perform classification task
X_all = np.concatenate([X_new, X_ulab[y_pred == 2]], axis=0)
y_all = np.concatenate([y_new, [2]*sizes[-1]])
clf = svm.SVC(gamma='auto').fit(X_all, y_all)

#Evaluate the performance of the classifier
print('Classification report:')
y_true = np.concatenate([y_val[:25], y_ulab[y_pred!= -1][::25]])
y_pred = clf.predict(X_val + X_ulab[y_pred!= -1])[::25]
print(metrics.classification_report(y_true, y_pred))
print('Confusion matrix:\
', metrics.confusion_matrix(y_true, y_pred))

(3) Multi-task learning

Multi-Task Learning refers to the use of multiple related tasks to improve the prediction ability of the model. Its algorithm flow is as follows:

  1. Data preprocessing: Divide the data set into a training set, a validation set, and a test set.

  2. Segmentation task: Divide the training set into multiple related subtasks, each subtask contains a related feature and label.

  3. Train a model for each subtask.

  4. Predict labels for unlabeled samples using the model for each subtask.

  5. Splicing results: Splice the predicted labels together as the output of the entire model.

  6. Train the entire model: Use the training set to train the entire model.

  7. Test: Use the test set to test the performance of the model.

The following is an example of Python code:

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from sklearn.model_selection import train_test_split

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

#Reshape input data
image_size = x_train.shape[1] * x_train.shape[2]
x_train = x_train.reshape(x_train.shape[0], image_size).astype('float32') / 255
x_test = x_test.reshape(x_test.shape[0], image_size).astype('float32') / 255

# One-hot encode labels
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Create multi-task model architecture
model = Sequential()
model.add(Dense(512, activation='relu', input_dim=image_size))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
optimizer = RMSprop(lr=0.001)

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

# Split training data into multiple tasks
x_train_digit, x_train_color, y_train_digit, y_train_color = \
    train_test_split(x_train, y_train[:, :10], test_size=0.2, random_state=0)

# Fit each sub-task model separately
digit_model = clone_model(model)
digit_model.set_weights(model.get_weights())
digit_history = digit_model.fit(x_train_digit,
                                y_train_digit,
                                batch_size=128,
                                epochs=20,
                                verbose=1,
                                validation_data=(x_val, y_val))

color_model = clone_model(model)
color_model.set_weights(model.get_weights())
color_history = color_model.fit(x_train_color,
                                y_train_color,
                                batch_size=128,
                                epochs=20,
                                verbose=1,
                                validation_data=(x_val, y_val))

# Merge predictions of individual models into final output
digit_predictions = digit_model.predict(x_test)
color_predictions = color_model.predict(x_test)
predictions = []
for i in range(image_size):
    predictions.append(np.argmax(np.bincount(
        digit_predictions[:, :, i].reshape((-1,), order='F'),
        weights=color_predictions[:, :, i].reshape((-1,), order='F'))) + 1)

# Compute overall accuracy of the merged predictions
accuracy = np.mean(np.array(predictions) == np.argmax(y_test, axis=1))
print("Overall accuracy:", accuracy)

Semi-supervised learning is becoming an important research direction in machine learning. Currently, there are still many related algorithms and theories that require further research. For example, some algorithms can automatically find suitable unlabeled samples without human intervention; some algorithms can better handle the dependencies between samples and reduce redundant information; and some algorithms can handle more complex structured data.

In addition, the application scenarios of semi-supervised learning are also very rich. From image classification to text classification, sequence analysis, etc., semi-supervised learning has broad application prospects. In the future, semi-supervised learning will receive more and more attention because it helps solve various problems and improve the prediction performance of the model.