Introduction
Genetic algorithms as defined in Wikipedia are inspired by the process of natural selection proposed by Charles Darwin. More generally, we can use the following description to understand natural processes and how it relates to genetic algorithms.
For more content and Complete code follow the public account [Little Z’s Scientific Research Daily] to get it.
We start with an initial population with certain characteristics, as shown in Figure 1 . This initial population will be tested in a specific environment to see how the individuals (parents) in the population perform against predefined fitness criteria. Fitness in machine learning can be any performance metric – accuracy, precision, recall, F1 score, auc, etc. Based on the fitness values, we select the best-performing parents (“survival of the fittest”) as the survivor group (Fig. 2).
Figure 1 Initialization population
Figure 2 Survival of the fittest
The parents in the surviving population will now mate to produce offspring using a combination of two steps: crossover/recombination and mutation. In the case of crossover, the genes (parameters) from the mating parents are recombined to produce offspring, with each child inheriting some genes (parameters) from each parent (Figure 3).
Figure 3 Crossover operation
Finally, in the case of mutations, some values (parameters) of the gene will be changed to maintain genetic diversity (Fig. 4). This allows natural/genetic algorithms to generally come up with better solutions.
Figure 4 Mutation
Figure 5 shows the second generation population, which includes surviving parents and children. We retain the surviving parent in order to retain the best fitness parameter in case the offspring has worse fitness values than the parent.
Figure 5 Second generation population
XGBoost’s genetic algorithm module
We will create a genetic algorithm module customized for XGBoost. Here is the description of XGboost:
XGBoost is an optimized distributed gradient boosting library designed to be fun, flexible and portable. He implements machine learning algorithms under the Gradient Boosting framework.
This module will have functions that follow four steps:
(i)Initialization
(ii) Choice
(iii) Crossover
(iv)Mutation
Initialization:
The first step is initialization, where parameters are randomly initialized to create the population. It is similar to the first-generation population shown in Figure 1 . The code below shows the initialization process where we generate a vector containing parameters. For XGBoost, we selected 7 parameters for optimization: learning_rate, n_estimators, max_depth, min_child_weight, subsample, colsample_bytree and gamma. .
Limits on parameters are either based on the limits described in the XGBoost documentation, or on reasonable guesses (if the upper limit is set to infinity). We first create an empty array for each parameter and then fill it with random values.
Parent selection (survival of the fittest):
In the second step, we train the model using the initial population and calculate the fitness value. In this case, we will calculate the F1 score. We will define how many parents to select and create an array based on the fitness values of the selected parents.
Crossover:
There are many ways to define crossover in genetic algorithms, such as single-point crossover, two-point crossover, k-point crossover, uniform crossover, ordered list crossover, etc. We will use uniform crossover, where each parameter of the children will be independently selected from the parent according to a certain distribution. In our example we will use the “discrete uniform” distribution of the numpy random function.
Mutation:
The final step is to introduce diversity to the offspring by randomly selecting a parameter and randomly changing its value. We will also introduce some restrictions to limit the changed values to a certain range. Skipping these limits may cause errors.
code show as below:
Initialization:
def initialize_poplulation(numberOfParents): learningRate = np.empty([numberOfParents, 1]) nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8) maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8) minChildWeight = np.empty([numberOfParents, 1]) gammaValue = np.empty([numberOfParents, 1]) subSample = np.empty([numberOfParents, 1]) colSampleByTree = np.empty([numberOfParents, 1]) for i in range(numberOfParents): print(i) learningRate[i] = round(random.uniform(0.01, 1), 2) nEstimators[i] = random.randrange(10, 1500, step = 25) maxDepth[i] = int(random.randrange(1, 10, step= 1)) minChildWeight[i] = round(random.uniform(0.01, 10.0), 2) gammaValue[i] = round(random.uniform(0.01, 10.0), 2) subSample[i] = round(random.uniform(0.01, 1.0), 2) colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2) population = np.concatenate((learningRate, nEstimators, maxDepth, minChildWeight, gammaValue, subSample, colSampleByTree), axis= 1) return population
Parent selection (survival of the fittest):
def fitness_f1score(y_true, y_pred): fitness = round((f1_score(y_true, y_pred, average='weighted')), 4) return fitness #Train the data and find the fitness score def train_population(population, dMatrixTrain, dMatrixtest, y_test): fScore = [] for i in range(population.shape[0]): param = { 'objective':'binary:logistic', 'learning_rate': population[i][0], 'n_estimators': population[i][1], 'max_depth': int(population[i][2]), 'min_child_weight': population[i][3], 'gamma': population[i][4], 'subsample': population[i][5], 'colsample_bytree': population[i][6], 'seed': 24} num_round = 100 xgbT = xgb.train(param, dMatrixTrain, num_round) preds = xgbT.predict(dMatrixtest) preds = preds>0.5 fScore.append(fitness_f1score(y_test, preds)) return fScore #select parents for mating def new_parents_selection(population, fitness, numParents): selectedParents = np.empty((numParents, population.shape[1])) #create an array to store fittest parents #Find the best performing parent for parentId in range(numParents): bestFitnessId = np.where(fitness == np.max(fitness)) bestFitnessId = bestFitnessId[0][0] selectedParents[parentId, :] = population[bestFitnessId, :] fitness[bestFitnessId] = -1 #set this value to negative, in case of F1-score, so this parent is not selected again return selectedParents
Crossover:
def crossover_uniform(parents, childrenSize): crossoverPointIndex = np.arange(0, np.uint8(childrenSize[1]), 1, dtype= np.uint8) #get all the index crossoverPointIndex1 = np.random.randint(0, np.uint8(childrenSize[1]), np.uint8(childrenSize[1]/2)) # select half of the indexes randomly crossoverPointIndex2 = np.array(list(set(crossoverPointIndex) - set(crossoverPointIndex1))) #select leftover indexes children = np.empty(childrenSize) for i in range(childrenSize[0]): #find parent 1 index parent1_index = i%parents.shape[0] #find parent 2 index parent2_index = (i + 1)%parents.shape[0] #insert parameters based on randomly selected indexes in parent 1 children[i, crossoverPointIndex1] = parents[parent1_index, crossoverPointIndex1] #insert parameters based on randomly selected indexes in parent 1 children[i, crossoverPointIndex2] = parents[parent2_index, crossoverPointIndex2] return children
Mutation:
def mutation(crossover, numberOfParameters): #Define minimum and maximum values allowed for each parameter minMaxValue = np.zeros((numberOfParameters, 2)) minMaxValue[0:] = [0.01, 1.0] #min/max learning rate minMaxValue[1, :] = [10, 2000] #min/max n_estimator minMaxValue[2, :] = [1, 15] #min/max depth minMaxValue[3, :] = [0, 10.0] #min/max child_weight minMaxValue[4, :] = [0.01, 10.0] #min/max gamma minMaxValue[5, :] = [0.01, 1.0] #min/maxsubsample minMaxValue[6, :] = [0.01, 1.0] #min/maxcolsample_bytree # Mutation changes a single gene in each offspring randomly. mutationValue = 0 parameterSelect = np.random.randint(0, 7, 1) print(parameterSelect) if parameterSelect == 0: #learning_rate mutationValue = round(np.random.uniform(-0.5, 0.5), 2) if parameterSelect == 1: #n_estimators mutationValue = np.random.randint(-200, 200, 1) if parameterSelect == 2: #max_depth mutationValue = np.random.randint(-5, 5, 1) if parameterSelect == 3: #min_child_weight mutationValue = round(np.random.uniform(5, 5), 2) if parameterSelect == 4: #gamma mutationValue = round(np.random.uniform(-2, 2), 2) if parameterSelect == 5: #subsample mutationValue = round(np.random.uniform(-0.5, 0.5), 2) if parameterSelect == 6: #colsample mutationValue = round(np.random.uniform(-0.5, 0.5), 2) #indtroduce mutation by changing one parameter, and set to max or min if it goes out of range for idx in range(crossover.shape[0]): crossover[idx, parameterSelect] = crossover[idx, parameterSelect] + mutationValue if(crossover[idx, parameterSelect] > minMaxValue[parameterSelect, 1]): crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 1] if(crossover[idx, parameterSelect] < minMaxValue[parameterSelect, 0]): crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 0] return crossover
Genetic Algorithm Optimization XGBoost
We will implement the modules discussed above to train the dataset. This dataset comes from the UCI Machine Learning Repository. It contains a set of 102 molecules, 39 of which have been identified by humans as having odors that could be used in fragrances, while 69 do not have the desired odor. The data set contains 6,590 low-energy conformations of these molecules, containing 166 features. As the goal of this tutorial, we are doing the least amount of preconceptions to understand genetic algorithms.
import numpy as np import pandas as pd import geneticXGboost import xgboost as xgb np.random.seed(723) dataset = pd.read_csv('clean2.data', header=None) X = dataset.iloc[:, 2:168].values #discard first two coloums as these are molecule's name and conformation's name y = dataset.iloc[:, 168].values #extrtact last coloum as class (1 => desired odor, 0 => undesired odor) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 97) from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix xgbDMatrixTest = xgb.DMatrix(X_test, y_test)
We started with 8 parents and we selected the 4 most suitable ones to mate. We will create 4 generations and monitor fitness (F1 score). Half of the parents of the next generation will be the most suitable parents selected from the previous generation. In this way, we can maintain at least the same optimal fitness score as the previous generation when the fitness score of the offspring is lower.
numberOfParents = 8 #number of parents to start numberOfParentsMating = 4 #number of parents that will mate numberOfParameters = 7 #number of parameters that will be optimized numberOfGenerations = 4 #number of genration that will be created populationSize = (numberOfParents, numberOfParameters) population = geneticXGboost.initilialize_poplulation(numberOfParents) fitnessHistory = np.empty([numberOfGenerations + 1, numberOfParents]) populationHistory = np.empty([(numberOfGenerations + 1)*numberOfParents, numberOfParameters]) populationHistory[0:numberOfParents, :] = population for generation in range(numberOfGenerations): print("This is number %s generation" % (generation)) fitnessValue = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test) fitnessHistory[generation, :] = fitnessValue print('Best F1 score in the this iteration = {}'.format(np.max(fitnessHistory[generation, :]))) parents = geneticXGboost.new_parents_selection(population=population, fitness=fitnessValue, numParents=numberOfParentsMating) children = geneticXGboost.crossover_uniform(parents=parents, childrenSize=(populationSize[0] - parents.shape[0], numberOfParameters)) children_mutated = geneticXGboost.mutation(children, numberOfParameters) population[0:parents.shape[0], :] = parents #fittest parents population[parents.shape[0]:, :] = children_mutated #children populationHistory[(generation + 1)*numberOfParents : (generation + 1)*numberOfParents + numberOfParents , :] = population
Finally, we get the best score and related parameters:
fitness = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test) fitnessHistory[generation + 1, :] = fitness #index of the best solution bestFitnessIndex = np.where(fitness == np.max(fitness))[0][0] #Bestfitness print("Best fitness is =", fitness[bestFitnessIndex]) #Best parameters print("Best parameters are:") print('learning_rate', population[bestFitnessIndex][0]) print('n_estimators', population[bestFitnessIndex][1]) print('max_depth', int(population[bestFitnessIndex][2])) print('min_child_weight', population[bestFitnessIndex][3]) print('gamma', population[bestFitnessIndex][4]) print('subsample', population[bestFitnessIndex][5]) print('colsample_bytree', population[bestFitnessIndex][6])
Now let us visualize the fitness changes of each generation of the population (below). Although we already started with a high F1 score (~0.98), in a randomly generated initial population of two parents, we were able to improve it further in the final generation. The lowest F1 score for a parent in the initial population was 0.9143, and the best score for a parent in the final generation was 0.9947. This shows that we can improve the performance metrics of XGBoost by simply implementing a genetic algorithm.
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56,890 people are learning the system