Tuning hyperparameters in XGBoost using genetic algorithms

Introduction

Genetic algorithms as defined in Wikipedia are inspired by the process of natural selection proposed by Charles Darwin. More generally, we can use the following description to understand natural processes and how it relates to genetic algorithms.

For more content and Complete code follow the public account [Little Z’s Scientific Research Daily] to get it.

We start with an initial population with certain characteristics, as shown in Figure 1 . This initial population will be tested in a specific environment to see how the individuals (parents) in the population perform against predefined fitness criteria. Fitness in machine learning can be any performance metric – accuracy, precision, recall, F1 score, auc, etc. Based on the fitness values, we select the best-performing parents (“survival of the fittest”) as the survivor group (Fig. 2).

Figure 1 Initialization population

Figure 2 Survival of the fittest

The parents in the surviving population will now mate to produce offspring using a combination of two steps: crossover/recombination and mutation. In the case of crossover, the genes (parameters) from the mating parents are recombined to produce offspring, with each child inheriting some genes (parameters) from each parent (Figure 3).

Figure 3 Crossover operation

Finally, in the case of mutations, some values (parameters) of the gene will be changed to maintain genetic diversity (Fig. 4). This allows natural/genetic algorithms to generally come up with better solutions.

Figure 4 Mutation

Figure 5 shows the second generation population, which includes surviving parents and children. We retain the surviving parent in order to retain the best fitness parameter in case the offspring has worse fitness values than the parent.

Figure 5 Second generation population

XGBoost’s genetic algorithm module

We will create a genetic algorithm module customized for XGBoost. Here is the description of XGboost:

XGBoost is an optimized distributed gradient boosting library designed to be fun, flexible and portable. He implements machine learning algorithms under the Gradient Boosting framework.

This module will have functions that follow four steps:

(i)Initialization

(ii) Choice

(iii) Crossover

(iv)Mutation

Initialization:

The first step is initialization, where parameters are randomly initialized to create the population. It is similar to the first-generation population shown in Figure 1 . The code below shows the initialization process where we generate a vector containing parameters. For XGBoost, we selected 7 parameters for optimization: learning_rate, n_estimators, max_depth, min_child_weight, subsample, colsample_bytree and gamma. .

Limits on parameters are either based on the limits described in the XGBoost documentation, or on reasonable guesses (if the upper limit is set to infinity). We first create an empty array for each parameter and then fill it with random values.

Parent selection (survival of the fittest):

In the second step, we train the model using the initial population and calculate the fitness value. In this case, we will calculate the F1 score. We will define how many parents to select and create an array based on the fitness values of the selected parents.

Crossover:

There are many ways to define crossover in genetic algorithms, such as single-point crossover, two-point crossover, k-point crossover, uniform crossover, ordered list crossover, etc. We will use uniform crossover, where each parameter of the children will be independently selected from the parent according to a certain distribution. In our example we will use the “discrete uniform” distribution of the numpy random function.

Mutation:

The final step is to introduce diversity to the offspring by randomly selecting a parameter and randomly changing its value. We will also introduce some restrictions to limit the changed values to a certain range. Skipping these limits may cause errors.

code show as below:

Initialization:

def initialize_poplulation(numberOfParents):
    learningRate = np.empty([numberOfParents, 1])
    nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8)
    maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8)
    minChildWeight = np.empty([numberOfParents, 1])
    gammaValue = np.empty([numberOfParents, 1])
    subSample = np.empty([numberOfParents, 1])
    colSampleByTree = np.empty([numberOfParents, 1])
for i in range(numberOfParents):
        print(i)
        learningRate[i] = round(random.uniform(0.01, 1), 2)
        nEstimators[i] = random.randrange(10, 1500, step = 25)
        maxDepth[i] = int(random.randrange(1, 10, step= 1))
        minChildWeight[i] = round(random.uniform(0.01, 10.0), 2)
        gammaValue[i] = round(random.uniform(0.01, 10.0), 2)
        subSample[i] = round(random.uniform(0.01, 1.0), 2)
        colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2)
    
    population = np.concatenate((learningRate, nEstimators, maxDepth, minChildWeight, gammaValue, subSample, colSampleByTree), axis= 1)
    return population

Parent selection (survival of the fittest):

def fitness_f1score(y_true, y_pred):
    fitness = round((f1_score(y_true, y_pred, average='weighted')), 4)
    return fitness
#Train the data and find the fitness score
def train_population(population, dMatrixTrain, dMatrixtest, y_test):
    fScore = []
    for i in range(population.shape[0]):
        param = { 'objective':'binary:logistic',
              'learning_rate': population[i][0],
              'n_estimators': population[i][1],
              'max_depth': int(population[i][2]),
              'min_child_weight': population[i][3],
              'gamma': population[i][4],
              'subsample': population[i][5],
              'colsample_bytree': population[i][6],
              'seed': 24}
        num_round = 100
        xgbT = xgb.train(param, dMatrixTrain, num_round)
        preds = xgbT.predict(dMatrixtest)
        preds = preds>0.5
        fScore.append(fitness_f1score(y_test, preds))
    return fScore
#select parents for mating
def new_parents_selection(population, fitness, numParents):
    selectedParents = np.empty((numParents, population.shape[1])) #create an array to store fittest parents
    
    #Find the best performing parent
    for parentId in range(numParents):
        bestFitnessId = np.where(fitness == np.max(fitness))
        bestFitnessId = bestFitnessId[0][0]
        selectedParents[parentId, :] = population[bestFitnessId, :]
        fitness[bestFitnessId] = -1 #set this value to negative, in case of F1-score, so this parent is not selected again
    return selectedParents

Crossover:

def crossover_uniform(parents, childrenSize):
    
    crossoverPointIndex = np.arange(0, np.uint8(childrenSize[1]), 1, dtype= np.uint8) #get all the index
    crossoverPointIndex1 = np.random.randint(0, np.uint8(childrenSize[1]), np.uint8(childrenSize[1]/2)) # select half of the indexes randomly
    crossoverPointIndex2 = np.array(list(set(crossoverPointIndex) - set(crossoverPointIndex1))) #select leftover indexes
    
    children = np.empty(childrenSize)
    
    for i in range(childrenSize[0]):
        
        #find parent 1 index
        parent1_index = i%parents.shape[0]
        #find parent 2 index
        parent2_index = (i + 1)%parents.shape[0]
        #insert parameters based on randomly selected indexes in parent 1
        children[i, crossoverPointIndex1] = parents[parent1_index, crossoverPointIndex1]
        #insert parameters based on randomly selected indexes in parent 1
        children[i, crossoverPointIndex2] = parents[parent2_index, crossoverPointIndex2]
    return children

Mutation:

def mutation(crossover, numberOfParameters):
    #Define minimum and maximum values allowed for each parameter
    minMaxValue = np.zeros((numberOfParameters, 2))
    
    minMaxValue[0:] = [0.01, 1.0] #min/max learning rate
    minMaxValue[1, :] = [10, 2000] #min/max n_estimator
    minMaxValue[2, :] = [1, 15] #min/max depth
    minMaxValue[3, :] = [0, 10.0] #min/max child_weight
    minMaxValue[4, :] = [0.01, 10.0] #min/max gamma
    minMaxValue[5, :] = [0.01, 1.0] #min/maxsubsample
    minMaxValue[6, :] = [0.01, 1.0] #min/maxcolsample_bytree
 
    # Mutation changes a single gene in each offspring randomly.
    mutationValue = 0
    parameterSelect = np.random.randint(0, 7, 1)
    print(parameterSelect)
    if parameterSelect == 0: #learning_rate
        mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
    if parameterSelect == 1: #n_estimators
        mutationValue = np.random.randint(-200, 200, 1)
    if parameterSelect == 2: #max_depth
        mutationValue = np.random.randint(-5, 5, 1)
    if parameterSelect == 3: #min_child_weight
        mutationValue = round(np.random.uniform(5, 5), 2)
    if parameterSelect == 4: #gamma
        mutationValue = round(np.random.uniform(-2, 2), 2)
    if parameterSelect == 5: #subsample
        mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
    if parameterSelect == 6: #colsample
        mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
  
    #indtroduce mutation by changing one parameter, and set to max or min if it goes out of range
    for idx in range(crossover.shape[0]):
        crossover[idx, parameterSelect] = crossover[idx, parameterSelect] + mutationValue
        if(crossover[idx, parameterSelect] > minMaxValue[parameterSelect, 1]):
            crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 1]
        if(crossover[idx, parameterSelect] < minMaxValue[parameterSelect, 0]):
            crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 0]
    return crossover

Genetic Algorithm Optimization XGBoost

We will implement the modules discussed above to train the dataset. This dataset comes from the UCI Machine Learning Repository. It contains a set of 102 molecules, 39 of which have been identified by humans as having odors that could be used in fragrances, while 69 do not have the desired odor. The data set contains 6,590 low-energy conformations of these molecules, containing 166 features. As the goal of this tutorial, we are doing the least amount of preconceptions to understand genetic algorithms.

import numpy as np
import pandas as pd
import geneticXGboost
import xgboost as xgb
np.random.seed(723)
dataset = pd.read_csv('clean2.data', header=None)

X = dataset.iloc[:, 2:168].values #discard first two coloums as these are molecule's name and conformation's name
y = dataset.iloc[:, 168].values #extrtact last coloum as class (1 => desired odor, 0 => undesired odor)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 97)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix
xgbDMatrixTest = xgb.DMatrix(X_test, y_test)

We started with 8 parents and we selected the 4 most suitable ones to mate. We will create 4 generations and monitor fitness (F1 score). Half of the parents of the next generation will be the most suitable parents selected from the previous generation. In this way, we can maintain at least the same optimal fitness score as the previous generation when the fitness score of the offspring is lower.

numberOfParents = 8 #number of parents to start
numberOfParentsMating = 4 #number of parents that will mate
numberOfParameters = 7 #number of parameters that will be optimized
numberOfGenerations = 4 #number of genration that will be created
 
populationSize = (numberOfParents, numberOfParameters)
 
population = geneticXGboost.initilialize_poplulation(numberOfParents)
 
fitnessHistory = np.empty([numberOfGenerations + 1, numberOfParents])
 
populationHistory = np.empty([(numberOfGenerations + 1)*numberOfParents, numberOfParameters])
 
populationHistory[0:numberOfParents, :] = population
for generation in range(numberOfGenerations):
    print("This is number %s generation" % (generation))
    
 
    fitnessValue = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
    fitnessHistory[generation, :] = fitnessValue
    
    print('Best F1 score in the this iteration = {}'.format(np.max(fitnessHistory[generation, :])))

parents = geneticXGboost.new_parents_selection(population=population, fitness=fitnessValue, numParents=numberOfParentsMating)
    
   
    children = geneticXGboost.crossover_uniform(parents=parents, childrenSize=(populationSize[0] - parents.shape[0], numberOfParameters))
 
    children_mutated = geneticXGboost.mutation(children, numberOfParameters)
    
    population[0:parents.shape[0], :] = parents #fittest parents
    population[parents.shape[0]:, :] = children_mutated #children
    
    populationHistory[(generation + 1)*numberOfParents : (generation + 1)*numberOfParents + numberOfParents , :] = population 

Finally, we get the best score and related parameters:

fitness = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
fitnessHistory[generation + 1, :] = fitness
#index of the best solution
bestFitnessIndex = np.where(fitness == np.max(fitness))[0][0]
#Bestfitness
print("Best fitness is =", fitness[bestFitnessIndex])
#Best parameters
print("Best parameters are:")
print('learning_rate', population[bestFitnessIndex][0])
print('n_estimators', population[bestFitnessIndex][1])
print('max_depth', int(population[bestFitnessIndex][2]))
print('min_child_weight', population[bestFitnessIndex][3])
print('gamma', population[bestFitnessIndex][4])
print('subsample', population[bestFitnessIndex][5])
print('colsample_bytree', population[bestFitnessIndex][6])

Now let us visualize the fitness changes of each generation of the population (below). Although we already started with a high F1 score (~0.98), in a randomly generated initial population of two parents, we were able to improve it further in the final generation. The lowest F1 score for a parent in the initial population was 0.9143, and the best score for a parent in the final generation was 0.9947. This shows that we can improve the performance metrics of XGBoost by simply implementing a genetic algorithm.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56,890 people are learning the system