LLM – GPU computing power evaluation during training and inference

Table of Contents

I. Introduction

2. FLOPs and TFLOPs

◆ FLOPs [Floating point Opearation Per Second]

◆ TFLOPs [Tera Floating point Operation Per Second]

3. GPU consumption during training phase

◆ Factors affecting training

◆ GPT-3 training statistics

◆ Custom training GPU evaluation

4. GPU consumption during inference phase

◆ Factors affecting reasoning

◆ Custom inference GPU evaluation

◆ Calculate the difference

5. Calculation method based on Token

6. Summary


1. Introduction

In the era of LLM large models, the computing power of GPU graphics cards is crucial. Both training and inference are inseparable from the support of large-scale GPUs. Due to work needs, it is necessary to evaluate the graphics card computing power required for LLM tasks. After searching, I found that there are not many articles on GPU evaluation calculations on the Internet. I hereby record them. If there are any inappropriate points, please share them in the comment area.

二.FLOPs and TFLOPs

Before introducing the evaluation of GPU computing power, we first need to understand the common evaluation indicators of GPU computing power.

FLOPs [Floating point Opearation Per Second]

Indicates the number of floating point operations performed by the device per second.

TFLOPs [Tera Floating point Opearation Per Second]

Represents the number of times the device performs one trillion floating-point operations per second, which is 10^12 to the power.

Floating point operations refer to calculations involving decimal points, such as addition, subtraction, multiplication and division. TFLOPs are larger units than FLOPs and are used to measure higher performance of a computer or processor. Taking the L40S as an example, at FP32 accuracy, its computing power can reach 91.6 TFLOPs, while the A800 has only 19.5 TFLOPs, and the RTX 3060 is about 12.5 TFLOPS. If it is a supercomputer, its TFLOPS performance index can reach 1000 trillion operations per second.

3. GPU consumption in the training phase

Factors affecting training

Before calculating, first look at the factors that will affect the training process during the training phase.

-Training data size

– Model parameter scale

– Number of training epochs

– Graphics card computing power

GPT-3 training statistics

We introduced FLOPs and TFLOPs above. Here ZettaFLOPs represents one hundred thousand (10^21) floating point operations per second. The difference between different indicators is mainly the unit. Taking GPT-3 as a reference, it is a 175 billion parameter model and 45 TB of training data. One training requires approximately 175 ZettaFLOPs, which is 1.75 x 10^23 floating point operations. Since we do not get here whether 175 ZFLOPs is an Epoch or a complete training, so for subsequent calculations, we first use the complete training requirements as a reference.

Custom training GPU evaluation

It is assumed here that the model we are training is LLaMA-33B, the number of parameters is more than 330, the amount of training data is 50G, and the GPU computing power is 19.5 TFLOPs based on FP32 calculation under A800. Then the GPU computing power required to train this model is:

NeedFLOPs = (330 / 1750) * (0.048828125 / 45) * 1.75 * 10^23 FLOPs

The device we use is A800, and the computing power it can provide is:

CalcByA800 = 19.5 * 10^12 FLOPs

Finally, combined with the number of days we need, for example, if training ends in 5 days, then the number of s in 5 days is:

TrainTime = 86400 * 5

The final number of GPUs required is:

GPUCount = NeedFLOPs / (CalcByA800 * TrainTime)

In order to facilitate calculation, we directly rewrite it as Python code:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

"""
    Returns the estimated number of GPUs

    Parameters
    ----------
    args_num: The number of current model parameters, in billions
    data_size: current training data size, unit is GB

    Returns
    -------
    count : required number of GPUs, not rounded up
"""


def calc_gpu_num(_args_num, _data_size, _train_days):
    need_flops = (_args_num / 1750) * (_data_size / 45) * 1.75 * 10 ** 23
    calc_by_a800 = 19.5 * 10 ** 12
    train_time = 86400 * _train_days
    gpu_count = need_flops / (calc_by_a800 * train_time)
    return gpu_count


if __name__ == '__main__':
    args_num = 330 #LLaMA-33B
    data_size = 0.048828125 # 50G
    train_days = 5 #Train for 5 days
    count = calc_gpu_num(args_num, data_size, train_days)
    print(count)

The calculated count = 4.250628165558721. We round up to get that the LLaMA-33B model that trains 50G data for 5 days requires 5 A800s. However, issues such as video memory or performance loss are not considered here, and it is only a rough estimate.

4. GPU consumption in the inference phase

Factors affecting reasoning

– Input and output data

– Model parameter scale

– Graphics card computing power

The inference stage needs to calculate the sum of input and output text. In actual calculation, the text needs to be tokenizer to token_ids for transformers calculation. The ratio of Chinese characters to tokens is approximately 1:2, that is, 1 Chinese character corresponds to 2 token_ids. During GPU inference calculation, it is proportional to the input and output text sum L, model dimension D and model layer number N.

Customized inference GPU evaluation

Assume that the number of input query texts is 100 and the output text book is 1000, then L = (100 + 1000) * 2 = 2200. According to the model dimension D = 1280 and the number of layers N = 96, calculate the required computing power:

?NeedFLOPs ≈ L * D * N ≈ 270336000 = 2.7 * 10^8

Suppose we use A800 to complete this inference request in 1 second:

CalcByA800 = 19.5 * 10^12 FLOPs

Then the number of GPUs required:

count = 270336000 / (19.5 * 10**12) = 1.3863384615384615e-05

In turn, CalcByA800 / NeedFLOPs can calculate an A800, which can satisfy about 7.2w users to obtain a reply within 1s, but this is only an ideal situation:

people = (19.5 * 10**12) / 270336000 = 72132.45738636363

Calculate Difference

Reasoning involves words and calculations, such as generating coherent plots and generating mathematical logic. These processes consume almost the same computing power. This is because the essence of LLM is a language model. As long as the input and output are the same, it processes token_ids, and the inference obtained is the distribution of next_token or token_ids, so it is the same as the next_token distribution. The specific form of expression is irrelevant, that is, different types of tasks, as long as the input and output are similar, their computing power consumption will also be similar.

5. Calculation method based on Token

The above methods are all calculated based on FLOPs, but due to a lot of information in actual scenarios, we obtain incomplete information. For example, whether the training FLOPs of GPT-3 are accurate, whether the training FLOPs are an epoch or a whole training process, so in actual scenarios, another method can be used, that is, calculation based on Token processing efficiency, but this calculation method has a premise , which requires you to measure in advance the number of tokens processed by the corresponding GPU device per second, that is, Process Tokens/s.

Taking the blogger’s actual calculation as an example, the token processing capacity of P40, that is, 4090, is about 25 Token/s. Assuming that we need to request 1,000 times a day, input and output a total of 1,000 Chinese characters each time, and it takes a total of 10 hours to complete the processing, then P40’s The quantity can be calculated using the following formula:

def calc_gpu_num_by_token(post_token_num, post_time_all, token_process):
    return post_token_num / (post_time_all * token_process)


if __name__ == '__main__':
    token_num = 1000 * 1000 * 2 # 1000 requests and 1000 Chinese characters correspond to 2000 tokens
    time_cost = 10 * 3600 #Total processing time
    token_process_num = 25 #The number of tokens that can be processed per s
    print(calc_gpu_num_by_token(token_num, time_cost, token_process_num))

It is calculated that 2.223 P40s are needed to handle these requirements in 10 hours. Rounding up, we get 3 P40s.

Six. Summary

Here are several methods for calculating GPU computing power. The above calculations are all based on ideal conditions. In actual situations, the IO delay of multiple machines and multiple cards, the network delay of the machine, and the supply and demand relationship between model size and GPU memory, etc. need to be considered. wait. More data still requires your own practice to gain true knowledge. If you have any questions, you are welcome to discuss them in the comment area, because there is still relatively little information in this area online. You can share them together and make progress together!

The knowledge points of the article match the official knowledge archive, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceMachine learning toolkit Scikit-learn385539 people are learning the system