Building an affordable LLM with bitsandbytes, 4-bit quantization, and QLoRA

As we all know, LLMs are huge, and running or training them on consumer hardware would be a huge step forward in making them more accessible. Our previous LLM.int8 blog post showed how we integrated the technology from the LLM.int8 paper into transformers through the bitsandbytes library. On this basis, we continue to work hard to continuously lower the entry barrier for large models. In the process, we decided to team up with bitsandbytes again to enable users to run the vast majority of HF models on any modality (text, visual, multimodal, etc.) with 4-bit precision. Users can also leverage tools in the Hugging Face ecosystem to train adapters on top of 4-bit models. This work is based on a new approach recently introduced in the QLoRA paper by Dettmers et al., whose abstract is as follows:

We propose QLoRA, an efficient fine-tuning method that reduces memory usage such that a 65B model can be fine-tuned on a single 48GB GPU, and the resulting model performs on par with full 16-bit fine-tuning. QLoRA backpropagates gradients into a low-rank adapter (LoRA) by freezing a 4-bit quantized pretrained language model. With only 24 hours of fine-tuning on a single GPU, our best model (which we named Guanaco) outperformed all previously publicly released models on the Vicuna benchmark and achieved 99.3% of the ChatGPT performance level. QLoRA introduces several innovative technologies to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is the best representation of normal distribution weights in an information theoretic sense (b ) dual quantization, which reduces the average memory footprint by quantizing the quantization coefficients twice, and (c) a paging optimizer to reduce the peak memory footprint. We used QLoRA to fine-tune more than 1,000 models and provided detailed performance analysis of their performance on tasks such as instruction compliance and chatting, covering 8 instruction data sets, multiple model architectures (LLaMA, T5), and Large models that cannot be fine-tuned using conventional methods (such as the 33B and 65B models). The results show that fine-tuning QLoRA on small high-quality datasets leads to state-of-the-art performance and requires smaller model sizes. We conducted a detailed evaluation analysis of chatbot performance using humans and GPT-4, showing that GPT-4 evaluation is a cheap and reasonable alternative to human evaluation. Additionally, we found that current chatbot benchmarks are not very trustworthy in accurately assessing chatbot performance levels. We also selected some samples and analyzed the cases where Guanaco did worse than ChatGPT. We release all models and code, including CUDA kernel functions for 4-bit training.

Resources

Here are some introductory resources for 4-bit models and QLoRA:

Original paper
Google Colab notebook on basic bitsandbytes usage – This notebook shows how to infer on a 4-bit model and run a GPT-neo-X model on a free Google Colab instance (20B) .
Fine-tuned Google Colab Notebook – This notebook shows how to fine-tune a 4-bit model on downstream tasks using the Hugging Face ecosystem. We demonstrate that GPT-neo-X 20B can be fine-tuned on a Google Colab instance!
The original code base used to reproduce the paper’s results
Demo space for the Guanaco 33B – This demo space is also included below.

Introduction

If you are not yet familiar with model accuracy and some common data types (float16, float32, bfloat16, int8), we recommend that you read our first blog post carefully, which details the relevant concepts with pictures and text.

For more information, I recommend checking out this wikibook document to learn about the basics of floating point representation.

Two different data types are discussed in the QLoRA paper: 4-bit Float and 4-bit NormalFloat. Here we will discuss the 4-bit Float data type as it is easier to understand.

FP8 and FP4 stand for floating point 8-bit and 4-bit precision respectively. They belong to the minifloats family of floating-point values (the minifloats family also includes other precisions such as bfloat16 and float16).

Let’s first look at how floating point values are represented in FP8 format, and then see what the FP4 format looks like.

FP8 format

As discussed in a previous blog post, each bit in an n-bit floating point number belongs to a specific category and is responsible for representing the various components of the number (sign, mantissa, and exponent).

FP8 for Deep Learning This paper introduces the FP8 (floating point 8) format for the first time, which has two different encoding methods: E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa).


FP8 format overview. Image source: Content comes from `sgugger`

Although the accuracy decreases significantly as the number of bits is reduced from 32 to 8, there is still a lot going for these two 8-bit encodings. Currently, we can use them through the Transformer Engine library, which is also integrated with the HF ecosystem’s accelerate library.

The E4M3 format can represent floating point numbers in the range -448 to 448. Because the E5M2 format increases the number of exponent digits, its representation range is expanded to -57344 to 57344 – but its accuracy will be lost compared to E4M3, because the number of representable numbers remains the same. Experience has proven that E4M3 is best suited for forward calculations and E5M2 is best suited for backward calculations.

FP4 Precision Brief

The sign bit represents the sign ( + /-), and the exponent bit translates to the 2nd power of the integer represented by that part (e.g. 2^{010} = 2^{2} = 4 ). The fraction or mantissa digit is expressed as the sum of powers of -2. If the i-th digit is 1, then the sum is added to 2^-i, otherwise it remains unchanged, where i is the position of this bit in the bit sequence. For example, for the mantissa 1010, we have (2^-1 + 0 + 2^-3 + 0) = (0.5 + 0.125) = 0.625 , then, we add a to the fraction 1, get 1.625. Finally, multiply all results together. For example, using 2 exponent bits and 1 mantissa bit, the corresponding value for encoding 1101 is:

-1 * 2^(2)*(1 + 2^-1) = -1 * 4 * 1.5 = -6

FP4 does not have a fixed format, so you can try different mantissa/exponent combinations. Generally speaking, 3 exponent bits will work better in most cases. But in some cases, 2 exponent bits plus 1 mantissa bit will perform better.

QLoRA, a new way to achieve large model freedom through quantification

In short, QLoRA reduces the memory usage of LLM fine-tuning compared to standard 16-bit model fine-tuning without sacrificing performance. Using this approach, we can fine-tune a 33B model on a single 24GB GPU and a 65B model on a single 46GB GPU.

More specifically, QLoRA uses 4-bit quantization to compress pretrained language models. The parameters of the base model are then frozen, and a relatively small number of trainable parameters are added to the model in the form of low-rank adapters. During fine-tuning, QLoRA backpropagates gradients into low-rank adapters via frozen 4-bit quantized pretrained language models. The weights of the LoRA layer are the only parameters that can be updated during training. You can read the original LoRA paper to learn more about LoRA.

QLoRA has a data type for storing the base model weights (usually 4-bit NormalFloat) and a data type for performing calculations (16-bit BrainFloat). QLoRA dequantizes the weights from a stored data type to a computed data type to perform forward and backward propagation, but only computes the weight gradient for the LoRA parameter of bfloat16. Weights are only decompressed when needed, so memory usage remains low during both training and inference.

Extensive experiments show that QLoRA fine-tuning performs on par with 16-bit fine-tuning. Furthermore, the Guanaco model, fine-tuned using QLoRA on the LLaMA model on the OpenAssistant dataset (OASST1), is currently the most advanced chatbot system and performs close to ChatGPT on the Vicuna benchmark. This is a further demonstration of the power of QLoRA fine-tuning.

How to use it in transformers?

In this section, we describe the integration of this method in transformers, how to use it, and the currently supported models.

Getting Started

As a quick start, we can install accelerate and transformers from source to load the 4-bit model, also make sure you have the latest version of bitsandbytes installed library (0.39.0).

pip install -q -U bitsandbytes
pip install -q -U git + https://github.com/huggingface/transformers.git
pip install -q -U git + https://github.com/huggingface/peft.git
pip install -q -U git + https://github.com/huggingface/accelerate.git

Quick start

The basic way to load a model in 4-bit is by passing the parameter load_in_4bit=True when calling the from_pretrained method, and setting the device map to "auto" .

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True, device_map="auto")
...

That’ll be fine!

In general, we recommend that users do not manually set up devices after loading a model using device_map. Therefore, any device assignments to the model or any submodule of the model should be avoided after this line – unless you know what you are doing.

Remember that loading a quantized model automatically converts other submodules of the model to the float16 data type. You can modify this behavior by passing torch_dtype=dtype to the from_pretrained method (for example, if you wish to use float32 in the layer normalization operator code> ).

Advanced usage

You can use different variants of 4-bit quantization, such as NF4 (NormalFloat4 (default)) or pure FP4 quantization. From the theoretical analysis and empirical results, we recommend using NF4 quantization for better performance.

Other options include bnb_4bit_use_double_quant which does a second round of quantization after the first, saving an additional 0.4 bits per parameter. Finally there is the calculation type, although 4-bit bitsandbytes stores the weights in 4 bits, the calculation is still done in 16 or 32 bits, any combination can be chosen here (float16, bfloat16, float32, etc.).

Matrix multiplication and training will be faster if a 16-bit calculation data type is used (default torch.float32). Users should take advantage of the latest BitsAndBytesConfig in transformers to change these parameters. Here is an example of loading a 4-bit model using NF4 quantization, using double quantization and the bfloat16 calculation data type to speed up training:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

Change calculation data type

As mentioned above, you can also change the compute data type of the quantized model by changing the bnb_4bit_compute_dtype parameter in BitsAndBytesConfig.

import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

Nested quantization

To enable nested quantization, you can use the bnb_4bit_use_double_quant parameter in BitsAndBytesConfig. This will enable a second round of quantization after the first round, saving an additional 0.4 bits per parameter. We also used this feature in the fine-tuned Google Colab notebook mentioned above.

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)

Of course, as mentioned at the beginning of this section, all of these features are arbitrarily combinable. You can combine all these parameters to find the configuration that works best for you. Rules of thumb are: if memory is limited, use double quantization; use NF4 for higher accuracy; use 16-bit floating point for faster fine-tuning. As an example, in the Inference Demonstration App, we completed fitting a gpt-neo-x-20b (40GB) model using 4 bits on a single 16GB GPU using nested quantization, the bfloat16 compute data type, and NF4 quantization.

FAQ

In this section we answer some frequently asked questions.

Are there any hardware requirements for FP4 quantization?

Note that this method is only compatible with GPUs and 4-bit quantization of models on CPU is not yet possible. On GPUs, this method does not have any hardware requirements and any GPU can be used to run 4-bit quantization as long as CUDA>=11.2 is installed.

Also keep in mind that the computation is not done in 4 bits, just the weights and activations are compressed into that format, while the computation is still done on the specified or primitive data type.

What models are supported?

Similar to the LLM.int8 integration described in this blog post, our integration relies heavily on the accelerate library. Therefore, any model that supports accelerate library loading (i.e. supports the device_map parameter when calling from_pretrained) can do 4-bit quantization. Also note that this has nothing to do with modality at all, as long as the model can be loaded using the device_map parameter, they can be quantized.

For text models, as of this writing, the most commonly used architectures are supported, such as Llama, OPT, GPT-Neo, GPT-NeoX for plain text, Blip2 for multimodality, etc.

As of this writing, models that support accelerate are:

[
    'bigbird_pegasus', 'blip_2', 'bloom', 'bridgetower', 'codegen', 'deit', 'esm',
    'gpt2', 'gpt_bigcode', 'gpt_neo', 'gpt_neox', 'gpt_neox_japanese', 'gptj', 'gptsan_japanese',
    'lilt', 'llama', 'longformer', 'longt5', 'luke', 'm2m_100', 'mbart', 'mega', 'mt5', 'nllb_moe',
    'open_llama', 'opt', 'owlvit', 'plbart', 'roberta', 'roberta_prelayernorm', 'rwkv', 'switch_transformers',
    't5', 'vilt', 'vit', 'vit_hybrid', 'whisper', 'xglm', 'xlm_roberta'
]

Note that if your favorite model is not in the list, you can submit a PR or issue in transformers to add support for accelerate loading for that architecture.

Can we train 4-bit/8-bit models?

Full model 4-bit training of these models is not possible. However, you can train these models using Parameter Efficient Fine-Tuning (PEFT), which trains additional parts such as adapters on top of the base model. The QLoRA paper does this, and Hugging Face’s PEFT library also officially supports this method. We offer corresponding spinner notebooks. If you want to reproduce the results of the paper, you can also check the QLoRA code base.


The output activation of the original (frozen) pre-trained weights (left) needs to be added to the output of the low-rank adapter, which is composed of Matrix A and B weight composition (right side).

What else is there to do with this work?

This work can bring some positive impact to the community and AI research, as it can impact many possible usage or application scenarios. In RLHF (Reinforcement Learning with Human Feedback), a single 4-bit base model can be loaded and multiple adapters trained on it, one for reward modeling and another for value policy training. We will publish a more detailed blog post about this usage soon.

We also included some benchmarks on the impact of this quantization approach on training large models on consumer hardware. We conducted multiple fine-tuning experiments on 2 different architectures Llama 7B (when fp16, the model size is 15GB) and Llama 13B (when fp16, the model size is 27GB) on NVIDIA T4 (16GB), and the results are as follows:

td>

Model	Half-precision model size (GB)	Hardware/total video memory	Quantization method (CD = Calculation data type / GC = gradient checkpointing / NQ = double quantization)	batch size	Gradient accumulation steps	Optimizer	Sequence length	Result

<10B model
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	LLM.int8 (8-bit) + GC	1	4	AdamW	512	< strong>No OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	LLM.int8 (8-bit) + GC	1	4	AdamW	1024	OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	4bit + NF4 + bf16 CD + no GC	1	4	AdamW	512	No OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	4bit + FP4 + bf16 CD + no GC	1	4	AdamW	512	No OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	4bit + NF4 + bf16 CD + no GC	1	4	AdamW	1024	OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	4bit + FP4 + bf16 CD + no GC	1	4	AdamW	1024	OOM
decapoda-research/llama-7b-hf	14GB	1xNVIDIA-T4 / 16GB	4bit + NF4 + bf16 CD + GC	1	4	AdamW	1024	No OOM

10B + model
decapoda-research/llama-13b-hf	27GB	2xNVIDIA-T4 / 32GB	LLM.int8 (8-bit) + GC	1	4	AdamW	512	< strong>No OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA-T4 / 16GB	LLM.int8 (8-bit) + GC	1	4	AdamW	512	OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA-T4 / 16GB	4bit + FP4 + bf16 CD + no GC	1	4	AdamW	512	OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA-T4 / 16GB	4bit + FP4 + fp16 CD + no GC	1	4	AdamW	512	OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA-T4 / 16GB	4bit + NF4 + fp16 CD + GC	1	4	AdamW	512	No OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA-T4 / 16GB	4bit + NF4 + fp16 CD + GC	1	4	AdamW	1024	OOM
decapoda-research/llama-13b-hf	27GB	1xNVIDIA- T4 / 16GB	4bit + NF4 + fp16 CD + GC + NQ	1	4	AdamW	1024	No OOM

We used the latest SFTTrainer from the TRL library, you can find the benchmark script here.

Demo space

If you want to try the Guananco model in the paper, you can play with this demonstration space. We have also embedded it directly below for you to play directly.

Acknowledgments

The HF team would like to thank everyone involved in this project at the University of Washington for contributing their work to the community.

The authors would also like to thank Pedro Cuenca for his help in reviewing the blog post, and Olivier Dehaene and Omar Sanseviero for their quick and robust support in integrating the paper on HF Hub.

Dear friends, you can click Read the original text to view all external links in the article!

Original English text: https://hf.co/blog/4bit-transformers-bitsandbytes

Original author: Younes Belkada, Tim Dettmers, Artidoro Pagnoni, Sylvain Gugger, Sourab Mangrulkar

Translator: Matrix Yao (Yao Weifeng), Intel Deep Learning Engineer, working on the application of transformer-family models on various modal data and training and inference of large-scale models.

Reviewer/typesetting: zhongdongy (阿东)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 387703 people are learning the system