SentenceTransformer accelerates vectorization using multiple GPUs

Article directory

  • Preface
  • code

Foreword

When we need to vectorize large-scale data to store it in a vector database, and there are multiple GPUs at our disposal on the server, we hope to use all GPUs at the same time to parallelize the process and accelerate the vectorization.

Code

Just a few lines of code, no more nonsense

from sentence_transformers import SentenceTransformer

#Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == '__main__':

    #Create a large list of 100k sentences
    sentences = ["This is sentence {}".format(i) for i in range(100000)]

    #Define the model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    #Start the multi-process pool on all available CUDA devices
    pool = model.start_multi_process_pool()

    #Compute the embeddings using the multi-process pool
    emb = model.encode_multi_process(sentences, pool)
    print("Embeddings computed. Shape:", emb.shape)

    #Optional: Stop the procedures in the pool
    model.stop_multi_process_pool(pool)

Note: Be sure to add the sentence if __name__ == '__main__':, otherwise the following error will be reported:

RuntimeError:
        An attempt has been made to start a new process before the
        The current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

In fact, the official code has been given. I just copied and pasted it. The code location is: computing_embeddings_multi_gpu.py

The official also gave an example of streaming encode, which is also multi-GPU parallel, as follows:

from sentence_transformers import SentenceTransformer, LoggingHandler
import logging
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

#Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == '__main__':
    #Set params
    data_stream_size = 16384 #Size of the data that is loaded into memory at once
    chunk_size = 1024 #Size of the chunks that are sent to each process
    encode_batch_size = 128 #Batch size of the model
    

    #Load a large dataset in streaming mode. more info: https://huggingface.co/docs/datasets/stream
    dataset = load_dataset('yahoo_answers_topics', split='train', streaming=True)
    dataloader = DataLoader(dataset.with_format("torch"), batch_size=data_stream_size)

    #Define the model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    #Start the multi-process pool on all available CUDA devices
    pool = model.start_multi_process_pool()

    for i, batch in enumerate(tqdm(dataloader)):
        #Compute the embeddings using the multi-process pool
        sentences = batch['best_answer']
        batch_emb = model.encode_multi_process(sentences, pool, chunk_size=chunk_size, batch_size=encode_batch_size)
        print("Embeddings computed for 1 batch. Shape:", batch_emb.shape)

    #Optional: Stop the procedures in the pool
    model.stop_multi_process_pool(pool)

Official case: computing_embeddings_streaming.py

 + -------------------------------------------------- ---------------------------------- +
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|---------------------------------- + ----------------- ----- + ---------------------- +
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|================================ + ================= ===== + ======================|
| 0 NVIDIA A800-SXM... On | 00000000:23:00.0 Off | 0 |
| N/A 58C P0 297W / 400W | 75340MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 1 NVIDIA A800-SXM... On | 00000000:29:00.0 Off | 0 |
| N/A 71C P0 352W / 400W | 80672MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 2 NVIDIA A800-SXM... On | 00000000:52:00.0 Off | 0 |
| N/A 68C P0 398W / 400W | 75756MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 3 NVIDIA A800-SXM... On | 00000000:57:00.0 Off | 0 |
| N/A 58C P0 341W / 400W | 75994MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 4 NVIDIA A800-SXM... On | 00000000:8D:00.0 Off | 0 |
| N/A 56C P0 319W / 400W | 70084MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 5 NVIDIA A800-SXM... On | 00000000:92:00.0 Off | 0 |
| N/A 70C P0 354W / 400W | 76314MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 6 NVIDIA A800-SXM... On | 00000000:BF:00.0 Off | 0 |
| N/A 73C P0 360W / 400W | 75876MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +
| 7 NVIDIA A800-SXM... On | 00000000:C5:00.0 Off | 0 |
| N/A 57C P0 364W / 400W | 80404MiB / 81920MiB | 100% Default |
| | | Disabled |
 + ---------------------------------- + ------------------ ----- + ---------------------- +

Quack, hurry up