[NLP] DeepSpeed-FastGen: High-throughput text generation for LLM through MII and DeepSpeed-Inference

1. Introduction

Large language models (LLMs) such as GPT-4 and LLaMA have become dominant workloads serving AI applications at all levels. From general chat models to document summarization, from self-driving to co-piloting at every layer of the software stack, the need to deploy and serve these models at scale has skyrocketed. While frameworks such as DeepSpeed, PyTorch, etc. can regularly achieve good hardware utilization during LLM training, the interactivity of these applications and the poor arithmetic intensity of tasks such as open-ended text generation have become bottlenecks in inference throughput in existing systems.

To this end, frameworks such as vLLM powered by PagedAttention and research systems such as Orca significantly improve the inference performance of LLM. However, these systems still struggle to provide consistent quality of service, especially for workloads with long prompts. These long-tip workloads are becoming increasingly important as more and more models (such as MPT-StoryWriter) and systems (such as DeepSpeed Ulysses) support context windows that scale to tens of thousands of tokens. To better understand the problem space, we provide detailed examples of how LLM text generation works in two distinct stages called prompt processing and generation. When the system treats them as different stages, the build will be preempted by on-the-fly processing, which can break service level agreements (SLAs).

Today, we are pleased to introduce DeepSpeed-FastGen, a system that overcomes these limitations by leveraging the proposed dynamic SplitFuse technology to achieve a 2.3x improvement in effective throughput compared to state-of-the-art systems such as vLLM. DeepSpeed-FastGen leverages a combination of DeepSpeed-MII and DeepSpeed-Inference to provide an easy-to-use service system.

Quick Start: Trying DeepSpeed-FastGen is as easy as installing the latest DeepSpeed-MII version:

pip install deepspeed-mii

To deploy generated text using a simple non-persistent pipeline, run the following code. See Section 5 for more details.

from mii import pipeline
pipe = pipeline("mistralai/Mistral-7B-v0.1")
output = pipe(["Hello, my name is", "DeepSpeed is"], max_new_tokens=128)
print(output)

2. Existing MA service technologies

The text generation workload for a single sequence consists of two stages: 1) prompt processing, where user-supplied text is efficiently processed as a batch of tokens to build a key-value (KV) cache for attention, 2) token generation, It will add a token to the cache and generate a new token. In the process of generating a text sequence, the model will make multiple forward calls to the model to generate a complete text sequence. Two main techniques have been proposed in the literature and deployed in the system to address the various limitations and bottlenecks that may arise during these phases.

Blocked KV cache:

vLLM found that memory fragmentation caused by a large monolithic KV cache significantly reduced the concurrency of the LLM service system, and proposed a paging attention mechanism to enable non-contiguous caching and improve the overall system throughput. Instead of allocating a single variable-sized contiguous block of memory, the underlying storage in a KV cache is allocated in fixed-size blocks (also called pages). The blocking KV cache increases the potential sequence concurrency by eliminating the memory fragmentation caused by the KV cache, thereby improving system throughput. Non-contiguous KV cache implementations are also included in HuggingFace TGI and NVIDIA TensorRT-LLM.

Continuous batching:

In the past, dynamic batching, where the server waits for multiple requests to be processed simultaneously, was used to increase GPU utilization. However, this approach has drawbacks, as it often requires padding the inputs to the same length or stalling the system to wait for a larger batch to be built.

Recent advances in large language model (LLM) inference and serving have focused on fine-grained scheduling and optimizing memory efficiency. For example, Orca proposes iteration-level scheduling (also known as continuous batching), which makes different scheduling decisions on each forward pass of the model. This allows requests to join/leave the batch as needed, eliminating the need to pad requests, thus improving overall throughput. In addition to Orca, continuous batch processing has been implemented in NVIDIA TRT-LLM, HuggingFace TGI, and vLLM.

In current systems, there are two main ways to implement continuous batch processing. In TGI and vLLM, the build phase is preempted to perform hint processing (called padding in TGI) before continuing with the build. In Orca, these stages are not differentiated; instead, Orca adds hints to the running batch whenever the total number of sequences does not reach a fixed range. Both methods require, to varying degrees, stopping the generation to handle long prompts.

To address these shortcomings, we propose a novel cue and generation combination strategy, Dynamic SplitFuse.

3. Dynamic SplitFuse: a novel prompt and generative synthesis strategy

DeepSpeed-FastGen aims to leverage contiguous batching and non-contiguous KV caching to improve the occupancy and responsiveness of data center LLM services, similar to existing frameworks such as TRT-LLM, TGI, and vLLM. To reach new levels of performance, DeepSpeed-FastGen introduces SplitFuse, which leverages dynamic hinting and generation decomposition and unification to further improve continuous batch processing and system throughput.

1. Three performance insights

Before describing Dynamic SplitFuse, we answer three key performance questions that collectively drove its design.

1. What factors affect the advancement of a single LLM? TIn order to schedule effectively, it is necessary to understand which relevant independent variables should be controlled by the scheduling loop. We observe below that the sequence composition in the forward pass (batch size in the sequence) has a negligible impact on performance compared to the raw number of tokens in the forward pass. This means that an efficient scheduler can be built around a single signal (the number of tokens in the forward pass).

2. How does the model’s throughput respond to changes in the number of tokens in the forward pass? LLLM has two critical operating areas, and the transitions are relatively steep. For a small number of tokens, the GPU bottleneck is reading the model from memory, so the throughput varies with the number of tokens, whereas for many tokens, the model’s throughput is computationally bounded, and the throughput is close to constant. If all forward passes are in the throughput saturation region, the model should run efficiently.

3. How should token pools be scheduled across multiple forward passes? We observed above that the token throughput curve is concave for well-aligned inputs, which means that the second derivative must be less than or equal to 0. For example, let f(x) be a concave function of latency versus throughput for a given model. For a concave function f(x) , the following holds:

This shows that for a given pool of 2x tokens to be processed, the way to maximize throughput is to distribute them evenly between the two batches. More generally, in a system that must consume and process P tokens via F forward passes, an ideal partitioning scheme would distribute them equally.

2. Dynamic SplitFuse

Dynamic SplitFuse is a novel token combining strategy for fast processing and token generation. DeepSpeed-FastGen leverages dynamic SplitFuse to run at a consistent forward size by taking partial tokens from the tip and combining them with the generation. In particular, Dynamic SplitFuse performs two key behaviors:

Long prompts are broken up into smaller chunks and scheduled in multiple forward passes (iterations), with only the last one performing any builds.
Short prompts will be written to accurately fill the target token budget. Even brief prompts can be broken down to ensure budgets are met precisely and forward sizing is well-aligned.

Together, these two technologies provide specific benefits for all user metrics:

Better Responsiveness: This model will provide lower client latency since long prompts no longer require extremely long forward passes to process. Perform more forward passes in the same time window.
Higher Efficiency: The fusion of short prompts with a larger token budget enables models to run continuously at high throughput.
Lower variance and better consistency: Since the forward passes have a consistent size, and the forward pass size is the main determinant of performance, each forward pass Latency to delivery is more consistent than competing systems, as is the frequency of perception generation. As with other prior work, there are no preemptions or long-running hints to increase latency.

Therefore, DeepSpeed-FastGen will consume tokens from incoming prompts at a rate that allows for fast continuous generation, while adding tokens to the system to improve system utilization, provide lower latency for all clients and Higher throughput flow generation. -Art service system.

Figure 1: Illustration of the continuous batching strategy. Each block shows the execution of the forward pass. The arrow indicates that the forward pass has a sequence that generates one or more tokens. vLLM performs token generation or prompt processing in a forward pass; token generation takes precedence over prompt processing. Orca will run the prompt in full when building. Dynamic SplitFuse performs dynamic combination of fixed-size batches of generated tokens and prompt tokens.

4. Performance Evaluation

DeepSpeed-FastGen leverages its blocking KV cache and dynamic SplitFuse continuous batching to deliver state-of-the-art LLM serving performance. We evaluate DeepSpeed-FastGen against vLLM on a range of models and hardware configurations, following the benchmarking methodology discussed below.

1. Benchmarking method

We use two main quantitative schemes to measure performance.

Throughput-Latency Curve: Two key metrics for production readiness are throughput (measured in requests per second) and latency (responsiveness of each request). To measure this, we instantiated multiple clients simultaneously (ranging from 1 to 32) and sent requests to the server (512 in total). The latency incurred by each request is measured at the endpoint, and the throughput is measured by the end-to-end time to complete the experiment.

Effective throughput: Interactive applications, such as chat applications, may have more stringent and complex requirements than top-level metrics such as end-to-end latency can capture. We pay particular attention to increasingly popular chat user scenarios:

Users initiate tasks by sending prompts.
The system handles the prompt and returns the first token.
Subsequent tokens are streamed to the user as they are generated.

At every point in the process, the system has the potential to provide a poor user experience; for example, if the first token arrives too slowly or generation seems to stall for a while. We propose an SLA framework that considers these two dimensions.

Since the lengths of prompts and generated text vary significantly, affecting computational cost, it is impractical to set strict SLA values for throughput and latency. Therefore, we define the SLA for prompt latency as |tokens in prompt| / 512 seconds (= 512 tokens/second). Additionally, taking into account human reading speed, we set the SLA for the exponential moving average (EMA) generation latency to 2, 4, or 6 tokens/second. Requests that comply with these SLAs are considered successful, and the throughput of these successful requests is called the effective throughput.

We evaluate vLLM and DeepSpeed-FastGen on Llama-2 7B, Llama-2 13B, and Llama-2 70B on NVIDIA A100, H100, and A6000.

2. Throughput-Latency Analysis

In this experiment, DeepSpeed-FastGen outperforms vLLM in both throughput and latency, providing equivalent latency with greater throughput, or latency with faster response at the same throughput. On Llama-2 70B with 4x A100x80GB, DeepSpeed-FastGen shows up to 2x throughput improvement (1.36 rps vs. 0.67 rps) at the same latency (9 seconds), or up to 50% latency reduction (7 seconds vs. 14 seconds) while achieving the same throughput (1.2 rps), as shown in Figure 2. These trends remained unchanged when evaluating Llama-2 13B, as shown in Figure 3 .

Figure 2: Throughput and latency for text generation using Llama 2 70B (tensor parallelism on 4 A100-80GB GPUs). A normal distribution was applied to the prompt length and the generation length, with means of 1200/2600 and 128/60, respectively, and a variance of 30%

Figure 3: Using Llama 2 13B (A100-80GB GPU, no tensor parallelism) Throughput and latency of text generation. A normal distribution was applied to the prompt length and the generation length, with means of 1200/2600 and 60/128, respectively, and a variance of 30%

3. Effective throughput analysis

Under an effective throughput analysis that takes into account first token latency and the rate at which generation occurs, DeepSpeed-FastGen delivers 2.3x higher throughput than vLLM. Figure 4 shows a comparative analysis of the effective throughput of DeepSpeed-FastGen and vLLM. Each plotted point represents the effective throughput derived from a specific number of clients. As we scaled up the number of clients, we initially observed an increase in effective throughput. However, as the number of clients approaches system capacity, latency increases significantly, causing many requests to fail to meet the SLA. Therefore, effective throughput will saturate or decrease at some point. From an availability perspective, it’s not particularly relevant how many clients are needed to achieve maximum effective throughput; the top of the line is the sweet spot.

Figure 4: Effective throughput of DeepSpeed-FastGen and vLLM (Llama 2 70B/A100-80GB, using tensor parallelism on 4 A100-80GB GPUs). Apply normal distribution to prompt length and spawn length, mean 2600 and 60 respectively, 30% variance)

Build latency increases significantly when vLLM preempts a previous request being built. This causes vLLM’s effective throughput to appear lower than its directly measured throughput. At the peak of vLLM, the effective throughput was 0.63 queries/second, and approximately 28% of requests failed to meet the SLA of 4 tokens/second. Under the same SLA, DeepSpeed-FastGen achieved 1.42 queries/second (less than 1% of requests failed to meet the SLA), 2.3x higher than vLLM.

4. Token level timing analysis

Figure 5 shows the P50, P90 and P95 latencies of the generation process. Both vLLM and DeepSpeed-FastGen exhibit similar P50 latencies, but vLLM exhibits significantly higher P90 and P95 latencies. For P95 latency, DeepSpeed-FastGen achieves a 3.7x reduction.

The reason for this difference is that there is a noticeable spike in vLLM’s build latency when vLLM preempts an ongoing build to process a new prompt. In contrast, DeepSpeed-FastGen typically processes hints and builds from previous requests simultaneously, resulting in more consistent build latencies.

Figure 5: Llama 2 70B/A100-80GB per-token generation latency using tensor parallelism on 4 A100-80GB GPUs, 16 clients. A normal distribution was applied to the cue length and generation length, with means of 2600 and 128, respectively, and variance of 30%.

5. Scalability using load balancing

DeepSpeed-FastGen provides replica-level load balancing to evenly distribute requests across multiple servers, allowing you to easily scale your application.

Figure 6 illustrates the scalability of DeepSpeed-FastGen when using a load balancer and up to 16 replicas. Note that we use 4 A100 GPUs to compute the Llama 2 70B model. We used a total of 8 nodes to run 16 replicas. The results show that the scalability of DeepSpeed-FastGen is near perfect. Assuming a single replica has a throughput of 1.46 queries/sec, 16 replicas achieve a throughput of 23.7 queries/sec, a linear increase of 16x compared to a single replica.

Figure 6: Scalability using load balancing capabilities. Apply a normal distribution to the prompt length and spawn length, with means of 2600 and 60, respectively, and a variance of 30%

6. Other hardware platforms

In addition to our in-depth analysis of the A100, we also provide additional benchmark results for the H100 and A6000. The same performance trends as the A100 are observed on the A6000 and H100.

Figure 7: Throughput-latency curve and effective throughput for Llama 2 70b using 8 H100 GPUs. Apply a normal distribution to the prompt length and spawn length with means of 2600 and 60 respectively and a variance of 30%

Figure 8: Throughput-latency curve and effective throughput for Llama 2 7b using A6000. Apply a normal distribution to the prompt length and spawn length, with means of 2600 and 60, respectively, and a variance of 30%

5. DeepSpeed-FastGen: Implementation and usage

DeepSpeed-FastGen is a collaborative combination of DeepSpeed-MII and DeepSpeed-Inference, as shown in the figure below. Together, the two packages provide various components of the system, including a front-end API, host and device infrastructure for scheduling batches using Dynamic SplitFuse, optimized kernel implementations, and tools for building new model implementations.

The fastest way to get started with our DeepSpeed-FastGen alpha release is:

pip install deepspeed-mii

Follow our getting started guide for more details. For usage and reporting issues, please use the DeepSpeed-MII Github repository.

1.Supported models

Currently, we support the following model architectures in this alpha version of DeepSpeed-FastGen:

LLaMA and LLaMA-2
Mistral
OPT

All current models utilize the HuggingFace API on the backend to provide model weights and model corresponding label generators.

We plan to add more models in the coming weeks and months after the initial release. If you would like support for a specific model architecture, please ask an issue and let us know.

2. Deployment options

All examples below can be run in DeepSpeedExamples. After installation, you have two deployment options: interactive non-persistent pipeline or persistent service deployment:

Non-persistent pipeline

Non-persistent pipeline deployment is a great and quick way to get started and can be done with just a few lines of code. Non-persistent models only exist for the duration of the python script you run, but are useful for temporary interactive sessions.

from mii import pipeline
pipe = pipeline("mistralai/Mistral-7B-v0.1")
output = pipe(["Hello, my name is", "DeepSpeed is"], max_new_tokens=128)
print(output)

Persistent deployment

Durable deployments are ideal for long-running production applications. Persistent deployment uses a lightweight GRPC server, which can be created with the following 2 lines:

import mii
mii.serve("mistralai/Mistral-7B-v0.1")

Thanks to DeepSpeed-MII’s built-in load balancer, the above servers can be queried by multiple clients at the same time. Creating a client only requires two lines of code:

client = mii.client("mistralai/Mistral-7B-v0.1")
output = client.generate("Deepspeed is", max_new_tokens=128)
print(output)

When a persistent deployment is no longer needed, it can be terminated:

client.terminate_server()

3. Advanced installation information

For ease of use and to significantly reduce the lengthy compilation times required by many projects in this area, we distribute a precompiled Python wheel covering most of our custom kernels through a new library called DeepSpeed-Kernels. We found the library to be very portable in environments with NVIDIA GPUs of Compute 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+. In most cases, you don’t even need to know that this library exists, as it is a dependency of DeepSpeed-MII and will be installed along with it. However, if for some reason you need to manually compile our kernel, please see our advanced installation documentation.

6. Try DeepSpeed-FastGen

We are very excited to share this DeepSpeed-FastGen alpha release.

To get started, visit our DeepSpeed-MII GitHub page: GitHub landing page

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 388,465 people are learning the system