Faster LLMs with speculative decoding and AWS Inferentia2

dcarrero

1 year ago

In recent years, we have observed a significant increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. The larger models, with more parameters, on the order of hundreds of billions at the time of writing this, tend to produce better results. For example, Llama-3-70B scores better than its smaller 8 billion parameter version on metrics like reading comprehension (SQuAD 85.6 compared to 76.4). Therefore, clients often experiment with larger and newer models to build machine learning-based products that add value.

However, the larger the model, the more computationally demanding and costly it is to deploy. For instance, on AWS Trainium, Llama-3-70B has a median latency per token of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median latency per token of 20.6 ms, while Llama-2-7B takes 3.7 ms. Clients need to consider performance to ensure they meet their users’ needs. In this blog, we explore how speculative sampling can help make large language model inference more efficient in terms of computation and costs on AWS Inferentia and Trainium. This technique improves LLM inference performance and token output latency (TPOT).

Modern language models are based on the transformer architecture. Input instructions are first processed using a technique called context encoding, which runs quickly because it is parallelizable. We then generate tokens self-regressively, sequentially. To generate N output tokens, we need N serial executions of the decoder. The larger the model, such as Llama-3-70B, the longer it takes to generate the next token.

From a computational perspective, token generation in LLMs is a process limited by memory bandwidth. The larger the model, the more likely we are to wait for memory transfers, resulting in underutilization of computational units and not fully leveraging available floating-point operations (FLOPS).

Speculative sampling is a technique that improves computational efficiency for running inferences with LLMs, while maintaining accuracy. It works by using a smaller and faster drafter model to generate multiple tokens, which are then verified by a larger and slower target model. This verification step processes multiple tokens in a single pass instead of sequentially and is more computationally efficient. Increasing the number of tokens processed in parallel increases the compute intensity as a greater number of tokens can be multiplied with the same weight tensor, providing better performance compared to non-speculative execution, which is generally limited by memory bandwidth, resulting in better hardware resource utilization.

The speculative process involves an adjustable window k, where the target model provides one guaranteed correct token, and the drafter model speculates the next k-1 tokens. If the drafter model’s tokens are accepted, the process speeds up. If not, the target model takes control, ensuring accuracy.

For example, a scenario where all speculated tokens are accepted results in faster processing. The target model provides one guaranteed output token, and the drafter model runs multiple times to produce a sequence of possible output tokens. These are verified by the target model and then accepted through a probabilistic method.

On the other hand, a scenario where some tokens are rejected means we will get fewer output tokens and repeat this process more times to complete the response, resulting in slower processing overall. By adjusting the window size k and understanding when the drafter and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.

We demonstrate how speculative sampling works on Amazon EC2 Inf2 instances powered by Inferentia2 and EC2 Trn1 instances powered by Trainium. We use an example where we generate text faster with Llama-2-70B using Llama-2-7B as the drafter model. Although the example is based on Llama-2 models, a similar process can be followed for Llama-3 models.

Loading Llama-2 models with bfloat16 data type can be done by standard adjustments to the n_positions parameter, which represents the maximum sequence length allowed for generation. For speculative sampling, only a batch_size of 1 is supported. The combined models require nearly 200 GB of device memory for weights plus additional memory in the order of gigabytes for key-value (KV) caches.

As more developers seek to incorporate LLMs into their applications, they face the choice between using larger, more expensive, and slower models that provide higher quality results, or using smaller, less costly, and faster models that may decrease response quality. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers do not have to make that choice. They can leverage the high-quality outputs of large models and the speed of smaller models.

In this post, we have shown how we can accelerate inference with large models like Llama-2-70B using a new feature called speculative sampling. To try it yourself, you can review the speculative sampling example and adjust the input prompt and k parameter to see the results obtained. For more advanced use cases, you can develop your own token acceptor implementation. For more information on running your models on Inferentia and Trainium instances, you can read the AWS Neuron documentation and visit the AWS Neuron channel on repost.aws to discuss your experiments with the AWS Neuron community and share ideas.

Referrer: MiMub in Spanish