Improve Your LLM Performance with Amazon SageMaker’s Large Model Inference Container v15

Today, the long-awaited release of version 15 of the Amazon SageMaker Large Model Inference (LMI) container has been announced, which incorporates version 0.8.4 of vLLM and adds support for the vLLM V1 engine. This update includes compatibility with the latest open-source models, such as Meta’s Llama 4, Scout, and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, and DeepSeek-R, among others. With this evolution, Amazon SageMaker AI aims to meet the growing demand for performance and large-scale inference capabilities in generative artificial intelligence.

Among the highlighted improvements is a significant increase in performance and greater compatibility with multimodal models, meaning the system is able to efficiently understand and analyze text-to-text, image-to-text, and vice versa data. The built-in integration with vLLM makes it easier not only to deploy, but also to serve large-scale language models (LLMs) with optimal performance.

The release introduces a new asynchronous mode that directly connects with vLLM’s AsyncLLMEngine, significantly enhancing request handling. This functionality allows for managing multiple concurrent requests and delivering outputs with better performance than the Rolling-Batch implementation offered in version 14.

The vLLM V1 engine promises a performance increase of up to 111% compared to its predecessor V0, especially in high-concurrency situations and smaller models. This has been achieved through reducing CPU load, optimizing execution paths, and more efficient use of system resources. Although LMI version 15 is already configured to use the V1 engine by default, users have the option to revert to V0 if needed.

Furthermore, API schema support has been expanded, offering three flexible options to ease integration with applications using popular API patterns. Specific optimizations have been implemented for models combining vision and language, including efficient multi-platform caching.

The list of models supported in LMI v15 includes, among others, Llama 4 and Gemma 3, which can be deployed by specifying their corresponding ID. Performance benchmark tests of the V1 engine have shown performance advantages ranging from 24% to 111%, depending on the model used.

This new Amazon SageMaker LMI container represents a significant advancement in the inference capabilities of large models. With the revamped vLLM V1 engine, the new asynchronous operation mode, and increased model support, interested users are invited to explore the new possibilities offered by this update for deploying their generative artificial intelligence models.

via: MiMub in Spanish

Scroll to Top