AWS introduces NIXL support with EFA to enhance LLM inference scalability

AWS has integrated NVIDIA Inference Xfer Library (NIXL) with Elastic Fabric Adapter (EFA) to enhance the performance of large language model (LLM) inference on Amazon EC2.

Amazon Web Services (AWS) has announced the integration of NVIDIA Inference Xfer Library (NIXL) with its Elastic Fabric Adapter (EFA) to boost the performance of disaggregated large language model (LLM) inference on Amazon EC2. This collaboration aims to significantly improve the efficiency of disaggregated inference serving by focusing on three critical areas: enhancing KV-cache throughput, reducing inter-token latency, and optimizing KV-cache memory usage.

With the support of NIXL and EFA, AWS enables high-throughput, low-latency KV-cache transfers between prefill and decode nodes, facilitating efficient movement of KV-cache across various storage layers. NIXL is compatible with all EFA-enabled EC2 instances and integrates seamlessly with several frameworks, including NVIDIA Dynamo, SGLang, and vLLM. This integration offers users the flexibility to choose their preferred EC2 instance and framework, ensuring robust disaggregated inference scalability.

AWS has made NIXL version 1.0.0 or higher available with EFA installer version 1.47.0 or higher across all EFA-enabled EC2 instance types in every AWS region, without any additional charges. For those seeking further details, additional information can be found in the EFA documentation.