3 posts tagged with "llm-d release news!"

llm-d tag description

View All Tags

Intelligent Inference Scheduling with llm-d

September 3, 2025 · 9 min read

Nili Guy

R&D Manager, AI Infrastructure, IBM

Vita Bortnikov

IBM Fellow, IBM

Etai Lev Ran

Cloud Architect, IBM

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

Intelligent inference scheduling diagram

llm-d 0.2: Our first well-lit paths (mind the tree roots!)

July 29, 2025 · 10 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Our 0.2 release delivers progress against our three well-lit paths to accelerate deploying large scale inference on Kubernetes - better load balancing, lower latency with disaggregation, and native vLLM support for very large Mixture of Expert models like DeepSeek-R1.

We’ve also enhanced our deployment and benchmarking tooling, incorporating lessons from real-world infrastructure deployments and addressing key antipatterns. This release gives llm-d users, contributors, researchers, and operators, clearer guides for efficient use in tested, reproducible scenarios.

Announcing the llm-d community!

May 20, 2025 · 11 min read

Robert Shaw

Director of Engineering, Red Hat

Clayton Coleman

Distinguished Engineer, Google

Carlos Costa

Distinguished Engineer, IBM

Announcing the llm-d community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

Why Intelligent Inference Is Needed for LLM Inference​

Announcing the llm-d community​

Why Intelligent Inference Is Needed for LLM Inference

Announcing the llm-d community