Skip to main content

2 posts tagged with "Updates"

Project updates and progress reports

View All Tags

Intelligent Inference Scheduling with llm-d

· 9 min read
Nili Guy
R&D Manager, AI Infrastructure, IBM
Vita Bortnikov
IBM Fellow, IBM
Etai Lev Ran
Cloud Architect, IBM
Robert Shaw
Director of Engineering, Red Hat
Clayton Coleman
Distinguished Engineer, Google

The llm-d project lays out clear, “well-lit” paths for anyone to adopt the leading inference optimizations within their existing deployment framework - Kubernetes. These are tested approaches designed to make complex deployments easier and more efficient. In this post, we explore the first of these paths: intelligent inference scheduling. Unlike basic round-robin load balancing, this method takes the unique demands of LLMs into account, leading to better performance across the board: higher throughput, lower latency, and efficient use of resources.

Why Intelligent Inference Is Needed for LLM Inference​

Deploying large language models (LLMs) on Kubernetes has become the norm, but LLM inference workloads behave very differently from standard microservices. Traditional patterns like uniform replicas paired with round-robin load balancing assume each request uses the same amount of resources and finishes in roughly the same time. In contrast, LLM requests can vary wildly in token count and compute needs, making simple load-spread strategies prone to bottlenecks and imbalanced traffic.

Intelligent inference scheduling diagram

llm-d Community Update - June 2025

· 3 min read
Pete Cheslock
AI Community Architect, Red Hat

Hey everyone! We've been making great progress with the llm-d project, and I wanted to share some important updates and opportunities to get involved.

Help Shape the Future of the llm-d Project​

To guide the future development of the llm-d project, we need to understand the real-world challenges, configurations, and performance needs of our community. We've created a short survey to gather insight into how you serve Large Language Models, from the hardware you use to the features you need most.

This anonymous, vendor-agnostic survey will take approximately 5 minutes to complete. Your input will directly influence the project's roadmap and priorities. The aggregated results will be shared with the llm-d-contributors mailing list to benefit the entire community.

Your Input Will Define Our Roadmap

We've created an llm-d Community Roadmap Survey to gather information about your LLM workloads. We are looking to learn more about:

  • Your Serving Environment: This includes the hardware you use now and anticipate using in a year (like NVIDIA GPUs, AMD GPUs, or CPUs), and whether you run on-premise, in the cloud, or on edge devices.
  • Your Model Strategy: Do you serve a few large models or many smaller ones, which model families (like Llama or Mistral) are most common, and how you utilize techniques like LoRA adapters.
  • Your Performance Requirements: Your real-world SLOs for latency and throughput and the biggest LLM serving challenges you face—from cost optimization to operational ease of use.
  • Your Future Needs: What single new feature you would prioritize for an LLM Model-as-a-Service to help guide our innovation.

Take the 5-Minute Survey​

Your participation is invaluable. Please take a few minutes to complete the survey. We encourage you to share it with other users or proxy their needs in your response to ensure our direction reflects the community's diverse requirements.