Distributed Load Balancing at Scale for Generative AI Inference

9/11/25 | 4:15pm | E51-145

Santiago Balseiro

George E. Warren Professor of Business
Graduate School of Business, Columbia University

Abstract: Generative AI inference, the process of using a trained generative artificial intelligence model to generate outputs, requires substantial processing power and expensive computational resources. The growing popularity of Generative AI models has placed unprecedented demands on backend infrastructure, making efficient load balancing a critical component for scalable systems. This talk overviews our work on a novel distributed load balancing system specifically designed to manage Generative AI requests at a large scale, with a primary focus on minimizing user latency. From the algorithmic perspective, we introduce two dynamic algorithms for routing requests when inference servers have workload-dependent service rates. Our first algorithm, called Greatest Marginal Service Rate (GMSR), is a fast-reacting algorithm that routes each request to the backend where it will have the highest marginal impact on the current service rate. We prove that, in settings without network latency, GMSR converges in a distributed fashion to the optimal, centrally coordinated routing solution that minimizes overall system latency. To handle network latencies, which can be large in global systems, we introduce Distributed Gradient Descent Load Balancing (DGD-LB), a probabilistic routing algorithm that adjusts routing probabilities dynamically using gradient descent. We present sufficient conditions on the step-size of gradient descent that guarantee convergence to the optimal routing solution in the presence of network latencies. Experiments show that our algorithms can lead to substantial gains relative to other load balancers studied in the literature.

Papers:

https://arxiv.org/abs/2411.17103

https://arxiv.org/abs/2504.10693

Bio: Santiago R. Balseiro is the George E. Warren Professor of Business at the Graduate School of Business, Columbia University, and a research scientist at Google Research. His research develops novel methodological approaches that combine dynamic optimization, stochastic modeling, and game theory to address fundamental problems in the digital economy. His work tackles central problems in internet advertising while making methodological contributions to the area of large-scale sequential decision-making in the face of uncertainty and dynamic optimization with incentives. His research has been recognized by numerous awards, including an early career award, a best dissertation award, and numerous best paper awards.