Back to feed

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

NVIDIA Technical Blog

Mar 25, 2026

3/25/2026

Rigid Model To GPU Mapping In Kubernetes Causes GPU Waste And Underutilization Of Small Inference Workloads

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads · NVIDIA Technical Blog

Science, Technology & Innovation · Mar 25, 2026

Kubernetes GPU waste occurs when schedulers allocate whole GPUs to individual models—so small models (e.g., 10 GB VRAM ASR/TTS) occupy entire devices and leave usable VRAM stranded, reducing infrastructure efficiency and throughput unless GPU sharing/consolidation policies are improved.