Back to feed

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads

Engineering at Meta

Mar 31, 2026

3/31/2026

Embedding Memory Management And Distributed Serving Enable Large Scale Recommender Systems

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads · Engineering at Meta

Science, Technology & Innovation · Mar 31, 2026

Meta says the real bottleneck for trillion-parameter recommender serving is recommendation-scale embeddings (not just large transformer layers) and solves it with distributed memory architecture plus embedding-specific controls—feature-adaptive hashing, pruning, unified embeddings, and multi-GPU sharding with hardware-aware communications—to hit TB/O(1T) parameter scale with single-card performance parity, fast loading, autoscaling and production reliability, shifting the competitive edge to memory topology and embedding economics.


3/31/2026

Architecture And Hardware Co-Design With Layerwise Precision Policies And Kernel Fusion Enables Efficient Heterogeneous Inference

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads · Engineering at Meta

Science, Technology & Innovation · Mar 31, 2026

Meta boosted recommendation-model inference by co-designing architecture and hardware—using micro-benchmarked, layer-wise selective FP8 post‑training quantization and operator/kernel fusion (Grouped GEMM, horizontal fusion) to cut small-op and memory-move overheads, raising FLOPs utilization to ~35% across diverse hardware and showing heterogeneous fleets can work with per-layer precision/execution policies.


3/31/2026

ROI From Large Ranking Models Depends On Latency-Preserving Infrastructure And Intelligent Request Routing

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads · Engineering at Meta

Business, Finance & Industries · Mar 31, 2026

Meta launched an Adaptive Ranking Model on Instagram in Q4 2025 that, by routing each request to the most effective model while preserving strict sub‑second latency, produced +3% ad conversions and +5% ad CTR for targeted users, highlighting that ROI from larger ranking models depends on low‑latency, cost‑efficient serving rather than model size alone.


3/31/2026

Per Request Inference With Computation Sharing Enables Sublinear Scaling For LLM-Scale Ranking

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads · Engineering at Meta

Science, Technology & Innovation · Mar 31, 2026

Meta’s Adaptive Ranking Model shifts from per-candidate to per-request inference—computing dense user signals once per request and sharing them across ads via request-oriented computation sharing, in-kernel broadcast, and a centralized key-value log—achieving claimed sub-linear cost scaling and LLM-level model complexity with ~100 ms latency.