AINF™ — Part 1: The Inference Fabric

May 11, 2026 | Nalin Pai

We recently announced the Arrcus AI Networking Fabric (AINF™), a multi-cloud Kubernetes-native fabric for cross-site inference routing. This is the first post in a series where we introduce AINF and the problems it sets out to solve.

As industry momentum shifts from AI training to inference, inference is becoming a distributed discipline, with workloads spreading toward the edge and growing more agentic. Several forces drive this: the desire for full ownership and autonomy over infrastructure, the need to place compute close to users for latency-sensitive workloads, data residency mandates that dictate where data can be processed, and power arbitrage, sending GPUs wherever energy is cheapest. The result is GPU capacity fragmented across 5, 20, or even 100 sites, including data centers,micro-DCs, and edge locations.

Operating across that footprint calls for a policy-driven fabric for AI inference, spanning the deployed GPU fleets and directing each request to the best site based on several inputs: site capacity, current load, model and adapter availability, conversation affinity for multi-turn workloads, application service-level requirements, geo-fencing policies, and inter-site network conditions. That's AINF in a nutshell. The rest of this post goes over these concepts in more detail.

Challenges that deﬁne the inference fabric

Building such a fabric at scale requires addressing several challenges from ﬁrst principles, some very speciﬁc to inference.

Provisioned capacity

Inference sites differ in what they can serve. An edge PoP might have a handful of GPUs sized for a specific application; a regional DC might hold racks of older-generation GPUs; a core DC might host hundreds of current-generation, high-performance accelerators. Each site has its own ceiling on how many requests it can serve concurrently for different models. A well-designed inference fabric needs to be aware of each site's deployed capacity — hardware type, count, and sustainable throughput — across an operator's footprint.

Available capacity

Available capacity changes second by second as traffic moves. Response times vary widely across requests: a short classification call can finish in milliseconds, while a 30K-token reasoning prompt on a 70B model can occupy a worker for many seconds. Prompt size, model size, and request type all shape completion time, so small shifts in request mix produce large swings in available capacity. The most predictive signals for inference admission — KV-cache pressure, queue depth, in-flight prefill — measure different dimensions of that load. A site can be at 70% GPU-core utilization and still refuse a 30K-token prompt because the KV cache is full. A request placed at an overloaded site pays the cost in wait time before any tokens get generated. The challenge is producing a comparable view of these signals across sites fast enough to influence the next routing decision. Each runtime (e.g., Dynamo, vLLM, llm-d) exposes its own metric definitions, and the fabric polls them on a multi-second scrape cycle — a burst can saturate a site within a single scrape interval.

Model and adapter awareness

A telco edge PoP might host a 7B Mistral fine-tune for a specific intent-classification application. A core DC might serve a 70B Llama with several LoRA adapters loaded. A regional DC might serve a quantized open model behind an OpenAI-compatible endpoint. Each request needs to land at a site that can actually serve it. Model identifiers, versions, and adapters vary across sites and change as operators roll new variants out. A request that names a missing LoRA pays a cold-load penalty or gets refused outright.

Multi-turn KV reuse

Multi-turn conversations get a faster time to first token (TTFT) when follow-up requests land on a worker that still has the conversation's prefix in its KV cache. Inside a single cluster, KV-aware schedulers handle this by tracking per-worker KV state and steering requests to workers that already hold the prefix. Across sites, the same benefit applies when a conversation stays pinned to the site that originally served it. As load shifts, though, conversations may need to be rerouted, and a follow-up landing on a site without a pre-warmed cache pays the cost in TTFT. Moving the KV cache along with the request is not a done deal either. The KV cache built up by a conversation is orders of magnitude larger than the prompt itself. A 10K-token conversation on a 70B model can hold around 3 GB of KV state, and moving that across an inter-site link costs seconds of network time on top of whatever latency the request was already trying to keep low.

SLO and tier diversity

Different applications optimize for different things. A conversational AI assistant cares about TTFT in tens to low-hundreds of milliseconds. A planning or design workflow cares about end-to-end latency across multiple inference calls. An offline data-enrichment job cares about cost per million tokens and tolerates higher latency in exchange. With service-level awareness, a priority request can be routed to a local or nearby site to hold its TTFT budget; a lower-tier request can be routed out to a remote site with idle capacity for cost. The challenge is recognizing what each request is asking for and routing it where the right tradeoff lives — matching tier with capacity and latency on every decision.

Sovereignty and residency

An inference request carries potentially sensitive data in both directions — the prompt going in, and the response coming back. Both fall under residency and sovereignty rules. A request handling EU personal data may need to terminate at a site inside the EU. A defense workload may need to stay inside a sovereign cloud — a separate jurisdiction from public commercial regions. These are different constraints in origin: geography for residency, legal authority for sovereignty. The requirement is to honor these rules consistently across a footprint where site geography and jurisdiction vary.

Inter-site network awareness

Cross-site transfers carry real cost: a long-prompt request routed from Mumbai to Dubai pays the prompt's bytes in network cost and the round-trip in latency. But that cost is sometimes the right one to pay. A request can finish faster at a lightly-loaded remote site than queuing at a saturated local one. The challenge is recognizing when overall TTFT comes out ahead by going cross-site — composing the network path with the destination's available capacity on every overflow decision. The same network that adds cost also presents an opportunity. Modern traffic-engineering technology — segment routing, Flex-Algo, and PCEP-based controllers — lets operators engineer traffic against constraints like delay and bandwidth. Routing decisions that align with those primitives can help meet the request's service-level intent end-to-end.

Introducing AINF

AINF is the inference fabric we've built to take on each of these challenges. It sits above the operator's intra-site inference gateways and runtimes — llm-d, GAIE, NVIDIA Dynamo, the vLLM scheduler, and others — and decides which site should serve each request. AINF is Kubernetes-native, multi-tenant, and designed from first principles for AI inference. It has a fabric-wide view of GPU fleets, deployed models and adapters, provisioned capacity, and current load. Operators define policies that put their GPU fleets to efficient use while honoring the service levels each application is paying for.

AINF leverages well-engineered operator networks to help meet service-level requirements end-to-end. Routing decisions made at the fabric layer align with the network's traffic engineering, so each request's service tier stays consistent from the application through the underlay. The integration becomes even tighter when the underlying network runs ArcOS, Arrcus's networkOS, with native support for modern traffic-engineering technology.

AINF has two main components: the Router and the Orchestrator.

The AINF Router sits inside each site's inference stack or stands alone at a non-inference site, like an edge PoP. It accepts OpenAI-compatible requests, routes them at L7 by evaluating model availability, current load, conversation affinity, geo-fencing, and tier policy, and forwards them across sites over mTLS. The Router also interfaces with the inference runtime to extract load and KV-cache utilization signals. Those signals feed the AINF control plane so that peer Routers stay aware of each other's load, scaling to hundreds of sites. The AINF control plane is scalable and distributed: each Router routes every request locally, using its own tables. The Orchestrator sits outside the request path, so the data plane keeps running even when it is unreachable.

The AINF Orchestrator manages and provisions Routers from a central site or cloud VPC. It maintains a fabric-wide policy repository and distributes policies through the control plane to every router — geo-fencing rules, service routing policies, and tier entitlements, all managed centrally. The Orchestrator is built for multi-tenancy: telcos, neo-clouds, colocation operators, and network providers can offer AINF as an inference distribution fabric to customers running their own inference fleets. Enterprises running sovereign fleets can install the same software directly to manage inference distribution across their clouds and on-prem footprints. The Orchestrator also aggregates fabric-wide telemetry, giving operators observability and traceability for every request flowing through the fabric.

For the operator, the result is a single service surface across a distributed footprint: higher GPU utilization as load smooths across the fabric, better TTFT for end users through warm-cache and tier-aware routing, and policy enforced from the routing layer — consistently across clouds, regions, and on-prem environments.

Stay tuned for more posts on AINF, including disaggregated prefill and decode, power-aware routing across the GPU fleet, and more.

AINF

Inference

Back to blog