KV cache offload for LLM inference

L4 context memory for GPU clusters.

GPUCache moves long-context KV cache out of HBM and into a shared NVMe flash tier built on BlueField DPUs, RDMA, and SPDK.

The problem

Stop paying the recompute tax.

Long-context LLM inference puts pressure on GPU HBM. Once KV cache no longer fits in VRAM, serving stacks either evict it or recompute it later, which burns GPU cycles on work the model has already done.

GPUCache puts that cache in an EBOF NVMe tier controlled by BlueField DPUs. The goal is a storage path with RDMA, SPDK, and no x86 storage host in the hot path.

The core move

Take x86 storage nodes out of the hot path.

GPUs read and write context over RDMA to NVMe flash behind a BlueField DPU. The fast path avoids host CPU scheduling, host DRAM copies, and the normal TCP/IP storage stack.

Architecture

A DPU-first storage path for KV cache.

01

Rust, no GC pauses

Rust keeps tail latency predictable and memory use tight on the ARM cores inside the DPU.

02

EBOF without a host server

GPUCache runs on BlueField-3/4 DPUs and uses SPDK to manage NVMe SSDs behind a PCIe switch.

03

RDMA end to end

Context moves from GPU VRAM over Spectrum-X to the DPU, then into flash, without host CPU intervention.

04

Erasure coding on the DPU

DOCA can push parity work onto DPU acceleration engines instead of spending CPU cycles on replication.

05

KV-aware prefetch and eviction

The cache engine can track access patterns from vLLM, TensorRT-LLM, and PageAttention blocks.

Data path

GPU VRAM to DPU to NVMe flash.

GPU VRAM
RDMA / RoCEv2
BlueField DPU
SPDK
NVMe Flash

GPUCache vs. the status quo

A KV-cache tier built for the GPU hot path.

Feature Traditional Storage MinIO MemKV (Go) RustFS GPUCache (Rust)
Language C / C++ / Java Go, subject to GC pauses Rust, deterministic and zero-GC
Data Path CPU → Memory → NIC RDMA direct to DPU RDMA direct to DPU
Storage Engine File / Object (S3) Proprietary NixL KV-optimized NVMe-oF
Resilience 3x replication Unknown / WIP Hardware-offloaded erasure coding

Roadmap

What is being built now.

  1. Core Rust KV engine.
  2. RDMA / RoCEv2 communication layer.
  3. SPDK integration for direct NVMe addressing.
  4. NVIDIA DOCA support for ARM cross-compilation.
  5. DPU hardware erasure-coding offload.

Call for contributors

Help build the DPU storage layer for LLM context.

GPUCache needs contributors who know Rust systems programming, RDMA and RoCEv2 networking, NVIDIA DOCA, BlueField DPUs, ARM64 cross-compilation, vLLM, TensorRT-LLM, and PageAttention.