KV cache offload for LLM inference

L4 context memory for GPU clusters.

GPUCache moves long-context KV cache out of HBM and into a shared NVMe flash tier built on BlueField DPUs, RDMA, and SPDK.

Read the architecture View on GitHub

The problem

Stop paying the recompute tax.

Long-context LLM inference puts pressure on GPU HBM. Once KV cache no longer fits in VRAM, serving stacks either evict it or recompute it later, which burns GPU cycles on work the model has already done.

GPUCache puts that cache in an EBOF NVMe tier controlled by BlueField DPUs. The goal is a storage path with RDMA, SPDK, and no x86 storage host in the hot path.

The core move

Take x86 storage nodes out of the hot path.

GPUs read and write context over RDMA to NVMe flash behind a BlueField DPU. The fast path avoids host CPU scheduling, host DRAM copies, and the normal TCP/IP storage stack.

Architecture

A DPU-first storage path for KV cache.

Rust, no GC pauses

Rust keeps tail latency predictable and memory use tight on the ARM cores inside the DPU.

EBOF without a host server

GPUCache runs on BlueField-3/4 DPUs and uses SPDK to manage NVMe SSDs behind a PCIe switch.

RDMA end to end

Context moves from GPU VRAM over Spectrum-X to the DPU, then into flash, without host CPU intervention.

Erasure coding on the DPU

DOCA can push parity work onto DPU acceleration engines instead of spending CPU cycles on replication.

KV-aware prefetch and eviction

The cache engine can track access patterns from vLLM, TensorRT-LLM, and PageAttention blocks.

Data path

GPU VRAM to DPU to NVMe flash.

GPU VRAM

RDMA / RoCEv2

BlueField DPU

SPDK

NVMe Flash

GPUCache vs. the status quo

A KV-cache tier built for the GPU hot path.

Feature	Traditional Storage	MinIO MemKV (Go)	RustFS GPUCache (Rust)
Language	C / C++ / Java	Go, subject to GC pauses	Rust, deterministic and zero-GC
Data Path	CPU → Memory → NIC	RDMA direct to DPU	RDMA direct to DPU
Storage Engine	File / Object (S3)	Proprietary NixL	KV-optimized NVMe-oF
Resilience	3x replication	Unknown / WIP	Hardware-offloaded erasure coding

Roadmap

What is being built now.

Core Rust KV engine.
RDMA / RoCEv2 communication layer.
SPDK integration for direct NVMe addressing.
NVIDIA DOCA support for ARM cross-compilation.
DPU hardware erasure-coding offload.

Call for contributors

Help build the DPU storage layer for LLM context.

GPUCache needs contributors who know Rust systems programming, RDMA and RoCEv2 networking, NVIDIA DOCA, BlueField DPUs, ARM64 cross-compilation, vLLM, TensorRT-LLM, and PageAttention.

Browse issues Join discussions