Rust, no GC pauses
Rust keeps tail latency predictable and memory use tight on the ARM cores inside the DPU.
KV cache offload for LLM inference
GPUCache moves long-context KV cache out of HBM and into a shared NVMe flash tier built on BlueField DPUs, RDMA, and SPDK.
The problem
Long-context LLM inference puts pressure on GPU HBM. Once KV cache no longer fits in VRAM, serving stacks either evict it or recompute it later, which burns GPU cycles on work the model has already done.
GPUCache puts that cache in an EBOF NVMe tier controlled by BlueField DPUs. The goal is a storage path with RDMA, SPDK, and no x86 storage host in the hot path.
The core move
GPUs read and write context over RDMA to NVMe flash behind a BlueField DPU. The fast path avoids host CPU scheduling, host DRAM copies, and the normal TCP/IP storage stack.
Architecture
Rust keeps tail latency predictable and memory use tight on the ARM cores inside the DPU.
GPUCache runs on BlueField-3/4 DPUs and uses SPDK to manage NVMe SSDs behind a PCIe switch.
Context moves from GPU VRAM over Spectrum-X to the DPU, then into flash, without host CPU intervention.
DOCA can push parity work onto DPU acceleration engines instead of spending CPU cycles on replication.
The cache engine can track access patterns from vLLM, TensorRT-LLM, and PageAttention blocks.
Data path
GPUCache vs. the status quo
| Feature | Traditional Storage | MinIO MemKV (Go) | RustFS GPUCache (Rust) |
|---|---|---|---|
| Language | C / C++ / Java | Go, subject to GC pauses | Rust, deterministic and zero-GC |
| Data Path | CPU → Memory → NIC | RDMA direct to DPU | RDMA direct to DPU |
| Storage Engine | File / Object (S3) | Proprietary NixL | KV-optimized NVMe-oF |
| Resilience | 3x replication | Unknown / WIP | Hardware-offloaded erasure coding |
Roadmap
Call for contributors
GPUCache needs contributors who know Rust systems programming, RDMA and RoCEv2 networking, NVIDIA DOCA, BlueField DPUs, ARM64 cross-compilation, vLLM, TensorRT-LLM, and PageAttention.