turbopuffer is a serverless vector and full-text search database built on object storage, separating compute and storage to address the latency-cost-throughput trade-off in production retrieval systems. The architecture uses tiered storage - NVMe/SSD caching layered over object storage - to optimize for variable query patterns and burst load while avoiding the fixed cost overhead of traditional in-memory vector databases.
The system handles 3.5T+ documents, 10M+ writes/s, and 25k+ queries/s, with support for hybrid search (vector + full-text) and metadata filtering. Serverless scaling means you pay for what you use; the separation of compute and storage eliminates the need to over-provision either dimension. This matters for workloads with bursty traffic or datasets that grow unpredictably - common in AI retrieval pipelines feeding assistants and agents.
The design makes explicit trade-offs around tail latency and operational complexity. Tiered storage introduces variable access costs and potential cache-miss penalties, requiring careful tuning for your query profile. Full-text search integration alongside vectors reduces the need for multiple systems, but hybrid scoring and ranking add computational overhead that affects per-query latency. Metadata filtering allows selective search without scanning the full corpus, critical for reducing throughput costs in gated retrieval scenarios.