Baseten builds AI infrastructure for production deployment and scaling of models, with work spanning kernel-level optimization for inference performance through developer tooling. The platform ships daily, measuring success by real-world impact of AI products running on it rather than vanity metrics. Engineers embed directly with customers to surface operational bottlenecks, then optimize obsessively - work ranges from TensorRT-LLM and CUDA kernel tuning to building developer tools that reduce deployment friction.
The stack centers on inference at scale: TensorRT-LLM and PyTorch for model execution, NVIDIA Triton Inference Server for serving, Kubernetes (EKS) with Karpenter for autoscaling, and Knative for event-driven workloads on AWS EC2. Infrastructure decisions prioritize shipping velocity over process - small teams with real ownership iterate rapidly on production reliability, latency (including tail behavior), and cost efficiency. Docker containerization and PostgreSQL round out core operational dependencies.
The team is internationally distributed, composed of engineers and designers who take craft seriously without performative posturing. Customer-embedded engineering informs both platform architecture and developer experience tradeoffs, creating tight feedback loops between deployment reality and infrastructure evolution. From founding, the approach has centered on hands-on problem solving and rapid iteration rather than abstraction layers that delay production learning.