Sesame builds voice interfaces through tight integration of hardware, software, and machine learning, pursuing research in speech generation, personality modeling, and multimodal ML. The company operates large GPU clusters to support ambitious research programs aimed at making computers lifelike through natural voice interaction, with development cycles measured in days rather than quarters. Backed by a16z, Sequoia, Spark, and Matrix, the technical effort spans PyTorch-based model development alongside Android and iOS deployment, with infrastructure supporting rapid iteration from whiteboard concepts to production systems.
The engineering organization comprises an interdisciplinary team of long-tenured experts across machine learning, hardware, software, and entertainment backgrounds, operating from offices in San Francisco, Bellevue, and New York. Core technical domains include speech generation systems, personality modeling for voice companions, and multimodal ML architectures that coordinate audio and other sensory inputs. The product strategy emphasizes deliberate design choices to create voice interfaces that are nuanced and intimate rather than intrusive, with hardware engineering efforts targeting lightweight eyewear form factors for all-day wear.
Infrastructure and operational requirements center on GPU cluster management to support training and inference for speech models, alongside mobile platform engineering for real-time voice processing. The technical challenge involves crossing the uncanny valley in voice interaction - achieving latency, naturalness, and contextual appropriateness simultaneously across diverse usage scenarios. Team composition reflects this: specialists in human-computer interaction work alongside ML researchers and hardware engineers to optimize the full stack from acoustic modeling through industrial design.