AC

Infrastructure Engineer, ML Systems

Applied Compute
Posted onFeb 23, 2026
LocationSan Francisco, California, United States (On-site)
Employment typeFull-time

Who We Are

We build Specific Intelligence for the enterprise: agents that continuously learn from a company's processes, data, expertise, and goals. Today, there's a massive gap between what AI models can do in isolation and what they reliably do inside real businesses; these systems fail because they don't adapt to feedback. We're building the continual learning layer: a platform that captures context, memory, and decision traces across the enterprise, providing an environment where specialized agents learn how to do real work.

Why we're excited:We get to work at a rare intersection. Our product team builds the platform powering a new generation of digital coworkers. Our research team pushes the frontier of post-training and reinforcement learning to create new product experiences. Our applied research engineers sit side-by-side with customers as they ship models into production. This combination of strong product, deep research, and boots on the ground is what we believe it takes to bring AI to the enterprise. We are product-led, research-enabled, and forward-deployed.

Our Team: We are a team of engineers, researchers, and operators. Many of us are former founders. We've built RL infrastructure at OpenAI, data foundations at Scale AI, and systems at Together, Two Sigma, and Watershed. We work with F50 customers in addition to DoorDash, Mercor, and Cognition. We’re fortunate to be backed by Benchmark, Sequoia, Lux, and others.

Who Thrives Here: We're looking for people who are excited about applying novel research and complex systems to real-world problems. You should be comfortable navigating unfamiliar environments quickly, whether that's a new codebase, a new customer's data architecture, or a problem domain you've never seen before. You should also genuinely enjoy working with customers: listening, empathizing, and understanding how work actually gets done in their organizations. Former founders, people who've built a lot of side projects, or anyone who's shown they can own something end-to-end, tend to do well here.

The Role


You'll design, implement, and optimize the large-scale training infrastructure that powers our frontier reinforcement learning stack. This is systems work at the edge of what's possible, training state-of-the-art models for our enterprise partners. Frontier systems are exciting but brittle, and require both performance and correctness to train models effectively. You'll work closely with researchers to make our RL stack reliable, fast, and capable of running for days without intervention.

What You'll Do

  • Design and optimize our RL training and inference pipelines across large GPU clusters

  • Build tooling and observability that lets researchers and customers inspect, profile, and debug training runs

  • Implement systems with an eye toward how they affect ML (low precision numerics, distributed training edge cases, etc.)

  • Partner with researchers to bring frontier post-training capabilities into production deployments

  • Contribute to Faraday, our secure deployment offering for security-conscious enterprise environments

What We're Looking For

  • Experience programming with and managing training jobs on large-scale GPU systems

  • Fearlessness and curiosity to understand all levels of the training stack

  • Bias toward fast implementation, paired with a high bar for reliability and efficiency

  • Familiarity with open-weights models (architecture and inference)

  • Background in reinforcement learning or integration of inference with RL training loops

Strong Candidates Also Have

  • Experience with distributed training frameworks (PyTorch, JAX, DeepSpeed)

  • Background in high-performance computing or working with large-scale clusters

  • Contributions to open-source ML infrastructure

  • Demonstrated technical creativity through published projects, OSS contributions, or side projects

Logistics

This role is based in San Francisco. We work in-person at our office in the Design District. We offer competitive compensation and equity, generous health benefits, unlimited PTO, paid parental leave, daily lunches and dinners, transportation, and relocation support. We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the process with you.

We encourage you to apply even if you do not believe you meet every single qualification. As set forth in Applied Compute’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Applied Compute

View company profile

Applied Compute builds Specific Intelligence for enterprises, training custom AI models and deploying in-house agent workforces using proprietary company data. Founded by former OpenAI researchers, the company is backed by $80M from Benchmark, Sequoia, and Lux Capital.

Similar jobs

You might also be interested in...

TM5d

Research Engineer, Infrastructure, Tinker

Thinking Machines Lab

San Francisco, California, United States (On-site)

$350k – $475k Yearly

TM5d

Research Engineer, Infrastructure, RL Systems

Thinking Machines Lab

San Francisco, California, United States (On-site)

$350k – $475k Yearly

TM5d

Research Engineer, Infrastructure, Training Systems

Thinking Machines Lab

San Francisco, California, United States (On-site)

$350k – $475k Yearly

TM5d

Research Engineer, Infrastructure, Inference

Thinking Machines Lab

San Francisco, California, United States (On-site)

$350k – $475k Yearly

AN5d

Research Engineer, Pretraining Scaling

Anthropic

San Francisco, California, United States (On-site)

$315k – $560k Yearly