We're looking for outstanding AI systems software engineers to develop groundbreaking technologies across the inference systems software stack. Our team builds core AI systems software that accelerates high-impact workloads on NVIDIA GPUs, from deep learning primitives and kernel libraries to LLM inference runtimes, serving abstractions, and code generation technologies. As a member of the team, you will help design, build, optimize, and ship production-quality software that powers NVIDIA's AI software stack.
This role spans both foundational library engineering and next-generation inference systems work, with opportunities to contribute across the stack from low-level kernels and performance primitives to serving runtimes and developer-facing abstractions. You may work on GPU-accelerated deep learning primitives, efficient attention kernel implementations, LLM serving components, just-in-time compilation systems, software abstractions, and performance-critical runtime infrastructure for large language models, agents, and other advanced AI workloads. You will collaborate with world-class engineers across deep learning software, compilers, GPU architecture, and open-source inference ecosystems, and your work will directly impact NVIDIA's AI platform and the performance of real-world workloads at scale.
What you'll be doing:
Develop production-quality software that ships as part of NVIDIA's AI software stack, including cuDNN, FlashInfer, and optimized support for large language model inference workloads.
Innovate and develop new AI systems technologies for efficient inference, with a focus on performance, scalability, maintainability, and usability.
Design, implement, and optimize kernels for high-impact AI workloads across LLM inference, generative AI, computer vision, autonomous driving, and recommender systems.
Design and implement extensible software abstractions for deep learning libraries, LLM serving engines, and runtime systems.
Build and improve just-in-time compilation, code generation, and runtime technologies for performance-critical GPU workloads.
Analyze workload performance, tune current software, and propose improvements to future software and hardware-software interfaces.
Collaborate closely with engineers across deep learning frameworks, libraries, kernels, compilers, and GPU architecture teams at NVIDIA.
Contribute to open-source communities and ecosystem integrations where relevant, including projects such as FlashInfer, vLLM, and SGLang.
What we need to see:
Master's degree in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
3+ years of relevant industry, research, or systems software development experience in machine learning, deep learning systems, compilers, or GPU software. More experience is expected for senior-level candidates.
Strong programming skills in C/C++ and Python, with hands-on experience developing high-performance software.
Solid experience with CUDA development and GPU programming fundamentals.
Strong experience developing or using deep learning frameworks such as PyTorch, JAX, TensorFlow, or ONNX.
Good understanding of linear algebra, performance analysis, profiling, and code optimization.
Experience designing software abstractions, APIs, or higher-level system architecture for performance-sensitive systems.
Familiarity with modern machine learning and inference system trends, especially around LLMs and generative AI.
For senior candidates, strong experience in GPU kernel development and performance optimization, especially using CUDA C/C++, cuTile, Triton, or similar technologies, is expected.
Ways to stand out from the crowd:
Hands-on experience with inference engines and runtimes such as vLLM, SGLang, MLC, TensorRT-LLM, or similar systems.
Background in domain-specific compiler, code generation, or library solutions for LLM inference and training.
Expertise in machine learning compilers or IR systems such as MLIR, Apache TVM, TensorIR, or related technologies.
Practical experience with GPU performance modeling, computer architecture, or accelerator-oriented software design.
Open-source project ownership or meaningful contributions in deep learning systems, compilers, kernels, or inference infrastructure.
