Mirelo AI is building the next generation of creative tools by generating realistic sound, speech and music from video.
We develop cutting-edge foundational generative AI models that "unmute" silent video content and create custom, hyper-realistic audio for gaming, video platforms, and creators. Our technology empowers global storytellers to transform their content.
We recently closed a $41 million Seed round co-led by Andreessen Horowitz and Index Ventureswith participation from Atlantic, and are rapidly expanding across Product, Engineering, Go-to-Market, and Growth.
About the Role
At Mirelo, you’ll work at the centre of how we build the next generation of multimodal video-to-audio models. This role is deeply hands-on and research-heavy: with a great H100/200-per-engineer ratio you explore and build new multimodal models and push the boundaries of what’s possible in music, sound, and speech generation. You’ll collaborate closely across research and engineering, run focused ablations, and translate experimental results into clear next steps for the team. From data curation to deployment, you’ll help shape the full lifecycle of the models that power our products and partnerships.
Key Responsibilities
Design, implement and train large-scale multimodal generative models for audio generation (diffusion and/or autoregressive models).
Explore new modeling ideas for audio generation (music, sound, speech) while taking inspiration from the language and image domains.
Develop and experiment with post-training for new capabilities (fine-grained control, in/out-painting, editing, …)
Conduct rigorous ablation studies, get actionable insights and communicate results to the team to discuss new research directions.
Contribute hands-on to all stages of model development including data curation, experimentation, evaluation, and deployment.
Ideal Candidate Profile
Hands-on experience in training large-scale generative models in a fast-paced research environment.
Deep understanding of cutting-edge methods and ML research in at least one of the domains: image, language, video or audio (specific audio experience not necessary, but nice to have).
Strong proficiency in PyTorch, transformer architectures, and the full ecosystem of modern deep learning.
Solid understanding of distributed training techniques—FSDP, low precision training, model parallelism
Strong track-record in working on generative models (publications in top-tier venues, open-source contributions or applied ML projects).
Nice to Have
Proficiency with profiling, debugging, and optimizing single and multi-GPU operations using tools like Nsight or stack trace viewers.
Strong software engineering skills/experience in collaborating on large codebases that go beyond PhD research code.
Experience with generative models for audio (sound, music or speech) and audio codec design.
Why Join?
Join at a pivotal moment. We've secured fresh funding and are gaining traction - now is when your contributions can make a real difference to our success.
True ownership from day one. You'll have genuine autonomy and responsibility. Your ideas and work will directly shape our product and company direction.
Competitive compensation and equity. We offer strong packages that ensure you share in the success you help create.
Build for the next generation of creators. Be part of the innovation that will transform how creators work and thrive.
We welcome applications from all individuals, regardless of ethnic origin, gender, disability, religion or belief, age, or sexual orientation and identity.