Engineering Manager, ML Platform
Engineering Manager needed to lead ML platform team, build infrastructure, and scale production inference. Requires 5+ years of experience.
About the role
We're looking for an Engineering Manager (Backend) to lead the team responsible for Runway's machine learning platform. You should have experience leading high-performing engineering teams and be deeply interested in the intersection of machine learning and distributed systems. You will will manage our current and growing team of 5.
You’ll have a chance to work closely with our Research and Machine Learning teams to build out our data processing, training, and eval systems. This role is on a small team with a big impact.
What you’ll do
- Build the platform infrastructure for ML at scale. Lead the platform engineering team that powers Runway's machine learning pipeline—from data processing through model training to production inference. Work closely with research teams to build robust, scalable systems that let them move fast.
- Keep training jobs running smoothly. Build monitoring, alerting, and automation around critical multi-day training runs on hundreds of GPUs. Your systems catch problems before they derail expensive compute jobs.
- Enable model evaluation and exploration. Maintain the platform that lets researchers inspect training data, visualize outputs, and evaluate model checkpoints. Build tools that bridge raw infrastructure and research workflows.
- Scale production inference. Own the inference pipeline serving Runway's products. Implement monitoring and alerting for performance and reliability. Lead GPU capacity planning to balance cost and user experience as demand grows.
What you’ll need
- Platform engineering foundation. 5+ years building distributed systems, data pipelines, and infrastructure at scale. Experience managing engineering teams of 3-8 people.
- Production infrastructure expertise. Experience with cloud platforms (AWS/GCP), container orchestration (Kubernetes/ECS) and operating services at scale. You've built reliable systems that handle large data volumes and complex workloads.
- Proven experience with monitoring and reliability. Experience building comprehensive monitoring and alerting. You know what metrics matter and how to surface the right information to different teams.
- Collaborative mindset. Comfortable working directly with researchers and data scientists. You can translate their needs into engineering solutions and build tools people actually want to use.
- Humility and open mindedness. At Runway we love to learn from one another.
Salary range: $280,000-340,000