AI Infrastructure Engineer, Core Infrastructure
Design and build foundational systems for ML infrastructure, managing compute allocation, scheduling, and autoscaling.
As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training and evaluation to large-scale inference and experimentation.
Our platform is responsible for orchestrating workloads across heterogeneous compute environments (GPU, CPU, on-prem, and cloud), optimizing for reliability, cost efficiency, and developer velocity.
The ideal candidate has a strong background in distributed systems, scheduling, and platform architecture, and is excited by the challenge of building internal infrastructure used across all ML teams.
You will:
- Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds.
- Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads.
- Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets.
- Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing.
- Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards.
- Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment.
Ideally you'd have:
- 4+ years of experience building large-scale backend or distributed systems.
- Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture.
- Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform).
- Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues).
- Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs).
- A track record of improving system efficiency, reliability, or developer velocity in production environments.
Nice to haves:
- Experience with multi-tenant compute platforms or internal PaaS.
- Knowledge of GPU scheduling, cost modeling, or hybrid cloud orchestration.
- Familiarity with LLM or ML training workloads, though deep ML expertise is not required.