AI Training Infrastructure Engineer
Design, deploy, and maintain training clusters, implement distributed training algorithms, and develop tools for AI researchers. Requires experience with Python, PyTorch, and managing HPC clusters.
Responsibilities
- Design, deploy, and maintain Figure's training clusters
- Architect and maintain scalable deep learning frameworks for training on massive robot datasets
- Work together with AI researchers to implement training of new model architectures at a large scale
- Implement distributed training and parallelization strategies to reduce model development cycles
- Implement tooling for data processing, model experimentation, and continuous integration
Requirements
- Strong software engineering fundamentals
- Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field
- Experience with Python and PyTorch
- Experience managing HPC clusters for deep neural network training
- Minimum of 4 years of professional, full-time experience building reliable backend systems
Bonus Qualifications
- Experience managing cloud infrastructure (AWS, Azure, GCP)
- Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)
- Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)
The US base salary range for this full-time position is between $150,000 - $400,000 annually.
The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.