Currently monitoring jobs and companies

AI Training Infrastructure Engineer

Design, deploy, and maintain training clusters, implement distributed training algorithms, and develop tools for AI researchers. Requires experience with Python, PyTorch, and managing HPC clusters.

Apply now

Responsibilities

Design, deploy, and maintain Figure's training clusters
Architect and maintain scalable deep learning frameworks for training on massive robot datasets
Work together with AI researchers to implement training of new model architectures at a large scale
Implement distributed training and parallelization strategies to reduce model development cycles
Implement tooling for data processing, model experimentation, and continuous integration

Requirements

Strong software engineering fundamentals
Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field
Experience with Python and PyTorch
Experience managing HPC clusters for deep neural network training
Minimum of 4 years of professional, full-time experience building reliable backend systems

Bonus Qualifications

Experience managing cloud infrastructure (AWS, Azure, GCP)
Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)
Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)

The US base salary range for this full-time position is between $150,000 - $400,000 annually.

The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components/benefits depending on the specific role. This information will be shared if an employment offer is extended.

Details

Location

On-siteSunnyvale, CA, USA

Salary

$150,000/year

Experience

Min. 4 years

Technologies

AWS

Kubernetes

Ansible

PyTorch

Python

Azure

Terraform

Responsibilities

Design, deploy, and maintain Figure's training clusters
Architect and maintain scalable deep learning frameworks for training on massive robot datasets
Work together with AI researchers to implement training of new model architectures at a large scale
Implement distributed training and parallelization strategies to reduce model development cycles
Implement tooling for data processing, model experimentation, and continuous integration

Requirements

Strong software engineering fundamentals
Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field
Experience with Python and PyTorch
Experience managing HPC clusters for deep neural network training
Minimum of 4 years of professional, full-time experience building reliable backend systems

Bonus Qualifications

Experience managing cloud infrastructure (AWS, Azure, GCP)
Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)
Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)

The US base salary range for this full-time position is between $150,000 - $400,000 annually.

Figure

AI Training Infrastructure Engineer

Design, deploy, and maintain training clusters, implement distributed training algorithms, and develop tools for AI researchers. Requires experience with Python, PyTorch, and managing HPC clusters.

Apply now

Responsibilities

Design, deploy, and maintain Figure's training clusters
Architect and maintain scalable deep learning frameworks for training on massive robot datasets
Work together with AI researchers to implement training of new model architectures at a large scale
Implement distributed training and parallelization strategies to reduce model development cycles
Implement tooling for data processing, model experimentation, and continuous integration

Requirements

Strong software engineering fundamentals
Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field
Experience with Python and PyTorch
Experience managing HPC clusters for deep neural network training
Minimum of 4 years of professional, full-time experience building reliable backend systems

Bonus Qualifications

Experience managing cloud infrastructure (AWS, Azure, GCP)
Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)
Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)

The US base salary range for this full-time position is between $150,000 - $400,000 annually.

Details

Location

On-siteSunnyvale, CA, USA

Salary

$150,000/year

Experience

Min. 4 years

Technologies

AWS

Kubernetes

Ansible

PyTorch

Python

Azure

Terraform

Responsibilities

Design, deploy, and maintain Figure's training clusters
Architect and maintain scalable deep learning frameworks for training on massive robot datasets
Work together with AI researchers to implement training of new model architectures at a large scale
Implement distributed training and parallelization strategies to reduce model development cycles
Implement tooling for data processing, model experimentation, and continuous integration

Requirements

Strong software engineering fundamentals
Bachelor's or Master's degree in Computer Science, Robotics, Engineering, or a related field
Experience with Python and PyTorch
Experience managing HPC clusters for deep neural network training
Minimum of 4 years of professional, full-time experience building reliable backend systems

Bonus Qualifications

Experience managing cloud infrastructure (AWS, Azure, GCP)
Experience with job scheduling / orchestration tools (SLURM, Kubernetes, LSF, etc.)
Experience with configuration management tools (Ansible, Terraform, Puppet, Chef, etc.)

The US base salary range for this full-time position is between $150,000 - $400,000 annually.