Research Engineer - Data
Architect and manage petascale data pipelines, combining text, images, 3D models, and other data modalities to drive world-class AI models.
As a Research Engineer – Data at Leonardo, you will architect and manage petascale data pipelines, combining text, images, 3D models, and other data modalities to drive world-class AI models. You’ll work hand-in-hand with our Researchers to create and curate large, multi-modal datasets, including synthetic data, that supercharge SOTA generative AI solutions. Your expertise in distributed systems, data processing, and experimentation will shape the backbone of our research work.
Responsibilities:
**Data Acquisition & Curation**Lead the ingestion, unification, and organization of large, unstructured data sources (e.g., text, images, 3D geometry, code snippets) into scalable, high-quality datasets suitable for machine learning research and production.
**High-Performance Data Pipelines**Develop and optimize distributed systems for data processing, including filtering, indexing, and retrieval, leveraging frameworks like Ray, Metaflow, Spark, or Hadoop.
**Synthetic Data Generation**Build and orchestrate pipelines to generate synthetic data at scale, advancing research on cost-efficient inference and training strategies.
**Experiments & Analysis**Design and conduct experiments on dataset quality, scalability, and performance.
**Security & Compliance**Collaborate with legal and safety teams to ensure all data usage respects privacy, security, and ethical standards.
**Open-Source Contributions**Contribute to internal and external libraries or frameworks, sharing insights and breakthroughs with the wider AI community through publications or technical blogs.
Skills we like you to have:
**Multi-Modal Data Expertise**Hands-on experience with images, videos, 3D geometry (mesh/solid modeling), and/or text data. Well-rounded expertise in Python and PyTorch.
**Synthetic Data & Inference**Passion for synthetic data generation making use of inference of pretrained models, 3D rendering engines, and/or other softwares.
**Distributed Computing & MLOps**Demonstrated proficiency in setting up large-scale, robust data pipelines, using frameworks like Spark, Ray, or Metaflow. Comfortable with model versioning, and experiment tracking.
**Performance Optimization**Good understanding of parallel and distributed computing. Experienced with setting up evaluation methods
**Cloud & Storage Systems**Experience with AWS, Azure, or other cloud platforms. Proficient in both relational (MySQL, PostgreSQL) and NoSQL (MongoDB, Cassandra) databases, plus vector data stores.