Senior Site Reliability Engineer
Senior Site Reliability Engineer to ensure distributed systems are built and maintained with quality, and operated in scalable ways.
As a Senior Site Reliability Engineer, you will work closely with both product and platform engineering teams to ensure Duolingo’s sophisticated distributed systems and products are built and maintained with extraordinary quality, and operated in measurable and scalable ways.
You Will...
- Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
- Own core infrastructure (i.e understand, diagnose, and debug these systems in production)
- Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
- Maintain and document sustainable postmortem/incident response practices
- Advocate for and implement changes that improve reliability, scalability, and velocity
- Reduce the burden of toil with iterative development of tooling and automation
- Collaborate with engineering teams to release new features and become an authority on our services
You Have...
- 3+ years of experience within site reliability engineering/DevOps of a product with millions of users
- Experience identifying and solving issues in large-scale distributed systems
- Experience with Java, Kotlin, Python or Go
- Proficiency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
- An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
Exceptional Candidates Will Have...
- Experience improving automation and tooling to reduce service maintenance toil
- Proven experience driving improvements to incident response processes