Staff Reliability Engineer

Join a team to ensure the health, performance, and resilience of the platform, applying SRE principles to drive observability, automation, and incident prevention.

We're Celonis, the global leader in Process Mining technology and one of the world's fastest-growing SaaS firms.

The Team As a member of our Reliability Engineering team, you will play a critical role in ensuring the health, performance, and resilience of our platform.

The Role

Join a highly technical, collaborative, and innovation-driven team that blends Site Reliability Engineering with modern Software Engineering practices to build resilient and scalable systems.
Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention.
Develop and enforce SLOs, SLAs, and error budgets to drive reliability-focused development.
Provide mentorship and technical leadership across the SRE and engineering teams.
Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms.
Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency.
Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability.

The qualifications you need:

Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent hands-on experience).
Minimum of 8+ years of experience in software engineering or SRE roles.
Deep experience with cloud platforms (AWS, GCP, or Azure).
Proficiency in Java, the Spring framework, and Python (or a similar scripting language) in a Linux environment.
Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles.
Demonstrated ability to lead projects and influence engineering culture.
Knowledge of SRE principles, including SLI/SLO design, error budgets, and toil reduction strategies.
Excellent written and verbal communication skills in English.
Please note: This position is not eligible for immigration visa sponsorship, now or in the future.

Nice to Have

Experience with observability and monitoring tools (e.g., Datadog, etc.).
Experience in developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures.
Experience with CI/CD pipelines and tools such as ArgoCD, GitHub Actions, or similar.
Experience with Infrastructure as Code (IaC) tools such as Terraform and Kustomize.
Exposure to incident management practices, on-call rotations, and postmortem culture.

Visa sponsorship is not offered for this role.

The base salary range below is for the role in the specified location, based on a Full Time Schedule.

Total compensation package will include base salary + bonus/commission + equity + benefits (health, dental, life, 401k, and paid time off). Please note that the base salary range is a guideline, and that the actual total compensation offer will be determined based on various factors, including, but not limited to, applicant's qualifications, skills, experiences, and location.

The base salary range below is for the role in California, based on a Full Time Schedule.

$195,000—$235,000 USD