Senior Reliability Engineer

Join a team to ensure the health, performance, and resilience of our platform, applying SRE principles to drive observability, automation, and incident prevention.

We're Celonis, the global leader in Process Mining technology and one of the world's fastest-growing SaaS firms.

The Team As a member of our Reliability Engineering team, you will play a critical role in ensuring the health, performance, and resilience of our platform.

The Role

Join a highly technical, collaborative, and innovation-driven team that blends Site Reliability Engineering with modern Software Engineering practices to build resilient and scalable systems.
Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention.
Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms.
Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency.
Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability.
Document operational knowledge and runbooks, embedding SRE best practices into onboarding, incident response, and platform architecture standards.

The qualifications you need:

Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent hands-on experience).
Minimum of 5 years of experience building and maintaining cloud-based software applications with at least one public cloud platform (AWS, Azure, or GCP).
Proficiency in Java, the Spring framework, and Python (or a similar scripting language) in a Linux environment.
Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles.
Knowledge of SRE principles, including SLI/SLO design, error budgets, and toil reduction strategies.
Proven expertise in developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures.
Strong problem-solving and troubleshooting abilities in complex, distributed systems.
Excellent written and verbal communication skills in English.

Nice to Have

Familiarity with observability and monitoring tools (e.g., Datadog, etc.).
Experience with CI/CD pipelines and tools such as ArgoCD, GitHub Actions, or similar.
Experience with Infrastructure as Code (IaC) tools such as Terraform and Kustomize.
Exposure to incident management practices, on-call rotations, and postmortem culture.

Visa sponsorship is not offered for this role.

The base salary range below is for the role in the specified location, based on a Full Time Schedule.

Total compensation package will include base salary + bonus/commission + equity + benefits (health, dental, life, 401k, and paid time off). Please note that the base salary range is a guideline, and that the actual total compensation offer will be determined based on various factors, including, but not limited to, applicant's qualifications, skills, experiences, and location.

The base salary range below is for the role in California, based on a Full Time Schedule.

$160,000—$210,000 USD