Staff Site Reliability Engineer
Seeking a Staff Site Reliability Engineer with extensive experience in AWS, AZURE, Kubernetes, and GitOps to lead the SRE team.
Your Impact:
We are seeking a Staff Site Reliability Engineer (Infrastructure & Site Reliability Engineering) with extensive experience in AWS, AZURE, Kubernetes, and GitOps to lead our Site Reliability Engineering (SRE) team. The successful candidate will deeply understand SRE practices and have a track record of implementing high-quality site reliability engineering practices (SLAs, SLOs, Proactive Alert Management, Incident Response/Review, Postmortems, etc.).
In this role, you will work with our SRE and cross-functional engineering teams to develop and operate our development and production infrastructure and operations
Your Role
- Work collaboratively with software engineering on infrastructure and deployment requirements;
- Contribute actively and assist in our automation and observability initiatives
- Build and maintain operational tools for deployment, monitoring, and analysis of cloud (AWS & AZURE) infrastructure and systems
- Collaborate with senior team members in responding to production incidents, actively contribute to postmortems, and engage in continuous improvement efforts as part of on-call rotations for exposure to critical issue resolution
- Establish and drive operations performance through SLOs
- Provide project management, sprint planning, and road-mapping support to the SRE team
- Expert-level technical skills and ability to provide mentoring to team members
- Our team uses practices to maximize our development velocity, including but not limited to: continuous integration/deployment, code review via GitHub pull requests
Your Experience
- Strong customer orientation
- Excellent interpersonal and organizational skills
- Attention to detail and focus on quality
- Strong communication skills to effectively liaise with both technical and non-technical staff
- Ability to act decisively and work well under pressure
- Must be a collaborative problem solver
- Strong bias for ownership and action
Qualifications:
- At least 10 + years of experience designing, building, and maintaining SAAS environments
- 6+ years of experience designing, building and maintaining AWS/AZURE infrastructure with Terraform
- 3+ years of experience building and running Kubernetes, Clickhouse, MySQL, and Kafka clusters
- Experience with observability (monitoring – logging, tracing, metrics)
- Experience with GitOps CI/CD processes
- Experience with scripting with Python, Go (Golang), bash, or PowerShell and AWS CLI tools
- Experience with security operations – security policies, infrastructure, key management, setup of encryption at rest, and transport