Senior Site Reliability Engineer
Seeking a Site Reliability Engineer to build and maintain scalable, reliable, and secure cloud infrastructure on AWS, Kubernetes, and related technologies.
Snyk, a leader in developer security, has acquired Probely, a modern Dynamic Application Security Testing (DAST) provider based in Portugal, with coverage of API security testing and web applications.
We are seeking a skilled and proactive Site Reliability Engineer (SRE) to join our team and support our hypergrowth by building scalable, reliable, and secure cloud infrastructure. You will be responsible for ensuring the performance and uptime of our systems while adopting DevOps best practices and leveraging modern tools.
You’ll Spend Your Time:
- Design, deploy, and maintain infrastructure on AWS, including VPC, EC2, RDS, IAM and EKS clusters.
- Manage Kubernetes clusters across multiple environments with a focus on performance, security, and availability.
- Utilize ArgoCD, Kustomize and Helm for continuous deployment and GitOps workflows.
- Implement and manage monitoring and alerting systems using Prometheus, Grafana, and custom exporters.
- Maintain centralized logging and observability using Graylog and OpenSearch.
- Automate infrastructure provisioning with Terraform and custom scripting in Python or Bash.
- Implement best practices around networking, including VPN, load balancing, routing, and firewalls.
- Troubleshoot complex system issues across network, infrastructure, and application layers.
- Ensure high availability, scalability, and disaster recovery across all systems.
- Collaborate with development and operations teams to improve deployment processes and infrastructure resiliency.
What You’ll Need:
- Strong hands-on experience with AWS services (VPC, EC2, EKS, RDS, IAM).
- Deep understanding of Kubernetes architecture and day-to-day cluster management.
- Experience with Cloudflare products (DNS, Zero Trust, WAF, CDN).
- Proficiency in the Prometheus + Grafana monitoring stack.
- Strong with Calico for managing Kubernetes network policies.
- Solid experience with Graylog and OpenSearch for logging and search analytics.
- Proficient with Infrastructure as Code tools, especially Terraform, Kustomize and Helm.
- Experience with CI/CD pipelines and GitOps practices using ArgoCD.
- Strong scripting and automation skills in Bash and/or Python.
- Solid knowledge of networking principles (TCP/IP, DNS, HTTP/HTTPS, VPNs, security groups, etc.).
We’d be Lucky if You:
- Familiarity with incident management practices (on-call, runbooks, postmortem, disaster recovery).
- Understanding of Zero Trust security models and security best practices in cloud environments.
- Exposure to Service Mesh (Istio, Linkerd) and container networking.
- Experience with cost optimization and cloud spend monitoring.
- Familiarity with Linux system administration and shell scripting.
- Knowledge of RBAC and IAM in AWS and Kubernetes.