We’re currently looking for Director of Site Reliability Engineering (SRE).
In this position, you will oversee our existing global organization (United States, Europe, India, China) and elevate our incident response and prevention, monitoring, capacity planning, and technology stack modernization with a focus on cost efficiency.
Responsibilities:
Lead incident management, monitoring, and capacity planning processes
Establish robust knowledge management practices
Manage team capacities and resources effectively
Oversee projects within your teams
Adopt a modern CI/CD technical stack (GitOps) and support apps migration to Kubernetes;
Support and develop engineers, team leads and managers, fostering their growth
Ensure the well-being of the team, the health of the service, and effective communication
Establish high-quality cross-team communications with service developers, infrastructure teams, SecOps, FinOps, and product teams, ensuring transparency for all stakeholders
Fine-tune budget accuracy (public clouds, on-premise, human resources)
Implement structured, documented, and efficient global in-team processes
Ensurement of compliance to certifications requirements (PCI, SOC2, GDPR etc);
Requirements:
Practical experience in managing high-availability services
Experience managing multiple teams (manager of managers)
Proven track record in initiating and implementing improvements
Project management experience
Hands-on technical background
Understanding of SRE practices
Familiarity with the software development lifecycle
Ability to multitask and reprioritize in a fast-paced environment
Capacity to work with minimal supervision and meet strict deadlines
Strong sense of responsibility, proactivity, collaboration
Advanced English proficiency
Teams Tech Stack:
Public Clouds: Amazon EC2, Amazon EKS, Storage, Virtual Networking, Load Balancing, Databases
On-Premise: VMware vSphere, Kubernetes, Oracle Linux
Source Control: Git
Infrastructure as Code (IaC): Terraform
Configuration Management: Ansible, in-house
CI/CD: Jenkins, GitLab, Amazon ECR, Helm, Kustomize, Flux, in-house
Logs: ELK Stack
Monitoring:Prometheus, Grafana, Victoria Metrics, Zabbix
Incident Management, Change Management: PagerDuty, in-house
Automation: Python, Go
What we offer:
Well-coordinated professional team;
Life assurance and private medical insurance;
Competitive salary;
Great opportunities for self-realization, professional and career growth;
Corporate training programs, free language courses;
Excellent work environment and good collaboration;
Opportunity to be a part of the international company.