As a Site Reliability Engineer, your prime responsibilities are related to a tight work with engineering's teams on product improvement in observability, reliability, and scalability.
Key Responsibilities:
Tight work with the engineering and architecture teams on identifying resilience gaps, and building & executing a roadmap for their resolution
Onboarding of K8S microservices to the GitOPS-based CI/CD process and push microservices migration from semi-manual GitOPS releases to the fully-automated zero-downtime CI/CD process
Implement SLI/SLO for K8S microservices and build the process to follow them. Identify observability gaps, and execute a roadmap for their mitigation
React to production issues as an on-call engineer, participate in the RCA process, and write runbooks & automation to mitigate possible issues in the future
Develop, test, execute & support disaster recovery plans for mission-critical services and sub-systems
Capacity planning & cloud infra cost optimization
Implement security & compliance requirements
Requirements:
3+ years of technical experience in the same or similar role supporting large-scale and high-load production systems
Experience in the development and support of public cloud infrastructure
Hands-on experience in running HA applications and development of the CI/CD process in Kubernetes
Proven programming skills in Python, Go or similar
Good knowledge of Linux environment, TCP/IP, network routing, DNS
Familiar with SRE principles, DevOps practices, and modern cloud-native landscape
Accuracy, attention to details, ability to follow processes
Good communication skills with English level intermediate or above
Pluses:
Experience working with contact centers, VoIP solutions;
Ability to read and troubleshoot Java code if needed;
Experience in SQL/NoSQL DB's or attitude to develop skills in this field.
We offer:
Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
ReBenefit Platform Account.