We are looking for an experienced Senior Site Reliability Engineer to join the RingCentral Operations Intelligence team. As a Sr. Site Reliability Engineer, you will be responsible for maintaining and operating our monitoring systems and infrastructure. Our team focuses on providing accurate operational insights into the system, spanning from the collection and storage of metrics and logs to the correlation of alerts and their presentation. You will play a crucial role in ensuring the reliability, performance, and availability of our monitoring platform by identifying potential issues, and proactively resolving them. The ideal candidate should have a background in cloud operations and monitoring technologies such as ELK, Zabbix, and VictoriaMetrics, as well as experience with containerization using Kubernetes, message brokers like Kafka, and SQL/NoSQL databases. Programming experience is desired for the role.
Responsibilities:
The primary responsibility is to maintain the monitoring infrastructure availability.
Make changes to the monitoring system according to the company’s needs and processes.
Collaborate with development and operations teams to integrate monitoring solutions into the software development lifecycle and operational processes.
Be on top of capacity requirements in a growing environment.
Active work with the team’s codebase to extend system integrations and routine automation.
Conduct regular audits and assessments of monitoring systems to ensure adherence to best practices and industry standards.
Represent the team in global incident resolution, and participate in on-call rotation.
Maintain the documentation
Skills:
Proven experience as an SRE, Systems Engineer, or similar role of 4+ years.
Strong linux administration skills.
Problem-solving and troubleshooting skills.
Knowledge of one of the programming languages (see Preferable technology stack).
Understanding of the monitoring domain and SaaS approaches.
Experience with cloud platforms.
Knowledge of one or more of the configuration management tools.
Familiarity with ITIL or other IT service management frameworks.
Experience in implementing and operating monitoring systems in large-scale, heterogeneous, and fast-growing environments would be a plus.
Ability to work in a diverse multicultural environment, communicating with globally distributed teams.
Customer-centric mindset.
Team player with self-start ability.
Fluent in spoken and written English.
Preferable technology stack:
OS: Linux (CentOS/RedHat/Oracle Linux).
Programming languages in order of preference: Go, Python, JavaScript/TypeScript (would be a plus).
Cloud: AWS.
Containerization: Kubernetes, Docker.
Distributed log: Kafka, ELK stack.
Monitoring: Zabbix, Prometheus, CloudWatch, Grafana.
DBs: VictoriaMetrics, MongoDB, PostgreSQL, ClickHouse, MySQL.
Configuration Mgmt: Ansible, Terraform, ArgoCD, Spinnaker.
CI/CD: GitLab CI.
HA: Keepalived, HAProxy.
Qualification:
B.S in Computer Engineering, Computer Science, or related field with 4+ years of related experience
We offer:
Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
This role requires on-site presence at our office 4 days a week to support effective collaboration and teamwork.