The Messaging team develops and maintains a collaborative tool—a messenger that provides a unified workspace for teams. It enables users to exchange messages, files, links, notes, collaborate on documents, manage shared calendars, and utilize bots and integrations.
The service is part of a unified communication system that supports internal calls within the service (WebRTC), calls to traditional and mobile phones, video chats, webinars, and integrations with cloud PBX, contact centers, and omnichannel platforms.
This is a fault-tolerant, low-latency system with a distributed architecture, operating 24/7 with a 99.999% availability level.
We are currently looking for passionate professionals to strengthen our team!
Our technology stack includes Java, Node.js, Redis, Kafka, MongoDB, Elasticsearch, Docker, ELK, Prometheus Stack, and Kubernetes. The service is hosted on AWS.
We’re currently looking for an experienced Site Reliability Engineer (SRE) or DevOps to join our team.
As a SRE, you will be responsible for maintaining and improving uptime and availability across several of our services. You will play a crucial role in ensuring the reliability, performance, and availability of our services by identifying potential issues, and proactively resolving them. The ideal candidate should have a background in various service observability platforms as well as experience with containerization using Kubernetes, message queuing systems like Kafka, and SQL/NoSQL databases. Programming experience is desired for the role.
As a DevOps, you will be responsible for maintaining and improving our engineering platform (k8s-based), our build and delivery processes and tools, improving our CI/CD, and collaborating with QA, developers, and other teams. The ideal candidate should have a background in Kubernetes, public clouds, and various modern tools including monitoring, logging, CI, and automation. Programming experience is desired for the role.
Responsibilities:
Collaborate with development and operations teams to integrate monitoring solutions into the software development lifecycle and operational processes.
Define, propose, and drive efforts to continually improve monitoring, troubleshooting, and self-healing for our services.
Design and implement redundancy, failover mechanisms, and load-balancing strategies to ensure system reliability.
Conduct risk assessments and identify potential points of failure in the infrastructure and propose solutions to fix it.
Respond to (on-call) and take actions to mitigate incidents and outages.
Be on top of capacity requirements in a growing environment.
Actively work with various teams’ codebases to extend observability and improve uptime.
Represent the team in global incidents resolution, and participate in on-call rotation.
Requirements:
Hard skills
Proven experience as an SRE or similar role of 4+ years.
Problem-solving and troubleshooting skills.
Hands-on experience with Linux systems (Red Hat-based) in large-scale production environments over the past several years.
Strong knowledge of computer networks and their principles, including an understanding of the workings and architecture of web applications; key protocols such as DNS, HTTP, and HTTPS; the OSI/TCP/IP model; and traffic routing principles.
Knowledge of one of the programming languages (see Preferable technology stack).
Experience with cloud platforms - AWS, Azure, or GCP (AWS preferred).
Understanding of and experience with the IaC paradigm, knowledge of one or more of the configuration management tools, particularly with Terraform and Ansible.
Proficiency in working with containerized applications using Docker and modern orchestration tools (Kubernetes preferred).
Soft skills
Ability to work in a diverse multicultural environment, communicating with globally distributed teams.
A software engineer mindset with a strong focus on automating routine tasks
Team player with self-start ability and strong drive to dig deeply and solve problems.
Fluent in spoken and written English (upper-int level or higher).
Preferred Qualifications:
B.S in Computer Engineering, Computer Science, or equivalent experience with 4+ years of related experience
Proven experience with influencing the software engineering of cloud/SaaS services
Familiarity with AI, LLM, and various related technologies
Deep understanding of the DevOps Lifecycle and application of it within organizations
Experience in software development and working as a software engineer.
We offer:
Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
ReBenefit Platform Account.