Mashreq logo

Senior Site Reliability Engineer

Mashreq
Full-time
On-site
United Arab Emirates

JobsCloseBy Editorial Insights

Mashreq is hiring a Senior Site Reliability Engineer onsite in the United Arab Emirates to own observability and reliability initiatives across squads, enabling proactive monitoring, incident detection and performance optimization. The ideal candidate has 8+ years in IT infrastructure and applications, 5+ years in observability and CI, and a programming background in Java, with deep expertise in AppDynamics, Grafana, ELK/Splunk/Loki/OpenSearch, and cloud native Kubernetes observability on Azure. You will collaborate with engineering and operations to define KPIs, implement comprehensive monitoring, and lead RCA during incidents while driving proactive improvements for scalability and customer experience. To apply, tailor your resume to quantify impact, highlight collaboration across teams, tool mastery, and clear communication.


Responsible for ensuring robust service observability across all squads, enabling proactive monitoring, incident detection, and performance optimization. This role leads reliability initiatives across teams, fostering a culture of resilience and operational excellence. By supporting development teams in gaining deep visibility into systems and user journeys, the SRE helps identify bottlenecks, improve system behavior, and enhance customer experience. The role emphasizes proactive optimization, problem solving, and continuous improvement of reliability, scalability, and availability across the technology landscape.

•    Collaborate with engineering, operations, and other stakeholders to understand monitoring requirements & performance goals.
•    Support teams and define key performance indicators (KPIs) metrics, diagnose issues, and proactively identify areas for optimization.
•    Develop and implement observability processes to enable comprehensive monitoring, logging, and tracing of systems and applications across all teams.
•    Provide proactive approaches to monitoring problems by utilizing existing observability tools and domain expertise.
•    In-depth knowledge of application performance metrics, monitoring, and troubleshooting.
•    Providing expertise in Problem detection, Isolation & RCA during incident management with relevant data and artifacts from observability tools & corresponding systems

•  Provide timely and accurate reports on application performance, highlighting key insights and trends.
•    Collaborate with digital squads to implement performance improvements, including configuration optimizations and infrastructure adjustments.
•    Offer guidance and training to end-users and internal teams on best practices for APM and optimizing application performance.
 

•    Overall, around 8+ years of experience with IT Infrastructure, Applications
•    5+ years of hands-on experience in Observability and continuous integration.
•    2 years of programming background in Java or relevant technologies 
•    Deep knowledge of AppDynamics, Grafana, and similar tools 
•    Expertise with ELK tools, Splunk, Loki, or OpenSearch
•    Skilled in services and trace correlation
•    Cloud Native and Kubernetes observability
•    Knowledge of cloud infrastructure (Azure) and cluster management tools like Kubernetes
•    Strong communication skills with ability to align the organization on complex technical decisions
•    Bachelor's or master’s degree in information technology, Computer Science, or a related quantitative discipline