Mintoak

SRE ( Site Reliability Engineer) Lead

Mumbai (WFO)

Full Time

Tech & Development

Job Description

  • Team Leadership:

  • Manage and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of collaboration and continuous learning.

  • Design and Implement Monitoring and Alerting:

  • Lead the implementation of reliable, scalable, and fault-tolerant systems, including infrastructure, monitoring, alerting.

  • Incident Management:

  • Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimize system downtime and impact.

  • Monitoring & Alerting:

  • Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimize system performance.

  • Automation & Tooling:

  • Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilizing relevant tools and technologies.

  • Capacity Planning:

  • Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.

  • Performance Optimization:

  • Analyze system metrics and identify bottlenecks, implement performance improvements, and optimize resource utilization.

  • Collaboration:

  • Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.

  • Technical Strategy:

  • Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.

Technical Skills and Experience

  • Technical Expertise:

  • Strong proficiency in system administration, cloud computing (AWS, Azure), networking, distributed systems, containerization technologies (Docker, Kubernetes).

  • Programming Skills:

  • Expertise in scripting languages (Python, Bash) and ability to develop automation tools. Good to have basic understanding of Java

  • Monitoring & Alerting:

  • Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.

  • Incident Management:

  • Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.

  • Leadership & Communication:

  • Excellent communication skills to convey technical concepts to both technical and non-technical audiences, ability to lead and motivate a team.

  • Problem-Solving:

  • Strong analytical and troubleshooting skills to identify and resolve complex technical issues.

Qualifications & Experience

  • B.E OR B.Tech

Full Name*

Country *

IndiaArrow
City*
Email ID*
Mobile No*
+91|
Key Skills*
Total Experience*
Relevant Experience*
Notice Period*
Resume (.pdf, .doc, .docx)*
Upload Resume

Monetize your SME relationship by 8X.

Enhance your offerings with our advanced solutions