Junior SRE ( Site Reliability Engineer)
Mumbai (WFO)
Full Time
•Tech & Development
Job Description
Incident Management:
Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimize system downtime and impact.
Monitoring & Alerting:
Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimize system performance.
Automation & Tooling:
Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilizing relevant tools and technologies.
Capacity Planning:
Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.
Performance Optimization:
Analyze system metrics and identify bottlenecks, implement performance improvements, and optimize resource utilization.
Collaboration:
Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.
Technical Strategy:
Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.
Programming Skills:
Expertise in scripting languages (Python, Bash) and ability to develop automation tools. Good to have basic understanding of Java
Monitoring & Alerting:
Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.
Incident Management:
Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.
Problem-Solving:
Strong analytical and troubleshooting skills to identify and resolve complex technical issues.
Qualifications & Experience
B.E or B.Tech