Job Description
You will be a technical expert on our team, responsible for architecting, building, and scaling our infrastructure to meet the demands of a rapidly growing business. You will drive key initiatives, mentor other engineers, and set the standards for reliability and operational excellence. This role requires deep technical expertise and the ability to influence both the SRE team and the broader engineering organization.
What You'll Do:
- Technical Leadership: Lead the design and implementation of large-scale, complex infrastructure projects on AWS and GCP. Own critical reliability goals and drive their execution.
- Architecture & Strategy: Define and evolve the technical roadmap for cloud infrastructure focusing on scalability, security, and cost optimization.
- Mentorship & Guidance: Act as a senior mentor to junior engineers, helping them grow their skills and navigate technical challenges while setting a high bar for best practices.
- Tooling & Automation: Champion advanced automation and internal tooling development to empower the engineering organization.
- Continuous Improvement: Identify systemic weaknesses and lead initiatives involving technology adoption, architectural changes, or process improvements.
- Incident Management: Lead major incident response efforts and shape the post-mortem culture to ensure learning from incidents.
Requirements
- Experience: Over 6 years in senior Site Reliability, DevOps, or System Engineering with leadership and impact.
- Cloud Expertise: Deep hands-on experience with AWS and GCP at scale managing multi-region deployments.
- System Design: Strong background in designing highly available and resilient distributed systems.
- Scripting & Automation: Expert skills in Python or Go for complex automation and internal services.
- Infrastructure as Code: Extensive experience with Terraform managing multiple environments.
- Leadership & Communication: Proven technical leadership and exceptional ability to communicate complex concepts to diverse audiences.