[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.
Responsibilities
- Define and implement SLIs/SLOs for critical services
- Lead incident response and coordinate cross-team mitigation efforts
- Conduct blameless postmortems and ensure corrective actions are completed
- Perform production readiness reviews for new services and features
- Identify systemic risks and drive preventative improvements
- Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
- Improve signal-to-noise ratio in alerts and reduce alert fatigue
- Build internal tooling for reliability tracking and reporting
- Improve visibility into GPU performance and distributed systems health
- Automate recurring operational workflows
- Build tools and scripts (Python, Go, Bash) to eliminate manual processes
- Improve deployment safety through automation and guardrails
- Strengthen CI/CD reliability and release processes
- Partner with engineering teams to improve system resilience
- Provide guidance on fault tolerance, scalability, and failure handling
- Contribute to architectural discussions with a reliability-first mindset
Skills
- 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
- Strong Linux systems and Networking expertise
- Experience managing containerized production systems
- Strong understanding of distributed systems and failure modes
- Experience defining and managing SLIs/SLOs
- Proven incident response and postmortem leadership experience
- Strong scripting or programming skills
- Experience with monitoring and alerting systems
- Excellent written communication skills
- Successful completion of a background check
- Experience with GPU infrastructure or AI/ML platforms
- Experience improving reliability in high-growth or large scale environments
- Familiarity with GPU observability tooling
- Experience with Infrastructure as Code
- Experience working in startup environments
- Experience building internal reliability platforms or frameworks
Benefits
- Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
- Generous medical, dental & vision plans
- Flexible PTO- take the time you need to recharge
- Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
- Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.
Company Overview
Company H1B Sponsorship
Apply To This Job