[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Lean Tech is a rapidly expanding organization in the technology services sector, seeking a highly experienced Senior Site Reliability Engineer. The role focuses on evolving the reliability, security, observability, and operational maturity of their cloud platform, leveraging AI tools and practices to enhance operational efficiency.
Responsibilities
- Own and evolve the reliability, security, observability, and operational maturity of our cloud platform
- Use AI tools and agentic workflows to automate infrastructure and SRE tasks
- Manage production infrastructure for SaaS platforms, including senior AWS ownership
- Lead production incidents and drive root-cause analysis, creating remediation plans
- Ensure compliance with security best practices and maintain compliance controls
Skills
- Expert use of AI tools and agentic workflows to automate infrastructure and SRE tasks
- Hands-on experience using AI for Terraform development, incident triage, log analysis, runbook creation, postmortems, operational automation, CI/CD pipeline generation, and reducing repetitive operational work
- Strong understanding of AI capabilities, limitations, and necessary validation processes
- Ability to clearly articulate AI workflows, tooling choices, operational safeguards, and production outcomes
- 10+ years managing production infrastructure for SaaS platforms, including 5+ years of senior AWS ownership
- Deep expertise with AWS services such as ECS, VPC, IAM, RDS, S3, CloudFront, Route53, Lambda, API Gateway, CloudWatch, Secrets Manager, and related security and governance services
- Advanced Terraform experience managing multi-account environments, infrastructure state, drift remediation, and dependency management
- Advanced Terraform experience managing multi-account, multi-workspace infrastructure
- Strong understanding of: provider versioning, state management, drift detection and remediation, dependency management, infrastructure blast radius analysis
- Proven experience resolving production infrastructure drift safely
- Significant experience leading production incidents as the accountable owner
- Ability to operate calmly and effectively during high-severity outages
- Proven experience authoring detailed postmortems and operational remediation plans
- Strong understanding of operational risk management and production recovery procedures
- Proven experience leading production incidents, driving root-cause analysis, and creating remediation plans
- Strong background in observability, monitoring, logging, distributed tracing, and alerting using tools such as Grafana
- Experience owning CI/CD pipelines, deployment strategies, infrastructure automation, and operational workflows
- Strong Linux administration, containerization (Docker), networking, and scripting skills
- Experience with security best practices, identity management (SAML, OIDC, SCIM), and compliance frameworks such as SOC 2, ISO 27001, HIPAA, or PCI
- Comfortable working directly with auditors and maintaining compliance controls
- Experience supporting Spring Boot or JVM-based systems in production
- Experience with runtime security or EDR tooling such as Falco
- Experience automating joiner/mover/leaver identity workflows using SCIM and IdP tooling
- AWS certifications including: AWS Solutions Architect Professional, AWS DevOps Engineer Professional, AWS Security Specialty
- Ability to read and debug Kotlin or Java backend services from an SRE perspective
- React/NodeJS/Backstage developer experience
- MuleSoft API Management experience
Benefits
- Professional development opportunities with international customers
- Collaborative work environment
- Career path and mentorship programs that will lead to new levels
Company Overview
Company H1B Sponsorship
Apply To This Job