[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. They are seeking a Senior Site Reliability Engineer who will maintain and improve the reliability, scalability, and performance of Java-based applications while pioneering AI-driven operations. The role involves designing and building AI workflows, managing observability tools, and collaborating with cross-functional teams to enhance system reliability.
Responsibilities
- Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment
- Troubleshoot and resolve complex issues across production and non-production environments
- Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance
- Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling
- Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting
- Implement and refine observability strategies that enhance visibility into application and infrastructure health
- Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring
- Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction
- Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization
- Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval
- Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents
- Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems
- Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving
- Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization
- Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence
- Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents
- Document and share lessons learned, contributing to a culture of continuous improvement
- Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements
- Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language
- Measure and report on toil reduction metrics to quantify the impact of automation initiatives
- Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities
- Collaborate with DevOps and NOC teams to support the application platform
- Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders
- Provide feedback on application performance, potential improvements, and observability metrics
Skills
- Degree in Computer Science or a related field, or equivalent professional experience
- 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems
- 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security
- Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management
- Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting
- Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection
- Proficiency in PromQL and experience with Loki for log aggregation and analysis
- Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization
- Cloud platform expertise (AWS preferred; GCP or Azure also valued)
- Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible
- ArgoCD proficiency for GitOps workflows and continuous deployment
- Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation
- Proven track record with on-call rotations, incident response, and root cause analysis
- 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context
- Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks
- Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines
- Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent)
- Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples
- Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows
- Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents
- Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems
- Hands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrieval
- Experience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in production
- Contributions to open-source AI/ML or SRE tooling projects
- Background in data engineering or ML pipelines that complements SRE responsibilities
Benefits
- Competitive pay and benefits
- Flexible vacation allowance
- A hybrid / remote working environment
- Startup culture backed by a secure, global brand
Company Overview
Company H1B Sponsorship
Apply To This Job