[Remote] Senior Site Reliability Engineer

Remote, USA Full-time Posted 2026-06-16

Note: The job is a remote job and is open to candidates in USA. Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. They are seeking a Senior Site Reliability Engineer who will maintain and improve the reliability, scalability, and performance of Java-based applications while pioneering AI-driven operations. The role involves designing and building AI workflows, managing observability tools, and collaborating with cross-functional teams to enhance system reliability.

Responsibilities

Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment
Troubleshoot and resolve complex issues across production and non-production environments
Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance
Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling
Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting
Implement and refine observability strategies that enhance visibility into application and infrastructure health
Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring
Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction
Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization
Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval
Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents
Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems
Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving
Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization
Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence
Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents
Document and share lessons learned, contributing to a culture of continuous improvement
Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements
Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language
Measure and report on toil reduction metrics to quantify the impact of automation initiatives
Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities
Collaborate with DevOps and NOC teams to support the application platform
Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders
Provide feedback on application performance, potential improvements, and observability metrics

Skills

Degree in Computer Science or a related field, or equivalent professional experience
5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems
3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security
Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management
Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting
Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection
Proficiency in PromQL and experience with Loki for log aggregation and analysis
Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization
Cloud platform expertise (AWS preferred; GCP or Azure also valued)
Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible
ArgoCD proficiency for GitOps workflows and continuous deployment
Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation
Proven track record with on-call rotations, incident response, and root cause analysis
1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context
Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks
Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines
Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent)
Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples
Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows
Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents
Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems
Hands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrieval
Experience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in production
Contributions to open-source AI/ML or SRE tooling projects
Background in data engineering or ML pipelines that complements SRE responsibilities

Benefits

Competitive pay and benefits
Flexible vacation allowance
A hybrid / remote working environment
Startup culture backed by a secure, global brand

Company Overview

Hard Rock Digital is building the future of online sports betting and interactive gaming. It was founded in 2020, and is headquartered in Austin, Texas, USA, with a workforce of 501-1000 employees. Its website is https://www.hardrockdigital.com/.

Company H1B Sponsorship

Hard Rock Digital has a track record of offering H1B sponsorships, with 3 in 2025, 4 in 2024, 5 in 2022, 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply Now

[Remote] Senior Site Reliability Engineer

Similar Jobs

[Remote] Strategic Account Executive

[Remote] Claims Operations & Data Analyst | Remote

[Remote] Donor Marketing Manager

[Remote] Data Analyst, IT Service Center

[Remote] Bilingual Marketing & Lead Generation Intern (Spanish)

[Remote] Data Analyst

[Remote] Financial Analyst

[Remote] AWS HealthLake - FHIR SME / Architect

[Remote] Enterprise Account Executive

[Remote] Senior /Principal Federal Security Engineer

Experienced Data Entry Clerk – Remote Opportunity with arenaflex

Part-Time Center Associate - Sugarloaf, GA

Senior Customer Success Manager

[Remote] Project Associate - HSE, Training Content

Experienced Data Entry Specialist – Remote Opportunity at arenaflex

Inside Sales Representative

Experienced Specialty Pharmacy Technician – Data Entry Transplant in arenaflex's Smyrna, GA Location

Agentic AI Engineer

[Remote] Service Delivery Center, AI & Data - Analyst

Emerging Enterprise Account Executive