[Remote] Software Engineer 5 – Model Runtime, AI Platform
Note: The job is a remote job and is open to candidates in USA. Netflix is the world's leading streaming entertainment service, with a mission to entertain the world. They are seeking a Software Engineer to join the Model Runtime team, responsible for building and optimizing ML infrastructure that supports Netflix's critical machine learning models.
Responsibilities
- Build alignment and post-training infrastructure — Design infrastructure for reinforcement learning (GRPO, DPO, PPO), reward modeling, and preference optimization so Netflix can train recommendation models directly against what members actually value
- Enable next-generation GenAI workloads — Create infrastructure for multimodal and diffusion models, including distributed training, disaggregated serving, real-time, near-real-time and batch inference, and asynchronous GPU pipelines
- Scale distributed training — Engineer fault-tolerant training systems using FSDP, tensor/pipeline/context parallelism, and mixed-precision strategies across clusters of hundreds of GPUs
- Optimize across the full stack — Profile and tune from PyTorch operators down to GPU kernels, driving utilization improvements and building cost models that inform infrastructure strategy
- Evaluate emerging hardware and frameworks — Be the team's eyes on specialized accelerators, next-gen NVIDIA silicon, and the open-source ecosystem to keep Netflix at the efficiency frontier
Skills
- Experience in ML systems engineering — building infrastructure for training, fine-tuning, or inference of pre-LLM and post-LLM era models at scale
- Strong systems programming skills with the ability to work across multiple layers of the stack, from high-level ML frameworks down to GPU kernels and memory management
- Hands-on experience with PyTorch internals, large-scale distributed training and system-model codesign
- Comfortable with ambiguity and working across multiple business and technical domains to execute on both 0-to-1 and 1-to-100 projects
- Adopt and promote best practices in operations, including observability, logging, reporting, and on-call processes to ensure engineering excellence
- Experience with cloud computing providers, preferably AWS
- Excellent written and verbal communication skills
- Strong communication skills; effective across distributed time zones and remote environments
- Deep experience with distributed training at scale (FSDP, parallelism strategies, checkpointing) or LLM post-training (SFT, RLHF, DPO/GRPO)
- Inference optimization — vLLM, TensorRT, quantization, continuous batching, KV-cache management
- GPU performance profiling and tuning (CUDA, NCCL, Nsight, PyTorch profiler)
- Experience with multimodal or diffusion model architectures and generation pipelines
- Track record building reusable ML libraries or contributing to open-source ML projects
Benefits
- Health Plans
- Mental Health support
- A 401(k) Retirement Plan with employer match
- Stock Option Program
- Disability Programs
- Health Savings and Flexible Spending Accounts
- Family-forming benefits
- Life and Serious Injury Benefits
- Paid leave of absence programs
- Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off
- Full-time salaried employees are immediately entitled to flexible time off
Company Overview
Company H1B Sponsorship
Apply To This Job