[Remote] Senior AI Performance and Efficiency Engineer
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leading technology company focused on AI and GPU computing, and they are seeking a Senior AI/ML Performance and Efficiency Engineer. The role involves enhancing efficiency for researchers by collaborating on infrastructure and application improvements to support groundbreaking AI and ML research on GPU Clusters.
Responsibilities
- Collaborate closely with our AI/ML researchers to make their ML models more efficient leading to significant productivity improvements and cost savings
- Build tools, frameworks, and apply ML techniques to detect & analyze efficiency bottlenecks and deliver productivity improvements for our researchers
- Work with researchers working on a variety of innovative ML workloads across Robotics, Autonomous vehicles, LLM’s, Videos and more
- Collaborate across the engineering organizations to deliver efficiency in our usage of hardware, software, and infrastructure
- Proactively monitor fleet wide utilization patterns, analyze existing inefficiency patterns, or discover new patterns, and deliver scalable solutions to solve them
- Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization
Skills
- BS or similar background in Computer Science or related area (or equivalent experience)
- Minimum 5+ years of experience designing and operating large scale compute infrastructure
- Strong understanding of modern ML techniques and tools
- Experience investigating, and resolving, training & inference performance end to end
- Debugging and optimization experience with NSight Systems and NSight Compute
- Experience with debugging large-scale distributed training using NCCL
- Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms
- Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector
- Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds
- Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
- Experience with Machine Learning and Deep Learning concepts, algorithms and models
- Familiarity with InfiniBand with IBOP and RDMA
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
- Familiarity with deep learning frameworks like PyTorch and TensorFlow
Benefits
- Equity
- Comprehensive benefits package
Company Overview
Company H1B Sponsorship
Apply To This Job