[Remote] Senior Principal Back-End Network Engineer, AI Infrastructure
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company engineered for AI, providing high-performance infrastructure for AI startups and large enterprises. They are seeking a Principal Network Engineer for AI Infrastructure to lead the technical direction and operational strategy of their AI interconnect networks, focusing on reliability and scalability of Infiniband and RDMA-based network fabrics.
Responsibilities
- Owning the technical direction and operational strategy for Nscale’s AI interconnect networks
- Designing, reviewing, and evolving large-scale Infiniband and RoCE fabric architectures to support future growth and workload demands
- Acting as the senior escalation point for the most complex network incidents, guiding deep technical investigations and systemic fixes
- Driving cross-team initiatives to improve fabric reliability, performance predictability, and operational maturity
- Defining standards for hardware configuration, congestion control, routing, firmware lifecycle management, and change safety
- Partnering with SRE, Compute Platform, and Network Architecture teams to influence end-to-end system design
- Mentoring senior and mid-level network engineers, raising the bar for operational rigor and technical excellence
- Driving measurable improvements in uptime, latency consistency, capacity efficiency, and incident reduction
Skills
- 12+ years of experience in network engineering, with deep focus on HPC, AI, or hyperscale data centre networking
- Expert-level operational and architectural experience with Infiniband and/or large-scale RoCE fabrics
- Deep understanding of RDMA internals, congestion management, and fabric-level failure modes
- Strong expertise in modern data centre routing and control planes (BGP, OSPF, ECMP)
- Proven ability to debug and resolve cross-layer issues spanning hardware, firmware, kernel, and application communication libraries
- Demonstrated ability to lead complex technical initiatives across teams without direct authority
- A systems-level mindset, balancing performance, reliability, scalability, and operational cost
- Extensive experience with NVIDIA/Mellanox networking platforms in production AI or HPC environments
- Deep familiarity with distributed training frameworks and GPU communication patterns
- Experience designing network observability systems for high-cardinality, high-throughput environments
- Prior experience influencing platform or infrastructure strategy at scale
Benefits
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
- Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Company Overview
Apply To This Job