Argentina

Lead Hpc Network Engineer - Ai Infrastructure (General …, General Rodríguez

Lead Hpc Network Engineer - Ai Infrastructure (General …, General Rodríguez
Descripción
We are looking for a Lead HPC Network Engineer to drive the strategy, architecture, and engineering excellence behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client. The role focuses on defining the technical vision, leading architecture decisions, and setting engineering standards for high-performance network fabrics supporting large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a technical leader, you will mentor senior engineers, influence client roadmaps, and own end-to-end delivery of mission-critical network platforms. The idóneo candidate combines deep expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading engineering teams and shaping large-scale HPC/AI network platforms. Responsibilities Own the architectural vision and long-term roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics supporting large-scale GPU clusters and distributed AI/LLM workloads Lead the design, evaluation, and selection of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and define decision frameworks aligned with workload scale, performance, and cost constraints Establish engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths Drive performance engineering initiatives for RDMA/RoCE, NCCL/MSCCL, and collective communication across multi-node GPU training workloads, and lead complex root-cause investigations Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration Lead the adoption and integration strategy for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases Shape the network observability strategy, defining metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies Mentor and technically lead engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, driving cross-functional alignment and resolution of complex bottlenecks Represent the engineering team in client and stakeholder forums, influencing technical direction, communicating trade-offs, and ensuring delivery of reliable, scalable network platforms Requirements 6+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 3+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership on large-scale initiatives (1+ years) Proven experience leading the architecture and delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environments Deep expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with the ability to set standards and guide other engineers Strong understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, and ability to advise on workload-network co-design Expert-level knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration Advanced proficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at scale Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers Excellent leadership, mentoring, stakeholder management, and communication skills, with experience guiding engineering teams, influencing client architecture decisions, and driving consensus across researchers and platform stakeholders Excellent written and verbal communication skills in English (B2+ level) Nice to have Hands-on experience with Azure Networking, Ethernet, and GPGPU/GPU technologies at an architectural level Strong command of Grafana, Prometheus, and Network Administration, with experience defining observability standards Proven ability to design, develop, and govern Infrastructure as Code at scale Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling team productivity We offer International projects with top brands Work with global teams of highly skilled, diverse peers Employee financial programs Paid time off and sick leave Upskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ courses Global career opportunities Volunteer and community involvement opportunities EPAM Employee Groups Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn #J-*****-Ljbffr Postúlate en Kit Empleo: kitempleo.com.ar/empleo/pobao
Información clave
Consejos de seguridad
Tené cuidado si el aviso tiene errores ortográficos.
1 / 10
Más info sobre el aviso

El aviso Lead Hpc Network Engineer - Ai Infrastructure (General … fue publicado en la categoría General Rodríguez Informática, telecomunicación de Locanto.

No hay más avisos en General Rodríguez para esta categoría, ¡por ahora!

¿Buscás algo más? Podés aumentar tu radio de búsqueda y mirar los resultados en otras ubicaciones en tu región, como Informática, telecomunicación en La Reja, Moreno o Luján. Además, en esta sección, disponemos de más avisos clasificados en un radio de 15 km. Hacé clic aquí para verlos.