Argentina

Senior Hpc Network Engineer - Ai Infrastructure (General …, General Rodríguez

Senior Hpc Network Engineer - Ai Infrastructure (General …, General Rodríguez
Descripción
Senior HPC Network Engineer to support advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.The role focuses on architecting, operating, and optimizing high-performance network fabrics for large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability.The ideal candidate has strong hands-on experience with InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters.ResponsibilitiesArchitect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloadsDesign and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needsOptimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication pathsTune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloadsDesign and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integrationSupport SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use casesBuild and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysisCollaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliabilityRequirements5+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 2+ years focused on HPC, AI/ML, or GPU cluster networkingProven hands-on experience with InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in performance-critical distributed compute environmentsUnderstanding of host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinityKnowledge of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gatherExpertise in Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integrationProficiency in RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuningSkills in Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnosticsBackground in network observability and performance management, including telemetry, traffic monitoring, and congestion detection, as well as latency analysis, SLOs, and capacity planning, along with alerting and troubleshooting across L1-L4, fabric, and RDMA layersStrong troubleshooting, root-cause analysis, documentation, and communication skills for working with client engineering teams, researchers, and platform stakeholdersEnglish level of minimum B2 (Upper-Intermediate) for effective communicationNice to haveFamiliarity with Azure Networking, Ethernet, and GPGPU/GPU technologiesCompetency in Grafana, Prometheus, and Network AdministrationCapability to develop and maintain Infrastructure as CodeFlexibility to use Python and UNIX shell scripting for automation and toolingWe offerInternational projects with top brandsWork with integral teams of highly skilled, diverse peersEmployee financial programsPaid time off and sick leaveUpskilling, reskilling and certification coursesUnlimited access to the LinkedIn Learning library and 22,000+ coursesGlobal career opportunitiesVolunteer and community involvement opportunitiesEPAM Employee GroupsAward-winning culture recognized by Glassdoor, Newsweek and LinkedIn #J-*****-Ljbffr Postúlate en Kit Empleo: kitempleo.com.ar/empleo/po3l8
Información clave
Consejos de seguridad
Reportá avisos o mensajes sospechosos.
1 / 10
Más info sobre el aviso

El aviso Senior Hpc Network Engineer - Ai Infrastructure (General … fue publicado en la categoría General Rodríguez Informática, telecomunicación de Locanto.

No hay más avisos en General Rodríguez para esta categoría, ¡por ahora!

¿Buscás algo más? Podés aumentar tu radio de búsqueda y mirar los resultados en otras ubicaciones en tu región, como Informática, telecomunicación en Moreno, La Reja o Luján. Además, en esta sección, disponemos de más avisos clasificados en un radio de 15 km. Hacé clic aquí para verlos.