Senior HPC Network Engineer (Tortuguitas)
Senior HPC Network Engineer (Tortuguitas)
-
Tortuguitas, Argentina
-
Publicado: hace menos de un mes
-
Guardar
Descripción
Senior HPC Network Engineer to support advanced AI, research, and Kubernetes-based GPU infrastructure for a major general technology client. The role focuses on architecting, operating, and optimizing high-performance network fabrics for large-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. The ideal candidate has strong hands-on experience with InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters. Responsibilities
- Architect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloads
- Design and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needs
- Optimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
- Tune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloads
- Design and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
- Support SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use cases
- Build and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysis
- Collaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliability
Requirements
- 5+ Postúlate en Kit Empleo: kitempleo.com.ar/empleo/pn2lk
- Architect, operate, and troubleshoot high-performance InfiniBand/RDMA and Ethernet fabrics for large-scale GPU clusters and distributed AI/LLM workloads
- Design and evaluate cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, based on workload scale and performance needs
- Optimize host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
- Tune and troubleshoot RDMA/RoCE, NCCL/MSCCL, and collective communication performance for multi-node GPU training workloads
- Design and maintain Kubernetes networking for GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration
- Support SmartNIC/DPU technologies such as NVIDIA BlueField where applicable, including SR-IOV, offload, isolation, and security use cases
- Build and improve network observability, including metrics, dashboards, alerts, congestion detection, latency tracing, SLO reporting, and capacity/performance analysis
- Collaborate with Kubernetes, storage, GPU infrastructure, observability, and AI research teams to resolve network and I/O bottlenecks and improve workload reliability
Requirements
- 5+ Postúlate en Kit Empleo: kitempleo.com.ar/empleo/pn2lk
Información clave
-
Nombre de la empresaEPAM Systems
-
Nombre de la vacanteSenior HPC Network Engineer (Tortuguitas)
Consejos de seguridad
Protege tus datos personales e inicia la comunicación a través del formulario de contacto.
Más info sobre el aviso
El aviso Senior HPC Network Engineer (Tortuguitas) fue publicado en la categoría Pablo Nogués Informática, telecomunicación de Locanto.
Ahora mismo, no tenemos más avisos en esta categoría en Pablo Nogués.
¿Buscás algo más? Podés aumentar tu radio de búsqueda y mirar los resultados en otras ubicaciones en tu región, como Informática, telecomunicación en San Miguel, José C. Paz o Muñiz. Además, en esta sección, disponemos de más avisos clasificados en un radio de 15 km. Hacé clic aquí para verlos.