Soulmind
Soulmind
RRunPod
Created by Soulmind on 2/5/2025 in #⛅|pods
Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4
I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting NCCL_P2P_DISABLE=1 prevents the hang but reduces P2P performance Why It Matters Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically. Questions for Runpod & the Community 1. Hardware Differences?
Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API? If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!
7 replies