Soulmind
Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4
I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting
Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API? If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting
NCCL_P2P_DISABLE=1
prevents the hang but reduces P2P performance
Why It Matters
Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically.
Questions for Runpod & the Community
1. Hardware Differences?Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API? If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!
7 replies