Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting NCCL_P2P_DISABLE=1 prevents the hang but reduces P2P performance Why It Matters Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically. Questions for Runpod & the Community 1. Hardware Differences?
Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API? If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!
6 Replies
Dj
Dj3w ago
I had this issue earlier (<t:1738791678>) using 2x L40S pods. The problematic pod has already been destroyed, but I noticed full GPU memory utilization and P2P comms hanging. I had to switch to a smaller model.
Dj
Dj3w ago
No description
Dj
Dj3w ago
Using vLLM with tensor_parallel_size set to 2 (the value of RUNPOD_GPU_COUNT)
Dj
Dj3w ago
195.26.232.186, in Dallas, TX according to https://ipinfo.io
Trusted IP Data Provider, from IPv6 to IPv4
Get accurate IP address information with IPinfo. Trusted by 400,000+ users, we handle more than 40 billion API requests monthly. Sign up for free account today.
Dj
Dj3w ago
(I dont know how to check my specific node info lol)
Poddy
Poddy3w ago
@Soulmind
Escalated To Zendesk
The thread has been escalated to Zendesk!

Did you find this page helpful?