RunPod•3w ago

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

I’m seeing a possible NCCL P2P issue on some L40S hosts in US-TX-4. Some pods hang indefinitely while others in the same region work fine. Here’s a reproducible example:
runpod-vllm-nccl-diagnostic Observations - Environment: 2 x L40S GPU pods in US-TX-4
- Behavior: One pod succeeded with full NCCL P2P; another hung indefinitely (log)
- Workaround: Setting NCCL_P2P_DISABLE=1 prevents the hang but reduces P2P performance Why It Matters Efficient multi-GPU comms are critical for large-scale LLM workloads. If some hosts can’t support P2P, it would help to know if we can avoid them programmatically. Questions for Runpod & the Community 1. Hardware Differences?
Are there known L40S configurations that impede NCCL P2P?
2. Mitigation
Any recommended approach beyond disabling P2P?
3. Filtering Hosts
Can we specify a P2P-supported filter in the GraphQL API? If others experience similar NCCL P2P issues, feel free to check the repo and replicate. Any insights or guidance are much appreciated. Thank you!

6 Replies

Dj•3w ago

I had this issue earlier (<t:1738791678>) using 2x L40S pods. The problematic pod has already been destroyed, but I noticed full GPU memory utilization and P2P comms hanging. I had to switch to a smaller model.

Dj•3w ago

Dj•3w ago

Using vLLM with tensor_parallel_size set to 2 (the value of RUNPOD_GPU_COUNT)

Dj•3w ago

195.26.232.186, in Dallas, TX according to https://ipinfo.io

Trusted IP Data Provider, from IPv6 to IPv4

Get accurate IP address information with IPinfo. Trusted by 400,000+ users, we handle more than 40 billion API requests monthly. Sign up for free account today.

Dj•3w ago

(I dont know how to check my specific node info lol)

Poddy•3w ago

@Soulmind

Escalated To Zendesk

The thread has been escalated to Zendesk!

Gaming

Programming

Potential L40S P2P Communication Issue via NCCL on Some Hosts in US-TX-4

Did you find this page helpful?