Downloading models causes the pod to freeze

Hey, not sure if I'm missing something obvious here. I'm noticing two problems (might have the same cause): 1. I'm trying to download phi-4 14b from HuggingFace. I'm not doing anything out of the ordinary, I just run
hf_model = transformers.AutoModelForCausalLM.from_pretrained(model_id, token=hf_token, torch_dtype=torch.bfloat16).to(device)
hf_model = transformers.AutoModelForCausalLM.from_pretrained(model_id, token=hf_token, torch_dtype=torch.bfloat16).to(device)
The download is going to the volume disk. Somewhere mid downloading 2nd out of 6 .safetensors files, the pod freezes. I lose ssh connection, the RunPod dash shows 100% Memory usage. I can only restart the pod at this point. 2. When I try rsyncing 6gb of files from my local machine to a pod (eg. Llama3.2-3B-Instruct), it uses up the ram and freezes, more often than not. Sometimes it helps to restart the pod, but sometimes the only way is to download the weights instead of rsyncing up. I'm using: - 1xA40, 50GB RAM - runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image - 20GB container, 40GB volume (with enough free space before starting the download) Thanks!
4 Replies
drazenz
drazenzOP2w ago
Another note - I run unminimize after setting up the pod to get rsync and other common command line tools
riverfog7
riverfog72w ago
Its probably trying to load the model to cpu first and then transfer that to gpu
drazenz
drazenzOP2w ago
@riverfog7 hey that's not what happens, it fails while downloading the files
riverfog7
riverfog72w ago
does downloading with huggingface-cli work? I just tried and downloading to /root/ works (container disk) but /workspace doesnt work and freezes(volume or network volume) but the memory usage is fine and doesnt cause any memory problems Also the ram utilization reported by runpod is 0 (which is not reasonable, free -h says 22gib used)

Did you find this page helpful?