My pod has randomly crashed several times today, and received emails of Runpod issues.
Today, my pod has crashed a few times, to the point where I'm receiving emails from Runpod about the issues. How can I fix?
39 Replies
could you provide some informations
Maybe some logs or screenshot on your pod tab would help
Here's a snapshot of the audit logs, where I'm stopping and starting pods that have disconnected in the middle of processes..
@rethinkstudios#001 deleted your post as you leaked your email. You are running comfy from web terminal?
Thank you! Didn't know that would be an issue. Yes, I'm launching Comfy either thru Jupiter's terminal or the native terminal
and in the middle of creating, I'll get a connection closed, and several hours of work will crash. SUPER frustrating.
What can we do to keep a solid connection?
use normal ssh and run process in tmux
Are there tutorials available on how to set that up?
And what's the difference in the user experience?
Usually you want to setup ssh keys on your machine and add public key to RunPod settings page.
Something like this??
RunPod Blog
How to Configure Basic Terminal Access on RunPod
The fastest way to get access to a custom pod is to use our basic terminal access feature. This works with any custom container that you want to run on RunPod, whether or not it has a built in SSH daemon or exposed ports. Do be aware that there are
Or more like this? https://www.youtube.com/watch?v=_qjd6UAHaRg
TreeCityWes
YouTube
How To Setup SSH Public/Private Key Pair for Vast.ai and Runpod. Se...
Using WSL or Windows Subsystem for Linux to Setup SSH Public/Private Key Pair for Vast.ai and Runpod. Secure your XenBlocks Cloud Miner!
SSH Key Pair Guide: https://github.com/TreeCityWes/VastSSHKeyPair/blob/main/VastSSHKeyPair.md
Vast.ai GPU Rental: https://cloud.vast.ai/?ref_id=88736
Xen.game: https://xen.game/treecitywes
GDXen: https://w...
Yes, what he did is the same thing generating public key using wsl, setting it into the platform then connecting with ssh
Bt you dont need to use wsl as windows has build in ssh client
Hey guys... My pod just disconnected again, and I was using SSH via Terminus..
Your pod ran out of system memory (RAM not VRAM) and the Linux kernel killed off the process. Your pod was not disconnected.. Try using the filter at the top of the page to ensure that your pod gets more system memory assigend to it.
Which template is this by the way? You can load tcmalloc to try to improve memory management, thats what A1111 and Forge do because they ran out of memory when switching out models too frequently.
Hmm, are you sure? The template is PyTorch 2.01 on an A40 with 48G of RAM, 48G of VRAM
Yes, that is exactly what your error means, so obviously I am sure, Google it
Why do you come here asking for help if you know better than everyone here?
And pytorch template does not include libtcmalloc so install it and implement it as I suggested.
Without libtcmalloc stable diffusion runs out of memory eventually.
I didn't mean any disrespect, and genuinely asked. 🙂
This is what you mean, yes? https://github.com/comfyanonymous/ComfyUI/issues/1462
GitHub
Possible memory leak with lora usage · Issue #1462 · comfyanonymous...
I use comfy as a backend in my app, and especially after using many loras, the CPU RAM usage gradually climbs. The weird part is that the RAM usage exceeds the total size of all models/loras/vaes etc.
And would the command to install be:
pip install libtcmalloc-minimal4
TCMALLOC="$(ldconfig -p | grep -Po "libtcmalloc.so.\d" | head -n 1)"
export LD_PRELOAD="${TCMALLOC}"
Yep hep
Try it
pip install libtcmalloc-minimal4
ERROR: Could not find a version that satisfies the requirement libtcmalloc-minimal4 (from versions: none)
ERROR: No matching distribution found for libtcmalloc-minimal4
hmm, would I need to specify a version?? from this list? https://launchpad.net/ubuntu/focal/+package/libtcmalloc-minimal4
Launchpad
libtcmalloc-minimal4 : Focal (20.04) : Ubuntu
The gperftools, previously called google-perftools, package contains some
utilities to improve and analyze the performance of C++ programs. This is a
part of that package, and includes an optimized thread-caching malloc.
Try looking for other scripts like from setup in runpod workers or runpod templates
There should be some examples of working tmalloc install
I'm not on my pc right now so can't help much sorry
You install it with apt not with pip.
Trying this today and will let you know how it goes.. Fingers crossed!
Solution
@rethinkstudios#001
apt-get install google-perftools
this is correct way to install TCmalloc
oof
Hmm..
Got this error when trying to install:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package google-perftools
run first apt update
Don't install that, install libtcmalloc-minimal4
And the google one is called libgoogle-perftools4
So the final command would look like this??
apt update
apt update && apt -y install libtcmalloc-minimal4 libgoogle-perftools4
TCMALLOC="$(ldconfig -p | grep -Po "libtcmalloc.so.\d" | head -n 1)"
export LD_PRELOAD="${TCMALLOC}"
You don't need to mess with the environment variable, A1111 handles it for you as long as its just installed.
OK. Was using it for Comfy.
Oh sorry my bad, then yeah do that