Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
I've been utilizing RunPod endpoints for the past month or so, no issues everythings been working wonderfully. Past week, a handful of my jobs have been failing. I'm not entirely sure why. I have not made any code changes and not changed the docker image that is from my template. I do notice that it seems to be waiting for GPU's to be available. but not sure why, when it finds them this error throws.
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select
44 Replies
This is a code issue, what is the endpoint you are using?
Like my actual endpoint?
i've been calling
/run
to run my jobs and get back a job ID and then webhook back to my server once done @ashleykYes the actual endpoint.
This makes no difference, there is something wrong with the code that your rundpoint is running.
No good, I need to know the code you are running.
Oh, sorry I'm not sure what you're looking for. the docker image tag?
I think when I had set up my endpoint and got everything working it was around this tag version of yours ashleykza/runpod-worker-a1111:2.2.1
Yeah, did you build your own image or use someone else's image?
Then all i've done is add loras and models to my network volume. eveyrthing was working great
I used yours but built mine on digital ocean, tagged it, and have just been using that one
I suggest updating to
ashleykza/runpod-worker-a1111:2.3.4
.ok
so just change my template to that?
When did you build your own image?
Nov 21, 2023 at 10:32 pm
Sorry I am not reading the messages fast enough, I thought you were on
ashleykza/runpod-worker-a1111:2.2.1
, so if you built it around that time, probably a good idea to pull the changes and build it again to get the updates.No need to appologize, you've been very helpful for me and I appreciate your guidance.
Do i just use your tag in my template, or do i go back through the process of deploying to digital ocean, tag to my docker hub and use that?
There isn't really a need to build your own image unless you customized something such as adding an additional extension or something like that, otherwise you can just use mine.
ok, no i didnt make any changes. Only changes I've made have been to my network volume
I will try the 2.3.4 tag of yours
Seems to be working much better! Do your tagged version ever get removed? Also, do you have a release log or somewhere that you update to keep up with changes?
Only inactive versions that haven't been used in a month or more, not active ones that are being pulled by workers. The release log is here:
https://github.com/ashleykleynhans/runpod-worker-a1111/releases
GitHub
Releases · ashleykleynhans/runpod-worker-a1111
RunPod Serverless Worker for the Automatic1111 Stable Diffusion API - ashleykleynhans/runpod-worker-a1111
Hello!
I'm making a Serverless NLLB Endpoint on RunPod.
I have built and pulled a docker image that works perfectly well locally, but when I deploy it on RunPod, it doesn't work with the same error (Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Does anybody have some ideas about what's going wrong?
usually I see this error when checkpoint file is being corupted
Sorry, but I don't have any checkpoint file in my project... or I don't know about it 🙂 I'm a newbie in this field..
What is NLLB? Doesn't it use some kind of model?
Yes, I use facebook/nllb-200-distilled-600M from huggingface
This is what @Papa Madiator means by "checkpoint"
Ok, then what's the difference between local and serverless(in RunPod) model running? It works locally..
up cause on RunPod you use RunPod gpu's instead of local one
I understand this, of cause 🙂 But what does this change? I have device choice in code so gpu is used, when it possible.
What do you mean "by what does this change"?
I mean, what's the difference if I run cpu or gpu on totally the same code? Why is it OK with cpu, but it isn't OK with gpu?
Sorry, if I'm writing something strange 🙂
It's fine just needed to clear things out so maybe I can give you a better answer
Hmm What's not working with gpu ?
Some code are specifically designed to be run on cpu, or gpu, or both. But they need to get access to the hardware's program
Maybe if it works with cpu and not gpu it doesn't have support yet
It's working in other environment (Beam, if you ever heart of it), so the problem that it doesn't work in RunPod...This conversation takes me some interesting debugging ideas, so I'll try and will return :))
Whats working in beam srry?
i dont know what type of software, model are you running
I'm simply running facebook/nllb-200-distilled-600M model from Huggingface. That's all)
Hmm okay then whats the error on runpod?
Running it how? oobabooga or something else?
It's in the name of the thread: "Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!"
Serverless running
using hf transformers?
which library did you use to run it?
might be a cuda driver version or template problem. which docker image are you using?
This error is caused by the model and the inputs not being moved to the same device before calling the model. To resolve, move both the model and inputs to the same device before running the model:
found this on stackoverflow
I'm making docker image by myself and I check the device (DEVICE:str = 'cuda' if torch.cuda.is_available() else 'cpu')
whats your code to run it? ( like in the handler )
that makes that error
What makes the error is my question 🙂 I have just found out that the problem in function, that directly uses the library transformers. I can write its code, if it's ok here)
well maybe your code the library you use mostly
yeah that what i asked sure
here this function
def translate(self, collection, tgt_lang):
texts = [t.strip() for t in collection]
batch_count = len(texts) / BATCH_SIZE
if batch_count < 1:
batch_count = 1
texts_split = np.split(np.array(texts), batch_count)
result = []
with torch.inference_mode():
for batch in texts_split:
encoded = self.tokenizer.batch_encode_plus(list(batch), max_length=self.max_length, truncation=True, return_tensors="pt", padding=True)
for key in encoded.keys():
encoded[key] = encoded[key].to(self.device)
translated = self.model.generate(**encoded, max_length=self.max_length, forced_bos_token_id=self.tokenizer.lang_code_to_id[tgt_lang])
decoded = self.tokenizer.batch_decode(translated, skip_special_tokens=True)
result.extend(decoded)
return [tr if len(tx)>0 else tx for tr, tx in zip(result, texts)]
This function is a method of my class model
What is torch.inference_mode(): for btw
try using .to(self.device) on your self.model
or the self.tokenizer
https://pytorch.org/cppdocs/notes/inference_mode.html about inference mode you can read here 🙂 it's complicated to say in two words)
I'll try! Thanks!
right thanks
I have changed the string
...
translated = self.model.to(self.device).generate( ...
...
added 'to(self.device)', as you advised me. All other strings are without .to(self.device)
And it's working now!!! ☺️ Great thanks!!
and if I write
encoded[key] = encoded[key]
without to(self.device), it's working also! (that is logical)
Uyep
Hope this device points to the gpu
If not then it will use cpu 😂