RunPod•14mo ago

serverless endpoint, Jobs always 1 in queued, even 3 workers running

after 600 s ,still 1 jos in queued,and log nothing. How i to see what is running? This morning when I was using a GPU pod, I was prompted that an ip_adapter was not found, but now I can't see any output. My local project does have an ip_adapter.

18 Replies

fanbingOP•14mo ago

"id": "e7ea07cf-7c78-4bab-bc59-0c22e3e26cc9-e1",

ashleyk•14mo ago

Click on one of your workers to see what they are doing. Probably a problem with your code. You shouldn't move to serverless if its not even working in GPU cloud. It is much easier to debug things in GPU cloud than in serverless.

fanbingOP•14mo ago

2024-03-10T08:54:43.082245784Z Traceback (most recent call last): 2024-03-10T08:54:43.082314323Z File "/src/handler.py", line 16, in <module> 2024-03-10T08:54:43.082319933Z from pipeline_stable_diffusion_xl_instantid import StableDiffusionXLInstantIDPipeline, draw_kps 2024-03-10T08:54:43.082325027Z File "/src/pipeline_stable_diffusion_xl_instantid.py", line 42, in <module> 2024-03-10T08:54:43.082329870Z from ip_adapter.resampler import Resampler 2024-03-10T08:54:43.082333984Z ModuleNotFoundError: No module named 'ip_adapter'

ashleyk•14mo ago

yeah you have to fix that. I would scale workers down to zero, your code is broken, serverless can't fix it for you.

fanbingOP•14mo ago

but in my local project, ip_adapter in checkpoints , I test, ok ,it generate a image , i will check my code again,

ashleyk•14mo ago

It looks like the ip_adapter module is not installed.

ashleyk•14mo ago

It looks like you're using InstantID, I have this code that works: https://github.com/ashleykleynhans/runpod-worker-instantid

GitHub

GitHub - ashleykleynhans/runpod-worker-instantid: InstantID : Zero-...

InstantID : Zero-shot Identity-Preserving Generation in Seconds | RunPod Serverless Worker - ashleykleynhans/runpod-worker-instantid

fanbingOP•14mo ago

I have download ip_adapter in my instantid project, use docker to build image thans,I will see the link about InstantID

fanbingOP•14mo ago

Thank you very much , I use your worker-instantid to generate image success. But I find a litter problem, model in Request in api/generate.md is wrong, "wangqixun/YamerMIX_v8" is correct , not "lwangqixun/YamerMIX_v8".

ashleyk•14mo ago

Thanks, the typo has been fixed.

fanbingOP•14mo ago

I am learning your "runpod-worker-instantid",and now instantid can use multicontrolnet modes, , should i use download_checkpoints.py to download there modes just like "get_instantid_pipeline('wangqixun/YamerMIX_v8')", or it is ok in rp_handler.py? Which is better ?

ashleyk•14mo ago

Better to download the models into the image otherwise they probably get downloaded in your worker.

fanbingOP•14mo ago

Why 'wangqixun/YamerMIX_v8' can use download_checkpoint.py to download to network volume, but multicontrolnet mode like pose model 、depth model should use image ? What different ? Or I misunderstand something? I'm new to runpod, please forgive my basic questions.When using serverless endpoints, comparing the approach of downloading Multicontrolnet models, which are about 25GB in size, into the Docker image VS network volume, which option is more efficient?

billchen•13mo ago

@ashleykrunpod can‘t run docker command 。and can't install docker。

flash-singh•13mo ago

container image has better throughput than network storage since its on local nvme disk

billchen•13mo ago

"output": "Traceback (most recent call last):\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 313, in handler\n images = generate_image(\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 282, in generate_image\n PIPELINE = get_instantid_pipeline(model)\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 129, in get_instantid_pipeline\n pipe.load_ip_adapter_instantid(face_adapter)\n File "/workspace/runpod-worker-instantid/src/pipeline_stable_diffusion_xl_instantid.py", line 159, in load_ip_adapter_instantid\n self.set_image_proj_model(model_ckpt, image_emb_dim, num_tokens)\n File "/workspace/runpod-worker-instantid/src/pipeline_stable_diffusion_xl_instantid.py", line 181, in set_image_proj_model\n self.image_proj_model.load_state_dict(state_dict)\n File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict\n raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(\nRuntimeError: Error(s) in loading state_dict for Resampler:\n\tsize mismatch for proj_out.weight: copying a param with shape torch.Size([2048, 1280]) from checkpoint, the shape in current model is torch.Size([768, 1280]).\n\tsize mismatch for proj_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n"

billchen•13mo ago

i use https://huggingface.co/Justin-Choo/XXMix_9realisticSDXL

Justin-Choo/XXMix_9realisticSDXL · Hugging Face

billchen•13mo ago

Error(s) in loading state_dict for Resampler:\n\tsize mismatch for proj_out.weight: copying a param with shape torch.Size([2048, 1280]) from checkpoint, the shape in current model is torch.Size([768, 1280]).\n\tsize mismatch for proj_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).

Gaming

Programming

serverless endpoint, Jobs always 1 in queued, even 3 workers running

Did you find this page helpful?