serverless endpoint, Jobs always 1 in queued, even 3 workers running
after 600 s ,still 1 jos in queued,and log nothing. How i to see what is running? This morning when I was using a GPU pod, I was prompted that an ip_adapter was not found, but now I can't see any output. My local project does have an ip_adapter.
18 Replies
"id": "e7ea07cf-7c78-4bab-bc59-0c22e3e26cc9-e1",
Click on one of your workers to see what they are doing.
Probably a problem with your code.
You shouldn't move to serverless if its not even working in GPU cloud.
It is much easier to debug things in GPU cloud than in serverless.
2024-03-10T08:54:43.082245784Z Traceback (most recent call last):
2024-03-10T08:54:43.082314323Z File "/src/handler.py", line 16, in <module>
2024-03-10T08:54:43.082319933Z from pipeline_stable_diffusion_xl_instantid import StableDiffusionXLInstantIDPipeline, draw_kps
2024-03-10T08:54:43.082325027Z File "/src/pipeline_stable_diffusion_xl_instantid.py", line 42, in <module>
2024-03-10T08:54:43.082329870Z from ip_adapter.resampler import Resampler
2024-03-10T08:54:43.082333984Z ModuleNotFoundError: No module named 'ip_adapter'
yeah you have to fix that. I would scale workers down to zero, your code is broken, serverless can't fix it for you.
but in my local project, ip_adapter in checkpoints , I test, ok ,it generate a image ,
i will check my code again,
It looks like the ip_adapter module is not installed.
It looks like you're using InstantID, I have this code that works:
https://github.com/ashleykleynhans/runpod-worker-instantid
GitHub
GitHub - ashleykleynhans/runpod-worker-instantid: InstantID : Zero-...
InstantID : Zero-shot Identity-Preserving Generation in Seconds | RunPod Serverless Worker - ashleykleynhans/runpod-worker-instantid
I have download ip_adapter in my instantid project, use docker to build image
thans,I will see the link about InstantID
Thank you very much , I use your worker-instantid to generate image success. But I find a litter problem, model in Request in api/generate.md is wrong, "wangqixun/YamerMIX_v8" is correct , not "lwangqixun/YamerMIX_v8".
Thanks, the typo has been fixed.
I am learning your "runpod-worker-instantid",and now instantid can use multicontrolnet modes, , should i use download_checkpoints.py to download there modes just like "get_instantid_pipeline('wangqixun/YamerMIX_v8')", or it is ok in rp_handler.py? Which is better ?
Better to download the models into the image otherwise they probably get downloaded in your worker.
Why 'wangqixun/YamerMIX_v8' can use download_checkpoint.py to download to network volume, but multicontrolnet mode like pose model 、depth model should use image ? What different ? Or I misunderstand something?
I'm new to runpod, please forgive my basic questions.When using serverless endpoints, comparing the approach of downloading Multicontrolnet models, which are about 25GB in size, into the Docker image VS network volume, which option is more efficient?
@ashleykrunpod can‘t run docker command 。and can't install docker。
container image has better throughput than network storage since its on local nvme disk
"output": "Traceback (most recent call last):\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 313, in handler\n images = generate_image(\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 282, in generate_image\n PIPELINE = get_instantid_pipeline(model)\n File "/workspace/runpod-worker-instantid/src/rp_handler.py", line 129, in get_instantid_pipeline\n pipe.load_ip_adapter_instantid(face_adapter)\n File "/workspace/runpod-worker-instantid/src/pipeline_stable_diffusion_xl_instantid.py", line 159, in load_ip_adapter_instantid\n self.set_image_proj_model(model_ckpt, image_emb_dim, num_tokens)\n File "/workspace/runpod-worker-instantid/src/pipeline_stable_diffusion_xl_instantid.py", line 181, in set_image_proj_model\n self.image_proj_model.load_state_dict(state_dict)\n File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict\n raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(\nRuntimeError: Error(s) in loading state_dict for Resampler:\n\tsize mismatch for proj_out.weight: copying a param with shape torch.Size([2048, 1280]) from checkpoint, the shape in current model is torch.Size([768, 1280]).\n\tsize mismatch for proj_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n"
Error(s) in loading state_dict for Resampler:\n\tsize mismatch for proj_out.weight: copying a param with shape torch.Size([2048, 1280]) from checkpoint, the shape in current model is torch.Size([768, 1280]).\n\tsize mismatch for proj_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).\n\tsize mismatch for norm_out.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([768]).