RunPod•6mo ago

Asynchronous serverless endpoint failing with 400 Bad Request

I'm getting the following error when my serverless endpoint tried to return it's output object: "Failed to return job results. | 400, message='Bad Request', url='https://api.runpod.ai/v2/ne9y7bgqrpzcu6/job-done/asvftiq7ad2xzj/30238db1-1d48-4a80-8c5e-86f69acf3642-e1?gpu=$RUNPOD_GPU_TYPE_ID&isStream=false'" The payload is small, only a KiB or so. What can be the other causes of this "Bad Request", presumbly done by runpods python library?

20 Replies

bartOP•6mo ago

I'm also get 404's for my sync endpoint: Failed to return job results. | 404, message='Not Found', url='https://api.runpod.ai/v2...' It is also getting "retried":

2024-10-18T10:41:19.802111336Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Finished.", "level": "INFO"}
2024-10-18T10:41:20.022622726Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-18T10:41:20.022797552Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-18T10:41:20.022804705Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Started.", "level": "INFO"}

2024-10-18T10:41:19.802111336Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Finished.", "level": "INFO"}
2024-10-18T10:41:20.022622726Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-18T10:41:20.022797552Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-18T10:41:20.022804705Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Started.", "level": "INFO"}

Super weird

Poddy•6mo ago

@bart

Escalated To Zendesk

The thread has been escalated to Zendesk!

Jason•6mo ago

I think there might be a problem with Runpod's connection

srimanthd•6mo ago

facing the same issue I solved it. It was the runpod whisper template endpoint schema validation failing. I was passing an extra property in the inputs and it failed because of that.

xnorcode•6mo ago

I have the same issue @srimanthd what do you mean extra property? can you share an example?

srimanthd•6mo ago

@xnorcode are you using whisper endpoint too?

xnorcode•6mo ago

I am running flux not whisper I don't have any input schema validations. my endpoint runs perfectly, but twice. it loads the model, generates and uploads the image and completes the job. And then it tries to execute it again.

srimanthd•6mo ago

Ah, might be unrelated then. https://github.com/runpod-workers/worker-faster_whisper/blob/main/src/rp_schema.py This file has the allowed inputs. I was passing an additional input prop called "type": "speech_to_text" and then I got the erorr.

GitHub

worker-faster_whisper/src/rp_schema.py at main · runpod-workers/wor...

🎧 | RunPod worker of the faster-whisper model for Serverless Endpoint. - runpod-workers/worker-faster_whisper

xnorcode•6mo ago

and on the second attempt it shows the above error log ah no, I'm not using that. I using a custom flux docker image I have been getting this issue for 2-3 weeks now... rewrote my code 4 times! Still can't figure out what's wrong.. 😢 I raised a support ticket now, hopefully will figure out what's the issue soon

AttilaF•6mo ago

Has this isu been solved? Anyone uses a JS script to reach the serverless endpoint? Can you please share code, I keep getting 404

Jason•6mo ago

Maybe it is outdated if it doesn't work, meanwhile you can try to use like axios js library or just fetch to hit the endpoint Or maybe share your code

xnorcode•6mo ago

Still there's issues... So a little update on my case After a lot of tests (writing/rewriting) of my code, I couldn't figure out what the issue was from side. I am in continues talk with the support team suggesting this was s know issue with the Runpod SDK version I was using and should change it to solve the issue Hi Andreas, The issue you’re experiencing is a known bug in SDK 1.7.1. To resolve this, please update to SDK 1.7.3 or downgrade to 1.6.2, which should fix the retry problem. The root cause is that our system runs a health check during long tasks, and if the check isn’t reported in time, the job is put back in the queue, causing a retry. Let me know if you have any questions or need further assistance. Best Regards, When upgrading to the latest version SDK 1.7.3 the worker container seems to crash (gets removed) once the model is loaded and my starts inference steps. So this is another issue we're experience and also forwarded to the team, hopefully they'll find a fix soon. When downgrading to the SDK version 1.6.2 I now get another error causing the worker to stop/fail: ValueError: Host '127.0.0.1:8188' cannot contain ':' (at position 9) I can't do anything about the 1.7.3 version so waiting for the Runpod team. I'm currently trying to see if there's anything I can do from my side to get the 1.6.2 version working (which also seems there's not much from side to do). this is the error I get with 1.6.2: ValueError: Host '127.0.0.1:8188' cannot contain ':' (at position 9) raise ValueError( File "/usr/local/lib/python3.10/dist-packages/yarl/_url.py", line 1386, in _encode_host _host = _encode_host(host, validate_host=True) File "/usr/local/lib/python3.10/dist-packages/yarl/_url.py", line 355, in build url = URL.build(scheme=self.scheme, host=self.host) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_request.py", line 451, in url File "aiohttp/_helpers.pyx", line 26, in aiohttp._helpers.reify.get not request.url.raw_path.startswith(self._prefix2) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_urldispatcher.py", line 767, in resolve match_dict, allowed = await resource.resolve(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_urldispatcher.py", line 1022, in resolve match_info = await self._router.resolve(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_app.py", line 512, in _handle resp = await request_handler(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_protocol.py", line 452, in _handle_request Traceback (most recent call last): Error handling request

Jason•6mo ago

@yhlong00000

xnorcode•6mo ago

and here's the line of code that I get the error from: once the comfy server is up and running the below get request raises the above error: HOSTNAME = "127.0.0.1" PORT = 8188 url = f"http://{HOSTNAME}:{PORT}" response = requests.get(url) # If the response status code is 200, the server is up and running if response.status_code == 200: utils.log(f"API: reachable!") return True tried all variations of url formatting

Jason•6mo ago

Hmm I'm thinking maybe run the command for updating requests version via pip Search in google After installing runpodctl I think that your error trace is showing an error because the requests or aiohttp library And since you just downgraded runpodctl andit didn't work, then it maybe because the older runpodctl uses older library dependency Idk, hi yhlong

yhlong00000•6mo ago

If you have long-running jobs, SDK 1.7.3 has a bug that causes them to retry unexpectedly. In my testing, versions 1.7.2 and 1.6.2 don’t have this issue. I’m not sure why you’re encountering the ValueError with 1.6.2, but could you try using 1.7.2 to see if it resolves the retry problem?

from yarl import URL

HOSTNAME = "127.0.0.1"
PORT = 8188

# Construct URL properly with separate host and port
url = URL.build(scheme="http", host=HOSTNAME, port=PORT)

response = requests.get(str(url))

# If the response status code is 200, the server is up and running
if response.status_code == 200:
    utils.log("API: reachable!")
    return True

from yarl import URL

HOSTNAME = "127.0.0.1"
PORT = 8188

# Construct URL properly with separate host and port
url = URL.build(scheme="http", host=HOSTNAME, port=PORT)

response = requests.get(str(url))

# If the response status code is 200, the server is up and running
if response.status_code == 200:
    utils.log("API: reachable!")
    return True

xnorcode•6mo ago

good suggestion, will try this now. ok thnx, will try version 1.7.2 as well. @yhlong00000 I completed testing with 1.7.2 and seems to be working perfectly without any retries. I just sent you an email with logs and more information for you to review. I'm now testing with 1.6.2 version and will update you on that soon. @yhlong00000 Runpod SDK 1.6.2 not working even while updating requests after install runpod package as suggested above. I've emailed some more details about this test for you to review. I am upgrading to SDK 1.7.2 which seems to be working fine so far. Hopefully, we'll get a new stable version soon. Thanks again for your prompt support!

yhlong00000•6mo ago

Cool, thanks for testing it. Will let you know once we have new version.

bartOP•6mo ago

I am indeed using runpod==1.7.1 and will update it to 1.7.4 according to the GitHub advisory https://github.com/runpod/runpod-python/releases/tag/1.7.3 . If I experience any problems I will report back to this thread. Thanks all for the input and the swift responses and hopefully resolvement!

GitHub

Release 1.7.3 · runpod/runpod-python

SDK 1.7.3 Advisory: Known Issues with Long-Running Jobs – Please Upgrade to 1.7.4 1.7.3: Long-running jobs (>60 seconds) can cause the system to stop the worker, triggering retries and failures....

bartOP•6mo ago

1.7.4 seems to work well! No more getting retried and stuff like that. I'm not seeing my container logs after Tensorflow starts up, but that might be an issue on my end (e.g. not disabling python output buffering). Thanks!

Gaming

Programming

Asynchronous serverless endpoint failing with 400 Bad Request

Did you find this page helpful?