R
RunPod•4mo ago
bart

Asynchronous serverless endpoint failing with 400 Bad Request

I'm getting the following error when my serverless endpoint tried to return it's output object: "Failed to return job results. | 400, message='Bad Request', url='https://api.runpod.ai/v2/ne9y7bgqrpzcu6/job-done/asvftiq7ad2xzj/30238db1-1d48-4a80-8c5e-86f69acf3642-e1?gpu=$RUNPOD_GPU_TYPE_ID&isStream=false'" The payload is small, only a KiB or so. What can be the other causes of this "Bad Request", presumbly done by runpods python library?
20 Replies
bart
bartOP•4mo ago
I'm also get 404's for my sync endpoint: Failed to return job results. | 404, message='Not Found', url='https://api.runpod.ai/v2...' It is also getting "retried":
2024-10-18T10:41:19.802111336Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Finished.", "level": "INFO"}
2024-10-18T10:41:20.022622726Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-18T10:41:20.022797552Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-18T10:41:20.022804705Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Started.", "level": "INFO"}
2024-10-18T10:41:19.802111336Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Finished.", "level": "INFO"}
2024-10-18T10:41:20.022622726Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-18T10:41:20.022797552Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-18T10:41:20.022804705Z {"requestId": "e6f8c0d0-a884-4ced-8350-7d6929744ce9-e1", "message": "Started.", "level": "INFO"}
Super weird
Poddy
Poddy•4mo ago
@bart
Escalated To Zendesk
The thread has been escalated to Zendesk!
nerdylive
nerdylive•4mo ago
I think there might be a problem with Runpod's connection
srimanthd
srimanthd•4mo ago
facing the same issue I solved it. It was the runpod whisper template endpoint schema validation failing. I was passing an extra property in the inputs and it failed because of that.
xnorcode
xnorcode•4mo ago
I have the same issue @srimanthd what do you mean extra property? can you share an example?
srimanthd
srimanthd•4mo ago
@xnorcode are you using whisper endpoint too?
xnorcode
xnorcode•4mo ago
I am running flux not whisper I don't have any input schema validations. my endpoint runs perfectly, but twice. it loads the model, generates and uploads the image and completes the job. And then it tries to execute it again.
srimanthd
srimanthd•4mo ago
Ah, might be unrelated then. https://github.com/runpod-workers/worker-faster_whisper/blob/main/src/rp_schema.py This file has the allowed inputs. I was passing an additional input prop called "type": "speech_to_text" and then I got the erorr.
GitHub
worker-faster_whisper/src/rp_schema.py at main · runpod-workers/wor...
🎧 | RunPod worker of the faster-whisper model for Serverless Endpoint. - runpod-workers/worker-faster_whisper
xnorcode
xnorcode•4mo ago
and on the second attempt it shows the above error log ah no, I'm not using that. I using a custom flux docker image I have been getting this issue for 2-3 weeks now... rewrote my code 4 times! Still can't figure out what's wrong.. 😢 I raised a support ticket now, hopefully will figure out what's the issue soon
AttilaF
AttilaF•4mo ago
Has this isu been solved? Anyone uses a JS script to reach the serverless endpoint? Can you please share code, I keep getting 404
nerdylive
nerdylive•4mo ago
Maybe it is outdated if it doesn't work, meanwhile you can try to use like axios js library or just fetch to hit the endpoint Or maybe share your code
xnorcode
xnorcode•4mo ago
Still there's issues... So a little update on my case After a lot of tests (writing/rewriting) of my code, I couldn't figure out what the issue was from side. I am in continues talk with the support team suggesting this was s know issue with the Runpod SDK version I was using and should change it to solve the issue Hi Andreas, The issue you’re experiencing is a known bug in SDK 1.7.1. To resolve this, please update to SDK 1.7.3 or downgrade to 1.6.2, which should fix the retry problem. The root cause is that our system runs a health check during long tasks, and if the check isn’t reported in time, the job is put back in the queue, causing a retry. Let me know if you have any questions or need further assistance. Best Regards, When upgrading to the latest version SDK 1.7.3 the worker container seems to crash (gets removed) once the model is loaded and my starts inference steps. So this is another issue we're experience and also forwarded to the team, hopefully they'll find a fix soon. When downgrading to the SDK version 1.6.2 I now get another error causing the worker to stop/fail: ValueError: Host '127.0.0.1:8188' cannot contain ':' (at position 9) I can't do anything about the 1.7.3 version so waiting for the Runpod team. I'm currently trying to see if there's anything I can do from my side to get the 1.6.2 version working (which also seems there's not much from side to do). this is the error I get with 1.6.2: ValueError: Host '127.0.0.1:8188' cannot contain ':' (at position 9) raise ValueError( File "/usr/local/lib/python3.10/dist-packages/yarl/_url.py", line 1386, in _encode_host _host = _encode_host(host, validate_host=True) File "/usr/local/lib/python3.10/dist-packages/yarl/_url.py", line 355, in build url = URL.build(scheme=self.scheme, host=self.host) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_request.py", line 451, in url File "aiohttp/_helpers.pyx", line 26, in aiohttp._helpers.reify.get not request.url.raw_path.startswith(self._prefix2) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_urldispatcher.py", line 767, in resolve match_dict, allowed = await resource.resolve(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_urldispatcher.py", line 1022, in resolve match_info = await self._router.resolve(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_app.py", line 512, in _handle resp = await request_handler(request) File "/usr/local/lib/python3.10/dist-packages/aiohttp/web_protocol.py", line 452, in _handle_request Traceback (most recent call last): Error handling request
nerdylive
nerdylive•4mo ago
@yhlong00000
xnorcode
xnorcode•4mo ago
and here's the line of code that I get the error from: once the comfy server is up and running the below get request raises the above error: HOSTNAME = "127.0.0.1" PORT = 8188 url = f"http://{HOSTNAME}:{PORT}" response = requests.get(url) # If the response status code is 200, the server is up and running if response.status_code == 200: utils.log(f"API: reachable!") return True tried all variations of url formatting
nerdylive
nerdylive•4mo ago
Hmm I'm thinking maybe run the command for updating requests version via pip Search in google After installing runpodctl I think that your error trace is showing an error because the requests or aiohttp library And since you just downgraded runpodctl andit didn't work, then it maybe because the older runpodctl uses older library dependency Idk, hi yhlong
yhlong00000
yhlong00000•4mo ago
If you have long-running jobs, SDK 1.7.3 has a bug that causes them to retry unexpectedly. In my testing, versions 1.7.2 and 1.6.2 don’t have this issue. I’m not sure why you’re encountering the ValueError with 1.6.2, but could you try using 1.7.2 to see if it resolves the retry problem?
from yarl import URL

HOSTNAME = "127.0.0.1"
PORT = 8188

# Construct URL properly with separate host and port
url = URL.build(scheme="http", host=HOSTNAME, port=PORT)

response = requests.get(str(url))

# If the response status code is 200, the server is up and running
if response.status_code == 200:
utils.log("API: reachable!")
return True
from yarl import URL

HOSTNAME = "127.0.0.1"
PORT = 8188

# Construct URL properly with separate host and port
url = URL.build(scheme="http", host=HOSTNAME, port=PORT)

response = requests.get(str(url))

# If the response status code is 200, the server is up and running
if response.status_code == 200:
utils.log("API: reachable!")
return True
xnorcode
xnorcode•4mo ago
good suggestion, will try this now. ok thnx, will try version 1.7.2 as well. @yhlong00000 I completed testing with 1.7.2 and seems to be working perfectly without any retries. I just sent you an email with logs and more information for you to review. I'm now testing with 1.6.2 version and will update you on that soon. @yhlong00000 Runpod SDK 1.6.2 not working even while updating requests after install runpod package as suggested above. I've emailed some more details about this test for you to review. I am upgrading to SDK 1.7.2 which seems to be working fine so far. Hopefully, we'll get a new stable version soon. Thanks again for your prompt support!
yhlong00000
yhlong00000•4mo ago
Cool, thanks for testing it. Will let you know once we have new version.
bart
bartOP•4mo ago
I am indeed using runpod==1.7.1 and will update it to 1.7.4 according to the GitHub advisory https://github.com/runpod/runpod-python/releases/tag/1.7.3 . If I experience any problems I will report back to this thread. Thanks all for the input and the swift responses and hopefully resolvement!
GitHub
Release 1.7.3 · runpod/runpod-python
SDK 1.7.3 Advisory: Known Issues with Long-Running Jobs – Please Upgrade to 1.7.4 1.7.3: Long-running jobs (>60 seconds) can cause the system to stop the worker, triggering retries and failures....
bart
bartOP•4mo ago
1.7.4 seems to work well! No more getting retried and stuff like that. I'm not seeing my container logs after Tensorflow starts up, but that might be an issue on my end (e.g. not disabling python output buffering). Thanks!

Did you find this page helpful?