RunPod•14mo ago

Hi, is there currently an outage to Serverless API?

The request are "IN_QUEUE" forever...

27 Replies

haris•14mo ago

I've had similar issues because I provided an incorrect input body, are you able to provide the body you're using for your severless endpoints? As well as what template you're using and any other information that you think could be useful

andyh3118OP•14mo ago

This is my endpoint 1ifuoxegzxuhb4 We are using vLLM I don't think input body is wrong though because the same service has been running smoothly for 2-3 weeks already. Things started to become unstable since weekend, and today is full outage for us...

haris•14mo ago

Got it, if you go look at your severless endpoint after you send a request, are you able to see if it has any workers running? We might not have any availability on the GPUs you'v chosen

andyh3118OP•14mo ago

You can see all the requests are pending

haris•14mo ago

]

andyh3118OP•14mo ago

Workers are running

haris•14mo ago

Odd

andyh3118OP•14mo ago

And are boosting correctly.

haris•14mo ago

I'll bring this up internally as I'm not too sure what the issue could be, give me a moment.

andyh3118OP•14mo ago

Thanks!

Alpay Ariyak•14mo ago

What is the docker image you are using? Our worker vLLM?

andyh3118OP•14mo ago

Ah. sorry, I was wrong. It is not vLLM. We use our own exllama image.

River•14mo ago

i think you need to debug your docker image here it appears to be broken

andyh3118OP•14mo ago

ok. any logs on your end that you can share? (to indicate that it is broken?)

J.•14mo ago

were u able to confirm exllama works on gpu pod?

andyh3118OP•14mo ago

Ah.. it has been working for 2-3 weeks (we used it very actively) same image / model

J.•14mo ago

no new docker builds? interesting

andyh3118OP•14mo ago

based on the logs, the requests are not getting to the handler. This is our handler code:

import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})

import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})import logging

import runpod

from app.exllamav2_common import boot_engine, generate

logger = logging.getLogger(__name__)

async def handler(job: dict):
    request_dict: dict = job.pop("input", {})

    configs_dict = request_dict.copy()

    full_response = ""
    for full_response in generate(configs_dict):
        yield {"text": full_response, "finished": False}

    yield {"text": full_response, "finished": True}


boot_engine()

def concurrency_modifier(current_concurrency):
    max_concurrency = 1
    return max(0, max_concurrency - current_concurrency)

runpod.serverless.start({
    "handler": handler,
    "return_aggregate_stream": True,
})

J.•14mo ago

It says start generating so I feel that it is reaching the handler

andyh3118OP•14mo ago

hmm. you are correct.

J.•14mo ago

I guess two things here: 1) Maybe try to create a test endpoint with like maybe 3 max workers and see if it works there. Cause then ud isolate at least if its the original endpoint or your code (if both fail)

andyh3118OP•14mo ago

So it get to the handler, but stuck 😆

J.•14mo ago

2) I just tried with my LLM, which uses an async generator too and works fine https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py

andyh3118OP•14mo ago

got it.

J.•14mo ago

So either you got a bad endpoint somehow / or your code or input, something is off.

andyh3118OP•14mo ago

thanks. let me look into that. Could be the issue with ExllamaV2.

Gaming

Programming

Hi, is there currently an outage to Serverless API?

Did you find this page helpful?