Thibaud
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
SGLang works very well in pod but impossible to run in serverless.
the api route stay => error 404
i use the exact same config (docker, command line, port) in pod and serverless.
120 replies
RRunPod
•Created by Thibaud on 8/8/2024 in #⚡|serverless
can't run 70b
any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated
i tried that:
config
80GB GPU
2GPUs / Worker
container disk: 500 gb
env var:
MAX_MODEL_LEN 15000*
MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated
but it doesn't work
without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing
gpu_memory_utilization
or decreasing max_model_len
when initializing the engine. "
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete
2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:26Z worker is ready
2024-08-08T12:44:38Z create pod network
2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm
2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z worker is ready
2024-08-08T12:44:39Z start container
2024-08-08T12:48:14Z start container
and nothing after75 replies