Empty Tokens Using Mixtral AWQ
2024-01-20T00:36:26.942667713Z
2024-01-20T00:36:26.943297221Z ==========
2024-01-20T00:36:26.943372701Z == CUDA ==
2024-01-20T00:36:26.943619654Z ==========
2024-01-20T00:36:26.952680191Z
2024-01-20T00:36:26.952702058Z CUDA Version 11.8.0
2024-01-20T00:36:26.952707724Z
2024-01-20T00:36:26.952711901Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-01-20T00:36:26.952716521Z
2024-01-20T00:36:26.952720974Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-01-20T00:36:26.952726474Z By pulling and using the container, you accept the terms and conditions of this license:
2024-01-20T00:36:26.952730628Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-01-20T00:36:26.952735187Z
2024-01-20T00:36:26.952739194Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-01-20T00:36:26.967114577Z
2024-01-20T00:36:31.398534811Z /usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
2024-01-20T00:36:31.398583407Z warnings.warn(
2024-01-20T00:36:33.126995125Z WARNING 01-20 00:36:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-20T00:36:33.127225878Z INFO 01-20 00:36:33 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-20T00:39:28.047075314Z INFO 01-20 00:39:28 llm_engine.py:223] # GPU blocks: 4982, # CPU blocks: 2048
2024-01-20T00:39:30.772439067Z INFO 01-20 00:39:30 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-01-20T00:39:30.772471720Z INFO 01-20 00:39:30 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
2024-01-20T00:39:50.500041757Z INFO 01-20 00:39:50 model_runner.py:449] Graph capturing finished in 20 secs.
2024-01-20T00:39:50.506268314Z --- Starting Serverless Worker | Version 1.5.2 ---
2024-01-20T00:39:57.685484188Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished running generator.", "level": "INFO"}
2024-01-20T00:39:57.755166616Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished.", "level": "INFO"}
2024-01-20T00:36:26.942667713Z
2024-01-20T00:36:26.943297221Z ==========
2024-01-20T00:36:26.943372701Z == CUDA ==
2024-01-20T00:36:26.943619654Z ==========
2024-01-20T00:36:26.952680191Z
2024-01-20T00:36:26.952702058Z CUDA Version 11.8.0
2024-01-20T00:36:26.952707724Z
2024-01-20T00:36:26.952711901Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-01-20T00:36:26.952716521Z
2024-01-20T00:36:26.952720974Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-01-20T00:36:26.952726474Z By pulling and using the container, you accept the terms and conditions of this license:
2024-01-20T00:36:26.952730628Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-01-20T00:36:26.952735187Z
2024-01-20T00:36:26.952739194Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-01-20T00:36:26.967114577Z
2024-01-20T00:36:31.398534811Z /usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
2024-01-20T00:36:31.398583407Z warnings.warn(
2024-01-20T00:36:33.126995125Z WARNING 01-20 00:36:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-20T00:36:33.127225878Z INFO 01-20 00:36:33 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-20T00:39:28.047075314Z INFO 01-20 00:39:28 llm_engine.py:223] # GPU blocks: 4982, # CPU blocks: 2048
2024-01-20T00:39:30.772439067Z INFO 01-20 00:39:30 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-01-20T00:39:30.772471720Z INFO 01-20 00:39:30 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
2024-01-20T00:39:50.500041757Z INFO 01-20 00:39:50 model_runner.py:449] Graph capturing finished in 20 secs.
2024-01-20T00:39:50.506268314Z --- Starting Serverless Worker | Version 1.5.2 ---
2024-01-20T00:39:57.685484188Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished running generator.", "level": "INFO"}
2024-01-20T00:39:57.755166616Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished.", "level": "INFO"}
{
"input": {
"messages": [
{
"role": "user",
"content": "Tell me why RunPod is the best GPU provider"
},
{
"role": "assistant",
"content": "RunPod is the best GPU provider for several reasons."
},
{
"role": "user",
"content": "Name 3 resons"
}
],
"sampling_params": {
"max_tokens": 100
},
"apply_chat_template": true,
"stream": true
}
}
{
"input": {
"messages": [
{
"role": "user",
"content": "Tell me why RunPod is the best GPU provider"
},
{
"role": "assistant",
"content": "RunPod is the best GPU provider for several reasons."
},
{
"role": "user",
"content": "Name 3 resons"
}
],
"sampling_params": {
"max_tokens": 100
},
"apply_chat_template": true,
"stream": true
}
}
2 Replies
{
"delayTime": 206939,
"executionTime": 7064,
"id": "7756be1f-38eb-4672-9097-7785b367df08-u1",
"output": [
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 30
}
},
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 60
}
},
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 90
}
},
{
"finished": true,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 100
}
}
],
"status": "COMPLETED"
}
{
"delayTime": 206939,
"executionTime": 7064,
"id": "7756be1f-38eb-4672-9097-7785b367df08-u1",
"output": [
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 30
}
},
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 60
}
},
{
"finished": false,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 90
}
},
{
"finished": true,
"tokens": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"usage": {
"input": 43,
"output": 100
}
}
],
"status": "COMPLETED"
}
It seems like an issue with the model, it’s setup or sampling parameters. Quants especially often have issues.
I’d see if there are huggingface or GitHub issues related to that specific quant anywhere. Since there’s an actual coherent output coming out of the handler, its not a worker issue
I’d try to first get that model working with vllm in a pod