A step by step guide to deploy HuggingFace models?
So I'm looking for serverless options to host public models on HuggingFace for my personal use, but it looks like simply dropping HuggingFace URLs won't actually make it work. It showed the following errors when I tried to send an API request to my OpenAI Base Url:
Is there a step-by-step guide for beginners to deploy a HuggingFace models using RunPod's serverless option?...
ValueError: Unrecognized model in BeaverAI/Cydonia-22B-v2l-GGUF. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zoedepth\n
ValueError: Unrecognized model in BeaverAI/Cydonia-22B-v2l-GGUF. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zoedepth\n
Request queued forever
Hi, I am facing a problem while interacting with my runpod serverless endpoint. When I send the first request it gets queued and server don't get started, ideally from a cold start it should take 5 mins at max but its not initializing even after 15-20 mins, I have already deleted the ednpoint and created it again, it fixed the issue once but getting the same problem now. I am using docker image with custom tag. The logs say woker is ready, starting container, remove container, it throws no error...
Multi-Region Support and Expansion Plans
Hello,
Currently, the serverless worker system distributes containers randomly between the US and EU. I’m wondering if there are any plans to allow assigning a specific number of workers to each region (e.g., x workers in the US and x workers in the EU) under a single endpoint in the future.
Additionally, would it be possible to implement automatic routing of requests to the nearest region if this feature becomes available? For instance, if an edge function is called from the EU, it would be ideal to route the request to an EU-deployed worker to reduce latency....
Multiple endpoints within one handler
I have had success creating serverless endpoints in runpod with handler.py files that look like this:
imports
...
def handler(job):...
How to Minimize I/O Waiting Time?
Hello,
I’m using serverless Runpod for ComfyUI, where I send and return image URLs, leveraging the Google Cloud Bucket SDK. My current flow is:
Runpod handler downloads the image using the URL....
Image caching
Hi, are there plans to add caching to user images? I have pretty big image (18GB) and after some time it will pull image again even after it was already initialized, which will block my processing pipeline
Flashboot principles
Hello. Who can explain Flashboot principles?
When worker is idle model stays in gpu memory or pc memory?
How long the model stays in memory? Is some LRU eviction policy used?...
Thinking of using RunPod
I aplogize if i sound too ignorant.
I have a tool that converts texts prompt to images and currently i'm using dalle for the image generation. The costs in dalle is getting too much for me . If i use runpod will the costs of the image generations be lower ?
Currently i'm spending about $0.8 per 10 images using dalle...
Issues with network volume access
It seems like there is a latency in accessing files from serverless workers.
Error on the worker:
2024-10-22T18:36:15.970116008Z CompletedProcess(args=['python3', '/runpod-volume/script3.py', 'xx', 'xx-xx'], returncode=1, stdout='File /runpod-volume/xx/xx/file.txt not found after 15 seconds. Exiting.\n', stderr='')...
Is there any way to set retries to 0
I mean i don't want request to retry automatic if our request fails
how can we configure scale type using runpod sdk
does runpod sdk supports all the functionality in the platform ?
Tensor
Currently in my nodejs backend I send request to my serverless endpoint like this
``js
console.log('Starting initial processing with RunPod API...');
const initialProcessingResponse = await retryWithBackoff(async () => {
return await axios.post(
${process.env.RUNPOD_RUNSYNC_ENDPOINT}`,...Migrated from RO to IS
Hello, I've got a message from you that RO would be deprecated, so I decided to move it all to IS, since it has the highest availability of 4090.
All is working, but sometimes my serverless tasks die without any ouput or logs. It just dies all of a sudden. Is anyone else having the same issue?...
Depoying a model which is quantised with bitsandbytes(model config).
I have fintuned a 7B model by quantising in my local machine with 12 GB of VRAM with my custom dataset. And As I went to deploy my model on runpod with vLLM for faster inference. I found only 3 types of quantised model being deployed there namely GPTQ,AWQ and Squeeze LLM. Is there anything I am interpreting wrong or Runpod don't have the feature to deploy model that way? For now is there any other workaround that I can do to deploy my model as of now?
Anyone has a fork of ashleykza/runpod-worker-a1111:3.0.0?
The docker image no longer exists, does anyone have a backup?
API to remove worker from endpoint - please!
Sometimes one worker in endpoint fails because of internal errors, misconfiguration, out of space (because of memory purging errors) and etc (happens in less than 1%). Unfortunately this worker will generate endless errors and each task going to that worker will fail . So it is always a job to be done by logging in to account and manually kicking that worker out of endpoint to stop errors. Definitely need an API to be able to kick unhealthy workers from endpoints. 🙏
Batch processing of chats
Processing a batch chat completions.
Hi,
I am new to Runpod and am trying to adapt my project so I can use it with the serverless interface. My project does work fine on AWS using offline vLLM inference via the langchain library. My understanding is that to use RunPod Serverless, I will have to use the OpenAI interface. Now, what I don't understand is how to implement vLLM batch processing, like I do now with offline inference, using this API. The client.chat.completions.create() method seems to only take one chat (with multiple messages) at a time, but not multiple independent chats (each consisting of multiple messages). The RunPod documentation also only covers the single chat case. Is there a way and if how to send a batch of chats at once? This is important for my logic as using the prefix-caching-enabled option makes a big difference....
Pod crashing due to 100 percent cpu usage
Hey I need help regarding runpod serverless
My serverless pod using 100 percent cpu and then it crashes
Is their a way to limit cpu usage of pod to not exceed certain point?...