Training jobs using script
Hey, Can anyone tell me if runpod gives the feature to create a training script that can be run from anywhere and I can use that to create a GPU instance, and load and save my data to external cloud storages just like in AWS Sagemaker training script mode? I need to train multiple models in such manner with different architectures to see which one performs the best.
20 Replies
Yes you can upload files to s3 storage
Like in python you can do that too
https://docs.runpod.io/sdks/overview
https://docs.runpod.io/cli/overview
https://docs.runpod.io/pods/configuration/export-data
Here are some useful links.
Overview | RunPod Documentation
Unlock serverless functionality with RunPod SDKs, enabling developers to create custom logic, simplify deployments, and programatically manage infrastructure, including Pods, Templates, and Endpoints.
Overview | RunPod Documentation
RunPod CLI (runpodctl) is a command-line interface tool designed to automate and manage GPU pods on RunPod.
Export data | RunPod Documentation
Export RunPod data to various cloud providers, including Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, Backblaze B2 Cloud Storage, and Dropbox, with secure key and access token management.
I'm fairly new to RunPod. Can you please point me to a tutorial where a remote training job is run on a pod, the model weights are stored on S3, and the pod automatically kills itself once the training is complete?
You probably have to write some code to pull data from s3 and after training you can terminate the pod using our cli. Btw, ChatGPT is really good at writing code😀
https://docs.runpod.io/cli/overview
Overview | RunPod Documentation
RunPod CLI (runpodctl) is a command-line interface tool designed to automate and manage GPU pods on RunPod.
yeah upload data to s3 after training
Sorry, it is still unclear. Does runpod has a tutorial on training a custom model on a GPU instance? I have tried searching for it, but I have not found any.
I think there is, but what kind of model are you trying to train?
https://blog.runpod.io/using-runpods-dreambooth-endpoint-to-make-custom-generated-images/
i found one using dreambooth ( not sure if its outdated )
RunPod Blog
Using RunPod's DreamBooth Endpoint to Make Custom Generated Images
DreamBooth provides a great way to take a Stable Diffusion model and train it to include a specific new concept (maybe your dog or a friend) making it capable of generating AI images featuring that concept. In a previous post we walked through using RunPod's template to set things and
Probably not working anymore since the Dreambooth endpoint used TheLastBen's code
I recommend using Kohya_ss, EveryDream2Trainer or OneTrainer
This guy has some videos for training image models:
https://www.youtube.com/@SECourses/videos
YouTube
SECourses
Welcome to Software Engineering Courses (SECourses) – the ultimate destination for skillfully curated insights into state-of-the-art technologies and programming paradigms. We demystify the realms of Artificial Intelligence, Stable Diffusion, DreamBooth, LoRA, ControlNet, Textual Inversion, Software Engineering, Programming, C#, .NET, ASP .NET, ...
What kind of model are you training?
Well, I'm training different kinds of segmentation models for my tasks, varying from simple U-Net to Attention U-Net, and might also go for transformer-based segmentation models. I'd like to run an instance for each model, so I can compare their performance in as little time as possible.
refer to your model library maybe?
or the model repo
then use custom script to execute python to use boto3 to upload to s3
A big problem is to auto-kill the pod once the training is complete and saving the model weights before that.
thats the high level
Can you please shed some light on how to auto-kill the instance?
Yeah you can exec the script to upload the models then run runpodctl remove pod
like that
Okay, thanks!
If I just stop my pod and do not remove it, will I still be billed? And once I'll be inside the pod, can I stop it from there? Will the command
runpodctl remove pod $RUNPOD_POD_ID
work from inside the pod?yes
you will still be billed for the storage
probably you can run the remove pod command
but you have to reexport runpod pod id
from your start script ( cmd / entrypoint )