Should I Adjust min_child_samples When Training LightGBM with 100% of the Data?

Hi, I'm training a LightGBM model to optimize performance for an embedded system application, specifically for real-time anomaly detection on edge devices. I'm currently facing a dilemma regarding parameter tuning when increasing the amount of training data. Initially, I split my dataset into 90% for training and 10% for testing. Using grid search, I found the optimal parameters for the model. Now, I want to leverage 100% of the data to train the model to make it as robust as possible for deployment on resource-constrained devices. My question is about parameters like min_child_samples, which are related to the data volume. When I increase the data from 90% to 100%, should I keep min_child_samples the same as the value found during the 90% data training? Or should I adjust it because the data volume has increased, considering the constraints of embedded systems? Could someone provide guidance on how to handle this or share any best practices for tuning parameters when increasing the data size to ensure optimal model performance in embedded system applications?
Solution:
Hi @wafa_ath you need to re adjust min_child_samples when you want to switch from 90% to 100% of the data, Observe that - Larger dataset = Higher min_child_samples value. Why ? Cus since you now have more data, increasing the value of min_child_samples it may help prevent the model from becoming complex and overfitting on noise yeah, which is important for embedded systems - Then start by incrementally increasing min_child_samples by a small percentage , maybe you can start by 10% to 20% and monitor performance on cross validation...
Jump to solution
3 Replies
Solution
Marvee Amasi
Marvee Amasi3mo ago
Hi @wafa_ath you need to re adjust min_child_samples when you want to switch from 90% to 100% of the data, Observe that - Larger dataset = Higher min_child_samples value. Why ? Cus since you now have more data, increasing the value of min_child_samples it may help prevent the model from becoming complex and overfitting on noise yeah, which is important for embedded systems - Then start by incrementally increasing min_child_samples by a small percentage , maybe you can start by 10% to 20% and monitor performance on cross validation
Marvee Amasi
Marvee Amasi3mo ago
So since you re using 100% of the data for training, you should still validate your model by using cross validation k-fold cross validation prolly 🤷🏿‍♀️ to ensure the adjustments are beneficial and doesnt harm generalization
wafa_ath
wafa_ath3mo ago
Thanks for the advice, I’ll try incrementally increasing min_child_samples and use k-fold cross-validation to validate the changes
Want results from more Discord servers?
Add your server