Should I Adjust min_child_samples When Training LightGBM with 100% of the Data?
Hi, I'm training a LightGBM model to optimize performance for an embedded system application, specifically for real-time anomaly detection on edge devices. I'm currently facing a dilemma regarding parameter tuning when increasing the amount of training data.
Initially, I split my dataset into 90% for training and 10% for testing. Using grid search, I found the optimal parameters for the model. Now, I want to leverage 100% of the data to train the model to make it as robust as possible for deployment on resource-constrained devices.
My question is about parameters like
min_child_samples
, which are related to the data volume. When I increase the data from 90% to 100%, should I keep min_child_samples
the same as the value found during the 90% data training? Or should I adjust it because the data volume has increased, considering the constraints of embedded systems?
Could someone provide guidance on how to handle this or share any best practices for tuning parameters when increasing the data size to ensure optimal model performance in embedded system applications?Solution:Jump to solution
Hi @wafa_ath you need to re adjust
min_child_samples
when you want to switch from 90% to 100% of the data,
Observe that
- Larger dataset = Higher min_child_samples
value. Why ? Cus since you now have more data, increasing the value of min_child_samples
it may help prevent the model from becoming complex and overfitting on noise yeah, which is important for embedded systems
- Then start by incrementally increasing min_child_samples
by a small percentage , maybe you can start by 10% to 20% and monitor performance on cross validation...3 Replies
Solution
Hi @wafa_ath you need to re adjust
min_child_samples
when you want to switch from 90% to 100% of the data,
Observe that
- Larger dataset = Higher min_child_samples
value. Why ? Cus since you now have more data, increasing the value of min_child_samples
it may help prevent the model from becoming complex and overfitting on noise yeah, which is important for embedded systems
- Then start by incrementally increasing min_child_samples
by a small percentage , maybe you can start by 10% to 20% and monitor performance on cross validationSo since you re using 100% of the data for training, you should still validate your model by using cross validation
k-fold cross validation
prolly 🤷🏿♀️ to ensure the adjustments are beneficial and doesnt harm generalizationThanks for the advice, I’ll try incrementally increasing
min_child_samples
and use k-fold cross-validation to validate the changes