Hey guys I am currently studying and doing a deep dive on imbalanced dataset challenges, and I am doing a deep dive on oversampling and undersampling, I am using the SMOTE library in python. I have to do a big presentation and report of this to my peers, what should I talk about??
I seem to remember reading a paper titled βTo Smote or Not to Smote.β Relativity is essentially destroyed by oversampling or undersampling. Calibration should come after modeling. Conformal Prediction could be more beneficial.
Thank you. I found the paper and asked Gemini to summarize.
Balancing helps weak classifiers, but not strong ones:
For powerful algorithms like XGBoost and Catboost, balancing didnβt significantly improve performance. In fact, these strong classifiers performed better on the imbalanced data than weaker ones (like decision trees or SVM) even after balancing.
Optimizing the decision threshold is often better than balancing:
For label-based metrics (like F1-score), adjusting the threshold that determines a positive prediction can be as effective as balancing and is computationally cheaper. Simple oversampling can be sufficient:
If you must balance, basic random oversampling of the minority class (duplicating failure examples) can be just as good as more complex methods like SMOTE. There are exceptions:
Balancing can be useful when you have expert knowledge to set hyperparameters for the balancing method, are forced to use a weak classifier, or cannot optimize the decision threshold.
there is a paper on that? if the paper covers the above then its a good paper becase all the stuff mentioned above is what you figure out as an experienced practitioner over time .