Oversampling/Undersampling

Esther · October 14, 2024, 1:50pm

Hey guys I am currently studying and doing a deep dive on imbalanced dataset challenges, and I am doing a deep dive on oversampling and undersampling, I am using the SMOTE library in python. I have to do a big presentation and report of this to my peers, what should I talk about??

I was thinking:

Intro: Imbalanced datasets, challenges
Over/Under: Explaining what it is
Use Case 1: Under
Use Case 2: Over
Deep Dive on SMOTE
Best practices
Conclusions

Should I add something? Do you have any tips?

Jerry · October 14, 2024, 1:55pm

I seem to remember reading a paper titled “To Smote or Not to Smote.” Relativity is essentially destroyed by oversampling or undersampling. Calibration should come after modeling. Conformal Prediction could be more beneficial.

DataDrifter · October 14, 2024, 1:56pm

Thank you. I found the paper and asked Gemini to summarize.

Balancing helps weak classifiers, but not strong ones:

For powerful algorithms like XGBoost and Catboost, balancing didn’t significantly improve performance. In fact, these strong classifiers performed better on the imbalanced data than weaker ones (like decision trees or SVM) even after balancing.

Optimizing the decision threshold is often better than balancing:

For label-based metrics (like F1-score), adjusting the threshold that determines a positive prediction can be as effective as balancing and is computationally cheaper. Simple oversampling can be sufficient:

If you must balance, basic random oversampling of the minority class (duplicating failure examples) can be just as good as more complex methods like SMOTE. There are exceptions:

Balancing can be useful when you have expert knowledge to set hyperparameters for the balancing method, are forced to use a weak classifier, or cannot optimize the decision threshold.

QuantumSleuth · October 14, 2024, 1:57pm

there is a paper on that? if the paper covers the above then its a good paper becase all the stuff mentioned above is what you figure out as an experienced practitioner over time .

AlgoMystic · October 14, 2024, 1:58pm

Undersample, oversample, and jail time.

Britney · October 14, 2024, 2:02pm

Hired as DS manager

Tracey · October 14, 2024, 2:06pm

Because of jail, we have the best data scientists.