Oversampling/Undersampling

Hey guys I am currently studying and doing a deep dive on imbalanced dataset challenges, and I am doing a deep dive on oversampling and undersampling, I am using the SMOTE library in python. I have to do a big presentation and report of this to my peers, what should I talk about??

I was thinking:

  • Intro: Imbalanced datasets, challenges
  • Over/Under: Explaining what it is
  • Use Case 1: Under
  • Use Case 2: Over
  • Deep Dive on SMOTE
  • Best practices
  • Conclusions

Should I add something? Do you have any tips?

6 Likes

I seem to remember reading a paper titled β€œTo Smote or Not to Smote.” Relativity is essentially destroyed by oversampling or undersampling. Calibration should come after modeling. Conformal Prediction could be more beneficial.

5 Likes

Thank you. I found the paper and asked Gemini to summarize.

Balancing helps weak classifiers, but not strong ones:

For powerful algorithms like XGBoost and Catboost, balancing didn’t significantly improve performance. In fact, these strong classifiers performed better on the imbalanced data than weaker ones (like decision trees or SVM) even after balancing.

Optimizing the decision threshold is often better than balancing:

For label-based metrics (like F1-score), adjusting the threshold that determines a positive prediction can be as effective as balancing and is computationally cheaper. Simple oversampling can be sufficient:

If you must balance, basic random oversampling of the minority class (duplicating failure examples) can be just as good as more complex methods like SMOTE. There are exceptions:

Balancing can be useful when you have expert knowledge to set hyperparameters for the balancing method, are forced to use a weak classifier, or cannot optimize the decision threshold.

4 Likes

there is a paper on that? if the paper covers the above then its a good paper becase all the stuff mentioned above is what you figure out as an experienced practitioner over time .

3 Likes

Undersample, oversample, and jail time.

2 Likes

Hired as DS manager

1 Like

Because of jail, we have the best data scientists.