Why is it so common to use the average/mean instead of the median when most datasets are skewed and not evenly distributed?

Sup guys, there’s this thing that has bothered me since I started my data science journey.

I almost always use the median over the average for comparisons because I’ve learned over the years that most data is skewed and not evenly distributed. However, I feel like I’m the only person who does this, even though basic statistics teach us to use the median in such cases.

Thoughts?

2 Likes

You only need a sum and a total to get the average for a large array of items. Each value can be read once by the database, added to the total, and then released from memory to continue. To discover the value in the middle, you must sort all the data for a median, which takes more processing time and memory. A few algorithms work to improve this, but none of them is particularly scalable.

If you are performing one-time computations or you only have 100 or 1,000 data, then none of this matters. However, our data analysts request that we generate reports from tables containing millions of records, complete with real-time dashboard updates.
So unless there’s a really good reason to use a median instead of a mean, we’ll go with the mean.

The average, or mean, is commonly used because it takes into account all values in the dataset, providing a comprehensive measure.

It’s particularly useful in datasets where all values are relevant and contribute to the overall picture.

Despite its sensitivity to outliers in skewed distributions, the mean is still favored for its ease of calculation and its foundational role in further statistical analyses, such as variance and standard deviation.

However, in cases of extreme skewness, the median can indeed be a more representative measure of central tendency.

Always consider the nature of your data when choosing between the mean and median.

When analyzing data, the choice between using the mean or the median depends on the nature of the dataset. The mean represents the average value and is suitable for symmetric distributions without outliers. However, when dealing with skewed data or outliers, the median provides a more accurate measure of central tendency. For instance, in income distributions (which are often skewed), the median is preferred because it’s less affected by extreme values. Similarly, when outliers exist, the median better captures the typical value compared to the mean. So, despite skewed or unevenly distributed datasets, the common use of the mean persists due to its simplicity and historical convention, even though the median is often more robust in such cases