Sup guys, there’s this thing that has bothered me since I started my data science journey.
I almost always use the median over the average for comparisons because I’ve learned over the years that most data is skewed and not evenly distributed. However, I feel like I’m the only person who does this, even though basic statistics teach us to use the median in such cases.
You only need a sum and a total to get the average for a large array of items. Each value can be read once by the database, added to the total, and then released from memory to continue. To discover the value in the middle, you must sort all the data for a median, which takes more processing time and memory. A few algorithms work to improve this, but none of them is particularly scalable.
If you are performing one-time computations or you only have 100 or 1,000 data, then none of this matters. However, our data analysts request that we generate reports from tables containing millions of records, complete with real-time dashboard updates.
So unless there’s a really good reason to use a median instead of a mean, we’ll go with the mean.
The average, or mean, is commonly used because it takes into account all values in the dataset, providing a comprehensive measure.
It’s particularly useful in datasets where all values are relevant and contribute to the overall picture.
Despite its sensitivity to outliers in skewed distributions, the mean is still favored for its ease of calculation and its foundational role in further statistical analyses, such as variance and standard deviation.
However, in cases of extreme skewness, the median can indeed be a more representative measure of central tendency.
Always consider the nature of your data when choosing between the mean and median.