Average and median via optimization

This is certainly not a new observation, and I’m sure it is folklore and/or contained in various textbooks, but it was amusing to derive it independently.

Suppose we have a finite set of real numbers \{x_1, \dots, x_n\}, and we want to pick a value m which is somehow “in the middle” of the x_i. The punchline is that

  • if we want to minimize the sum of the squared distances from m to each x_i, we should pick m to be the average of the x_i;
  • if we want to minimize the sum of the absolute distances from m to each x_i, we should pick m to be the median of the x_i.

The first of these is tricky to understand intuitively but easy to derive; the second is intuitively straightforward but trying to derive it leads to an interesting twist.

Average = minimizing sum of squared distances

Let’s not worry about why we would want to minimize the sum of squared distances; there are good reasons and it’s not the point. I don’t know about you, but I find it difficult to reason intuitively about how and why to pick m to minimize this sum of squared differences. If you know of an intuitive way to explain this, I would love to hear about it! But in any case, it is easy to derive using some strightforward calculus.

Let \displaystyle S(m) = \sum_i(m - x_i)^2 denote the sum of squared distances from a given m to each of the x_i. Taking the derivative of S with respect to m, we find

\displaystyle \frac{d}{dm} S(m) = \sum_i 2(m - x_i).

Setting the derivative equal to zero, we can first divide through by the factor of 2, yielding

\displaystyle 0 = \sum_i (m - x_i)

Since m does not depend on i, this is just n copies of m less the sum of the x_i. Hence, solving for m yields

\displaystyle m = \frac{1}{n} \sum_i x_i

as expected: the value of m which minimizes the sum of squared distances to the x_i is their average, that is, the sum of the x_i divided by the size of the set.

Median = minimizing sum of absolute distances

Now suppose we want to minimize the sum of absolute distances instead, that is,

S(m) = \sum_i |m - x_i|

In this scenario, it is much easier to reason out the correct answer. Start with some arbitrary m, and imagine nudging it by some small amount \Delta x, say, to the right. m’s distances to any points on its left will each increase by \Delta x, and its distances to any points on its right will each decrease by the same amount. Therefore, if there are more x_i to the left of m, then the overall sum of distances distances will increase; if there are more x_i to the right, then the overall sum will decrease. So, to find m which minimizes the sum of absolute differences, we want the same number of x_i on the left and the right, that is, we want the median. Note that if n is odd, then we must pick m to be exactly equal to the x_i in the middle; if n is even, then we can pick m to be anywhere inside the interval between the middle two x_i.

Just for fun, can we derive this answer using calculus, like we did for minimizing squared differences? There is a wrinkle, of course, which is that the absolute value function is not differentiable everywhere: it has a sharp corner at zero. But we won’t let that stop us! Clearly the derivative of |x| is -1 when x < 0 and 1 when x > 0. So it seems reasonable to just assign the derivative a value of 0 at x = 0. Algebraically, we can define

\displaystyle \frac{d}{dx} |x| = [x > 0] - [x < 0]

where [P] is equal to 1 when the proposition P is true, and 0 when it is false (this notation is called the Iverson bracket). So when x > 0 we get [x > 0] - [x < 0] = 1 - 0 = 1; when x < 0 we get 0 - 1 = -1; and when x = 0 both propositions are false so we get 0 - 0 = 0.

Armed with this definition, we can differentiate S with respect to m:

\displaystyle \frac{d}{dm} S(m) = \frac{d}{dm} \sum_i |m - x_i| = \sum_i [m > x_i] - \sum_i [m < x_i]

Clearly, this is zero when \displaystyle \sum_i [m > x_i] = \sum_i [m < x_i], that is, when there are the same number of x_i on either side of m.

The curious thing to me is that even though the derivative of |x| is undefined when x = 0, it seems like it “wants” to be 0 here. In general, if we assign the value k to the derivative at x = 0, the derivative of S becomes

\displaystyle \frac{d}{dm} S(m) = \sum_i [m > x_i] + k \sum_i [m = x_i] - \sum_i [m < x_i]

When k is nonzero and n is odd, there are no values of m for which this derivative is zero, making it more difficult to find the minimum.


About Brent

Associate Professor of Computer Science at Hendrix College. Functional programmer, mathematician, teacher, pianist, follower of Jesus.
This entry was posted in math and tagged , , , , . Bookmark the permalink.

1 Response to Average and median via optimization

  1. Mark Dominus says:

    Your $[x>0] – [x-0]$ is known as the _sign function_ or _signum function_: https://en.wikipedia.org/wiki/Sign_function . I’d hoped Wikipedia would have some explanation of why $sgn(0)$, considered as the derivative of the absolute value function, “ought” to be $0$, but I didn’t see anything compelling.

    Still, I wonder if there’s some theorem about how, when $f$ is discontinuous at $x$, the function $ \hat f(x) = \frac12 \left(\lim_{x\to 0^-} f(x)+ \lim_{x\to 0^+} f(x) \right)$ is somehow well-behaved. For example, isn’t it the case that a Fourier series that converges to $f$ everywhere except at the discontinuities must actually converge to $\hat f$ everywhere?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.