## Average and median via optimization

This is certainly not a new observation, and I’m sure it is folklore and/or contained in various textbooks, but it was amusing to derive it independently.

Suppose we have a finite set of real numbers $\{x_1, \dots, x_n\}$, and we want to pick a value $m$ which is somehow “in the middle” of the $x_i$. The punchline is that

• if we want to minimize the sum of the squared distances from $m$ to each $x_i$, we should pick $m$ to be the average of the $x_i$;
• if we want to minimize the sum of the absolute distances from $m$ to each $x_i$, we should pick $m$ to be the median of the $x_i$.

The first of these is tricky to understand intuitively but easy to derive; the second is intuitively straightforward but trying to derive it leads to an interesting twist.

## Average = minimizing sum of squared distances

Let’s not worry about why we would want to minimize the sum of squared distances; there are good reasons and it’s not the point. I don’t know about you, but I find it difficult to reason intuitively about how and why to pick $m$ to minimize this sum of squared differences. If you know of an intuitive way to explain this, I would love to hear about it! But in any case, it is easy to derive using some strightforward calculus.

Let $\displaystyle S(m) = \sum_i(m - x_i)^2$ denote the sum of squared distances from a given $m$ to each of the $x_i$. Taking the derivative of $S$ with respect to $m$, we find

$\displaystyle \frac{d}{dm} S(m) = \sum_i 2(m - x_i)$.

Setting the derivative equal to zero, we can first divide through by the factor of 2, yielding

$\displaystyle 0 = \sum_i (m - x_i)$

Since $m$ does not depend on $i$, this is just $n$ copies of $m$ less the sum of the $x_i$. Hence, solving for $m$ yields

$\displaystyle m = \frac{1}{n} \sum_i x_i$

as expected: the value of $m$ which minimizes the sum of squared distances to the $x_i$ is their average, that is, the sum of the $x_i$ divided by the size of the set.

## Median = minimizing sum of absolute distances

Now suppose we want to minimize the sum of absolute distances instead, that is,

$S(m) = \sum_i |m - x_i|$

In this scenario, it is much easier to reason out the correct answer. Start with some arbitrary $m$, and imagine nudging it by some small amount $\Delta x$, say, to the right. $m$’s distances to any points on its left will each increase by $\Delta x$, and its distances to any points on its right will each decrease by the same amount. Therefore, if there are more $x_i$ to the left of $m$, then the overall sum of distances distances will increase; if there are more $x_i$ to the right, then the overall sum will decrease. So, to find $m$ which minimizes the sum of absolute differences, we want the same number of $x_i$ on the left and the right, that is, we want the median. Note that if $n$ is odd, then we must pick $m$ to be exactly equal to the $x_i$ in the middle; if $n$ is even, then we can pick $m$ to be anywhere inside the interval between the middle two $x_i$.

Just for fun, can we derive this answer using calculus, like we did for minimizing squared differences? There is a wrinkle, of course, which is that the absolute value function is not differentiable everywhere: it has a sharp corner at zero. But we won’t let that stop us! Clearly the derivative of $|x|$ is $-1$ when $x < 0$ and $1$ when $x > 0$. So it seems reasonable to just assign the derivative a value of $0$ at $x = 0$. Algebraically, we can define

$\displaystyle \frac{d}{dx} |x| = [x > 0] - [x < 0]$

where $[P]$ is equal to $1$ when the proposition $P$ is true, and $0$ when it is false (this notation is called the Iverson bracket). So when $x > 0$ we get $[x > 0] - [x < 0] = 1 - 0 = 1$; when $x < 0$ we get $0 - 1 = -1$; and when $x = 0$ both propositions are false so we get $0 - 0 = 0$.

Armed with this definition, we can differentiate $S$ with respect to $m$:

$\displaystyle \frac{d}{dm} S(m) = \frac{d}{dm} \sum_i |m - x_i| = \sum_i [m > x_i] - \sum_i [m < x_i]$

Clearly, this is zero when $\displaystyle \sum_i [m > x_i] = \sum_i [m < x_i]$, that is, when there are the same number of $x_i$ on either side of $m$.

The curious thing to me is that even though the derivative of $|x|$ is undefined when $x = 0$, it seems like it “wants” to be 0 here. In general, if we assign the value $k$ to the derivative at $x = 0$, the derivative of $S$ becomes

$\displaystyle \frac{d}{dm} S(m) = \sum_i [m > x_i] + k \sum_i [m = x_i] - \sum_i [m < x_i]$

When $k$ is nonzero and $n$ is odd, there are no values of $m$ for which this derivative is zero, making it more difficult to find the minimum.

Your $[x>0] – [x-0]$ is known as the _sign function_ or _signum function_: https://en.wikipedia.org/wiki/Sign_function . I’d hoped Wikipedia would have some explanation of why $sgn(0)$, considered as the derivative of the absolute value function, “ought” to be $0$, but I didn’t see anything compelling.
Still, I wonder if there’s some theorem about how, when $f$ is discontinuous at $x$, the function $\hat f(x) = \frac12 \left(\lim_{x\to 0^-} f(x)+ \lim_{x\to 0^+} f(x) \right)$ is somehow well-behaved. For example, isn’t it the case that a Fourier series that converges to $f$ everywhere except at the discontinuities must actually converge to $\hat f$ everywhere?