ext-inter: Vectorize new masked SAD/SSE functions

We would expect that these new functions would be slower than
the old masked SAD/SSE functions, as they do additional work
(blending two inputs and comparing to a third, rather than
just comparing two inputs).

This is true for the SAD functions, which are about 50% slower
(depending on block size and bit depth). However, the sub-pixel
SSE functions are comparable to the old speed for the accelerated
special cases (xoffset or yoffset = 0 or 4), and are
between 40-90% faster for the generic case.

Change-Id: I1a296ed8fc9e3edc313a6add516ff76b17cd3e9f
6 files changed