Vectorize high-precision convolve filter

Add SSE2 lowbd and SSSE3 highbd versions of the filters
introduced in https://aomedia-review.googlesource.com/c/11962/ .

These filters are equivalent in speed to the SSE2 implementations
of the regular convolve filter. The average time to filter a
64x64 block is:

lowbd C: 52us
lowbd SSE2: 5.6us
highbd C: 53us
highbd SSSE3: 5.8us

Also add a correctness test based on the warp filter tests.

Change-Id: Ia0d81100e8a414bbfc2b5f664d751cf24765299e
10 files changed