Vectorize corner matching function

Add an SSE4 version of compute_cross_correlation() from
corner_match.c. This function is about 3.4x the speed of
the scalar code; determine_correspondence as a whole is about
2.5-3x the speed it was previously.

BUG=aomedia:487

Change-Id: I707b7cfd5c513c025d3ee7fb6a5f1fa335ecd495
diff --git a/av1/encoder/corner_match.h b/av1/encoder/corner_match.h
index c045864..3b16f9e 100644
--- a/av1/encoder/corner_match.h
+++ b/av1/encoder/corner_match.h
@@ -15,6 +15,10 @@
 #include <stdlib.h>
 #include <memory.h>
 
+#define MATCH_SZ 13
+#define MATCH_SZ_BY2 ((MATCH_SZ - 1) / 2)
+#define MATCH_SZ_SQ (MATCH_SZ * MATCH_SZ)
+
 typedef struct {
   int x, y;
   int rx, ry;