Add SSE2 vectorized warp filter for lowbd

End-to-end speed improvements: (measured on tempete_cif.y4m,
20 frames for encoder and all 260 frames for decoder)

* GLOBAL_MOTION encoder: ~10% faster
* GLOBAL_MOTION decoder: 100-200% faster depending on bitrate
* WARPED_MOTION encoder: ~2.5% faster
* WARPED_MOTION decoder: ~20-40% faster depending on bitrate

The improvement in the GLOBAL_MOTION decoder is particularly
large because its runtime is dominated by calls to warp_plane().

This introduces minor changes to the output of the warp filter,
but these should be rare.

Change-Id: I5813ab9e90311e27587045153c32d400b6b9eb92
diff --git a/av1/av1_common.mk b/av1/av1_common.mk
index 04dd817..abc812a 100644
--- a/av1/av1_common.mk
+++ b/av1/av1_common.mk
@@ -157,4 +157,8 @@
 AV1_COMMON_SRCS-$(HAVE_SSE4_1) += common/x86/filterintra_sse4.c
 endif
 
+ifneq ($(findstring yes,$(CONFIG_GLOBAL_MOTION) $(CONFIG_WARPED_MOTION)),)
+AV1_COMMON_SRCS-$(HAVE_SSE2) += common/x86/warp_plane_sse2.c
+endif
+
 $(eval $(call rtcd_h_template,av1_rtcd,av1/common/av1_rtcd_defs.pl))