Upgrade fwht4x4_mmx() to fwht4x4_sse2() (from libvpx)

Cherry-pick af7fb17 Upgrade fwht4x4_mmx() to fwht4x4_sse2() for vp9 and
vp10.

Function level timing test shows about 27% time saving on
a Xeon E5-2680 v2 desktop.

Rename dct_sse2.c to dct_intrin_sse2.c to avoid duplicate basenames.

Change-Id: I2c504130099af8f0ccc07da0dacef2464197b0ac
6 files changed