Implemented DST 8x8 with SSE2 intrinsics.

Implemented fdst8_sse2() function against C version: fdst8().
Added seven DST related hybrid transform types in vp10_fht8x8_sse2().
Replaced vp10_fht8x8_c() with vp10_fht8x8_sse2() in fwd_txfm_8x8().
Speedup: 18.1%, 11.5%, 22.0% based on speed test from
city_cif.y4m, garden_sif.y4m, mobile_cif.y4m.

Change-Id: Ia4aa1ea44c7a33e494f64ce843037f8703f975e3
2 files changed