Improvement on hybrid transform 4x4 DCT_DCT SSE4.1 optimization

- Implemented Angie's new fwd txfm algorithm.
- Improve ~100% than last 64-bit version; 3 times faster than
  original C code.
- Passed bit-exact unit test.

Change-Id: Ica30b9768706604a6d69fe42da778441f0f5f02e
1 file changed