HBD hybrid transform 8x8 SSE4.1 optimization

- Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Update bit-exact unit test against current C version.
- HBD encoder speed improves ~3.8%.

Change-Id: Ie13925ba11214eef2b5326814940638507bf68ec
6 files changed