HBD hybrid transform 4x4 SSE4.1 optimization

- Optimization on tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Overall encoder speed improves ~4.5%-6%.
- Update bit-exact unit test against current C version.

Change-Id: If751c030612245b1c2470200c9570cf40d655504
4 files changed