HBD hybrid transform 4x4 SSE4.1 optimization - Optimization on tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST. - Overall encoder speed improves ~4.5%-6%. - Update bit-exact unit test against current C version. Change-Id: If751c030612245b1c2470200c9570cf40d655504