Add SSE4.1 highbitdepth self-guided filter Performance is very similar to the lowbd path (only 4-5% slower) Change-Id: Ifdb272c3f6c0e6f41e7046cc49497c72b5a796d9