Add SSE4.1 highbitdepth self-guided filter

Performance is very similar to the lowbd path (only 4-5% slower)

Change-Id: Ifdb272c3f6c0e6f41e7046cc49497c72b5a796d9
5 files changed