Optimization EXT_INTRA's filtered intra predictor (SSE4.1)

- Add unit tests to verify the bit-exact result.
- In speed test, function speed (for each mode/tx_size)
  improves about 23%~35%.
- On E5-2680, park_joy_1080p, 10 frames, --kf-max-dist=1,
  encoding time improves about 1%~2%.

Change-Id: Id89f313d44eea562c02e775a6253dc4df7e046a9
7 files changed