Add subtract_block SSE2 version and unit test.

3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
9 files changed