intrapred/d135: flatten border results before storing

the results along the top and left border are then stored with a moving
window into the vector.
~40-67% faster on ARM, ~40-77+% on x86 depending on the block size.

Change-Id: Iab369aa2946a3ae4eb7290d512868fe5db92dbc8
1 file changed