Implement shorter-tap first in convolve_round

The performance change is 0.004% on lowres

Change-Id: If3702ba6377ac42997e7d49b8959ff16fb182daa
1 file changed