Improved encoder threading

Reduce the number of sync points by letting each thread
continue imediatly with a new MB row.
Better multicore scaling, improves performance by 5-20% on ARM multicore.

Change-Id: Ic97e4d1c4886a842c85dd3539a93cb217188ed1b
4 files changed