Write shorter matrix_coef_delta if possible

I believe this optimization is the reason why the decoder reconstructs
the matrix coefficient by calculating (prev + delta + 256) % 256 rather
than the more obvious prev + delta. Take advantage of that, even though
delta is unlikely to be < -128 or >= 128.
diff --git a/av1/encoder/bitstream.c b/av1/encoder/bitstream.c
index 3749eae..748e091 100644
--- a/av1/encoder/bitstream.c
+++ b/av1/encoder/bitstream.c
@@ -6051,7 +6051,21 @@
       int16_t prev = 32;
       for (int i = 0; i < tx_size_2d[tsize]; i++) {
         int16_t coeff = mat[s->scan[i]];
-        aom_wb_write_svlc(wb, coeff - prev);
+        int16_t delta = coeff - prev;
+        // The decoder reconstructs the matrix coefficient by calculating
+        // (prev + delta + NUM_QM_VALS) % NUM_QM_VALS. Therefore delta,
+        // delta + NUM_QM_VALS, and delta - NUM_QM_VALS are all equivalent
+        // because they are equal modulo NUM_QM_VALS. If delta + NUM_QM_VALS or
+        // delta - NUM_QM_VALS has a smaller absolute value than delta, it is
+        // likely to have a shorter svlc() code, so we will write it instead.
+        // In other words, for each delta value, we aim to find an equivalent
+        // value (modulo NUM_QM_VALS) that has the shortest svlc() code.
+        if (delta < -(NUM_QM_VALS / 2)) {
+          delta += NUM_QM_VALS;
+        } else if (delta >= NUM_QM_VALS / 2) {
+          delta -= NUM_QM_VALS;
+        }
+        aom_wb_write_svlc(wb, delta);
         prev = coeff;
       }
     }