Fix buffer overrun in dist_wtd_convolve_2d_horiz_neon

When introducing the 6-tap specialization for
dist_wtd_convolve_2d_vert_neon[1], we attempted to reduce the number
of rows processed in the horizontal convolution (by 2) if the
subsequent vertical convolution would be using a 6-tap filter instead
of an 8-tap filter. This logic was faulty and meant we ended up
accessing memory outside of the (padded) source buffer - since the
horizontal convolution processes 4 rows of data per iteration, when
the number of rows to be processed is not necessarily a multiple of
4.

This patch restores the previous logic and src pointer starting
position for dist_wtd_convolve_2d_horiz_neon.

[1] Commit hash: be1d80024928684b1c9eebc648ed92d8ea70d166

Bug: aomedia:3367
Change-Id: Id11d4ae7c123957d3336b61f4d4cc8da09131b68
diff --git a/av1/common/arm/jnt_convolve_neon.c b/av1/common/arm/jnt_convolve_neon.c
index 700dc54..36c8f9c 100644
--- a/av1/common/arm/jnt_convolve_neon.c
+++ b/av1/common/arm/jnt_convolve_neon.c
@@ -1163,9 +1163,9 @@
   const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
   const int clamped_y_taps = y_filter_taps < 6 ? 6 : y_filter_taps;
 
-  const int im_h = h + clamped_y_taps - 1;
+  const int im_h = h + filter_params_y->taps - 1;
   const int im_stride = MAX_SB_SIZE;
-  const int vert_offset = clamped_y_taps / 2 - 1;
+  const int vert_offset = filter_params_y->taps / 2 - 1;
   const int horiz_offset = filter_params_x->taps / 2 - 1;
   const int round_0 = conv_params->round_0 - 1;
   const uint8_t *src_ptr = src - vert_offset * src_stride - horiz_offset;
@@ -1182,9 +1182,10 @@
   dist_wtd_convolve_2d_horiz_neon(src_ptr, src_stride, im_block, im_stride,
                                   x_filter, im_h, w, round_0);
 
-  if (clamped_y_taps <= 6) {
-    dist_wtd_convolve_2d_vert_6tap_neon(im_block, im_stride, dst8, dst8_stride,
-                                        conv_params, y_filter, h, w);
+  if (clamped_y_taps == 6) {
+    dist_wtd_convolve_2d_vert_6tap_neon(im_block + im_stride, im_stride, dst8,
+                                        dst8_stride, conv_params, y_filter, h,
+                                        w);
   } else {
     dist_wtd_convolve_2d_vert_8tap_neon(im_block, im_stride, dst8, dst8_stride,
                                         conv_params, y_filter, h, w);