Merge tag 'v3.13.0' into main

Release v3.13.0 Opaline

2025-09-02 v3.13.0
  This release is ABI compatible with the last release.

  The aom_roi_map_t struct, used only by the codec control
  AOME_SET_ROI_MAP, was modified in this release. Since AOME_SET_ROI_MAP
  was unimplemented, aom_roi_map_t was effectively an unused struct.
  Therefore aom_roi_map_t is to be considered as a new struct added in
  this release and this change does not break ABI compatibility.

  - New Features
    * New tuning mode AOM_TUNE_SSIMULACRA2 for the AOME_SET_TUNING codec
      control (--tune=ssimulacra2) in all-intra mode. The feature
      detection macro AOM_HAVE_TUNE_SSIMULACRA2, if defined, indicates
      that AOM_TUNE_SSIMULACRA2 is available. AOM_TUNE_SSIMULACRA2 was
      developed to maximize SSIMULACRA 2 scores.
    * New codec control AV1E_SET_SCREEN_CONTENT_DETECTION_MODE
      (--screen-detection-mode).
      This codec control helps select between two screen detection modes:
       * Mode 1: standard (default)
       * Mode 2: anti-aliased text and graphics aware
    * New codec control AV1E_SET_ENABLE_ADAPTIVE_SHARPNESS
      (--enable-adaptive-sharpness). When enabled, it modulates sharpness
      based on frame QP, which helps mitigate blocking artifacts in the
      low to medium quality range.
    * Added low complexity decode mode for 720p vertical videos.
    * ROI feature implemented for RTC: for delta QP, skip encoding, and
      reference selection.
    * External scaling feature for SVC: allow downscaled images to be
      passed into encoder for spatial layers without reconfiguring it.
    * Allow per-frame calculation of PSNR (contribution from Meta).

  - Compression Efficiency Improvements
    * Variance Boost is now enabled for tuning modes AOM_TUNE_IQ and
      AOM_TUNE_SSIMULACRA2 at speeds 8 and 9 (2-5% SSIMULACRA 2 BD-Rate
      gains)
    * Several quality/time tradeoff improvements and bug fixes for all
      intra mode speeds 8 and 9.
      * Up to 6.9% SSIMULACRA 2 BD-Rate gains for speed 8
      * Up to 2.2% SSIMULACRA 2 BD-Rate gains for speed 9

  - Perceptual Quality Improvements
    * RTC: Visual quality improvements for screen content mode.
    * RTC: Visual quality improvements for video mode for resolutions >=
      720p.

  - Speedups
    * Optimize intraBC search for better speed/efficiency tradeoffs for
      all intra mode speeds >= 1
    * Optimize intraBC block hashing process
    * RTC Screen: speed feature added to speed 12 for ~2x speedup on
      slide/scene changes, for resolutions >= 720p.
    * ML based speedup improvement on the partition pruning for speed <= 2

  - Other Improvements
    * Fixes for RPS (reference picture selection) for RTC: Based on
      av1_discuss issue:
      https://groups.google.com/a/aomedia.org/g/av1-discuss/c/sqFad980SsA

  - Bug Fixes
    * b:421196988: all intra speed 8: overuse of palette mode
      unnecessarily inflating file sizes
    * b:423804955: Improve quality for 4K Screencast
    * webrtc:388070060: Allow per-frame calculation of PSNR
    * b:433046392, b:432035817: Fix to SVC crash triggered with Jitsi
      video conference app.
    * b:419622699: Fix integer overflow in update_buffer_level
    * b:407813259: Fix to update seq_params for number of layers change
    * b:400885218: External scaling for AV1
    * b:391849810: High AV1 frame encode time on slide changes
    * b:399575647: Too aggressive QP backoff at scene changes
    * b:383306740: Quality degradation at horizontal scrolling

Bug: 441135035
Change-Id: I205cbdbe1bdcfae8e1dbb431c72ce9ae82cb7c61
diff --git a/aom/aom_image.h b/aom/aom_image.h
index 248b5b6..c4ad19a 100644
--- a/aom/aom_image.h
+++ b/aom/aom_image.h
@@ -154,17 +154,33 @@
  *
  * While encoding, when metadata is added to an aom_image via
  * aom_img_add_metadata(), the flag passed along with the metadata will
- * determine where the metadata OBU will be placed in the encoded OBU stream.
- * Metadata will be emitted into the output stream within the next temporal unit
- * if it satisfies the specified insertion flag.
+ * determine where the metadata OBU will be placed in the encoded OBU stream,
+ * and whether it's layer-specific. Metadata will be emitted into the output
+ * stream within the next temporal unit if it satisfies the specified insertion
+ * flag.
  *
- * During decoding, when the library encounters a metadata OBU, it is always
- * flagged as AOM_MIF_ANY_FRAME and emitted with the next output aom_image.
+ * If the video contains multiple spatial and/or temporal layers,
+ * a layer-specific metadata OBU only applies to the current frame's layer, as
+ * determined by the frame's temporal_id and spatial_id. Some metadata types
+ * cannot be layer-specific, as listed in Section 6.7.1 of the draft AV1
+ * specification as of 2025-03-06.
+ *
+ * During decoding, when the library encounters a metadata OBU, it is emitted
+ * with the next output aom_image. Its insert_flag is set to either
+ * AOM_MIF_ANY_FRAME, or AOM_MIF_ANY_FRAME_LAYER_SPECIFIC if the OBU contains an
+ * OBU header extension (i.e. the video contains multiple layers AND the
+ * metadata was added using *_LAYER_SPECIFC insert flag if using libaom).
  */
 typedef enum aom_metadata_insert_flags {
-  AOM_MIF_NON_KEY_FRAME = 0, /**< Adds metadata if it's not keyframe */
+  AOM_MIF_NON_KEY_FRAME = 0, /**< Adds metadata if it's not a keyframe */
   AOM_MIF_KEY_FRAME = 1,     /**< Adds metadata only if it's a keyframe */
-  AOM_MIF_ANY_FRAME = 2      /**< Adds metadata to any type of frame */
+  AOM_MIF_ANY_FRAME = 2,     /**< Adds metadata to any type of frame */
+  /** Adds layer-specific metadata if it's not a keyframe */
+  AOM_MIF_NON_KEY_FRAME_LAYER_SPECIFIC = 16,
+  /** Adds layer-specific metadata only if it's a keyframe */
+  AOM_MIF_KEY_FRAME_LAYER_SPECIFIC = 17,
+  /** Adds layer-specific metadata to any type of frame */
+  AOM_MIF_ANY_FRAME_LAYER_SPECIFIC = 18,
 } aom_metadata_insert_flags_t;
 
 /*!\brief Array of aom_metadata structs for an image. */
diff --git a/aom/internal/aom_image_internal.h b/aom/internal/aom_image_internal.h
index ef0f166..e11223a 100644
--- a/aom/internal/aom_image_internal.h
+++ b/aom/internal/aom_image_internal.h
@@ -29,6 +29,14 @@
   aom_metadata_t **metadata_array; /* Array of metadata structs */
 };
 
+/*! \brief Bit in aom_metadata_insert_flags marking metadata as layer-specific.
+ */
+#define AOM_MIF_LAYER_SPECIFIC 0x10
+/*! \brief Bits in aom_metadata_insert_flags used to signal which frames to
+ * add the metadata to (keyframes, non keyframes...).
+ */
+#define AOM_MIF_INSERT_LOCATION_MASK 0x0f
+
 /*!\brief Alloc memory for aom_metadata_array struct.
  *
  * Allocate memory for aom_metadata_array struct.
diff --git a/aom/src/aom_image.c b/aom/src/aom_image.c
index 497dff3..3522cda 100644
--- a/aom/src/aom_image.c
+++ b/aom/src/aom_image.c
@@ -18,6 +18,7 @@
 #include "aom/aom_integer.h"
 #include "aom/internal/aom_image_internal.h"
 #include "aom_mem/aom_mem.h"
+#include "aom/aom_codec.h"
 
 static inline unsigned int align_image_dimension(unsigned int d,
                                                  unsigned int subsampling,
@@ -383,6 +384,15 @@
     img->metadata = aom_img_metadata_array_alloc(0);
     if (!img->metadata) return -1;
   }
+  // Some metadata types are not allowed to be layer specific, according to
+  // the Table in Section 6.7.1 of the AV1 specifiction.
+  // Do not check for OBU_METADATA_TYPE_HDR_CLL or OBU_METADATA_TYPE_HDR_MDCV
+  // because there are plans to alow them to be layer specific.
+  if ((insert_flag & AOM_MIF_LAYER_SPECIFIC) &&
+      (type == OBU_METADATA_TYPE_SCALABILITY ||
+       type == OBU_METADATA_TYPE_TIMECODE)) {
+    return -1;
+  }
   aom_metadata_t *metadata =
       aom_img_metadata_alloc(type, data, sz, insert_flag);
   if (!metadata) return -1;
diff --git a/aom_dsp/arm/aom_convolve8_neon_dotprod.c b/aom_dsp/arm/aom_convolve8_neon_dotprod.c
index 7fc9cb1..6013a33 100644
--- a/aom_dsp/arm/aom_convolve8_neon_dotprod.c
+++ b/aom_dsp/arm/aom_convolve8_neon_dotprod.c
@@ -58,12 +58,13 @@
                                 vqtbl1q_s8(samples_128, permute_tbl.val[1]) };
 
   // Accumulate into 128 * FILTER_WEIGHT to account for range transform.
-  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT);
+  // (Divide by 2 since we halved the filter values.)
+  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT / 2);
   int32x4_t sum = vdotq_lane_s32(acc, perm_samples[0], filters, 0);
   sum = vdotq_lane_s32(sum, perm_samples[1], filters, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_h(const uint8x16_t samples,
@@ -82,7 +83,8 @@
                                 vqtbl1q_s8(samples_128, permute_tbl.val[2]) };
 
   // Accumulate into 128 * FILTER_WEIGHT to account for range transform.
-  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT);
+  // (Divide by 2 since we halved the filter values.)
+  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT / 2);
   // First 4 output values.
   int32x4_t sum0 = vdotq_lane_s32(acc, perm_samples[0], filters, 0);
   sum0 = vdotq_lane_s32(sum0, perm_samples[1], filters, 1);
@@ -91,14 +93,16 @@
   sum1 = vdotq_lane_s32(sum1, perm_samples[2], filters, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve8_horiz_8tap_neon_dotprod(
     const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
     ptrdiff_t dst_stride, const int16_t *filter_x, int w, int h) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_x));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter = vshrn_n_s16(vld1q_s16(filter_x), 1);
 
   if (w == 4) {
     const uint8x16x2_t perm_tbl = vld1q_u8_x2(kDotProdPermuteTbl);
@@ -110,8 +114,9 @@
       int16x4_t d1 = convolve8_4_h(s1, filter, perm_tbl);
       int16x4_t d2 = convolve8_4_h(s2, filter, perm_tbl);
       int16x4_t d3 = convolve8_4_h(s3, filter, perm_tbl);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -284,69 +289,19 @@
   }
 }
 
-static inline void transpose_concat_4x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
-                                        int8x8_t a3, int8x16_t *b) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, XX, XX, XX, XX
-  // a1: 10, 11, 12, 13, XX, XX, XX, XX
-  // a2: 20, 21, 22, 23, XX, XX, XX, XX
-  // a3: 30, 31, 32, 33, XX, XX, XX, XX
-  //
-  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-
-  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
-  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
-  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
-  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
-
-  int8x16_t a01 = vzipq_s8(a0q, a1q).val[0];
-  int8x16_t a23 = vzipq_s8(a2q, a3q).val[0];
-
-  int16x8_t a0123 =
-      vzipq_s16(vreinterpretq_s16_s8(a01), vreinterpretq_s16_s8(a23)).val[0];
-
-  *b = vreinterpretq_s8_s16(a0123);
-}
-
-static inline void transpose_concat_8x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
-                                        int8x8_t a3, int8x16_t *b0,
-                                        int8x16_t *b1) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, 04, 05, 06, 07
-  // a1: 10, 11, 12, 13, 14, 15, 16, 17
-  // a2: 20, 21, 22, 23, 24, 25, 26, 27
-  // a3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
-
-  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
-  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
-  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
-  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
-
-  int8x16_t a01 = vzipq_s8(a0q, a1q).val[0];
-  int8x16_t a23 = vzipq_s8(a2q, a3q).val[0];
-
-  int16x8x2_t a0123 =
-      vzipq_s16(vreinterpretq_s16_s8(a01), vreinterpretq_s16_s8(a23));
-
-  *b0 = vreinterpretq_s8_s16(a0123.val[0]);
-  *b1 = vreinterpretq_s8_s16(a0123.val[1]);
-}
-
 static inline int16x4_t convolve8_4_v(const int8x16_t samples_lo,
                                       const int8x16_t samples_hi,
                                       const int8x8_t filters) {
   // The sample range transform and permutation are performed by the caller.
 
   // Accumulate into 128 * FILTER_WEIGHT to account for range transform.
-  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT);
+  // (Divide by 2 since we halved the filter values.)
+  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT / 2);
   int32x4_t sum = vdotq_lane_s32(acc, samples_lo, filters, 0);
   sum = vdotq_lane_s32(sum, samples_hi, filters, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_v(const int8x16_t samples0_lo,
@@ -357,7 +312,8 @@
   // The sample range transform and permutation are performed by the caller.
 
   // Accumulate into 128 * FILTER_WEIGHT to account for range transform.
-  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT);
+  // (Divide by 2 since we halved the filter values.)
+  int32x4_t acc = vdupq_n_s32(128 * FILTER_WEIGHT / 2);
   // First 4 output values.
   int32x4_t sum0 = vdotq_lane_s32(acc, samples0_lo, filters, 0);
   sum0 = vdotq_lane_s32(sum0, samples0_hi, filters, 1);
@@ -366,14 +322,16 @@
   sum1 = vdotq_lane_s32(sum1, samples1_hi, filters, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve8_vert_8tap_neon_dotprod(
     const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
     ptrdiff_t dst_stride, const int16_t *filter_y, int w, int h) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_y));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter = vshrn_n_s16(vld1q_s16(filter_y), 1);
   const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
   int8x16x2_t samples_LUT;
 
@@ -394,10 +352,10 @@
     // This operation combines a conventional transpose and the sample permute
     // (see horizontal case) required before computing the dot product.
     int8x16_t s0123, s1234, s2345, s3456;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_s8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_s8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_s8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_s8_4x4(s3, s4, s5, s6, &s3456);
 
     do {
       uint8x8_t t7, t8, t9, t10;
@@ -409,7 +367,7 @@
       int8x8_t s10 = vreinterpret_s8_u8(vsub_u8(t10, vdup_n_u8(128)));
 
       int8x16_t s4567, s5678, s6789, s78910;
-      transpose_concat_4x4(s7, s8, s9, s10, &s78910);
+      transpose_concat_elems_s8_4x4(s7, s8, s9, s10, &s78910);
 
       // Merge new data into block from previous iteration.
       samples_LUT.val[0] = s3456;
@@ -422,8 +380,9 @@
       int16x4_t d1 = convolve8_4_v(s1234, s5678, filter);
       int16x4_t d2 = convolve8_4_v(s2345, s6789, filter);
       int16x4_t d3 = convolve8_4_v(s3456, s78910, filter);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -462,10 +421,10 @@
       // (see horizontal case) required before computing the dot product.
       int8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_s8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_s8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_s8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_s8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
 
       do {
         uint8x8_t t7, t8, t9, t10;
@@ -478,7 +437,7 @@
 
         int8x16_t s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo, s6789_hi,
             s78910_lo, s78910_hi;
-        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
+        transpose_concat_elems_s8_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
 
         // Merge new data into block from previous iteration.
         samples_LUT.val[0] = s3456_lo;
diff --git a/aom_dsp/arm/aom_convolve8_neon_i8mm.c b/aom_dsp/arm/aom_convolve8_neon_i8mm.c
index 121e892..b0bb2fc 100644
--- a/aom_dsp/arm/aom_convolve8_neon_i8mm.c
+++ b/aom_dsp/arm/aom_convolve8_neon_i8mm.c
@@ -25,17 +25,24 @@
 #include "aom_dsp/arm/transpose_neon.h"
 #include "aom_ports/mem.h"
 
-DECLARE_ALIGNED(16, static const uint8_t, kMatMulPermuteTbl[32]) = {
+DECLARE_ALIGNED(16, static const uint8_t, kMatMul6PermuteTbl[32]) = {
   // clang-format off
   0,  1,  2,  3,  4,  5,  6,  7,  2,  3,  4,  5,  6,  7,  8,  9,
   4,  5,  6,  7,  8,  9, 10, 11,  6,  7,  8,  9, 10, 11, 12, 13
   // clang-format on
 };
 
-DECLARE_ALIGNED(16, static const uint8_t, kDotProdPermuteTbl[48]) = {
-  0, 1, 2,  3,  1, 2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6,
-  4, 5, 6,  7,  5, 6,  7,  8,  6,  7,  8,  9,  7,  8,  9,  10,
-  8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
+DECLARE_ALIGNED(16, static const uint8_t, kMatMul8PermuteTbl[32]) = {
+  // clang-format off
+  1,  2,  3,  4,  5,  6,  7,  8,  3,  4,  5,  6,  7,  8,  9, 10,
+  5,  6,  7,  8,  9, 10, 11, 12,  7,  8,  9, 10, 11, 12, 13, 14
+  // clang-format on
+};
+
+DECLARE_ALIGNED(16, static const uint8_t, kMatMul8FilterPermuteTbl[16]) = {
+  // clang-format off
+  1,  2,  3,  4,  5,  6,  7, 16, 16,  1,  2,  3,  4,  5,  6,  7
+  // clang-format on
 };
 
 DECLARE_ALIGNED(16, static const uint8_t, kDotProdMergeBlockTbl[48]) = {
@@ -48,64 +55,83 @@
 };
 
 static inline int16x4_t convolve8_4_h(const uint8x16_t samples,
-                                      const int8x8_t filters,
-                                      const uint8x16x2_t permute_tbl) {
-  // Permute samples ready for dot product.
-  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
-  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
-  uint8x16_t permuted_samples[2] = { vqtbl1q_u8(samples, permute_tbl.val[0]),
-                                     vqtbl1q_u8(samples, permute_tbl.val[1]) };
+                                      const int8x16_t filters,
+                                      const uint8x16_t permute_tbl) {
+  // Permute samples ready for matrix multiply.
+  // { 1,  2,  3,  4,  5,  6,  7,  8,  3,  4,  5,  6,  7,  8,  9, 10 }
+  uint8x16_t perm_samples = vqtbl1q_u8(samples, permute_tbl);
 
-  int32x4_t sum =
-      vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[0], filters, 0);
-  sum = vusdotq_lane_s32(sum, permuted_samples[1], filters, 1);
+  // These instructions multiply a 2x8 matrix (samples) by an 8x2 matrix
+  // (filter), destructively accumulating into the destination register.
+  int32x4_t sum = vusmmlaq_s32(vdupq_n_s32(0), perm_samples, filters);
 
-  // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  // Tap 0, as well as further narrowing and packing, is applied by the caller.
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_h(const uint8x16_t samples,
-                                      const int8x8_t filters,
-                                      const uint8x16x3_t permute_tbl) {
-  // Permute samples ready for dot product.
-  // { 0,  1,  2,  3,  1,  2,  3,  4,  2,  3,  4,  5,  3,  4,  5,  6 }
-  // { 4,  5,  6,  7,  5,  6,  7,  8,  6,  7,  8,  9,  7,  8,  9, 10 }
-  // { 8,  9, 10, 11,  9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 }
-  uint8x16_t permuted_samples[3] = { vqtbl1q_u8(samples, permute_tbl.val[0]),
-                                     vqtbl1q_u8(samples, permute_tbl.val[1]),
-                                     vqtbl1q_u8(samples, permute_tbl.val[2]) };
+                                      const int8x16_t filters,
+                                      const uint8x8_t f0,
+                                      const uint8x16x2_t permute_tbl) {
+  // Permute samples ready for matrix multiply.
+  // { 1,  2,  3,  4,  5,  6,  7,  8,  3,  4,  5,  6,  7,  8,  9, 10 }
+  // { 5,  6,  7,  8,  9, 10, 11, 12,  7,  8,  9, 10, 11, 12, 13, 14 }
+  uint8x16_t perm_samples[2] = { vqtbl1q_u8(samples, permute_tbl.val[0]),
+                                 vqtbl1q_u8(samples, permute_tbl.val[1]) };
 
-  // First 4 output values.
-  int32x4_t sum0 =
-      vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[0], filters, 0);
-  sum0 = vusdotq_lane_s32(sum0, permuted_samples[1], filters, 1);
-  // Second 4 output values.
-  int32x4_t sum1 =
-      vusdotq_lane_s32(vdupq_n_s32(0), permuted_samples[1], filters, 0);
-  sum1 = vusdotq_lane_s32(sum1, permuted_samples[2], filters, 1);
+  // These instructions multiply a 2x8 matrix (samples) by an 8x2 matrix
+  // (filter), destructively accumulating into the destination register.
+  int32x4_t sum0123 = vusmmlaq_s32(vdupq_n_s32(0), perm_samples[0], filters);
+  int32x4_t sum4567 = vusmmlaq_s32(vdupq_n_s32(0), perm_samples[1], filters);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0123), vmovn_s32(sum4567));
+  // Apply tap 0 and accumulate.
+  sum = vreinterpretq_s16_u16(
+      vmlsl_u8(vreinterpretq_u16_s16(sum), vget_low_u8(samples), f0));
+
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve8_horiz_8tap_neon_i8mm(
     const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
     ptrdiff_t dst_stride, const int16_t *filter_x, int w, int h) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_x));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter_s8 = vshrn_n_s16(vld1q_s16(filter_x), 1);
+  // Stagger the filter for use with the matrix multiply instructions.
+  // { f1, f2, f3, f4, f5, f6, f7, 0, 0, f1, f2, f3, f4, f5, f6, f7 }
+  const uint8x16_t filter_idx = vld1q_u8(kMatMul8FilterPermuteTbl);
+  const int8x16_t filter =
+      vqtbl1q_s8(vcombine_s8(filter_s8, vdup_n_s8(0)), filter_idx);
+
+  // Since f0 is always negative and samples are unsigned, subtract (unsigned)
+  // s0 * -f0 to avoid signed overflow.
+  const uint8x8_t f0 = vdup_n_u8(-filter_x[0] >> 1);
 
   if (w == 4) {
-    const uint8x16x2_t perm_tbl = vld1q_u8_x2(kDotProdPermuteTbl);
+    const uint8x16_t perm_tbl = vld1q_u8(kMatMul8PermuteTbl);
+
     do {
       uint8x16_t s0, s1, s2, s3;
       load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
+      uint8x8_t s01 = load_u8_4x2(src + 0 * src_stride, src_stride);
+      uint8x8_t s23 = load_u8_4x2(src + 2 * src_stride, src_stride);
 
-      int16x4_t d0 = convolve8_4_h(s0, filter, perm_tbl);
-      int16x4_t d1 = convolve8_4_h(s1, filter, perm_tbl);
-      int16x4_t d2 = convolve8_4_h(s2, filter, perm_tbl);
-      int16x4_t d3 = convolve8_4_h(s3, filter, perm_tbl);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      int16x4_t t0 = convolve8_4_h(s0, filter, perm_tbl);
+      int16x4_t t1 = convolve8_4_h(s1, filter, perm_tbl);
+      int16x4_t t2 = convolve8_4_h(s2, filter, perm_tbl);
+      int16x4_t t3 = convolve8_4_h(s3, filter, perm_tbl);
+      // Apply tap 0 and accumulate.
+      int16x8_t t01 = vcombine_s16(t0, t1);
+      int16x8_t t23 = vcombine_s16(t2, t3);
+      t01 =
+          vreinterpretq_s16_u16(vmlsl_u8(vreinterpretq_u16_s16(t01), s01, f0));
+      t23 =
+          vreinterpretq_s16_u16(vmlsl_u8(vreinterpretq_u16_s16(t23), s23, f0));
+      // We halved the filter values to -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(t01, FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(t23, FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -115,7 +141,7 @@
       h -= 4;
     } while (h > 0);
   } else {
-    const uint8x16x3_t perm_tbl = vld1q_u8_x3(kDotProdPermuteTbl);
+    const uint8x16x2_t perm_tbl = vld1q_u8_x2(kMatMul8PermuteTbl);
 
     do {
       int width = w;
@@ -125,10 +151,10 @@
         uint8x16_t s0, s1, s2, s3;
         load_u8_16x4(s, src_stride, &s0, &s1, &s2, &s3);
 
-        uint8x8_t d0 = convolve8_8_h(s0, filter, perm_tbl);
-        uint8x8_t d1 = convolve8_8_h(s1, filter, perm_tbl);
-        uint8x8_t d2 = convolve8_8_h(s2, filter, perm_tbl);
-        uint8x8_t d3 = convolve8_8_h(s3, filter, perm_tbl);
+        uint8x8_t d0 = convolve8_8_h(s0, filter, f0, perm_tbl);
+        uint8x8_t d1 = convolve8_8_h(s1, filter, f0, perm_tbl);
+        uint8x8_t d2 = convolve8_8_h(s2, filter, f0, perm_tbl);
+        uint8x8_t d3 = convolve8_8_h(s3, filter, f0, perm_tbl);
 
         store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
 
@@ -189,7 +215,7 @@
       vcombine_s8(vext_s8(x_filter, x_filter, 1), x_filter);
 
   if (width == 4) {
-    const uint8x16_t perm_tbl = vld1q_u8(kMatMulPermuteTbl);
+    const uint8x16_t perm_tbl = vld1q_u8(kMatMul6PermuteTbl);
     do {
       uint8x16_t s0, s1, s2, s3;
       load_u8_16x4(src, src_stride, &s0, &s1, &s2, &s3);
@@ -210,7 +236,7 @@
       height -= 4;
     } while (height > 0);
   } else {
-    const uint8x16x2_t perm_tbl = vld1q_u8_x2(kMatMulPermuteTbl);
+    const uint8x16x2_t perm_tbl = vld1q_u8_x2(kMatMul6PermuteTbl);
 
     do {
       int w = width;
@@ -266,58 +292,6 @@
   }
 }
 
-static inline void transpose_concat_4x4(uint8x8_t a0, uint8x8_t a1,
-                                        uint8x8_t a2, uint8x8_t a3,
-                                        uint8x16_t *b) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, XX, XX, XX, XX
-  // a1: 10, 11, 12, 13, XX, XX, XX, XX
-  // a2: 20, 21, 22, 23, XX, XX, XX, XX
-  // a3: 30, 31, 32, 33, XX, XX, XX, XX
-  //
-  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-
-  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
-  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
-  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
-  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
-
-  uint8x16_t a01 = vzipq_u8(a0q, a1q).val[0];
-  uint8x16_t a23 = vzipq_u8(a2q, a3q).val[0];
-
-  uint16x8_t a0123 =
-      vzipq_u16(vreinterpretq_u16_u8(a01), vreinterpretq_u16_u8(a23)).val[0];
-
-  *b = vreinterpretq_u8_u16(a0123);
-}
-
-static inline void transpose_concat_8x4(uint8x8_t a0, uint8x8_t a1,
-                                        uint8x8_t a2, uint8x8_t a3,
-                                        uint8x16_t *b0, uint8x16_t *b1) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, 04, 05, 06, 07
-  // a1: 10, 11, 12, 13, 14, 15, 16, 17
-  // a2: 20, 21, 22, 23, 24, 25, 26, 27
-  // a3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
-
-  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
-  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
-  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
-  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
-
-  uint8x16_t a01 = vzipq_u8(a0q, a1q).val[0];
-  uint8x16_t a23 = vzipq_u8(a2q, a3q).val[0];
-
-  uint16x8x2_t a0123 =
-      vzipq_u16(vreinterpretq_u16_u8(a01), vreinterpretq_u16_u8(a23));
-
-  *b0 = vreinterpretq_u8_u16(a0123.val[0]);
-  *b1 = vreinterpretq_u8_u16(a0123.val[1]);
-}
-
 static inline int16x4_t convolve8_4_v(const uint8x16_t samples_lo,
                                       const uint8x16_t samples_hi,
                                       const int8x8_t filters) {
@@ -326,7 +300,7 @@
   sum = vusdotq_lane_s32(sum, samples_hi, filters, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_v(const uint8x16_t samples0_lo,
@@ -344,14 +318,16 @@
   sum1 = vusdotq_lane_s32(sum1, samples1_hi, filters, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0), vqmovn_s32(sum1));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve8_vert_8tap_neon_i8mm(
     const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
     ptrdiff_t dst_stride, const int16_t *filter_y, int w, int h) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(filter_y));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter = vshrn_n_s16(vld1q_s16(filter_y), 1);
   const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
   uint8x16x2_t samples_LUT;
 
@@ -361,19 +337,19 @@
     src += 7 * src_stride;
 
     // This operation combines a conventional transpose and the sample permute
-    // (see horizontal case) required before computing the dot product.
+    // required before computing the dot product.
     uint8x16_t s0123, s1234, s2345, s3456;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_u8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_u8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_u8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_u8_4x4(s3, s4, s5, s6, &s3456);
 
     do {
       uint8x8_t s7, s8, s9, s10;
       load_u8_8x4(src, src_stride, &s7, &s8, &s9, &s10);
 
       uint8x16_t s4567, s5678, s6789, s78910;
-      transpose_concat_4x4(s7, s8, s9, s10, &s78910);
+      transpose_concat_elems_u8_4x4(s7, s8, s9, s10, &s78910);
 
       // Merge new data into block from previous iteration.
       samples_LUT.val[0] = s3456;
@@ -386,8 +362,9 @@
       int16x4_t d1 = convolve8_4_v(s1234, s5678, filter);
       int16x4_t d2 = convolve8_4_v(s2345, s6789, filter);
       int16x4_t d3 = convolve8_4_v(s3456, s78910, filter);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -414,13 +391,13 @@
       s += 7 * src_stride;
 
       // This operation combines a conventional transpose and the sample permute
-      // (see horizontal case) required before computing the dot product.
+      // required before computing the dot product.
       uint8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_u8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_u8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_u8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_u8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
 
       do {
         uint8x8_t s7, s8, s9, s10;
@@ -428,7 +405,7 @@
 
         uint8x16_t s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo, s6789_hi,
             s78910_lo, s78910_hi;
-        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
+        transpose_concat_elems_u8_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
 
         // Merge new data into block from previous iteration.
         samples_LUT.val[0] = s3456_lo;
@@ -476,6 +453,142 @@
   }
 }
 
+static inline int16x4_t convolve4_4_v(const uint8x16_t samples,
+                                      const int8x8_t filters) {
+  // Sample permutation is performed by the caller.
+  int32x4_t sum = vusdotq_lane_s32(vdupq_n_s32(0), samples, filters, 0);
+
+  // Further narrowing and packing is performed by the caller.
+  return vmovn_s32(sum);
+}
+
+static inline uint8x8_t convolve4_8_v(const uint8x16_t samples0,
+                                      const uint8x16_t samples1,
+                                      const int8x8_t filters) {
+  // Sample permutation is performed by the caller.
+
+  // First 4 output values.
+  int32x4_t sum0 = vusdotq_lane_s32(vdupq_n_s32(0), samples0, filters, 0);
+  // Second 4 output values.
+  int32x4_t sum1 = vusdotq_lane_s32(vdupq_n_s32(0), samples1, filters, 0);
+
+  // Narrow and re-pack.
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0), vmovn_s32(sum1));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
+}
+
+static inline void convolve8_vert_4tap_neon_i8mm(
+    const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
+    ptrdiff_t dst_stride, const int16_t *filter_y, int w, int h) {
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int16x8_t filter_s16 =
+      vcombine_s16(vld1_s16(filter_y + 2), vdup_n_s16(0));
+  const int8x8_t filter = vshrn_n_s16(filter_s16, 1);
+  const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
+  uint8x16x2_t samples_LUT;
+
+  if (w == 4) {
+    uint8x8_t s0, s1, s2, s3;
+    load_u8_8x4(src, src_stride, &s0, &s1, &s2, &s3);
+    src += 4 * src_stride;
+
+    // This operation combines a conventional transpose and the sample permute
+    // required before computing the dot product.
+    uint8x16_t s0123;
+    transpose_concat_elems_u8_4x4(s0, s1, s2, s3, &s0123);
+
+    do {
+      uint8x8_t s4, s5, s6, s7;
+      load_u8_8x4(src, src_stride, &s4, &s5, &s6, &s7);
+
+      uint8x16_t s4567;
+      transpose_concat_elems_u8_4x4(s4, s5, s6, s7, &s4567);
+
+      // Merge new data into block from previous iteration.
+      samples_LUT.val[0] = s0123;
+      samples_LUT.val[1] = s4567;
+      uint8x16_t s1234 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+      uint8x16_t s2345 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+      uint8x16_t s3456 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+      int16x4_t d0 = convolve4_4_v(s0123, filter);
+      int16x4_t d1 = convolve4_4_v(s1234, filter);
+      int16x4_t d2 = convolve4_4_v(s2345, filter);
+      int16x4_t d3 = convolve4_4_v(s3456, filter);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
+
+      store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
+      store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
+
+      // Prepare block for next iteration - re-using as much as possible.
+      // Shuffle everything up four rows.
+      s0123 = s4567;
+
+      src += 4 * src_stride;
+      dst += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    do {
+      int height = h;
+      const uint8_t *s = src;
+      uint8_t *d = dst;
+
+      uint8x8_t s0, s1, s2, s3;
+      load_u8_8x4(s, src_stride, &s0, &s1, &s2, &s3);
+      s += 4 * src_stride;
+
+      // This operation combines a conventional transpose and the sample permute
+      // required before computing the dot product.
+      uint8x16_t s0123_lo, s0123_hi;
+      transpose_concat_elems_u8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+
+      do {
+        uint8x8_t s4, s5, s6, s7;
+        load_u8_8x4(s, src_stride, &s4, &s5, &s6, &s7);
+
+        uint8x16_t s4567_lo, s4567_hi;
+        transpose_concat_elems_u8_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
+
+        // Merge new data into block from previous iteration.
+        samples_LUT.val[0] = s0123_lo;
+        samples_LUT.val[1] = s4567_lo;
+        uint8x16_t s1234_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        uint8x16_t s2345_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        uint8x16_t s3456_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        samples_LUT.val[0] = s0123_hi;
+        samples_LUT.val[1] = s4567_hi;
+        uint8x16_t s1234_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        uint8x16_t s2345_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        uint8x16_t s3456_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        uint8x8_t d0 = convolve4_8_v(s0123_lo, s0123_hi, filter);
+        uint8x8_t d1 = convolve4_8_v(s1234_lo, s1234_hi, filter);
+        uint8x8_t d2 = convolve4_8_v(s2345_lo, s2345_hi, filter);
+        uint8x8_t d3 = convolve4_8_v(s3456_lo, s3456_hi, filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        // Prepare block for next iteration - re-using as much as possible.
+        // Shuffle everything up four rows.
+        s0123_lo = s4567_lo;
+        s0123_hi = s4567_hi;
+
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src += 8;
+      dst += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
 void aom_convolve8_vert_neon_i8mm(const uint8_t *src, ptrdiff_t src_stride,
                                   uint8_t *dst, ptrdiff_t dst_stride,
                                   const int16_t *filter_x, int x_step_q4,
@@ -496,8 +609,8 @@
     convolve8_vert_2tap_neon(src + 3 * src_stride, src_stride, dst, dst_stride,
                              filter_y, w, h);
   } else if (filter_taps == 4) {
-    convolve8_vert_4tap_neon(src + 2 * src_stride, src_stride, dst, dst_stride,
-                             filter_y, w, h);
+    convolve8_vert_4tap_neon_i8mm(src + 2 * src_stride, src_stride, dst,
+                                  dst_stride, filter_y, w, h);
   } else {
     convolve8_vert_8tap_neon_i8mm(src, src_stride, dst, dst_stride, filter_y, w,
                                   h);
diff --git a/aom_dsp/arm/highbd_convolve8_sve.c b/aom_dsp/arm/highbd_convolve8_sve.c
index b5db14b..882e360 100644
--- a/aom_dsp/arm/highbd_convolve8_sve.c
+++ b/aom_dsp/arm/highbd_convolve8_sve.c
@@ -20,6 +20,7 @@
 #include "aom_dsp/arm/aom_filter.h"
 #include "aom_dsp/arm/highbd_convolve8_neon.h"
 #include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
 
 static inline uint16x4_t highbd_convolve8_4_h(int16x8_t s[4], int16x8_t filter,
                                               uint16x4_t max) {
@@ -276,60 +277,6 @@
   6, 7, 16, 17, 18, 19, 20, 21, 14, 15, 24, 25, 26, 27, 28, 29
 };
 
-static inline void transpose_concat_4x4(int16x4_t s0, int16x4_t s1,
-                                        int16x4_t s2, int16x4_t s3,
-                                        int16x8_t res[2]) {
-  // Transpose 16-bit elements and concatenate result rows as follows:
-  // s0: 00, 01, 02, 03
-  // s1: 10, 11, 12, 13
-  // s2: 20, 21, 22, 23
-  // s3: 30, 31, 32, 33
-  //
-  // res[0]: 00 10 20 30 01 11 21 31
-  // res[1]: 02 12 22 32 03 13 23 33
-
-  int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
-  int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
-  int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
-  int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
-
-  int32x4_t s01 = vreinterpretq_s32_s16(vzip1q_s16(s0q, s1q));
-  int32x4_t s23 = vreinterpretq_s32_s16(vzip1q_s16(s2q, s3q));
-
-  int32x4x2_t s0123 = vzipq_s32(s01, s23);
-
-  res[0] = vreinterpretq_s16_s32(s0123.val[0]);
-  res[1] = vreinterpretq_s16_s32(s0123.val[1]);
-}
-
-static inline void transpose_concat_8x4(int16x8_t s0, int16x8_t s1,
-                                        int16x8_t s2, int16x8_t s3,
-                                        int16x8_t res[4]) {
-  // Transpose 16-bit elements and concatenate result rows as follows:
-  // s0: 00, 01, 02, 03, 04, 05, 06, 07
-  // s1: 10, 11, 12, 13, 14, 15, 16, 17
-  // s2: 20, 21, 22, 23, 24, 25, 26, 27
-  // s3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // res_lo[0]: 00 10 20 30 01 11 21 31
-  // res_lo[1]: 02 12 22 32 03 13 23 33
-  // res_hi[0]: 04 14 24 34 05 15 25 35
-  // res_hi[1]: 06 16 26 36 07 17 27 37
-
-  int16x8x2_t tr01_16 = vzipq_s16(s0, s1);
-  int16x8x2_t tr23_16 = vzipq_s16(s2, s3);
-
-  int32x4x2_t tr01_32 = vzipq_s32(vreinterpretq_s32_s16(tr01_16.val[0]),
-                                  vreinterpretq_s32_s16(tr23_16.val[0]));
-  int32x4x2_t tr23_32 = vzipq_s32(vreinterpretq_s32_s16(tr01_16.val[1]),
-                                  vreinterpretq_s32_s16(tr23_16.val[1]));
-
-  res[0] = vreinterpretq_s16_s32(tr01_32.val[0]);
-  res[1] = vreinterpretq_s16_s32(tr01_32.val[1]);
-  res[2] = vreinterpretq_s16_s32(tr23_32.val[0]);
-  res[3] = vreinterpretq_s16_s32(tr23_32.val[1]);
-}
-
 static inline void aom_tbl2x4_s16(int16x8_t t0[4], int16x8_t t1[4],
                                   uint8x16_t tbl, int16x8_t res[4]) {
   int8x16x2_t samples0 = { vreinterpretq_s8_s16(t0[0]),
@@ -426,10 +373,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -438,7 +385,7 @@
       int16x8_t s4567[2], s5678[2], s6789[2], s78910[2];
 
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s78910);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s78910);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s78910, merge_block_tbl[0], s4567);
@@ -481,10 +428,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -493,7 +440,7 @@
         int16x8_t s4567[4], s5678[4], s6789[4], s78910[4];
 
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s78910);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s78910);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s78910, merge_block_tbl[0], s4567);
diff --git a/aom_dsp/arm/mem_neon.h b/aom_dsp/arm/mem_neon.h
index ad761de..6590a7f 100644
--- a/aom_dsp/arm/mem_neon.h
+++ b/aom_dsp/arm/mem_neon.h
@@ -176,7 +176,7 @@
   return ret;
 }
 
-static inline uint8x8_t load_u8_4x2(const uint8_t *p, int stride) {
+static inline uint8x8_t load_u8_4x2(const uint8_t *p, ptrdiff_t stride) {
   uint8x8_t ret = vdup_n_u8(0);
   ret = vreinterpret_u8_u32(
       vld1_lane_u32((const uint32_t *)p, vreinterpret_u32_u8(ret), 0));
@@ -186,7 +186,7 @@
   return ret;
 }
 
-static inline uint16x4_t load_u16_2x2(const uint16_t *p, int stride) {
+static inline uint16x4_t load_u16_2x2(const uint16_t *p, ptrdiff_t stride) {
   uint16x4_t ret = vdup_n_u16(0);
   ret = vreinterpret_u16_u32(
       vld1_lane_u32((const uint32_t *)p, vreinterpret_u32_u16(ret), 0));
@@ -1194,7 +1194,8 @@
 #endif
 
 // Load 2 sets of 4 bytes when alignment is not guaranteed.
-static inline uint8x8_t load_unaligned_u8(const uint8_t *buf, int stride) {
+static inline uint8x8_t load_unaligned_u8(const uint8_t *buf,
+                                          ptrdiff_t stride) {
   uint32_t a;
   memcpy(&a, buf, 4);
   buf += stride;
@@ -1205,7 +1206,8 @@
 }
 
 // Load 4 sets of 4 bytes when alignment is not guaranteed.
-static inline uint8x16_t load_unaligned_u8q(const uint8_t *buf, int stride) {
+static inline uint8x16_t load_unaligned_u8q(const uint8_t *buf,
+                                            ptrdiff_t stride) {
   uint32_t a;
   uint32x4_t a_u32;
   if (stride == 4) return vld1q_u8(buf);
@@ -1223,7 +1225,8 @@
   return vreinterpretq_u8_u32(a_u32);
 }
 
-static inline uint8x8_t load_unaligned_u8_2x2(const uint8_t *buf, int stride) {
+static inline uint8x8_t load_unaligned_u8_2x2(const uint8_t *buf,
+                                              ptrdiff_t stride) {
   uint16_t a;
   uint16x4_t a_u16;
 
@@ -1263,7 +1266,8 @@
   return vreinterpret_u8_u16(a_u32);
 }
 
-static inline uint8x8_t load_unaligned_u8_4x2(const uint8_t *buf, int stride) {
+static inline uint8x8_t load_unaligned_u8_4x2(const uint8_t *buf,
+                                              ptrdiff_t stride) {
   uint32_t a;
   uint32x2_t a_u32;
 
@@ -1275,14 +1279,14 @@
   return vreinterpret_u8_u32(a_u32);
 }
 
-static inline void load_unaligned_u8_4x4(const uint8_t *buf, int stride,
+static inline void load_unaligned_u8_4x4(const uint8_t *buf, ptrdiff_t stride,
                                          uint8x8_t *tu0, uint8x8_t *tu1) {
   *tu0 = load_unaligned_u8_4x2(buf, stride);
   buf += 2 * stride;
   *tu1 = load_unaligned_u8_4x2(buf, stride);
 }
 
-static inline void load_unaligned_u8_3x8(const uint8_t *buf, int stride,
+static inline void load_unaligned_u8_3x8(const uint8_t *buf, ptrdiff_t stride,
                                          uint8x8_t *tu0, uint8x8_t *tu1,
                                          uint8x8_t *tu2) {
   load_unaligned_u8_4x4(buf, stride, tu0, tu1);
@@ -1290,7 +1294,7 @@
   *tu2 = load_unaligned_u8_4x2(buf, stride);
 }
 
-static inline void load_unaligned_u8_4x8(const uint8_t *buf, int stride,
+static inline void load_unaligned_u8_4x8(const uint8_t *buf, ptrdiff_t stride,
                                          uint8x8_t *tu0, uint8x8_t *tu1,
                                          uint8x8_t *tu2, uint8x8_t *tu3) {
   load_unaligned_u8_4x4(buf, stride, tu0, tu1);
@@ -1397,7 +1401,7 @@
 }
 
 static inline uint16x4_t load_unaligned_u16_2x2(const uint16_t *buf,
-                                                int stride) {
+                                                ptrdiff_t stride) {
   uint32_t a;
   uint32x2_t a_u32;
 
@@ -1418,7 +1422,7 @@
 }
 
 static inline uint16x8_t load_unaligned_u16_4x2(const uint16_t *buf,
-                                                uint32_t stride) {
+                                                ptrdiff_t stride) {
   uint64_t a;
   uint64x2_t a_u64;
 
@@ -1433,7 +1437,7 @@
 }
 
 static inline int16x8_t load_unaligned_s16_4x2(const int16_t *buf,
-                                               uint32_t stride) {
+                                               ptrdiff_t stride) {
   int64_t a;
   int64x2_t a_s64;
   memcpy(&a, buf, 8);
@@ -1446,14 +1450,14 @@
   return vreinterpretq_s16_s64(a_s64);
 }
 
-static inline void load_unaligned_u16_4x4(const uint16_t *buf, uint32_t stride,
+static inline void load_unaligned_u16_4x4(const uint16_t *buf, ptrdiff_t stride,
                                           uint16x8_t *tu0, uint16x8_t *tu1) {
   *tu0 = load_unaligned_u16_4x2(buf, stride);
   buf += 2 * stride;
   *tu1 = load_unaligned_u16_4x2(buf, stride);
 }
 
-static inline void load_s32_4x4(int32_t *s, int32_t p, int32x4_t *s1,
+static inline void load_s32_4x4(int32_t *s, ptrdiff_t p, int32x4_t *s1,
                                 int32x4_t *s2, int32x4_t *s3, int32x4_t *s4) {
   *s1 = vld1q_s32(s);
   s += p;
@@ -1464,7 +1468,7 @@
   *s4 = vld1q_s32(s);
 }
 
-static inline void store_s32_4x4(int32_t *s, int32_t p, int32x4_t s1,
+static inline void store_s32_4x4(int32_t *s, ptrdiff_t p, int32x4_t s1,
                                  int32x4_t s2, int32x4_t s3, int32x4_t s4) {
   vst1q_s32(s, s1);
   s += p;
@@ -1475,7 +1479,7 @@
   vst1q_s32(s, s4);
 }
 
-static inline void load_u32_4x4(uint32_t *s, int32_t p, uint32x4_t *s1,
+static inline void load_u32_4x4(uint32_t *s, ptrdiff_t p, uint32x4_t *s1,
                                 uint32x4_t *s2, uint32x4_t *s3,
                                 uint32x4_t *s4) {
   *s1 = vld1q_u32(s);
@@ -1487,7 +1491,7 @@
   *s4 = vld1q_u32(s);
 }
 
-static inline void store_u32_4x4(uint32_t *s, int32_t p, uint32x4_t s1,
+static inline void store_u32_4x4(uint32_t *s, ptrdiff_t p, uint32x4_t s1,
                                  uint32x4_t s2, uint32x4_t s3, uint32x4_t s4) {
   vst1q_u32(s, s1);
   s += p;
@@ -1578,14 +1582,14 @@
 }
 
 // Store two blocks of 16-bits from a single vector.
-static inline void store_u8x2_strided_x2(uint8_t *dst, uint32_t dst_stride,
+static inline void store_u8x2_strided_x2(uint8_t *dst, ptrdiff_t dst_stride,
                                          uint8x8_t src) {
   store_u8_2x1_lane(dst, src, 0);
   dst += dst_stride;
   store_u8_2x1_lane(dst, src, 1);
 }
 
-static inline void store_u8x2_strided_x4(uint8_t *dst, uint32_t dst_stride,
+static inline void store_u8x2_strided_x4(uint8_t *dst, ptrdiff_t dst_stride,
                                          uint8x8_t src) {
   store_u8_2x1_lane(dst, src, 0);
   dst += dst_stride;
@@ -1622,7 +1626,7 @@
 }
 
 // Store two blocks of 32-bits from a single vector.
-static inline void store_u16x2_strided_x2(uint16_t *dst, uint32_t dst_stride,
+static inline void store_u16x2_strided_x2(uint16_t *dst, ptrdiff_t dst_stride,
                                           uint16x4_t src) {
   store_u16_2x1_lane(dst, src, 0);
   dst += dst_stride;
@@ -1630,7 +1634,7 @@
 }
 
 // Store two blocks of 64-bits from a single vector.
-static inline void store_u16x4_strided_x2(uint16_t *dst, uint32_t dst_stride,
+static inline void store_u16x4_strided_x2(uint16_t *dst, ptrdiff_t dst_stride,
                                           uint16x8_t src) {
   store_u16_4x1_lane(dst, src, 0);
   dst += dst_stride;
@@ -1638,7 +1642,7 @@
 }
 
 // Store two blocks of 64-bits from a single vector.
-static inline void store_s16x4_strided_x2(int16_t *dst, int32_t dst_stride,
+static inline void store_s16x4_strided_x2(int16_t *dst, ptrdiff_t dst_stride,
                                           int16x8_t src) {
   store_s16_4x1_lane(dst, src, 0);
   dst += dst_stride;
diff --git a/aom_dsp/arm/transpose_neon.h b/aom_dsp/arm/transpose_neon.h
index 88df0d6..2a548bb 100644
--- a/aom_dsp/arm/transpose_neon.h
+++ b/aom_dsp/arm/transpose_neon.h
@@ -17,6 +17,155 @@
 #include "aom_dsp/aom_dsp_common.h"  // For AOM_FORCE_INLINE.
 #include "config/aom_config.h"
 
+static inline void transpose_concat_elems_u8_4x4(uint8x8_t a0, uint8x8_t a1,
+                                                 uint8x8_t a2, uint8x8_t a3,
+                                                 uint8x16_t *b) {
+  // Transpose 8-bit elements and concatenate result rows as follows:
+  // a0: 00, 01, 02, 03, XX, XX, XX, XX
+  // a1: 10, 11, 12, 13, XX, XX, XX, XX
+  // a2: 20, 21, 22, 23, XX, XX, XX, XX
+  // a3: 30, 31, 32, 33, XX, XX, XX, XX
+  //
+  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+
+  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
+  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
+  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
+  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
+
+  uint8x16_t a02 = vzipq_u8(a0q, a2q).val[0];
+  uint8x16_t a13 = vzipq_u8(a1q, a3q).val[0];
+
+  *b = vzipq_u8(a02, a13).val[0];
+}
+
+static inline void transpose_concat_elems_u8_8x4(uint8x8_t a0, uint8x8_t a1,
+                                                 uint8x8_t a2, uint8x8_t a3,
+                                                 uint8x16_t *b0,
+                                                 uint8x16_t *b1) {
+  // Transpose 8-bit elements and concatenate result rows as follows:
+  // a0: 00, 01, 02, 03, 04, 05, 06, 07
+  // a1: 10, 11, 12, 13, 14, 15, 16, 17
+  // a2: 20, 21, 22, 23, 24, 25, 26, 27
+  // a3: 30, 31, 32, 33, 34, 35, 36, 37
+  //
+  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
+
+  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
+  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
+  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
+  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
+
+  uint8x16_t a02 = vzipq_u8(a0q, a2q).val[0];
+  uint8x16_t a13 = vzipq_u8(a1q, a3q).val[0];
+
+  uint8x16x2_t a0123 = vzipq_u8(a02, a13);
+
+  *b0 = a0123.val[0];
+  *b1 = a0123.val[1];
+}
+
+static inline void transpose_concat_elems_s8_4x4(int8x8_t a0, int8x8_t a1,
+                                                 int8x8_t a2, int8x8_t a3,
+                                                 int8x16_t *b) {
+  // Transpose 8-bit elements and concatenate result rows as follows:
+  // a0: 00, 01, 02, 03, XX, XX, XX, XX
+  // a1: 10, 11, 12, 13, XX, XX, XX, XX
+  // a2: 20, 21, 22, 23, XX, XX, XX, XX
+  // a3: 30, 31, 32, 33, XX, XX, XX, XX
+  //
+  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+
+  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
+  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
+  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
+  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
+
+  int8x16_t a02 = vzipq_s8(a0q, a2q).val[0];
+  int8x16_t a13 = vzipq_s8(a1q, a3q).val[0];
+
+  *b = vzipq_s8(a02, a13).val[0];
+}
+
+static inline void transpose_concat_elems_s8_8x4(int8x8_t a0, int8x8_t a1,
+                                                 int8x8_t a2, int8x8_t a3,
+                                                 int8x16_t *b0, int8x16_t *b1) {
+  // Transpose 8-bit elements and concatenate result rows as follows:
+  // a0: 00, 01, 02, 03, 04, 05, 06, 07
+  // a1: 10, 11, 12, 13, 14, 15, 16, 17
+  // a2: 20, 21, 22, 23, 24, 25, 26, 27
+  // a3: 30, 31, 32, 33, 34, 35, 36, 37
+  //
+  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
+  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
+
+  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
+  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
+  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
+  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
+
+  int8x16_t a02 = vzipq_s8(a0q, a2q).val[0];
+  int8x16_t a13 = vzipq_s8(a1q, a3q).val[0];
+
+  int8x16x2_t a0123 = vzipq_s8(a02, a13);
+
+  *b0 = a0123.val[0];
+  *b1 = a0123.val[1];
+}
+
+static inline void transpose_concat_elems_s16_4x4(int16x4_t s0, int16x4_t s1,
+                                                  int16x4_t s2, int16x4_t s3,
+                                                  int16x8_t res[2]) {
+  // Transpose 16-bit elements and concatenate result rows as follows:
+  // s0: 00, 01, 02, 03
+  // s1: 10, 11, 12, 13
+  // s2: 20, 21, 22, 23
+  // s3: 30, 31, 32, 33
+  //
+  // res[0]: 00 10 20 30 01 11 21 31
+  // res[1]: 02 12 22 32 03 13 23 33
+
+  int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
+  int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
+  int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
+  int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
+
+  int16x8_t s02 = vzipq_s16(s0q, s2q).val[0];
+  int16x8_t s13 = vzipq_s16(s1q, s3q).val[0];
+
+  int16x8x2_t s0123 = vzipq_s16(s02, s13);
+
+  res[0] = s0123.val[0];
+  res[1] = s0123.val[1];
+}
+
+static inline void transpose_concat_elems_s16_8x4(int16x8_t s0, int16x8_t s1,
+                                                  int16x8_t s2, int16x8_t s3,
+                                                  int16x8_t res[4]) {
+  // Transpose 16-bit elements and concatenate result rows as follows:
+  // s0: 00, 01, 02, 03, 04, 05, 06, 07
+  // s1: 10, 11, 12, 13, 14, 15, 16, 17
+  // s2: 20, 21, 22, 23, 24, 25, 26, 27
+  // s3: 30, 31, 32, 33, 34, 35, 36, 37
+  //
+  // res[0]: 00 10 20 30 01 11 21 31
+  // res[1]: 02 12 22 32 03 13 23 33
+  // res[2]: 04 14 24 34 05 15 25 35
+  // res[3]: 06 16 26 36 07 17 27 37
+
+  int16x8x2_t s02 = vzipq_s16(s0, s2);
+  int16x8x2_t s13 = vzipq_s16(s1, s3);
+
+  int16x8x2_t s0123_lo = vzipq_s16(s02.val[0], s13.val[0]);
+  int16x8x2_t s0123_hi = vzipq_s16(s02.val[1], s13.val[1]);
+
+  res[0] = s0123_lo.val[0];
+  res[1] = s0123_lo.val[1];
+  res[2] = s0123_hi.val[0];
+  res[3] = s0123_hi.val[1];
+}
+
 static inline void transpose_elems_u8_8x8(
     uint8x8_t a0, uint8x8_t a1, uint8x8_t a2, uint8x8_t a3, uint8x8_t a4,
     uint8x8_t a5, uint8x8_t a6, uint8x8_t a7, uint8x8_t *o0, uint8x8_t *o1,
diff --git a/aom_dsp/prob.h b/aom_dsp/prob.h
index bb2c4a9..8bec067 100644
--- a/aom_dsp/prob.h
+++ b/aom_dsp/prob.h
@@ -18,7 +18,7 @@
 #include "config/aom_config.h"
 
 #include "aom_dsp/aom_dsp_common.h"
-#include "aom_dsp/entcode.h"
+
 #include "aom_ports/bitops.h"
 #include "aom_ports/mem.h"
 
diff --git a/av1/av1.cmake b/av1/av1.cmake
index 61533df..f5d7057 100644
--- a/av1/av1.cmake
+++ b/av1/av1.cmake
@@ -94,6 +94,7 @@
 
 if(CONFIG_HIGHWAY)
   list(APPEND AOM_AV1_COMMON_SOURCES "${AOM_ROOT}/av1/common/selfguided_hwy.h")
+  list(APPEND AOM_AV1_COMMON_SOURCES "${AOM_ROOT}/av1/common/warp_plane_hwy.h")
 endif()
 
 list(APPEND AOM_AV1_DECODER_SOURCES
@@ -321,8 +322,13 @@
             "${AOM_ROOT}/av1/common/x86/wiener_convolve_avx2.c")
 
 if(CONFIG_HIGHWAY)
+  list(APPEND AOM_AV1_COMMON_INTRIN_SSE4_1
+              "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_sse4.cc")
+  list(APPEND AOM_AV1_COMMON_INTRIN_AVX2
+              "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_avx2.cc")
   list(APPEND AOM_AV1_COMMON_INTRIN_AVX512
-              "${AOM_ROOT}/av1/common/x86/selfguided_hwy_avx512.cc")
+              "${AOM_ROOT}/av1/common/x86/selfguided_hwy_avx512.cc"
+              "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_avx512.cc")
 endif()
 
 list(APPEND AOM_AV1_ENCODER_ASM_SSE2 "${AOM_ROOT}/av1/encoder/x86/dct_sse2.asm"
@@ -591,7 +597,8 @@
                      "${AOM_ROOT}/av1/common/restoration.h"
                      "${AOM_ROOT}/av1/common/selfguided_hwy.h"
                      "${AOM_ROOT}/av1/common/warped_motion.c"
-                     "${AOM_ROOT}/av1/common/warped_motion.h")
+                     "${AOM_ROOT}/av1/common/warped_motion.h"
+                     "${AOM_ROOT}/av1/common/warp_plane_hwy.h")
 
     list(REMOVE_ITEM AOM_AV1_COMMON_INTRIN_SSE2
                      "${AOM_ROOT}/av1/common/x86/cfl_sse2.c"
@@ -601,7 +608,8 @@
     list(REMOVE_ITEM AOM_AV1_COMMON_INTRIN_SSE4_1
                      "${AOM_ROOT}/av1/common/x86/highbd_warp_plane_sse4.c"
                      "${AOM_ROOT}/av1/common/x86/selfguided_sse4.c"
-                     "${AOM_ROOT}/av1/common/x86/warp_plane_sse4.c")
+                     "${AOM_ROOT}/av1/common/x86/warp_plane_sse4.c"
+                     "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_sse4.cc")
 
     list(
       REMOVE_ITEM AOM_AV1_COMMON_INTRIN_SSSE3
@@ -614,10 +622,12 @@
                      "${AOM_ROOT}/av1/common/x86/highbd_wiener_convolve_avx2.c"
                      "${AOM_ROOT}/av1/common/x86/selfguided_avx2.c"
                      "${AOM_ROOT}/av1/common/x86/warp_plane_avx2.c"
+                     "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_avx2.cc"
                      "${AOM_ROOT}/av1/common/x86/wiener_convolve_avx2.c")
 
     list(REMOVE_ITEM AOM_AV1_COMMON_INTRIN_AVX512
-                     "${AOM_ROOT}/av1/common/x86/selfguided_hwy_avx512.cc")
+                     "${AOM_ROOT}/av1/common/x86/selfguided_hwy_avx512.cc"
+                     "${AOM_ROOT}/av1/common/x86/warp_plane_hwy_avx512.cc")
 
     list(REMOVE_ITEM AOM_AV1_COMMON_INTRIN_NEON
                      "${AOM_ROOT}/av1/common/arm/cfl_neon.c"
diff --git a/av1/common/arm/convolve_neon_dotprod.c b/av1/common/arm/convolve_neon_dotprod.c
index 8d0d929..446f661 100644
--- a/av1/common/arm/convolve_neon_dotprod.c
+++ b/av1/common/arm/convolve_neon_dotprod.c
@@ -16,6 +16,7 @@
 
 #include "aom_dsp/aom_dsp_common.h"
 #include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
 #include "aom_ports/mem.h"
 #include "av1/common/arm/convolve_neon.h"
 #include "av1/common/convolve.h"
@@ -64,7 +65,7 @@
   sum = vdotq_laneq_s32(sum, perm_samples[1], filter, 1);
   sum = vdotq_laneq_s32(sum, perm_samples[2], filter, 2);
 
-  return vqrshrn_n_s32(sum, FILTER_BITS);
+  return vshrn_n_s32(sum, 1);
 }
 
 static inline uint8x8_t convolve12_8_x(uint8x16_t samples[2],
@@ -105,9 +106,9 @@
   sum4567 = vdotq_laneq_s32(sum4567, perm_samples[3], filter, 2);
 
   // Narrow and re-pack.
-  int16x8_t sum_s16 = vcombine_s16(vqrshrn_n_s32(sum0123, FILTER_BITS),
-                                   vqrshrn_n_s32(sum4567, FILTER_BITS));
-  return vqmovun_s16(sum_s16);
+  int16x8_t sum_s16 =
+      vcombine_s16(vshrn_n_s32(sum0123, 1), vshrn_n_s32(sum4567, 1));
+  return vqrshrun_n_s16(sum_s16, FILTER_BITS - 1);
 }
 
 static inline void convolve_x_sr_12tap_neon_dotprod(
@@ -134,8 +135,8 @@
       int16x4_t d2 = convolve12_4_x(s2, filter, permute_tbl);
       int16x4_t d3 = convolve12_4_x(s3, filter, permute_tbl);
 
-      uint8x8_t d01 = vqmovun_s16(vcombine_s16(d0, d1));
-      uint8x8_t d23 = vqmovun_s16(vcombine_s16(d2, d3));
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -387,57 +388,6 @@
   } while (h != 0);
 }
 
-static inline void transpose_concat_4x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
-                                        int8x8_t a3, int8x16_t *b) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, XX, XX, XX, XX
-  // a1: 10, 11, 12, 13, XX, XX, XX, XX
-  // a2: 20, 21, 22, 23, XX, XX, XX, XX
-  // a3: 30, 31, 32, 33, XX, XX, XX, XX
-  //
-  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-
-  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
-  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
-  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
-  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
-
-  int8x16_t a01 = vzipq_s8(a0q, a1q).val[0];
-  int8x16_t a23 = vzipq_s8(a2q, a3q).val[0];
-
-  int16x8_t a0123 =
-      vzipq_s16(vreinterpretq_s16_s8(a01), vreinterpretq_s16_s8(a23)).val[0];
-
-  *b = vreinterpretq_s8_s16(a0123);
-}
-
-static inline void transpose_concat_8x4(int8x8_t a0, int8x8_t a1, int8x8_t a2,
-                                        int8x8_t a3, int8x16_t *b0,
-                                        int8x16_t *b1) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, 04, 05, 06, 07
-  // a1: 10, 11, 12, 13, 14, 15, 16, 17
-  // a2: 20, 21, 22, 23, 24, 25, 26, 27
-  // a3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
-
-  int8x16_t a0q = vcombine_s8(a0, vdup_n_s8(0));
-  int8x16_t a1q = vcombine_s8(a1, vdup_n_s8(0));
-  int8x16_t a2q = vcombine_s8(a2, vdup_n_s8(0));
-  int8x16_t a3q = vcombine_s8(a3, vdup_n_s8(0));
-
-  int8x16_t a01 = vzipq_s8(a0q, a1q).val[0];
-  int8x16_t a23 = vzipq_s8(a2q, a3q).val[0];
-
-  int16x8x2_t a0123 =
-      vzipq_s16(vreinterpretq_s16_s8(a01), vreinterpretq_s16_s8(a23));
-
-  *b0 = vreinterpretq_s8_s16(a0123.val[0]);
-  *b1 = vreinterpretq_s8_s16(a0123.val[1]);
-}
-
 static inline int16x4_t convolve12_4_y(const int8x16_t s0, const int8x16_t s1,
                                        const int8x16_t s2,
                                        const int8x8_t filters_0_7,
@@ -450,7 +400,7 @@
   sum = vdotq_lane_s32(sum, s2, filters_4_11, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vshrn_n_s32(sum, 1);
 }
 
 static inline uint8x8_t convolve12_8_y(
@@ -470,8 +420,9 @@
   sum4567 = vdotq_lane_s32(sum4567, s2_hi, filters_4_11, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0123), vqmovn_s32(sum4567));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum =
+      vcombine_s16(vshrn_n_s32(sum0123, 1), vshrn_n_s32(sum4567, 1));
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve_y_sr_12tap_neon_dotprod(
@@ -505,14 +456,14 @@
     int8x8_t sA = vreinterpret_s8_u8(vsub_u8(tA, vdup_n_u8(128)));
 
     int8x16_t s0123, s1234, s2345, s3456, s4567, s5678, s6789, s789A;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
-    transpose_concat_4x4(s4, s5, s6, s7, &s4567);
-    transpose_concat_4x4(s5, s6, s7, s8, &s5678);
-    transpose_concat_4x4(s6, s7, s8, s9, &s6789);
-    transpose_concat_4x4(s7, s8, s9, sA, &s789A);
+    transpose_concat_elems_s8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_s8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_s8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_s8_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_s8_4x4(s4, s5, s6, s7, &s4567);
+    transpose_concat_elems_s8_4x4(s5, s6, s7, s8, &s5678);
+    transpose_concat_elems_s8_4x4(s6, s7, s8, s9, &s6789);
+    transpose_concat_elems_s8_4x4(s7, s8, s9, sA, &s789A);
 
     do {
       uint8x8_t tB, tC, tD, tE;
@@ -524,7 +475,7 @@
       int8x8_t sE = vreinterpret_s8_u8(vsub_u8(tE, vdup_n_u8(128)));
 
       int8x16_t s89AB, s9ABC, sABCD, sBCDE;
-      transpose_concat_4x4(sB, sC, sD, sE, &sBCDE);
+      transpose_concat_elems_s8_4x4(sB, sC, sD, sE, &sBCDE);
 
       // Merge new data into block from previous iteration.
       int8x16x2_t samples_LUT = { { s789A, sBCDE } };
@@ -540,8 +491,8 @@
           convolve12_4_y(s2345, s6789, sABCD, filter_0_7, filter_4_11);
       int16x4_t d3 =
           convolve12_4_y(s3456, s789A, sBCDE, filter_0_7, filter_4_11);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
@@ -591,14 +542,14 @@
       int8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi, s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo,
           s6789_hi, s789A_lo, s789A_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
-      transpose_concat_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
-      transpose_concat_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi);
-      transpose_concat_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi);
-      transpose_concat_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
+      transpose_concat_elems_s8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_s8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_s8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_s8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_s8_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
+      transpose_concat_elems_s8_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi);
+      transpose_concat_elems_s8_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi);
+      transpose_concat_elems_s8_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
 
       do {
         uint8x8_t tB, tC, tD, tE;
@@ -611,7 +562,7 @@
 
         int8x16_t s89AB_lo, s89AB_hi, s9ABC_lo, s9ABC_hi, sABCD_lo, sABCD_hi,
             sBCDE_lo, sBCDE_hi;
-        transpose_concat_8x4(sB, sC, sD, sE, &sBCDE_lo, &sBCDE_hi);
+        transpose_concat_elems_s8_8x4(sB, sC, sD, sE, &sBCDE_lo, &sBCDE_hi);
 
         // Merge new data into block from previous iteration.
         int8x16x2_t samples_LUT_lo = { { s789A_lo, sBCDE_lo } };
@@ -673,12 +624,13 @@
                                       const int8x8_t filters) {
   // The sample range transform and permutation are performed by the caller.
   // Accumulate into 128 << FILTER_BITS to account for range transform.
-  const int32x4_t acc = vdupq_n_s32(128 << FILTER_BITS);
+  // (- 1 since we halved the filters.)
+  const int32x4_t acc = vdupq_n_s32(128 << (FILTER_BITS - 1));
   int32x4_t sum = vdotq_lane_s32(acc, s0, filters, 0);
   sum = vdotq_lane_s32(sum, s1, filters, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_y(const int8x16_t s0_lo,
@@ -688,7 +640,8 @@
                                       const int8x8_t filters) {
   // The sample range transform and permutation are performed by the caller.
   // Accumulate into 128 << FILTER_BITS to account for range transform.
-  const int32x4_t acc = vdupq_n_s32(128 << FILTER_BITS);
+  // (- 1 since we halved the filters.)
+  const int32x4_t acc = vdupq_n_s32(128 << (FILTER_BITS - 1));
 
   int32x4_t sum0123 = vdotq_lane_s32(acc, s0_lo, filters, 0);
   sum0123 = vdotq_lane_s32(sum0123, s1_lo, filters, 1);
@@ -697,14 +650,16 @@
   sum4567 = vdotq_lane_s32(sum4567, s1_hi, filters, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0123), vqmovn_s32(sum4567));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0123), vmovn_s32(sum4567));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve_y_sr_8tap_neon_dotprod(
     const uint8_t *src_ptr, int src_stride, uint8_t *dst_ptr, int dst_stride,
     int w, int h, const int16_t *y_filter_ptr) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(y_filter_ptr));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter = vshrn_n_s16(vld1q_s16(y_filter_ptr), 1);
 
   const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
 
@@ -723,25 +678,25 @@
     int8x8_t s6 = vreinterpret_s8_u8(vsub_u8(t6, vdup_n_u8(128)));
 
     int8x16_t s0123, s1234, s2345, s3456;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_s8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_s8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_s8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_s8_4x4(s3, s4, s5, s6, &s3456);
 
     do {
-      uint8x8_t t7, t8, t9, t10;
-      load_u8_8x4(src_ptr, src_stride, &t7, &t8, &t9, &t10);
+      uint8x8_t t7, t8, t9, tA;
+      load_u8_8x4(src_ptr, src_stride, &t7, &t8, &t9, &tA);
 
       int8x8_t s7 = vreinterpret_s8_u8(vsub_u8(t7, vdup_n_u8(128)));
       int8x8_t s8 = vreinterpret_s8_u8(vsub_u8(t8, vdup_n_u8(128)));
       int8x8_t s9 = vreinterpret_s8_u8(vsub_u8(t9, vdup_n_u8(128)));
-      int8x8_t s10 = vreinterpret_s8_u8(vsub_u8(t10, vdup_n_u8(128)));
+      int8x8_t sA = vreinterpret_s8_u8(vsub_u8(tA, vdup_n_u8(128)));
 
-      int8x16_t s4567, s5678, s6789, s78910;
-      transpose_concat_4x4(s7, s8, s9, s10, &s78910);
+      int8x16_t s4567, s5678, s6789, s789A;
+      transpose_concat_elems_s8_4x4(s7, s8, s9, sA, &s789A);
 
       // Merge new data into block from previous iteration.
-      int8x16x2_t samples_LUT = { { s3456, s78910 } };
+      int8x16x2_t samples_LUT = { { s3456, s789A } };
       s4567 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
       s5678 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
       s6789 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
@@ -749,9 +704,10 @@
       int16x4_t d0 = convolve8_4_y(s0123, s4567, filter);
       int16x4_t d1 = convolve8_4_y(s1234, s5678, filter);
       int16x4_t d2 = convolve8_4_y(s2345, s6789, filter);
-      int16x4_t d3 = convolve8_4_y(s3456, s78910, filter);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      int16x4_t d3 = convolve8_4_y(s3456, s789A, filter);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
@@ -761,7 +717,7 @@
       s0123 = s4567;
       s1234 = s5678;
       s2345 = s6789;
-      s3456 = s78910;
+      s3456 = s789A;
 
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
@@ -791,31 +747,31 @@
       // product.
       int8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_s8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_s8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_s8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_s8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
 
       do {
-        uint8x8_t t7, t8, t9, t10;
-        load_u8_8x4(s, src_stride, &t7, &t8, &t9, &t10);
+        uint8x8_t t7, t8, t9, tA;
+        load_u8_8x4(s, src_stride, &t7, &t8, &t9, &tA);
 
         int8x8_t s7 = vreinterpret_s8_u8(vsub_u8(t7, vdup_n_u8(128)));
         int8x8_t s8 = vreinterpret_s8_u8(vsub_u8(t8, vdup_n_u8(128)));
         int8x8_t s9 = vreinterpret_s8_u8(vsub_u8(t9, vdup_n_u8(128)));
-        int8x8_t s10 = vreinterpret_s8_u8(vsub_u8(t10, vdup_n_u8(128)));
+        int8x8_t sA = vreinterpret_s8_u8(vsub_u8(tA, vdup_n_u8(128)));
 
         int8x16_t s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo, s6789_hi,
-            s78910_lo, s78910_hi;
-        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
+            s789A_lo, s789A_hi;
+        transpose_concat_elems_s8_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
 
         // Merge new data into block from previous iteration.
-        int8x16x2_t samples_LUT_lo = { { s3456_lo, s78910_lo } };
+        int8x16x2_t samples_LUT_lo = { { s3456_lo, s789A_lo } };
         s4567_lo = vqtbl2q_s8(samples_LUT_lo, merge_block_tbl.val[0]);
         s5678_lo = vqtbl2q_s8(samples_LUT_lo, merge_block_tbl.val[1]);
         s6789_lo = vqtbl2q_s8(samples_LUT_lo, merge_block_tbl.val[2]);
 
-        int8x16x2_t samples_LUT_hi = { { s3456_hi, s78910_hi } };
+        int8x16x2_t samples_LUT_hi = { { s3456_hi, s789A_hi } };
         s4567_hi = vqtbl2q_s8(samples_LUT_hi, merge_block_tbl.val[0]);
         s5678_hi = vqtbl2q_s8(samples_LUT_hi, merge_block_tbl.val[1]);
         s6789_hi = vqtbl2q_s8(samples_LUT_hi, merge_block_tbl.val[2]);
@@ -827,7 +783,7 @@
         uint8x8_t d2 =
             convolve8_8_y(s2345_lo, s2345_hi, s6789_lo, s6789_hi, filter);
         uint8x8_t d3 =
-            convolve8_8_y(s3456_lo, s3456_hi, s78910_lo, s78910_hi, filter);
+            convolve8_8_y(s3456_lo, s3456_hi, s789A_lo, s789A_hi, filter);
 
         store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
 
@@ -839,8 +795,170 @@
         s1234_hi = s5678_hi;
         s2345_lo = s6789_lo;
         s2345_hi = s6789_hi;
-        s3456_lo = s78910_lo;
-        s3456_hi = s78910_hi;
+        s3456_lo = s789A_lo;
+        s3456_hi = s789A_hi;
+
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+static inline int16x4_t convolve4_4_y(const int8x16_t s0,
+                                      const int8x8_t filters) {
+  // The sample range transform and permutation are performed by the caller.
+  // Accumulate into 128 << FILTER_BITS to account for range transform.
+  // (- 1 since we halved the filters.)
+  const int32x4_t acc = vdupq_n_s32(128 << (FILTER_BITS - 1));
+
+  int32x4_t sum = vdotq_lane_s32(acc, s0, filters, 0);
+
+  // Further narrowing and packing is performed by the caller.
+  return vmovn_s32(sum);
+}
+
+static inline uint8x8_t convolve4_8_y(const int8x16_t s0, const int8x16_t s1,
+                                      const int8x8_t filters) {
+  // The sample range transform and permutation are performed by the caller.
+  // Accumulate into 128 << FILTER_BITS to account for range transform.
+  // (- 1 since we halved the filters.)
+  const int32x4_t acc = vdupq_n_s32(128 << (FILTER_BITS - 1));
+
+  int32x4_t sum0123 = vdotq_lane_s32(acc, s0, filters, 0);
+  int32x4_t sum4567 = vdotq_lane_s32(acc, s1, filters, 0);
+
+  // Narrow and re-pack.
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0123), vmovn_s32(sum4567));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
+}
+
+static inline void convolve_y_sr_4tap_neon_dotprod(
+    const uint8_t *src_ptr, int src_stride, uint8_t *dst_ptr, int dst_stride,
+    int w, int h, const int16_t *y_filter_ptr) {
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int16x8_t filter_s16 =
+      vcombine_s16(vld1_s16(y_filter_ptr + 2), vdup_n_s16(0));
+  const int8x8_t filter = vshrn_n_s16(filter_s16, 1);
+  const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
+  int8x16x2_t samples_LUT;
+
+  if (w == 4) {
+    uint8x8_t t0, t1, t2, t3;
+    load_u8_8x4(src_ptr, src_stride, &t0, &t1, &t2, &t3);
+    src_ptr += 4 * src_stride;
+
+    // Transform sample range to [-128, 127] for 8-bit signed dot product.
+    int8x8_t s0 = vreinterpret_s8_u8(vsub_u8(t0, vdup_n_u8(128)));
+    int8x8_t s1 = vreinterpret_s8_u8(vsub_u8(t1, vdup_n_u8(128)));
+    int8x8_t s2 = vreinterpret_s8_u8(vsub_u8(t2, vdup_n_u8(128)));
+    int8x8_t s3 = vreinterpret_s8_u8(vsub_u8(t3, vdup_n_u8(128)));
+
+    // This operation combines a conventional transpose and the sample permute
+    // required before computing the dot product.
+    int8x16_t s0123;
+    transpose_concat_elems_s8_4x4(s0, s1, s2, s3, &s0123);
+
+    do {
+      uint8x8_t t4, t5, t6, t7;
+      load_u8_8x4(src_ptr, src_stride, &t4, &t5, &t6, &t7);
+
+      int8x8_t s4 = vreinterpret_s8_u8(vsub_u8(t4, vdup_n_u8(128)));
+      int8x8_t s5 = vreinterpret_s8_u8(vsub_u8(t5, vdup_n_u8(128)));
+      int8x8_t s6 = vreinterpret_s8_u8(vsub_u8(t6, vdup_n_u8(128)));
+      int8x8_t s7 = vreinterpret_s8_u8(vsub_u8(t7, vdup_n_u8(128)));
+
+      int8x16_t s4567;
+      transpose_concat_elems_s8_4x4(s4, s5, s6, s7, &s4567);
+
+      // Merge new data into block from previous iteration.
+      samples_LUT.val[0] = s0123;
+      samples_LUT.val[1] = s4567;
+      int8x16_t s1234 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+      int8x16_t s2345 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+      int8x16_t s3456 = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+      int16x4_t d0 = convolve4_4_y(s0123, filter);
+      int16x4_t d1 = convolve4_4_y(s1234, filter);
+      int16x4_t d2 = convolve4_4_y(s2345, filter);
+      int16x4_t d3 = convolve4_4_y(s3456, filter);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
+
+      store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
+      store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
+
+      // Prepare block for next iteration - re-using as much as possible.
+      // Shuffle everything up four rows.
+      s0123 = s4567;
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    do {
+      int height = h;
+      const uint8_t *s = src_ptr;
+      uint8_t *d = dst_ptr;
+
+      uint8x8_t t0, t1, t2, t3;
+      load_u8_8x4(s, src_stride, &t0, &t1, &t2, &t3);
+      s += 4 * src_stride;
+
+      // Transform sample range to [-128, 127] for 8-bit signed dot product.
+      int8x8_t s0 = vreinterpret_s8_u8(vsub_u8(t0, vdup_n_u8(128)));
+      int8x8_t s1 = vreinterpret_s8_u8(vsub_u8(t1, vdup_n_u8(128)));
+      int8x8_t s2 = vreinterpret_s8_u8(vsub_u8(t2, vdup_n_u8(128)));
+      int8x8_t s3 = vreinterpret_s8_u8(vsub_u8(t3, vdup_n_u8(128)));
+
+      // This operation combines a conventional transpose and the sample permute
+      // required before computing the dot product.
+      int8x16_t s0123_lo, s0123_hi;
+      transpose_concat_elems_s8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+
+      do {
+        uint8x8_t t4, t5, t6, t7;
+        load_u8_8x4(s, src_stride, &t4, &t5, &t6, &t7);
+
+        int8x8_t s4 = vreinterpret_s8_u8(vsub_u8(t4, vdup_n_u8(128)));
+        int8x8_t s5 = vreinterpret_s8_u8(vsub_u8(t5, vdup_n_u8(128)));
+        int8x8_t s6 = vreinterpret_s8_u8(vsub_u8(t6, vdup_n_u8(128)));
+        int8x8_t s7 = vreinterpret_s8_u8(vsub_u8(t7, vdup_n_u8(128)));
+
+        int8x16_t s4567_lo, s4567_hi;
+        transpose_concat_elems_s8_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
+
+        // Merge new data into block from previous iteration.
+        samples_LUT.val[0] = s0123_lo;
+        samples_LUT.val[1] = s4567_lo;
+        int8x16_t s1234_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+        int8x16_t s2345_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+        int8x16_t s3456_lo = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+        samples_LUT.val[0] = s0123_hi;
+        samples_LUT.val[1] = s4567_hi;
+        int8x16_t s1234_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[0]);
+        int8x16_t s2345_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[1]);
+        int8x16_t s3456_hi = vqtbl2q_s8(samples_LUT, merge_block_tbl.val[2]);
+
+        uint8x8_t d0 = convolve4_8_y(s0123_lo, s0123_hi, filter);
+        uint8x8_t d1 = convolve4_8_y(s1234_lo, s1234_hi, filter);
+        uint8x8_t d2 = convolve4_8_y(s2345_lo, s2345_hi, filter);
+        uint8x8_t d3 = convolve4_8_y(s3456_lo, s3456_hi, filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        // Prepare block for next iteration - re-using as much as possible.
+        // Shuffle everything up four rows.
+        s0123_lo = s4567_lo;
+        s0123_hi = s4567_hi;
 
         s += 4 * src_stride;
         d += 4 * dst_stride;
@@ -864,27 +982,20 @@
   }
 
   const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
-
-  if (y_filter_taps <= 6) {
-    av1_convolve_y_sr_neon(src, src_stride, dst, dst_stride, w, h,
-                           filter_params_y, subpel_y_qn);
-    return;
-  }
-
-  const int vert_offset = y_filter_taps / 2 - 1;
-  src -= vert_offset * src_stride;
-
   const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_y, subpel_y_qn & SUBPEL_MASK);
 
-  if (y_filter_taps > 8) {
-    convolve_y_sr_12tap_neon_dotprod(src, src_stride, dst, dst_stride, w, h,
-                                     y_filter_ptr);
-    return;
+  if (y_filter_taps <= 4) {
+    convolve_y_sr_4tap_neon_dotprod(src - src_stride, src_stride, dst,
+                                    dst_stride, w, h, y_filter_ptr);
+  } else if (y_filter_taps == 12) {
+    convolve_y_sr_12tap_neon_dotprod(src - 5 * src_stride, src_stride, dst,
+                                     dst_stride, w, h, y_filter_ptr);
+  } else {
+    // 6-tap or 8-tap.
+    convolve_y_sr_8tap_neon_dotprod(src - 3 * src_stride, src_stride, dst,
+                                    dst_stride, w, h, y_filter_ptr);
   }
-
-  convolve_y_sr_8tap_neon_dotprod(src, src_stride, dst, dst_stride, w, h,
-                                  y_filter_ptr);
 }
 
 static inline int16x4_t convolve12_4_2d_h(uint8x16_t samples,
diff --git a/av1/common/arm/convolve_neon_i8mm.c b/av1/common/arm/convolve_neon_i8mm.c
index acd912e..0ecf6b2 100644
--- a/av1/common/arm/convolve_neon_i8mm.c
+++ b/av1/common/arm/convolve_neon_i8mm.c
@@ -16,6 +16,7 @@
 
 #include "aom_dsp/aom_dsp_common.h"
 #include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
 #include "aom_ports/mem.h"
 #include "av1/common/arm/convolve_neon.h"
 #include "av1/common/arm/convolve_neon_i8mm.h"
@@ -46,7 +47,7 @@
   int32x4_t sum = vusmmlaq_s32(horiz_const, perm_samples[0], filter[0]);
   sum = vusmmlaq_s32(sum, perm_samples[1], filter[1]);
 
-  return vqrshrn_n_s32(sum, FILTER_BITS);
+  return vshrn_n_s32(sum, 1);
 }
 
 static inline uint8x8_t convolve12_8_x(uint8x16_t samples[2],
@@ -71,9 +72,9 @@
   sum4567 = vusmmlaq_s32(sum4567, perm_samples[3], filter[1]);
 
   // Narrow and re-pack.
-  int16x8_t sum_s16 = vcombine_s16(vqrshrn_n_s32(sum0123, FILTER_BITS),
-                                   vqrshrn_n_s32(sum4567, FILTER_BITS));
-  return vqmovun_s16(sum_s16);
+  int16x8_t sum_s16 =
+      vcombine_s16(vshrn_n_s32(sum0123, 1), vshrn_n_s32(sum4567, 1));
+  return vqrshrun_n_s16(sum_s16, FILTER_BITS - 1);
 }
 
 static inline void convolve_x_sr_12tap_neon_i8mm(const uint8_t *src,
@@ -116,8 +117,8 @@
       int16x4_t d2 = convolve12_4_x(s2, filter, permute_tbl, horiz_const);
       int16x4_t d3 = convolve12_4_x(s3, filter, permute_tbl, horiz_const);
 
-      uint8x8_t d01 = vqmovun_s16(vcombine_s16(d0, d1));
-      uint8x8_t d23 = vqmovun_s16(vcombine_s16(d2, d3));
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst + 2 * dst_stride, dst_stride, d23);
@@ -352,58 +353,6 @@
                                x_filter_ptr, horiz_const);
 }
 
-static inline void transpose_concat_4x4(uint8x8_t a0, uint8x8_t a1,
-                                        uint8x8_t a2, uint8x8_t a3,
-                                        uint8x16_t *b) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, XX, XX, XX, XX
-  // a1: 10, 11, 12, 13, XX, XX, XX, XX
-  // a2: 20, 21, 22, 23, XX, XX, XX, XX
-  // a3: 30, 31, 32, 33, XX, XX, XX, XX
-  //
-  // b: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-
-  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
-  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
-  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
-  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
-
-  uint8x16_t a01 = vzipq_u8(a0q, a1q).val[0];
-  uint8x16_t a23 = vzipq_u8(a2q, a3q).val[0];
-
-  uint16x8_t a0123 =
-      vzipq_u16(vreinterpretq_u16_u8(a01), vreinterpretq_u16_u8(a23)).val[0];
-
-  *b = vreinterpretq_u8_u16(a0123);
-}
-
-static inline void transpose_concat_8x4(uint8x8_t a0, uint8x8_t a1,
-                                        uint8x8_t a2, uint8x8_t a3,
-                                        uint8x16_t *b0, uint8x16_t *b1) {
-  // Transpose 8-bit elements and concatenate result rows as follows:
-  // a0: 00, 01, 02, 03, 04, 05, 06, 07
-  // a1: 10, 11, 12, 13, 14, 15, 16, 17
-  // a2: 20, 21, 22, 23, 24, 25, 26, 27
-  // a3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // b0: 00, 10, 20, 30, 01, 11, 21, 31, 02, 12, 22, 32, 03, 13, 23, 33
-  // b1: 04, 14, 24, 34, 05, 15, 25, 35, 06, 16, 26, 36, 07, 17, 27, 37
-
-  uint8x16_t a0q = vcombine_u8(a0, vdup_n_u8(0));
-  uint8x16_t a1q = vcombine_u8(a1, vdup_n_u8(0));
-  uint8x16_t a2q = vcombine_u8(a2, vdup_n_u8(0));
-  uint8x16_t a3q = vcombine_u8(a3, vdup_n_u8(0));
-
-  uint8x16_t a01 = vzipq_u8(a0q, a1q).val[0];
-  uint8x16_t a23 = vzipq_u8(a2q, a3q).val[0];
-
-  uint16x8x2_t a0123 =
-      vzipq_u16(vreinterpretq_u16_u8(a01), vreinterpretq_u16_u8(a23));
-
-  *b0 = vreinterpretq_u8_u16(a0123.val[0]);
-  *b1 = vreinterpretq_u8_u16(a0123.val[1]);
-}
-
 static inline int16x4_t convolve12_4_y(const uint8x16_t s0, const uint8x16_t s1,
                                        const uint8x16_t s2,
                                        const int8x8_t filters_0_7,
@@ -413,7 +362,7 @@
   sum = vusdotq_lane_s32(sum, s2, filters_4_11, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vshrn_n_s32(sum, 1);
 }
 
 static inline uint8x8_t convolve12_8_y(
@@ -429,8 +378,9 @@
   sum4567 = vusdotq_lane_s32(sum4567, s2_hi, filters_4_11, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0123), vqmovn_s32(sum4567));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum =
+      vcombine_s16(vshrn_n_s32(sum0123, 1), vshrn_n_s32(sum4567, 1));
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve_y_sr_12tap_neon_i8mm(const uint8_t *src_ptr,
@@ -455,21 +405,21 @@
     // This operation combines a conventional transpose and the sample permute
     // (see horizontal case) required before computing the dot product.
     uint8x16_t s0123, s1234, s2345, s3456, s4567, s5678, s6789, s789A;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
-    transpose_concat_4x4(s4, s5, s6, s7, &s4567);
-    transpose_concat_4x4(s5, s6, s7, s8, &s5678);
-    transpose_concat_4x4(s6, s7, s8, s9, &s6789);
-    transpose_concat_4x4(s7, s8, s9, sA, &s789A);
+    transpose_concat_elems_u8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_u8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_u8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_u8_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_u8_4x4(s4, s5, s6, s7, &s4567);
+    transpose_concat_elems_u8_4x4(s5, s6, s7, s8, &s5678);
+    transpose_concat_elems_u8_4x4(s6, s7, s8, s9, &s6789);
+    transpose_concat_elems_u8_4x4(s7, s8, s9, sA, &s789A);
 
     do {
       uint8x8_t sB, sC, sD, sE;
       load_u8_8x4(src_ptr, src_stride, &sB, &sC, &sD, &sE);
 
       uint8x16_t s89AB, s9ABC, sABCD, sBCDE;
-      transpose_concat_4x4(sB, sC, sD, sE, &sBCDE);
+      transpose_concat_elems_u8_4x4(sB, sC, sD, sE, &sBCDE);
 
       // Merge new data into block from previous iteration.
       uint8x16x2_t samples_LUT = { { s789A, sBCDE } };
@@ -485,8 +435,8 @@
           convolve12_4_y(s2345, s6789, sABCD, filter_0_7, filter_4_11);
       int16x4_t d3 =
           convolve12_4_y(s3456, s789A, sBCDE, filter_0_7, filter_4_11);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
@@ -523,14 +473,14 @@
       uint8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi, s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo,
           s6789_hi, s789A_lo, s789A_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
-      transpose_concat_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
-      transpose_concat_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi);
-      transpose_concat_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi);
-      transpose_concat_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
+      transpose_concat_elems_u8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_u8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_u8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_u8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_u8_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
+      transpose_concat_elems_u8_8x4(s5, s6, s7, s8, &s5678_lo, &s5678_hi);
+      transpose_concat_elems_u8_8x4(s6, s7, s8, s9, &s6789_lo, &s6789_hi);
+      transpose_concat_elems_u8_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
 
       do {
         uint8x8_t sB, sC, sD, sE;
@@ -538,7 +488,7 @@
 
         uint8x16_t s89AB_lo, s89AB_hi, s9ABC_lo, s9ABC_hi, sABCD_lo, sABCD_hi,
             sBCDE_lo, sBCDE_hi;
-        transpose_concat_8x4(sB, sC, sD, sE, &sBCDE_lo, &sBCDE_hi);
+        transpose_concat_elems_u8_8x4(sB, sC, sD, sE, &sBCDE_lo, &sBCDE_hi);
 
         // Merge new data into block from previous iteration.
         uint8x16x2_t samples_LUT_lo = { { s789A_lo, sBCDE_lo } };
@@ -602,7 +552,7 @@
   sum = vusdotq_lane_s32(sum, s1, filters, 1);
 
   // Further narrowing and packing is performed by the caller.
-  return vqmovn_s32(sum);
+  return vmovn_s32(sum);
 }
 
 static inline uint8x8_t convolve8_8_y(const uint8x16_t s0_lo,
@@ -617,8 +567,9 @@
   sum4567 = vusdotq_lane_s32(sum4567, s1_hi, filters, 1);
 
   // Narrow and re-pack.
-  int16x8_t sum = vcombine_s16(vqmovn_s32(sum0123), vqmovn_s32(sum4567));
-  return vqrshrun_n_s16(sum, FILTER_BITS);
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0123), vmovn_s32(sum4567));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
 }
 
 static inline void convolve_y_sr_8tap_neon_i8mm(const uint8_t *src_ptr,
@@ -626,7 +577,8 @@
                                                 uint8_t *dst_ptr,
                                                 int dst_stride, int w, int h,
                                                 const int16_t *y_filter_ptr) {
-  const int8x8_t filter = vmovn_s16(vld1q_s16(y_filter_ptr));
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int8x8_t filter = vshrn_n_s16(vld1q_s16(y_filter_ptr), 1);
 
   const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
 
@@ -638,20 +590,20 @@
     // This operation combines a conventional transpose and the sample permute
     // (see horizontal case) required before computing the dot product.
     uint8x16_t s0123, s1234, s2345, s3456;
-    transpose_concat_4x4(s0, s1, s2, s3, &s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, &s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, &s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, &s3456);
+    transpose_concat_elems_u8_4x4(s0, s1, s2, s3, &s0123);
+    transpose_concat_elems_u8_4x4(s1, s2, s3, s4, &s1234);
+    transpose_concat_elems_u8_4x4(s2, s3, s4, s5, &s2345);
+    transpose_concat_elems_u8_4x4(s3, s4, s5, s6, &s3456);
 
     do {
-      uint8x8_t s7, s8, s9, s10;
-      load_u8_8x4(src_ptr, src_stride, &s7, &s8, &s9, &s10);
+      uint8x8_t s7, s8, s9, sA;
+      load_u8_8x4(src_ptr, src_stride, &s7, &s8, &s9, &sA);
 
-      uint8x16_t s4567, s5678, s6789, s78910;
-      transpose_concat_4x4(s7, s8, s9, s10, &s78910);
+      uint8x16_t s4567, s5678, s6789, s789A;
+      transpose_concat_elems_u8_4x4(s7, s8, s9, sA, &s789A);
 
       // Merge new data into block from previous iteration.
-      uint8x16x2_t samples_LUT = { { s3456, s78910 } };
+      uint8x16x2_t samples_LUT = { { s3456, s789A } };
       s4567 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
       s5678 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
       s6789 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
@@ -659,9 +611,10 @@
       int16x4_t d0 = convolve8_4_y(s0123, s4567, filter);
       int16x4_t d1 = convolve8_4_y(s1234, s5678, filter);
       int16x4_t d2 = convolve8_4_y(s2345, s6789, filter);
-      int16x4_t d3 = convolve8_4_y(s3456, s78910, filter);
-      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS);
-      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS);
+      int16x4_t d3 = convolve8_4_y(s3456, s789A, filter);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
 
       store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
       store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
@@ -671,7 +624,7 @@
       s0123 = s4567;
       s1234 = s5678;
       s2345 = s6789;
-      s3456 = s78910;
+      s3456 = s789A;
 
       src_ptr += 4 * src_stride;
       dst_ptr += 4 * dst_stride;
@@ -692,26 +645,26 @@
       // product.
       uint8x16_t s0123_lo, s0123_hi, s1234_lo, s1234_hi, s2345_lo, s2345_hi,
           s3456_lo, s3456_hi;
-      transpose_concat_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
-      transpose_concat_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
-      transpose_concat_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
-      transpose_concat_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
+      transpose_concat_elems_u8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+      transpose_concat_elems_u8_8x4(s1, s2, s3, s4, &s1234_lo, &s1234_hi);
+      transpose_concat_elems_u8_8x4(s2, s3, s4, s5, &s2345_lo, &s2345_hi);
+      transpose_concat_elems_u8_8x4(s3, s4, s5, s6, &s3456_lo, &s3456_hi);
 
       do {
-        uint8x8_t s7, s8, s9, s10;
-        load_u8_8x4(s, src_stride, &s7, &s8, &s9, &s10);
+        uint8x8_t s7, s8, s9, sA;
+        load_u8_8x4(s, src_stride, &s7, &s8, &s9, &sA);
 
         uint8x16_t s4567_lo, s4567_hi, s5678_lo, s5678_hi, s6789_lo, s6789_hi,
-            s78910_lo, s78910_hi;
-        transpose_concat_8x4(s7, s8, s9, s10, &s78910_lo, &s78910_hi);
+            s789A_lo, s789A_hi;
+        transpose_concat_elems_u8_8x4(s7, s8, s9, sA, &s789A_lo, &s789A_hi);
 
         // Merge new data into block from previous iteration.
-        uint8x16x2_t samples_LUT_lo = { { s3456_lo, s78910_lo } };
+        uint8x16x2_t samples_LUT_lo = { { s3456_lo, s789A_lo } };
         s4567_lo = vqtbl2q_u8(samples_LUT_lo, merge_block_tbl.val[0]);
         s5678_lo = vqtbl2q_u8(samples_LUT_lo, merge_block_tbl.val[1]);
         s6789_lo = vqtbl2q_u8(samples_LUT_lo, merge_block_tbl.val[2]);
 
-        uint8x16x2_t samples_LUT_hi = { { s3456_hi, s78910_hi } };
+        uint8x16x2_t samples_LUT_hi = { { s3456_hi, s789A_hi } };
         s4567_hi = vqtbl2q_u8(samples_LUT_hi, merge_block_tbl.val[0]);
         s5678_hi = vqtbl2q_u8(samples_LUT_hi, merge_block_tbl.val[1]);
         s6789_hi = vqtbl2q_u8(samples_LUT_hi, merge_block_tbl.val[2]);
@@ -723,7 +676,7 @@
         uint8x8_t d2 =
             convolve8_8_y(s2345_lo, s2345_hi, s6789_lo, s6789_hi, filter);
         uint8x8_t d3 =
-            convolve8_8_y(s3456_lo, s3456_hi, s78910_lo, s78910_hi, filter);
+            convolve8_8_y(s3456_lo, s3456_hi, s789A_lo, s789A_hi, filter);
 
         store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
 
@@ -735,8 +688,140 @@
         s1234_hi = s5678_hi;
         s2345_lo = s6789_lo;
         s2345_hi = s6789_hi;
-        s3456_lo = s78910_lo;
-        s3456_hi = s78910_hi;
+        s3456_lo = s789A_lo;
+        s3456_hi = s789A_hi;
+
+        s += 4 * src_stride;
+        d += 4 * dst_stride;
+        height -= 4;
+      } while (height != 0);
+      src_ptr += 8;
+      dst_ptr += 8;
+      w -= 8;
+    } while (w != 0);
+  }
+}
+
+static inline int16x4_t convolve4_4_y(const uint8x16_t s0,
+                                      const int8x8_t filters) {
+  int32x4_t sum = vusdotq_lane_s32(vdupq_n_s32(0), s0, filters, 0);
+
+  // Further narrowing and packing is performed by the caller.
+  return vmovn_s32(sum);
+}
+
+static inline uint8x8_t convolve4_8_y(const uint8x16_t s0, const uint8x16_t s1,
+                                      const int8x8_t filters) {
+  int32x4_t sum0123 = vusdotq_lane_s32(vdupq_n_s32(0), s0, filters, 0);
+  int32x4_t sum4567 = vusdotq_lane_s32(vdupq_n_s32(0), s1, filters, 0);
+
+  // Narrow and re-pack.
+  int16x8_t sum = vcombine_s16(vmovn_s32(sum0123), vmovn_s32(sum4567));
+  // We halved the filter values so -1 from right shift.
+  return vqrshrun_n_s16(sum, FILTER_BITS - 1);
+}
+
+static inline void convolve_y_sr_4tap_neon_i8mm(const uint8_t *src_ptr,
+                                                int src_stride,
+                                                uint8_t *dst_ptr,
+                                                int dst_stride, int w, int h,
+                                                const int16_t *y_filter_ptr) {
+  // Filter values are even, so halve to reduce intermediate precision reqs.
+  const int16x8_t filter_s16 =
+      vcombine_s16(vld1_s16(y_filter_ptr + 2), vdup_n_s16(0));
+  const int8x8_t filter = vshrn_n_s16(filter_s16, 1);
+  const uint8x16x3_t merge_block_tbl = vld1q_u8_x3(kDotProdMergeBlockTbl);
+  uint8x16x2_t samples_LUT;
+
+  if (w == 4) {
+    uint8x8_t s0, s1, s2, s3;
+    load_u8_8x4(src_ptr, src_stride, &s0, &s1, &s2, &s3);
+    src_ptr += 4 * src_stride;
+
+    // This operation combines a conventional transpose and the sample permute
+    // required before computing the dot product.
+    uint8x16_t s0123;
+    transpose_concat_elems_u8_4x4(s0, s1, s2, s3, &s0123);
+
+    do {
+      uint8x8_t s4, s5, s6, s7;
+      load_u8_8x4(src_ptr, src_stride, &s4, &s5, &s6, &s7);
+
+      uint8x16_t s4567;
+      transpose_concat_elems_u8_4x4(s4, s5, s6, s7, &s4567);
+
+      // Merge new data into block from previous iteration.
+      samples_LUT.val[0] = s0123;
+      samples_LUT.val[1] = s4567;
+      uint8x16_t s1234 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+      uint8x16_t s2345 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+      uint8x16_t s3456 = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+      int16x4_t d0 = convolve4_4_y(s0123, filter);
+      int16x4_t d1 = convolve4_4_y(s1234, filter);
+      int16x4_t d2 = convolve4_4_y(s2345, filter);
+      int16x4_t d3 = convolve4_4_y(s3456, filter);
+      // We halved the filter values so -1 from right shift.
+      uint8x8_t d01 = vqrshrun_n_s16(vcombine_s16(d0, d1), FILTER_BITS - 1);
+      uint8x8_t d23 = vqrshrun_n_s16(vcombine_s16(d2, d3), FILTER_BITS - 1);
+
+      store_u8x4_strided_x2(dst_ptr + 0 * dst_stride, dst_stride, d01);
+      store_u8x4_strided_x2(dst_ptr + 2 * dst_stride, dst_stride, d23);
+
+      // Prepare block for next iteration - re-using as much as possible.
+      // Shuffle everything up four rows.
+      s0123 = s4567;
+
+      src_ptr += 4 * src_stride;
+      dst_ptr += 4 * dst_stride;
+      h -= 4;
+    } while (h != 0);
+  } else {
+    do {
+      int height = h;
+      const uint8_t *s = src_ptr;
+      uint8_t *d = dst_ptr;
+
+      uint8x8_t s0, s1, s2, s3;
+      load_u8_8x4(s, src_stride, &s0, &s1, &s2, &s3);
+      s += 4 * src_stride;
+
+      // This operation combines a conventional transpose and the sample permute
+      // required before computing the dot product.
+      uint8x16_t s0123_lo, s0123_hi;
+      transpose_concat_elems_u8_8x4(s0, s1, s2, s3, &s0123_lo, &s0123_hi);
+
+      do {
+        uint8x8_t s4, s5, s6, s7;
+        load_u8_8x4(s, src_stride, &s4, &s5, &s6, &s7);
+
+        uint8x16_t s4567_lo, s4567_hi;
+        transpose_concat_elems_u8_8x4(s4, s5, s6, s7, &s4567_lo, &s4567_hi);
+
+        // Merge new data into block from previous iteration.
+        samples_LUT.val[0] = s0123_lo;
+        samples_LUT.val[1] = s4567_lo;
+        uint8x16_t s1234_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        uint8x16_t s2345_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        uint8x16_t s3456_lo = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        samples_LUT.val[0] = s0123_hi;
+        samples_LUT.val[1] = s4567_hi;
+        uint8x16_t s1234_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[0]);
+        uint8x16_t s2345_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[1]);
+        uint8x16_t s3456_hi = vqtbl2q_u8(samples_LUT, merge_block_tbl.val[2]);
+
+        uint8x8_t d0 = convolve4_8_y(s0123_lo, s0123_hi, filter);
+        uint8x8_t d1 = convolve4_8_y(s1234_lo, s1234_hi, filter);
+        uint8x8_t d2 = convolve4_8_y(s2345_lo, s2345_hi, filter);
+        uint8x8_t d3 = convolve4_8_y(s3456_lo, s3456_hi, filter);
+
+        store_u8_8x4(d, dst_stride, d0, d1, d2, d3);
+
+        // Prepare block for next iteration - re-using as much as possible.
+        // Shuffle everything up four rows.
+        s0123_lo = s4567_lo;
+        s0123_hi = s4567_hi;
 
         s += 4 * src_stride;
         d += 4 * dst_stride;
@@ -760,26 +845,20 @@
   }
 
   const int y_filter_taps = get_filter_tap(filter_params_y, subpel_y_qn);
-
-  if (y_filter_taps <= 6) {
-    av1_convolve_y_sr_neon(src, src_stride, dst, dst_stride, w, h,
-                           filter_params_y, subpel_y_qn);
-    return;
-  }
-
-  const int vert_offset = y_filter_taps / 2 - 1;
-  src -= vert_offset * src_stride;
-
   const int16_t *y_filter_ptr = av1_get_interp_filter_subpel_kernel(
       filter_params_y, subpel_y_qn & SUBPEL_MASK);
 
-  if (y_filter_taps > 8) {
-    convolve_y_sr_12tap_neon_i8mm(src, src_stride, dst, dst_stride, w, h,
-                                  y_filter_ptr);
-    return;
+  if (y_filter_taps <= 4) {
+    convolve_y_sr_4tap_neon_i8mm(src - src_stride, src_stride, dst, dst_stride,
+                                 w, h, y_filter_ptr);
+  } else if (y_filter_taps == 12) {
+    convolve_y_sr_12tap_neon_i8mm(src - 5 * src_stride, src_stride, dst,
+                                  dst_stride, w, h, y_filter_ptr);
+  } else {
+    // 6-tap or 8-tap.
+    convolve_y_sr_8tap_neon_i8mm(src - 3 * src_stride, src_stride, dst,
+                                 dst_stride, w, h, y_filter_ptr);
   }
-  convolve_y_sr_8tap_neon_i8mm(src, src_stride, dst, dst_stride, w, h,
-                               y_filter_ptr);
 }
 
 static inline int16x8_t convolve8_8_2d_h(uint8x16_t samples,
diff --git a/av1/common/arm/convolve_sve2.c b/av1/common/arm/convolve_sve2.c
index 536f441..3cda7d7 100644
--- a/av1/common/arm/convolve_sve2.c
+++ b/av1/common/arm/convolve_sve2.c
@@ -81,21 +81,21 @@
         s6789[2], s789A[2];
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
-    transpose_concat_4x4(s4, s5, s6, s7, s4567);
-    transpose_concat_4x4(s5, s6, s7, s8, s5678);
-    transpose_concat_4x4(s6, s7, s8, s9, s6789);
-    transpose_concat_4x4(s7, s8, s9, sA, s789A);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s4, s5, s6, s7, s4567);
+    transpose_concat_elems_s16_4x4(s5, s6, s7, s8, s5678);
+    transpose_concat_elems_s16_4x4(s6, s7, s8, s9, s6789);
+    transpose_concat_elems_s16_4x4(s7, s8, s9, sA, s789A);
 
     do {
       int16x4_t sB, sC, sD, sE;
       load_s16_4x4(s, src_stride, &sB, &sC, &sD, &sE);
 
       int16x8_t s89AB[2], s9ABC[2], sABCD[2], sBCDE[2];
-      transpose_concat_4x4(sB, sC, sD, sE, sBCDE);
+      transpose_concat_elems_s16_4x4(sB, sC, sD, sE, sBCDE);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s789A, sBCDE, merge_block_tbl.val[0], s89AB);
diff --git a/av1/common/arm/highbd_compound_convolve_sve2.c b/av1/common/arm/highbd_compound_convolve_sve2.c
index 668dfbf..493f218 100644
--- a/av1/common/arm/highbd_compound_convolve_sve2.c
+++ b/av1/common/arm/highbd_compound_convolve_sve2.c
@@ -506,10 +506,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -517,7 +517,7 @@
 
       int16x8_t s4567[2], s5678[2], s6789[2], s789A[2];
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s789A);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s789A);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -559,10 +559,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -570,7 +570,7 @@
         int16x8_t s4567[4], s5678[4], s6789[4], s789A[4];
 
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s789A);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s789A);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -682,10 +682,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -693,7 +693,7 @@
 
       int16x8_t s4567[2], s5678[2], s6789[2], s789A[2];
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s789A);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s789A);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -735,10 +735,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -746,7 +746,7 @@
         int16x8_t s4567[4], s5678[4], s6789[4], s789A[4];
 
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s789A);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s789A);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -1234,10 +1234,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -1245,7 +1245,7 @@
 
       int16x8_t s4567[2], s5678[2], s6789[2], s789A[2];
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s789A);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s789A);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -1291,10 +1291,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -1302,7 +1302,7 @@
         int16x8_t s4567[4], s5678[4], s6789[4], s789A[4];
 
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s789A);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s789A);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
diff --git a/av1/common/arm/highbd_convolve_sve2.c b/av1/common/arm/highbd_convolve_sve2.c
index fcf9d7b..8ce6021 100644
--- a/av1/common/arm/highbd_convolve_sve2.c
+++ b/av1/common/arm/highbd_convolve_sve2.c
@@ -19,6 +19,7 @@
 #include "aom_dsp/arm/aom_neon_sve_bridge.h"
 #include "aom_dsp/arm/aom_neon_sve2_bridge.h"
 #include "aom_dsp/arm/mem_neon.h"
+#include "aom_dsp/arm/transpose_neon.h"
 #include "aom_ports/mem.h"
 #include "av1/common/convolve.h"
 #include "av1/common/filter.h"
@@ -456,21 +457,21 @@
 
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2], s4567[2], s5678[2],
         s6789[2], s789A[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
-    transpose_concat_4x4(s4, s5, s6, s7, s4567);
-    transpose_concat_4x4(s5, s6, s7, s8, s5678);
-    transpose_concat_4x4(s6, s7, s8, s9, s6789);
-    transpose_concat_4x4(s7, s8, s9, sA, s789A);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s4, s5, s6, s7, s4567);
+    transpose_concat_elems_s16_4x4(s5, s6, s7, s8, s5678);
+    transpose_concat_elems_s16_4x4(s6, s7, s8, s9, s6789);
+    transpose_concat_elems_s16_4x4(s7, s8, s9, sA, s789A);
 
     do {
       int16x4_t sB, sC, sD, sE;
       load_s16_4x4(s, src_stride, &sB, &sC, &sD, &sE);
 
       int16x8_t s89AB[2], s9ABC[2], sABCD[2], sBCDE[2];
-      transpose_concat_4x4(sB, sC, sD, sE, sBCDE);
+      transpose_concat_elems_s16_4x4(sB, sC, sD, sE, sBCDE);
 
       // Use the above transpose and reuse data from the previous loop to get
       // the rest.
@@ -597,10 +598,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -608,7 +609,7 @@
 
       int16x8_t s4567[2], s5678[2], s6789[2], s789A[2];
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s789A);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s789A);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -651,10 +652,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -662,7 +663,7 @@
 
         int16x8_t s4567[4], s5678[4], s6789[4], s789A[4];
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s789A);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s789A);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -757,10 +758,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-      transpose_concat_4x4(s0, s1, s2, s3, s0123);
-      transpose_concat_4x4(s1, s2, s3, s4, s1234);
-      transpose_concat_4x4(s2, s3, s4, s5, s2345);
-      transpose_concat_4x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
       uint16x4_t d0 = highbd_convolve4_4_y(s0123, y_filter, max);
       uint16x4_t d1 = highbd_convolve4_4_y(s1234, y_filter, max);
@@ -797,10 +798,10 @@
         // This operation combines a conventional transpose and the sample
         // permute required before computing the dot product.
         int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-        transpose_concat_8x4(s0, s1, s2, s3, s0123);
-        transpose_concat_8x4(s1, s2, s3, s4, s1234);
-        transpose_concat_8x4(s2, s3, s4, s5, s2345);
-        transpose_concat_8x4(s3, s4, s5, s6, s3456);
+        transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+        transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+        transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+        transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
         uint16x8_t d0 = highbd_convolve4_8_y(s0123, y_filter, max);
         uint16x8_t d1 = highbd_convolve4_8_y(s1234, y_filter, max);
@@ -1245,21 +1246,21 @@
         s6789[2], s789A[2];
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
-    transpose_concat_4x4(s4, s5, s6, s7, s4567);
-    transpose_concat_4x4(s5, s6, s7, s8, s5678);
-    transpose_concat_4x4(s6, s7, s8, s9, s6789);
-    transpose_concat_4x4(s7, s8, s9, sA, s789A);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s4, s5, s6, s7, s4567);
+    transpose_concat_elems_s16_4x4(s5, s6, s7, s8, s5678);
+    transpose_concat_elems_s16_4x4(s6, s7, s8, s9, s6789);
+    transpose_concat_elems_s16_4x4(s7, s8, s9, sA, s789A);
 
     do {
       int16x4_t sB, sC, sD, sE;
       load_s16_4x4(s, src_stride, &sB, &sC, &sD, &sE);
 
       int16x8_t s89AB[2], s9ABC[2], sABCD[2], sBCDE[2];
-      transpose_concat_4x4(sB, sC, sD, sE, sBCDE);
+      transpose_concat_elems_s16_4x4(sB, sC, sD, sE, sBCDE);
 
       // Use the above transpose and reuse data from the previous loop to get
       // the rest.
@@ -1383,10 +1384,10 @@
     // This operation combines a conventional transpose and the sample permute
     // required before computing the dot product.
     int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-    transpose_concat_4x4(s0, s1, s2, s3, s0123);
-    transpose_concat_4x4(s1, s2, s3, s4, s1234);
-    transpose_concat_4x4(s2, s3, s4, s5, s2345);
-    transpose_concat_4x4(s3, s4, s5, s6, s3456);
+    transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+    transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+    transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+    transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
     do {
       int16x4_t s7, s8, s9, s10;
@@ -1394,7 +1395,7 @@
 
       int16x8_t s4567[2], s5678[2], s6789[2], s789A[2];
       // Transpose and shuffle the 4 lines that were loaded.
-      transpose_concat_4x4(s7, s8, s9, s10, s789A);
+      transpose_concat_elems_s16_4x4(s7, s8, s9, s10, s789A);
 
       // Merge new data into block from previous iteration.
       aom_tbl2x2_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -1442,10 +1443,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-      transpose_concat_8x4(s0, s1, s2, s3, s0123);
-      transpose_concat_8x4(s1, s2, s3, s4, s1234);
-      transpose_concat_8x4(s2, s3, s4, s5, s2345);
-      transpose_concat_8x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
       do {
         int16x8_t s7, s8, s9, s10;
@@ -1453,7 +1454,7 @@
 
         int16x8_t s4567[4], s5678[4], s6789[4], s789A[4];
         // Transpose and shuffle the 4 lines that were loaded.
-        transpose_concat_8x4(s7, s8, s9, s10, s789A);
+        transpose_concat_elems_s16_8x4(s7, s8, s9, s10, s789A);
 
         // Merge new data into block from previous iteration.
         aom_tbl2x4_s16(s3456, s789A, merge_block_tbl.val[0], s4567);
@@ -1562,10 +1563,10 @@
       // This operation combines a conventional transpose and the sample permute
       // required before computing the dot product.
       int16x8_t s0123[2], s1234[2], s2345[2], s3456[2];
-      transpose_concat_4x4(s0, s1, s2, s3, s0123);
-      transpose_concat_4x4(s1, s2, s3, s4, s1234);
-      transpose_concat_4x4(s2, s3, s4, s5, s2345);
-      transpose_concat_4x4(s3, s4, s5, s6, s3456);
+      transpose_concat_elems_s16_4x4(s0, s1, s2, s3, s0123);
+      transpose_concat_elems_s16_4x4(s1, s2, s3, s4, s1234);
+      transpose_concat_elems_s16_4x4(s2, s3, s4, s5, s2345);
+      transpose_concat_elems_s16_4x4(s3, s4, s5, s6, s3456);
 
       uint16x4_t d0 =
           highbd_convolve4_4_2d_v(s0123, y_filter, shift, offset, max);
@@ -1606,10 +1607,10 @@
         // This operation combines a conventional transpose and the sample
         // permute required before computing the dot product.
         int16x8_t s0123[4], s1234[4], s2345[4], s3456[4];
-        transpose_concat_8x4(s0, s1, s2, s3, s0123);
-        transpose_concat_8x4(s1, s2, s3, s4, s1234);
-        transpose_concat_8x4(s2, s3, s4, s5, s2345);
-        transpose_concat_8x4(s3, s4, s5, s6, s3456);
+        transpose_concat_elems_s16_8x4(s0, s1, s2, s3, s0123);
+        transpose_concat_elems_s16_8x4(s1, s2, s3, s4, s1234);
+        transpose_concat_elems_s16_8x4(s2, s3, s4, s5, s2345);
+        transpose_concat_elems_s16_8x4(s3, s4, s5, s6, s3456);
 
         uint16x8_t d0 =
             highbd_convolve4_8_2d_v(s0123, y_filter, shift, offset, max);
diff --git a/av1/common/arm/highbd_convolve_sve2.h b/av1/common/arm/highbd_convolve_sve2.h
index abbad14..40ba2cd 100644
--- a/av1/common/arm/highbd_convolve_sve2.h
+++ b/av1/common/arm/highbd_convolve_sve2.h
@@ -27,59 +27,6 @@
 };
 // clang-format on
 
-static inline void transpose_concat_4x4(int16x4_t s0, int16x4_t s1,
-                                        int16x4_t s2, int16x4_t s3,
-                                        int16x8_t res[2]) {
-  // Transpose 16-bit elements and concatenate result rows as follows:
-  // s0: 00, 01, 02, 03
-  // s1: 10, 11, 12, 13
-  // s2: 20, 21, 22, 23
-  // s3: 30, 31, 32, 33
-  //
-  // res[0]: 00 10 20 30 01 11 21 31
-  // res[1]: 02 12 22 32 03 13 23 33
-
-  int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0));
-  int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0));
-  int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0));
-  int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0));
-
-  int32x4_t s01 = vreinterpretq_s32_s16(vzip1q_s16(s0q, s1q));
-  int32x4_t s23 = vreinterpretq_s32_s16(vzip1q_s16(s2q, s3q));
-
-  int32x4x2_t s0123 = vzipq_s32(s01, s23);
-
-  res[0] = vreinterpretq_s16_s32(s0123.val[0]);
-  res[1] = vreinterpretq_s16_s32(s0123.val[1]);
-}
-
-static inline void transpose_concat_8x4(int16x8_t s0, int16x8_t s1,
-                                        int16x8_t s2, int16x8_t s3,
-                                        int16x8_t res[4]) {
-  // Transpose 16-bit elements and concatenate result rows as follows:
-  // s0: 00, 01, 02, 03, 04, 05, 06, 07
-  // s1: 10, 11, 12, 13, 14, 15, 16, 17
-  // s2: 20, 21, 22, 23, 24, 25, 26, 27
-  // s3: 30, 31, 32, 33, 34, 35, 36, 37
-  //
-  // res[0]: 00 10 20 30 01 11 21 31
-  // res[1]: 02 12 22 32 03 13 23 33
-  // res[2]: 04 14 24 34 05 15 25 35
-  // res[3]: 06 16 26 36 07 17 27 37
-
-  int16x8x2_t tr01_16 = vzipq_s16(s0, s1);
-  int16x8x2_t tr23_16 = vzipq_s16(s2, s3);
-  int32x4x2_t tr01_32 = vzipq_s32(vreinterpretq_s32_s16(tr01_16.val[0]),
-                                  vreinterpretq_s32_s16(tr23_16.val[0]));
-  int32x4x2_t tr23_32 = vzipq_s32(vreinterpretq_s32_s16(tr01_16.val[1]),
-                                  vreinterpretq_s32_s16(tr23_16.val[1]));
-
-  res[0] = vreinterpretq_s16_s32(tr01_32.val[0]);
-  res[1] = vreinterpretq_s16_s32(tr01_32.val[1]);
-  res[2] = vreinterpretq_s16_s32(tr23_32.val[0]);
-  res[3] = vreinterpretq_s16_s32(tr23_32.val[1]);
-}
-
 static inline void aom_tbl2x4_s16(int16x8_t t0[4], int16x8_t t1[4],
                                   uint16x8_t tbl, int16x8_t res[4]) {
   res[0] = aom_tbl2_s16(t0[0], t1[0], tbl);
diff --git a/av1/common/arm/reconintra_neon.c b/av1/common/arm/reconintra_neon.c
index 81eb224..dc0cb17 100644
--- a/av1/common/arm/reconintra_neon.c
+++ b/av1/common/arm/reconintra_neon.c
@@ -21,9 +21,6 @@
 
 #define MAX_UPSAMPLE_SZ 16
 
-// TODO(aomedia:349436249): enable for armv7 after SIGBUS is fixed.
-#if AOM_ARCH_AARCH64
-
 // These kernels are a transposed version of those defined in reconintra.c,
 // with the absolute value of the negatives taken in the top row.
 DECLARE_ALIGNED(16, const uint8_t,
@@ -113,7 +110,7 @@
       uint8x8_t s6 = vld1_dup_u8(&buffer[r + 1][c - 1]);
 
       do {
-        uint8x8_t s1234 = load_u8_4x1(&buffer[r - 1][c - 1] + 1);
+        uint8x8_t s1234 = load_unaligned_u8_4x1(&buffer[r - 1][c - 1] + 1);
         uint8x8_t s1 = vdup_lane_u8(s1234, 0);
         uint8x8_t s2 = vdup_lane_u8(s1234, 1);
         uint8x8_t s3 = vdup_lane_u8(s1234, 2);
@@ -212,7 +209,6 @@
     } while (r < height + 1);
   }
 }
-#endif  // AOM_ARCH_AARCH64
 
 void av1_filter_intra_edge_neon(uint8_t *p, int sz, int strength) {
   if (!strength) return;
diff --git a/av1/common/av1_rtcd_defs.pl b/av1/common/av1_rtcd_defs.pl
index 4c90393..1d14bd1 100644
--- a/av1/common/av1_rtcd_defs.pl
+++ b/av1/common/av1_rtcd_defs.pl
@@ -126,12 +126,7 @@
 
 # FILTER_INTRA predictor functions
 add_proto qw/void av1_filter_intra_predictor/, "uint8_t *dst, ptrdiff_t stride, TX_SIZE tx_size, const uint8_t *above, const uint8_t *left, int mode";
-# TODO(aomedia:349436249): enable NEON for armv7 after SIGBUS is fixed.
-if (aom_config("AOM_ARCH_ARM") eq "yes" && aom_config("AOM_ARCH_AARCH64") eq "") {
-  specialize qw/av1_filter_intra_predictor sse4_1/;
-} else {
-  specialize qw/av1_filter_intra_predictor sse4_1 neon/;
-}
+specialize qw/av1_filter_intra_predictor sse4_1 neon/;
 
 # High bitdepth functions
 
@@ -557,6 +552,9 @@
 if ((aom_config("CONFIG_REALTIME_ONLY") ne "yes") || (aom_config("CONFIG_AV1_DECODER") eq "yes")) {
   add_proto qw/void av1_warp_affine/, "const int32_t *mat, const uint8_t *ref, int width, int height, int stride, uint8_t *pred, int p_col, int p_row, int p_width, int p_height, int p_stride, int subsampling_x, int subsampling_y, ConvolveParams *conv_params, int16_t alpha, int16_t beta, int16_t gamma, int16_t delta";
   specialize qw/av1_warp_affine sse4_1 avx2 neon neon_i8mm sve/;
+  if (aom_config("CONFIG_HIGHWAY") eq "yes") {
+    specialize qw/av1_warp_affine avx512/;
+  }
 }
 
 # LOOP_RESTORATION functions
diff --git a/av1/common/seg_common.h b/av1/common/seg_common.h
index 3ba5052..3b88893 100644
--- a/av1/common/seg_common.h
+++ b/av1/common/seg_common.h
@@ -12,6 +12,8 @@
 #ifndef AOM_AV1_COMMON_SEG_COMMON_H_
 #define AOM_AV1_COMMON_SEG_COMMON_H_
 
+#include <string.h>
+
 #include "aom_dsp/prob.h"
 
 #ifdef __cplusplus
diff --git a/av1/common/warp_plane_hwy.h b/av1/common/warp_plane_hwy.h
new file mode 100644
index 0000000..4d11be6
--- /dev/null
+++ b/av1/common/warp_plane_hwy.h
@@ -0,0 +1,1408 @@
+/*
+ * Copyright (c) 2025, Alliance for Open Media. All rights reserved.
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#ifndef AV1_COMMON_WARP_PLANE_HWY_H_
+#define AV1_COMMON_WARP_PLANE_HWY_H_
+
+#include "av1/common/warped_motion.h"
+#include "config/av1_rtcd.h"
+#include "third_party/highway/hwy/highway.h"
+
+HWY_BEFORE_NAMESPACE();
+
+namespace {
+namespace HWY_NAMESPACE {
+
+namespace hn = hwy::HWY_NAMESPACE;
+
+constexpr hn::ScalableTag<uint8_t> uint8_tag;
+constexpr hn::ScalableTag<uint16_t> uint16_tag;
+
+constexpr hn::ScalableTag<int8_t> int8_tag;
+constexpr hn::ScalableTag<int16_t> int16_tag;
+constexpr hn::ScalableTag<int32_t> int32_tag;
+constexpr hn::ScalableTag<int64_t> int64_tag;
+
+constexpr hn::CappedTag<uint8_t, 32> uint8x32_tag;
+constexpr hn::CappedTag<int16_t, 16> int16x16_tag;
+
+constexpr hn::FixedTag<uint8_t, 4> uint8x4_tag;
+constexpr hn::FixedTag<uint8_t, 8> uint8x8_tag;
+constexpr hn::FixedTag<uint8_t, 16> uint8x16_tag;
+constexpr hn::FixedTag<uint16_t, 4> uint16x4_tag;
+constexpr hn::FixedTag<uint16_t, 8> uint16x8_tag;
+
+constexpr hn::FixedTag<int8_t, 8> int8x8_tag;
+constexpr hn::FixedTag<int8_t, 16> int8x16_tag;
+constexpr hn::FixedTag<int16_t, 8> int16x8_tag;
+constexpr hn::FixedTag<int32_t, 4> int32x4_tag;
+constexpr hn::FixedTag<int64_t, 2> int64x2_tag;
+
+using IVec8 = hn::Vec<decltype(int8_tag)>;
+using IVec16 = hn::Vec<decltype(int16_tag)>;
+using IVec32 = hn::Vec<decltype(int32_tag)>;
+using IVec8x16 = hn::Vec<decltype(int8x16_tag)>;
+
+template <typename D>
+HWY_ATTR inline void FilterPixelsHorizontal(D tag, const hn::VFromD<D> src,
+                                            int16_t *HWY_RESTRICT horz_out,
+                                            int8_t *HWY_RESTRICT coeff,
+                                            const IVec16 round_const,
+                                            const int shift, int row) {
+  constexpr hn::Repartition<int8_t, D> coeff_tag;
+  constexpr hn::Repartition<int16_t, D> result_tag;
+  constexpr hn::Repartition<uint16_t, D> unsigned_result_tag;
+  // N.B. coeffs are stored to support the maximum vector width, which may not
+  // be the vector width being filtered on now.
+  const auto coeff0 = hn::Load(coeff_tag, coeff + hn::MaxLanes(int8_tag) * 0);
+  const auto coeff1 = hn::Load(coeff_tag, coeff + hn::MaxLanes(int8_tag) * 1);
+  const auto coeff2 = hn::Load(coeff_tag, coeff + hn::MaxLanes(int8_tag) * 2);
+  const auto coeff3 = hn::Load(coeff_tag, coeff + hn::MaxLanes(int8_tag) * 3);
+
+  const auto shuffle0 = hn::Dup128VecFromValues(
+      uint8_tag, 0, 2, 2, 4, 4, 6, 6, 8, 1, 3, 3, 5, 5, 7, 7, 9  //
+  );
+  const auto shuffle1 = hn::Dup128VecFromValues(
+      uint8_tag, 4, 6, 6, 8, 8, 10, 10, 12, 5, 7, 7, 9, 9, 11, 11, 13  //
+  );
+  const auto shuffle2 = hn::Dup128VecFromValues(
+      uint8_tag, 1, 3, 3, 5, 5, 7, 7, 9, 2, 4, 4, 6, 6, 8, 8, 10  //
+  );
+  const auto shuffle3 = hn::Dup128VecFromValues(
+      uint8_tag, 5, 7, 7, 9, 9, 11, 11, 13, 6, 8, 8, 10, 10, 12, 12, 14  //
+  );
+
+  const auto src_0 =
+      hn::TableLookupBytes(src, hn::ResizeBitCast(tag, shuffle0));
+  const auto src_1 =
+      hn::TableLookupBytes(src, hn::ResizeBitCast(tag, shuffle1));
+  const auto src_2 =
+      hn::TableLookupBytes(src, hn::ResizeBitCast(tag, shuffle2));
+  const auto src_3 =
+      hn::TableLookupBytes(src, hn::ResizeBitCast(tag, shuffle3));
+
+  const auto res_02 = hn::SatWidenMulPairwiseAdd(result_tag, src_0, coeff0);
+  const auto res_46 = hn::SatWidenMulPairwiseAdd(result_tag, src_1, coeff1);
+  const auto res_13 = hn::SatWidenMulPairwiseAdd(result_tag, src_2, coeff2);
+  const auto res_57 = hn::SatWidenMulPairwiseAdd(result_tag, src_3, coeff3);
+
+  const auto res_even = hn::Add(res_02, res_46);
+  const auto res_odd = hn::Add(res_13, res_57);
+
+  const auto res = hn::Add(hn::Add(res_even, res_odd),
+                           hn::ResizeBitCast(result_tag, round_const));
+
+  hn::Store(hn::BitCast(result_tag,
+                        hn::ShiftRightSame(
+                            hn::BitCast(unsigned_result_tag, res), shift)),
+            result_tag, horz_out + row * hn::MaxLanes(int16x8_tag));
+}
+
+HWY_ATTR HWY_INLINE IVec8x16 LoadAV1Filter8Bit(unsigned int offset) {
+  return hn::LoadN(int8x16_tag, av1_filter_8bit[offset >> WARPEDDIFF_PREC_BITS],
+                   8);
+}
+
+HWY_ATTR HWY_INLINE IVec8 LoadAV1Filter8BitLower(unsigned int offset) {
+  return hn::LoadN(int8_tag, av1_filter_8bit[offset >> WARPEDDIFF_PREC_BITS],
+                   8);
+}
+
+template <int Block>
+HWY_ATTR HWY_INLINE IVec8 LoadAV1Filter8BitUpper(unsigned int offset,
+                                                 IVec8 src) {
+  return hn::InsertBlock<Block>(
+      src, hn::LoadN(int8x16_tag,
+                     av1_filter_8bit[offset >> WARPEDDIFF_PREC_BITS], 8));
+}
+
+HWY_ATTR inline void PrepareHorizontalFilterCoefficients(
+    int alpha, int beta, int sx, int8_t *HWY_RESTRICT coeff) {
+  auto tmp_0 = LoadAV1Filter8BitLower(sx + 0 * alpha);
+  auto tmp_1 = LoadAV1Filter8BitLower(sx + 1 * alpha);
+  auto tmp_2 = LoadAV1Filter8BitLower(sx + 2 * alpha);
+  auto tmp_3 = LoadAV1Filter8BitLower(sx + 3 * alpha);
+  auto tmp_4 = LoadAV1Filter8BitLower(sx + 4 * alpha);
+  auto tmp_5 = LoadAV1Filter8BitLower(sx + 5 * alpha);
+  auto tmp_6 = LoadAV1Filter8BitLower(sx + 6 * alpha);
+  auto tmp_7 = LoadAV1Filter8BitLower(sx + 7 * alpha);
+
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    tmp_0 = LoadAV1Filter8BitUpper<1>(sx + beta + 0 * alpha, tmp_0);
+    tmp_1 = LoadAV1Filter8BitUpper<1>(sx + beta + 1 * alpha, tmp_1);
+    tmp_2 = LoadAV1Filter8BitUpper<1>(sx + beta + 2 * alpha, tmp_2);
+    tmp_3 = LoadAV1Filter8BitUpper<1>(sx + beta + 3 * alpha, tmp_3);
+    tmp_4 = LoadAV1Filter8BitUpper<1>(sx + beta + 4 * alpha, tmp_4);
+    tmp_5 = LoadAV1Filter8BitUpper<1>(sx + beta + 5 * alpha, tmp_5);
+    tmp_6 = LoadAV1Filter8BitUpper<1>(sx + beta + 6 * alpha, tmp_6);
+    tmp_7 = LoadAV1Filter8BitUpper<1>(sx + beta + 7 * alpha, tmp_7);
+  }
+
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    tmp_0 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 0 * alpha, tmp_0);
+    tmp_1 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 1 * alpha, tmp_1);
+    tmp_2 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 2 * alpha, tmp_2);
+    tmp_3 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 3 * alpha, tmp_3);
+    tmp_4 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 4 * alpha, tmp_4);
+    tmp_5 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 5 * alpha, tmp_5);
+    tmp_6 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 6 * alpha, tmp_6);
+    tmp_7 = LoadAV1Filter8BitUpper<2>(sx + beta * 2 + 7 * alpha, tmp_7);
+
+    tmp_0 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 0 * alpha, tmp_0);
+    tmp_1 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 1 * alpha, tmp_1);
+    tmp_2 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 2 * alpha, tmp_2);
+    tmp_3 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 3 * alpha, tmp_3);
+    tmp_4 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 4 * alpha, tmp_4);
+    tmp_5 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 5 * alpha, tmp_5);
+    tmp_6 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 6 * alpha, tmp_6);
+    tmp_7 = LoadAV1Filter8BitUpper<3>(sx + beta * 3 + 7 * alpha, tmp_7);
+  }
+
+  const auto tmp_0_16 = hn::BitCast(int16_tag, tmp_0);
+  const auto tmp_1_16 = hn::BitCast(int16_tag, tmp_1);
+  const auto tmp_2_16 = hn::BitCast(int16_tag, tmp_2);
+  const auto tmp_3_16 = hn::BitCast(int16_tag, tmp_3);
+  const auto tmp_4_16 = hn::BitCast(int16_tag, tmp_4);
+  const auto tmp_5_16 = hn::BitCast(int16_tag, tmp_5);
+  const auto tmp_6_16 = hn::BitCast(int16_tag, tmp_6);
+  const auto tmp_7_16 = hn::BitCast(int16_tag, tmp_7);
+
+  const auto tmp_12 = hn::ZipLower(int32_tag, tmp_0_16, tmp_2_16);
+  const auto tmp_13 = hn::ZipLower(int32_tag, tmp_1_16, tmp_3_16);
+  const auto tmp_14 = hn::ZipLower(int32_tag, tmp_4_16, tmp_6_16);
+  const auto tmp_15 = hn::ZipLower(int32_tag, tmp_5_16, tmp_7_16);
+
+  const auto res_0 = hn::ZipLower(int64_tag, tmp_12, tmp_14);
+  const auto res_1 = hn::ZipUpper(int64_tag, tmp_12, tmp_14);
+  const auto res_2 = hn::ZipLower(int64_tag, tmp_13, tmp_15);
+  const auto res_3 = hn::ZipUpper(int64_tag, tmp_13, tmp_15);
+
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveLower(int64_tag, res_0, res_2)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 0);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveUpper(int64_tag, res_0, res_2)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 1);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveLower(int64_tag, res_1, res_3)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 2);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveUpper(int64_tag, res_1, res_3)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 3);
+}
+
+HWY_ATTR inline void PrepareHorizontalFilterCoefficientsBeta0(
+    int alpha, int beta, int sx, int8_t *HWY_RESTRICT coeff) {
+  (void)beta;
+  const auto tmp_0 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 0 * alpha));
+  const auto tmp_1 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 1 * alpha));
+  const auto tmp_2 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 2 * alpha));
+  const auto tmp_3 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 3 * alpha));
+  const auto tmp_4 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 4 * alpha));
+  const auto tmp_5 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 5 * alpha));
+  const auto tmp_6 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 6 * alpha));
+  const auto tmp_7 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 7 * alpha));
+
+  const auto tmp_02 = hn::ZipLower(int32x4_tag, tmp_0, tmp_2);
+  const auto tmp_13 = hn::ZipLower(int32x4_tag, tmp_1, tmp_3);
+  const auto tmp_46 = hn::ZipLower(int32x4_tag, tmp_4, tmp_6);
+  const auto tmp_57 = hn::ZipLower(int32x4_tag, tmp_5, tmp_7);
+
+  const auto broadcast_12 =
+      hn::BroadcastBlock<0>(hn::ResizeBitCast(int32_tag, tmp_02));
+  const auto broadcast_13 =
+      hn::BroadcastBlock<0>(hn::ResizeBitCast(int32_tag, tmp_13));
+  const auto broadcast_14 =
+      hn::BroadcastBlock<0>(hn::ResizeBitCast(int32_tag, tmp_46));
+  const auto broadcast_15 =
+      hn::BroadcastBlock<0>(hn::ResizeBitCast(int32_tag, tmp_57));
+
+  const auto res_0 = hn::ZipLower(int64_tag, broadcast_12, broadcast_14);
+  const auto res_1 = hn::ZipUpper(int64_tag, broadcast_12, broadcast_14);
+  const auto res_2 = hn::ZipLower(int64_tag, broadcast_13, broadcast_15);
+  const auto res_3 = hn::ZipUpper(int64_tag, broadcast_13, broadcast_15);
+
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveLower(int64_tag, res_0, res_2)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 0);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveUpper(int64_tag, res_0, res_2)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 1);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveLower(int64_tag, res_1, res_3)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 2);
+  hn::Store(hn::BitCast(int8_tag, hn::InterleaveUpper(int64_tag, res_1, res_3)),
+            int8_tag, coeff + hn::MaxLanes(int8_tag) * 3);
+}
+
+HWY_ATTR inline void PrepareHorizontalFilterCoefficientsAlpha0(
+    int alpha, int beta, int sx, int8_t *HWY_RESTRICT coeff) {
+  (void)alpha;
+  auto tmp_0 = LoadAV1Filter8BitLower(sx);
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    tmp_0 = LoadAV1Filter8BitUpper<1>(sx + beta, tmp_0);
+  }
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    tmp_0 = LoadAV1Filter8BitUpper<2>(sx + beta * 2, tmp_0);
+    tmp_0 = LoadAV1Filter8BitUpper<3>(sx + beta * 3, tmp_0);
+  }
+  const auto res_0 = hn::BitCast(int16_tag, tmp_0);
+
+  hn::Store(hn::BitCast(int8_tag, hn::Broadcast<0>(res_0)), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 0);
+  hn::Store(hn::BitCast(int8_tag, hn::Broadcast<1>(res_0)), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 1);
+  hn::Store(hn::BitCast(int8_tag, hn::Broadcast<2>(res_0)), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 2);
+  hn::Store(hn::BitCast(int8_tag, hn::Broadcast<3>(res_0)), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 3);
+}
+
+template <typename D>
+HWY_ATTR inline void HorizontalFilter(D tag, const hn::VFromD<D> src,
+                                      int16_t *HWY_RESTRICT horz_out, int sx,
+                                      int alpha, int beta, int row,
+                                      const IVec16 round_const,
+                                      const int reduce_bits_horiz) {
+  HWY_ALIGN int8_t coeff[4 * hn::MaxLanes(int8_tag)];
+  PrepareHorizontalFilterCoefficients(alpha, beta, sx, coeff);
+  FilterPixelsHorizontal(tag, src, horz_out, coeff, round_const,
+                         reduce_bits_horiz, row);
+}
+
+HWY_ATTR inline void PrepareLastHorizontalFilterCoefficients(
+    int alpha, int beta, int sx, int8_t *HWY_RESTRICT coeff) {
+  (void)beta;
+  const auto tmp_0 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 0 * alpha));
+  const auto tmp_1 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 1 * alpha));
+  const auto tmp_2 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 2 * alpha));
+  const auto tmp_3 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 3 * alpha));
+  const auto tmp_4 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 4 * alpha));
+  const auto tmp_5 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 5 * alpha));
+  const auto tmp_6 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 6 * alpha));
+  const auto tmp_7 =
+      hn::BitCast(int16x8_tag, LoadAV1Filter8Bit(sx + 7 * alpha));
+
+  const auto tmp_8 = hn::ZipLower(int32x4_tag, tmp_0, tmp_2);
+  const auto tmp_9 = hn::ZipLower(int32x4_tag, tmp_1, tmp_3);
+  const auto tmp_10 = hn::ZipLower(int32x4_tag, tmp_4, tmp_6);
+  const auto tmp_11 = hn::ZipLower(int32x4_tag, tmp_5, tmp_7);
+
+  const auto tmp_12 = hn::ZipLower(int64x2_tag, tmp_8, tmp_10);
+  const auto tmp_13 = hn::ZipUpper(int64x2_tag, tmp_8, tmp_10);
+  const auto tmp_14 = hn::ZipLower(int64x2_tag, tmp_9, tmp_11);
+  const auto tmp_15 = hn::ZipUpper(int64x2_tag, tmp_9, tmp_11);
+
+  const auto tmp_16 = hn::InterleaveLower(int64x2_tag, tmp_12, tmp_14);
+  const auto tmp_17 = hn::InterleaveUpper(int64x2_tag, tmp_12, tmp_14);
+  const auto tmp_18 = hn::InterleaveLower(int64x2_tag, tmp_13, tmp_15);
+  const auto tmp_19 = hn::InterleaveUpper(int64x2_tag, tmp_13, tmp_15);
+
+  const auto tmp_20 = hn::ResizeBitCast(int8_tag, tmp_16);
+  const auto tmp_21 = hn::ResizeBitCast(int8_tag, tmp_17);
+  const auto tmp_22 = hn::ResizeBitCast(int8_tag, tmp_18);
+  const auto tmp_23 = hn::ResizeBitCast(int8_tag, tmp_19);
+
+  hn::Store(hn::BroadcastBlock<0>(tmp_20), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 0);
+  hn::Store(hn::BroadcastBlock<0>(tmp_21), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 1);
+  hn::Store(hn::BroadcastBlock<0>(tmp_22), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 2);
+  hn::Store(hn::BroadcastBlock<0>(tmp_23), int8_tag,
+            coeff + hn::MaxLanes(int8_tag) * 3);
+}
+
+template <typename D>
+HWY_ATTR HWY_INLINE hn::VFromD<D> LoadRowsClamped(
+    D tag, const uint8_t *HWY_RESTRICT ref, const int stride, const int iy,
+    const int height) {
+  constexpr hn::BlockDFromD<D> block_tag;
+  const int iy0 = clamp(iy + 0, 0, height - 1);
+  auto src = hn::ResizeBitCast(tag, hn::LoadU(block_tag, ref + iy0 * stride));
+  if constexpr (tag.MaxBlocks() >= 2) {
+    const int iy1 = clamp(iy + 1, 0, height - 1);
+    const auto src_1 = hn::LoadU(block_tag, ref + iy1 * stride);
+    src = hn::InsertBlock<1>(src, src_1);
+  }
+  if constexpr (tag.MaxBlocks() >= 3) {
+    const int iy2 = clamp(iy + 2, 0, height - 1);
+    const auto src_2 = hn::LoadU(block_tag, ref + iy2 * stride);
+    const int iy3 = clamp(iy + 3, 0, height - 1);
+    const auto src_3 = hn::LoadU(block_tag, ref + iy3 * stride);
+    src = hn::InsertBlock<2>(src, src_2);
+    src = hn::InsertBlock<3>(src, src_3);
+  }
+  return src;
+}
+
+template <void (*PrepareCoeffs)(int alpha, int beta, int sx,
+                                int8_t *HWY_RESTRICT coeffs),
+          typename D>
+HWY_ATTR int WarpHorizontalFilterLoop(
+    D tag, const uint8_t *HWY_RESTRICT ref, int16_t *HWY_RESTRICT horz_out,
+    int stride, int32_t ix4, int32_t iy4, int32_t sx4, int alpha, int beta,
+    int p_height, int height, int i, const IVec16 round_const,
+    const int reduce_bits_horiz, int k, int8_t *HWY_RESTRICT coeff) {
+  constexpr int kNumRows = tag.MaxBlocks();
+  for (; k < AOMMIN(8, p_height - i) - kNumRows; k += kNumRows) {
+    const auto src =
+        LoadRowsClamped(tag, ref + ix4 - 7, stride, iy4 + k, height);
+    if constexpr (PrepareCoeffs != nullptr) {
+      int sx = sx4 + beta * (k + 4);
+      PrepareCoeffs(alpha, beta, sx, coeff);
+    }
+    FilterPixelsHorizontal(tag, src, horz_out, coeff, round_const,
+                           reduce_bits_horiz, k + 7);
+  }
+  return k;
+}
+
+template <
+    bool InnerCoeffUpdate,
+    void (*PrepareCoeffs)(int alpha, int beta, int sx,
+                          int8_t *HWY_RESTRICT coeffs),
+    void (*LastPrepareCoeffs)(int alpha, int beta, int sx,
+                              int8_t *HWY_RESTRICT coeffs) = PrepareCoeffs>
+HWY_ATTR inline void WarpHorizontalFilterTemplate(
+    const uint8_t *HWY_RESTRICT ref, int16_t *HWY_RESTRICT horz_out, int stride,
+    int32_t ix4, int32_t iy4, int32_t sx4, int alpha, int beta, int p_height,
+    int height, int i, const IVec16 round_const, const int reduce_bits_horiz) {
+  int k = -7, iy;
+  HWY_ALIGN int8_t coeff[4 * hn::MaxLanes(int8_tag)];
+  if constexpr (!InnerCoeffUpdate) {
+    PrepareCoeffs(alpha, beta, sx4, coeff);
+  }
+  if constexpr (uint8_tag.MaxBlocks() >= 3) {
+    k = WarpHorizontalFilterLoop<(InnerCoeffUpdate ? PrepareCoeffs : nullptr)>(
+        uint8_tag, ref, horz_out, stride, ix4, iy4, sx4, alpha, beta, p_height,
+        height, i, round_const, reduce_bits_horiz, k, coeff);
+  }
+  if constexpr (uint8_tag.MaxBlocks() >= 2) {
+    k = WarpHorizontalFilterLoop<(InnerCoeffUpdate ? PrepareCoeffs : nullptr)>(
+        uint8x32_tag, ref, horz_out, stride, ix4, iy4, sx4, alpha, beta,
+        p_height, height, i, round_const, reduce_bits_horiz, k, coeff);
+  }
+  if constexpr (uint8_tag.MaxBlocks() == 1) {
+    k = WarpHorizontalFilterLoop<(InnerCoeffUpdate ? LastPrepareCoeffs
+                                                   : nullptr)>(
+        uint8x16_tag, ref, horz_out, stride, ix4, iy4, sx4, alpha, beta,
+        p_height, height, i, round_const, reduce_bits_horiz, k, coeff);
+  }
+  iy = iy4 + k;
+  iy = clamp(iy, 0, height - 1);
+  const auto src = hn::LoadU(uint8x16_tag, ref + iy * stride + ix4 - 7);
+  if constexpr (InnerCoeffUpdate) {
+    int sx = sx4 + beta * (k + 4);
+    LastPrepareCoeffs(alpha, beta, sx, coeff);
+  }
+  FilterPixelsHorizontal(uint8x16_tag, src, horz_out, coeff, round_const,
+                         reduce_bits_horiz, k + 7);
+}
+
+HWY_ATTR inline void UnpackWeightsAndSetRoundConst(
+    ConvolveParams *HWY_RESTRICT conv_params, const int round_bits,
+    const int offset_bits, IVec16 &HWY_RESTRICT res_sub_const,
+    IVec16 &HWY_RESTRICT round_bits_const, IVec16 &HWY_RESTRICT wt) {
+  res_sub_const =
+      hn::Set(int16_tag, -(1 << (offset_bits - conv_params->round_1)) -
+                             (1 << (offset_bits - conv_params->round_1 - 1)));
+  round_bits_const = hn::Set(int16_tag, ((1 << round_bits) >> 1));
+
+  const auto w0 = static_cast<int16_t>(conv_params->fwd_offset);
+  const auto w1 = static_cast<int16_t>(conv_params->bck_offset);
+  const auto wt0 = hn::Set(int16_tag, w0);
+  const auto wt1 = hn::Set(int16_tag, w1);
+  wt = hn::InterleaveLower(wt0, wt1);
+}
+
+HWY_ATTR HWY_INLINE IVec16 LoadAV1WarpedFilter(size_t offset) {
+  return hn::LoadN(int16_tag, av1_warped_filter[offset >> WARPEDDIFF_PREC_BITS],
+                   8);
+}
+
+HWY_ATTR HWY_INLINE IVec16 LoadAV1WarpedFilterLower(size_t offset) {
+  return hn::ResizeBitCast(
+      int16_tag,
+      hn::Load(int16x8_tag, av1_warped_filter[offset >> WARPEDDIFF_PREC_BITS]));
+}
+
+template <int Block>
+HWY_ATTR HWY_INLINE IVec16 LoadAV1WarpedFilterUpper(size_t offset, IVec16 src) {
+  return hn::InsertBlock<Block>(
+      src,
+      hn::Load(int16x8_tag, av1_warped_filter[offset >> WARPEDDIFF_PREC_BITS]));
+}
+
+HWY_ATTR inline void PrepareVerticalFilterCoeffs(int gamma, int delta, int sy,
+                                                 int16_t *HWY_RESTRICT coeffs) {
+  auto filt_00 = LoadAV1WarpedFilterLower(sy + 0 * gamma);
+  auto filt_01 = LoadAV1WarpedFilterLower(sy + 2 * gamma);
+  auto filt_02 = LoadAV1WarpedFilterLower(sy + 4 * gamma);
+  auto filt_03 = LoadAV1WarpedFilterLower(sy + 6 * gamma);
+
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    filt_00 = LoadAV1WarpedFilterUpper<1>(sy + delta + 0 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<1>(sy + delta + 2 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<1>(sy + delta + 4 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<1>(sy + delta + 6 * gamma, filt_03);
+  }
+
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    filt_00 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 0 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 2 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 4 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 6 * gamma, filt_03);
+
+    filt_00 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 0 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 2 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 4 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 6 * gamma, filt_03);
+  }
+
+  auto filt_0 = hn::BitCast(int32_tag, filt_00);
+  auto filt_1 = hn::BitCast(int32_tag, filt_01);
+  auto filt_2 = hn::BitCast(int32_tag, filt_02);
+  auto filt_3 = hn::BitCast(int32_tag, filt_03);
+
+  auto res_0 = hn::ZipLower(int64_tag, filt_0, filt_1);
+  auto res_1 = hn::ZipLower(int64_tag, filt_2, filt_3);
+  auto res_2 = hn::ZipUpper(int64_tag, filt_0, filt_1);
+  auto res_3 = hn::ZipUpper(int64_tag, filt_2, filt_3);
+
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 0 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 1 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 2 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 3 * hn::MaxLanes(int16_tag));
+
+  filt_00 = LoadAV1WarpedFilterLower(sy + 1 * gamma);
+  filt_01 = LoadAV1WarpedFilterLower(sy + 3 * gamma);
+  filt_02 = LoadAV1WarpedFilterLower(sy + 5 * gamma);
+  filt_03 = LoadAV1WarpedFilterLower(sy + 7 * gamma);
+
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    filt_00 = LoadAV1WarpedFilterUpper<1>(sy + delta + 1 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<1>(sy + delta + 3 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<1>(sy + delta + 5 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<1>(sy + delta + 7 * gamma, filt_03);
+  }
+
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    filt_00 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 1 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 3 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 5 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta + 7 * gamma, filt_03);
+
+    filt_00 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 1 * gamma, filt_00);
+    filt_01 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 3 * gamma, filt_01);
+    filt_02 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 5 * gamma, filt_02);
+    filt_03 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta + 7 * gamma, filt_03);
+  }
+
+  filt_0 = hn::BitCast(int32_tag, filt_00);
+  filt_1 = hn::BitCast(int32_tag, filt_01);
+  filt_2 = hn::BitCast(int32_tag, filt_02);
+  filt_3 = hn::BitCast(int32_tag, filt_03);
+
+  res_0 = hn::ZipLower(int64_tag, filt_0, filt_1);
+  res_1 = hn::ZipLower(int64_tag, filt_2, filt_3);
+  res_2 = hn::ZipUpper(int64_tag, filt_0, filt_1);
+  res_3 = hn::ZipUpper(int64_tag, filt_2, filt_3);
+
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 4 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 5 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 6 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 7 * hn::MaxLanes(int16_tag));
+}
+
+HWY_ATTR inline void PrepareVerticalFilterCoeffsDelta0(
+    int gamma, int delta, int sy, int16_t *HWY_RESTRICT coeffs) {
+  (void)delta;
+  auto filt_00 = LoadAV1WarpedFilter(sy + 0 * gamma);
+  auto filt_01 = LoadAV1WarpedFilter(sy + 2 * gamma);
+  auto filt_02 = LoadAV1WarpedFilter(sy + 4 * gamma);
+  auto filt_03 = LoadAV1WarpedFilter(sy + 6 * gamma);
+
+  auto filt_10 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_00));
+  auto filt_11 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_01));
+  auto filt_12 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_02));
+  auto filt_13 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_03));
+
+  auto res_0 = hn::ZipLower(int64_tag, filt_10, filt_11);
+  auto res_1 = hn::ZipLower(int64_tag, filt_12, filt_13);
+  auto res_2 = hn::ZipUpper(int64_tag, filt_10, filt_11);
+  auto res_3 = hn::ZipUpper(int64_tag, filt_12, filt_13);
+
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 0 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 1 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 2 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 3 * hn::MaxLanes(int16_tag));
+
+  filt_00 = LoadAV1WarpedFilter(sy + 1 * gamma);
+  filt_01 = LoadAV1WarpedFilter(sy + 3 * gamma);
+  filt_02 = LoadAV1WarpedFilter(sy + 5 * gamma);
+  filt_03 = LoadAV1WarpedFilter(sy + 7 * gamma);
+
+  filt_10 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_00));
+  filt_11 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_01));
+  filt_12 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_02));
+  filt_13 = hn::BitCast(int32_tag, hn::BroadcastBlock<0>(filt_03));
+
+  res_0 = hn::ZipLower(int64_tag, filt_10, filt_11);
+  res_1 = hn::ZipLower(int64_tag, filt_12, filt_13);
+  res_2 = hn::ZipUpper(int64_tag, filt_10, filt_11);
+  res_3 = hn::ZipUpper(int64_tag, filt_12, filt_13);
+
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 4 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_0, res_1)),
+      int16_tag, coeffs + 5 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveLower(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 6 * hn::MaxLanes(int16_tag));
+  hn::Store(
+      hn::BitCast(int16_tag, hn::InterleaveUpper(int64_tag, res_2, res_3)),
+      int16_tag, coeffs + 7 * hn::MaxLanes(int16_tag));
+}
+
+HWY_ATTR inline void PrepareVerticalFilterCoeffsGamma0(
+    int gamma, int delta, int sy, int16_t *HWY_RESTRICT coeffs) {
+  (void)gamma;
+  auto filt_0 = LoadAV1WarpedFilterLower(sy);
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    filt_0 = LoadAV1WarpedFilterUpper<1>(sy + delta, filt_0);
+  }
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    filt_0 = LoadAV1WarpedFilterUpper<2>(sy + 2 * delta, filt_0);
+    filt_0 = LoadAV1WarpedFilterUpper<3>(sy + 3 * delta, filt_0);
+  }
+  auto res_0 = hn::BitCast(int32_tag, filt_0);
+
+  auto broadcast_0 = hn::BitCast(int16_tag, hn::Broadcast<0>(res_0));
+  auto broadcast_1 = hn::BitCast(int16_tag, hn::Broadcast<1>(res_0));
+  auto broadcast_2 = hn::BitCast(int16_tag, hn::Broadcast<2>(res_0));
+  auto broadcast_3 = hn::BitCast(int16_tag, hn::Broadcast<3>(res_0));
+
+  hn::Store(broadcast_0, int16_tag, coeffs + 0 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_1, int16_tag, coeffs + 1 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_2, int16_tag, coeffs + 2 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_3, int16_tag, coeffs + 3 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_0, int16_tag, coeffs + 4 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_1, int16_tag, coeffs + 5 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_2, int16_tag, coeffs + 6 * hn::MaxLanes(int16_tag));
+  hn::Store(broadcast_3, int16_tag, coeffs + 7 * hn::MaxLanes(int16_tag));
+}
+
+HWY_ATTR inline void FilterPixelsVertical(
+    int16_t *HWY_RESTRICT horz_out, int16_t *HWY_RESTRICT src_lo,
+    int16_t *HWY_RESTRICT src_hi, int16_t *HWY_RESTRICT coeffs,
+    IVec32 &HWY_RESTRICT res_lo, IVec32 &HWY_RESTRICT res_hi, int row) {
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    const auto horz_out_4 =
+        hn::Load(int16_tag, horz_out + (row + 4) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_5 =
+        hn::LoadU(int16_tag, horz_out + (row + 5) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_6 =
+        hn::LoadU(int16_tag, horz_out + (row + 6) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_7 =
+        hn::LoadU(int16_tag, horz_out + (row + 7) * hn::MaxLanes(int16x8_tag));
+    const auto src_lo_2 =
+        hn::InterleaveLower(int16_tag, horz_out_4, horz_out_5);
+    const auto src_hi_2 =
+        hn::InterleaveUpper(int16_tag, horz_out_4, horz_out_5);
+    const auto src_lo_3 =
+        hn::InterleaveLower(int16_tag, horz_out_6, horz_out_7);
+    const auto src_hi_3 =
+        hn::InterleaveUpper(int16_tag, horz_out_6, horz_out_7);
+    hn::Store(src_lo_2, int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag));
+    hn::Store(src_hi_2, int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag));
+    hn::Store(src_lo_3, int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag));
+    hn::Store(src_hi_3, int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag));
+  } else if constexpr (int16_tag.MaxBlocks() == 2) {
+    const auto horz_out_6 =
+        hn::Load(int16_tag, horz_out + (row + 6) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_8 =
+        hn::Load(int16_tag, horz_out + (row + 8) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_7 =
+        hn::ConcatLowerUpper(int16_tag, horz_out_8, horz_out_6);
+    const auto src_lo_3 =
+        hn::InterleaveLower(int16_tag, horz_out_6, horz_out_7);
+    const auto src_hi_3 =
+        hn::InterleaveUpper(int16_tag, horz_out_6, horz_out_7);
+    hn::Store(src_lo_3, int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag));
+    hn::Store(src_hi_3, int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag));
+  } else if constexpr (int16_tag.MaxBlocks() == 1) {
+    const auto horz_out_6 =
+        hn::Load(int16x8_tag, horz_out + (row + 6) * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_7 =
+        hn::Load(int16x8_tag, horz_out + (row + 7) * hn::MaxLanes(int16x8_tag));
+    const auto src_lo_3 =
+        hn::InterleaveLower(int16x8_tag, horz_out_6, horz_out_7);
+    const auto src_hi_3 =
+        hn::InterleaveUpper(int16x8_tag, horz_out_6, horz_out_7);
+    hn::Store(src_lo_3, int16x8_tag, src_lo + 3 * hn::MaxLanes(int16x8_tag));
+    hn::Store(src_hi_3, int16x8_tag, src_hi + 3 * hn::MaxLanes(int16x8_tag));
+  }
+
+  const auto coeff_0 =
+      hn::Load(int16_tag, coeffs + 0 * hn::MaxLanes(int16_tag));
+  const auto coeff_1 =
+      hn::Load(int16_tag, coeffs + 1 * hn::MaxLanes(int16_tag));
+  const auto coeff_2 =
+      hn::Load(int16_tag, coeffs + 2 * hn::MaxLanes(int16_tag));
+  const auto coeff_3 =
+      hn::Load(int16_tag, coeffs + 3 * hn::MaxLanes(int16_tag));
+  const auto coeff_4 =
+      hn::Load(int16_tag, coeffs + 4 * hn::MaxLanes(int16_tag));
+  const auto coeff_5 =
+      hn::Load(int16_tag, coeffs + 5 * hn::MaxLanes(int16_tag));
+  const auto coeff_6 =
+      hn::Load(int16_tag, coeffs + 6 * hn::MaxLanes(int16_tag));
+  const auto coeff_7 =
+      hn::Load(int16_tag, coeffs + 7 * hn::MaxLanes(int16_tag));
+
+  const auto src_lo_0 =
+      hn::Load(int16_tag, src_lo + 0 * hn::MaxLanes(int16_tag));
+  const auto src_lo_1 =
+      hn::Load(int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag));
+  const auto src_lo_2 =
+      hn::Load(int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag));
+  const auto src_lo_3 =
+      hn::Load(int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag));
+  const auto src_hi_0 =
+      hn::Load(int16_tag, src_hi + 0 * hn::MaxLanes(int16_tag));
+  const auto src_hi_1 =
+      hn::Load(int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag));
+  const auto src_hi_2 =
+      hn::Load(int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag));
+  const auto src_hi_3 =
+      hn::Load(int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag));
+
+  auto even_sum0 = hn::Zero(int32_tag);
+  auto even_sum1 = hn::Zero(int32_tag);
+  even_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_lo_0, coeff_0,
+                                            even_sum0, even_sum1);
+  even_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_lo_1, coeff_1,
+                                            even_sum0, even_sum1);
+  even_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_lo_2, coeff_2,
+                                            even_sum0, even_sum1);
+  even_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_lo_3, coeff_3,
+                                            even_sum0, even_sum1);
+  auto res_even = hn::RearrangeToOddPlusEven(even_sum0, even_sum1);
+
+  auto odd_sum0 = hn::Zero(int32_tag);
+  auto odd_sum1 = hn::Zero(int32_tag);
+  odd_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_hi_0, coeff_4,
+                                           odd_sum0, odd_sum1);
+  odd_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_hi_1, coeff_5,
+                                           odd_sum0, odd_sum1);
+  odd_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_hi_2, coeff_6,
+                                           odd_sum0, odd_sum1);
+  odd_sum0 = hn::ReorderWidenMulAccumulate(int32_tag, src_hi_3, coeff_7,
+                                           odd_sum0, odd_sum1);
+  auto res_odd = hn::RearrangeToOddPlusEven(odd_sum0, odd_sum1);
+
+  // Rearrange pixels back into the order 0 ... 7
+  res_lo = hn::InterleaveLower(int32_tag, res_even, res_odd);
+  res_hi = hn::InterleaveUpper(int32_tag, res_even, res_odd);
+}
+
+template <typename DS, typename DR, typename A, typename B, typename C>
+HWY_ATTR HWY_INLINE void StoreRows(DS store_tag, DR row_tag, hn::VFromD<DR> vec,
+                                   A stride, B y, C x,
+                                   hn::TFromD<DS> *HWY_RESTRICT out) {
+  hn::TFromD<DS> *HWY_RESTRICT pointers[row_tag.MaxBlocks()];
+  for (int i = 0; i < static_cast<int>(row_tag.MaxBlocks()); ++i) {
+    pointers[i] = &out[(y + i) * stride + x];
+  }
+  hn::Store(hn::ResizeBitCast(store_tag, hn::ExtractBlock<0>(vec)), store_tag,
+            pointers[0]);
+  if constexpr (row_tag.MaxBlocks() >= 2) {
+    hn::Store(hn::ResizeBitCast(store_tag, hn::ExtractBlock<1>(vec)), store_tag,
+              pointers[1]);
+  }
+  if constexpr (row_tag.MaxBlocks() >= 3) {
+    hn::Store(hn::ResizeBitCast(store_tag, hn::ExtractBlock<2>(vec)), store_tag,
+              pointers[2]);
+    hn::Store(hn::ResizeBitCast(store_tag, hn::ExtractBlock<3>(vec)), store_tag,
+              pointers[3]);
+  }
+}
+
+HWY_ATTR HWY_INLINE void StoreVerticalFilterOutput(
+    IVec32 res_lo, IVec32 res_hi, const IVec32 res_add_const, const IVec16 wt,
+    const IVec16 res_sub_const, const IVec16 round_bits_const,
+    uint8_t *HWY_RESTRICT pred, ConvolveParams *HWY_RESTRICT conv_params, int i,
+    int j, int k, const int reduce_bits_vert, int p_stride, int p_width,
+    const int round_bits) {
+  constexpr int kNumRows = uint16_tag.MaxBlocks();
+  if (conv_params->is_compound) {
+    uint16_t *HWY_RESTRICT pointers[kNumRows];
+    for (int row = 0; row < kNumRows; ++row) {
+      pointers[row] =
+          &conv_params->dst[(i + k + row) * conv_params->dst_stride + j];
+    }
+
+    res_lo =
+        hn::ShiftRightSame(hn::Add(res_lo, res_add_const), reduce_bits_vert);
+
+    const auto temp_lo_16 = hn::ReorderDemote2To(uint16_tag, res_lo, res_lo);
+    if (conv_params->do_average) {
+      auto p_16 =
+          hn::ResizeBitCast(uint16_tag, hn::Load(uint16x4_tag, pointers[0]));
+      if constexpr (kNumRows >= 2) {
+        p_16 = hn::InsertBlock<1>(
+            p_16, hn::ResizeBitCast(uint16x8_tag,
+                                    hn::Load(uint16x4_tag, pointers[1])));
+      }
+      if constexpr (kNumRows >= 3) {
+        p_16 = hn::InsertBlock<2>(
+            p_16, hn::ResizeBitCast(uint16x8_tag,
+                                    hn::Load(uint16x4_tag, pointers[2])));
+        p_16 = hn::InsertBlock<3>(
+            p_16, hn::ResizeBitCast(uint16x8_tag,
+                                    hn::Load(uint16x4_tag, pointers[3])));
+      }
+      auto res_lo_16 = hn::Undefined(int16_tag);
+      if (conv_params->use_dist_wtd_comp_avg) {
+        const auto p_16_lo =
+            hn::BitCast(int16_tag, hn::InterleaveLower(p_16, temp_lo_16));
+        const auto wt_res_lo = hn::WidenMulPairwiseAdd(int32_tag, p_16_lo, wt);
+        const auto shifted_32 = hn::ShiftRight<DIST_PRECISION_BITS>(wt_res_lo);
+        res_lo_16 = hn::BitCast(
+            int16_tag,
+            hn::ReorderDemote2To(uint16_tag, shifted_32, shifted_32));
+      } else {
+        res_lo_16 = hn::ShiftRight<1>(
+            hn::BitCast(int16_tag, hn::Add(p_16, temp_lo_16)));
+      }
+      res_lo_16 = hn::Add(res_lo_16, res_sub_const);
+      res_lo_16 =
+          hn::ShiftRightSame(hn::Add(res_lo_16, round_bits_const), round_bits);
+      const auto res_8_lo =
+          hn::ReorderDemote2To(uint8_tag, res_lo_16, res_lo_16);
+      StoreRows(uint8x4_tag, uint8_tag, res_8_lo, p_stride, i + k, j, pred);
+    } else {
+      hn::Store(
+          hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<0>(temp_lo_16)),
+          uint16x4_tag, pointers[0]);
+      if constexpr (kNumRows >= 2) {
+        hn::Store(
+            hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<1>(temp_lo_16)),
+            uint16x4_tag, pointers[1]);
+      }
+      if constexpr (kNumRows >= 3) {
+        hn::Store(
+            hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<2>(temp_lo_16)),
+            uint16x4_tag, pointers[2]);
+        hn::Store(
+            hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<3>(temp_lo_16)),
+            uint16x4_tag, pointers[3]);
+      }
+    }
+    if (p_width > 4) {
+      uint16_t *HWY_RESTRICT pointers4[kNumRows];
+      for (int row = 0; row < kNumRows; ++row) {
+        pointers4[row] =
+            &conv_params->dst[(i + k + row) * conv_params->dst_stride + j + 4];
+      }
+      res_hi =
+          hn::ShiftRightSame(hn::Add(res_hi, res_add_const), reduce_bits_vert);
+      const auto temp_hi_16 = hn::ReorderDemote2To(uint16_tag, res_hi, res_hi);
+      if (conv_params->do_average) {
+        auto p4_16 =
+            hn::ResizeBitCast(uint16_tag, hn::Load(uint16x4_tag, pointers4[0]));
+        if constexpr (kNumRows >= 2) {
+          p4_16 = hn::InsertBlock<1>(
+              p4_16, hn::ResizeBitCast(uint16x8_tag,
+                                       hn::Load(uint16x4_tag, pointers4[1])));
+        }
+        if constexpr (kNumRows >= 3) {
+          p4_16 = hn::InsertBlock<2>(
+              p4_16, hn::ResizeBitCast(uint16x8_tag,
+                                       hn::Load(uint16x4_tag, pointers4[2])));
+          p4_16 = hn::InsertBlock<3>(
+              p4_16, hn::ResizeBitCast(uint16x8_tag,
+                                       hn::Load(uint16x4_tag, pointers4[3])));
+        }
+
+        auto res_hi_16 = hn::Undefined(int16_tag);
+        if (conv_params->use_dist_wtd_comp_avg) {
+          const auto p_16_hi =
+              hn::BitCast(int16_tag, hn::InterleaveLower(p4_16, temp_hi_16));
+          const auto wt_res_hi =
+              hn::WidenMulPairwiseAdd(int32_tag, p_16_hi, wt);
+          const auto shifted_32 =
+              hn::ShiftRight<DIST_PRECISION_BITS>(wt_res_hi);
+          res_hi_16 = hn::BitCast(
+              int16_tag,
+              hn::ReorderDemote2To(uint16_tag, shifted_32, shifted_32));
+        } else {
+          res_hi_16 = hn::ShiftRight<1>(
+              hn::BitCast(int16_tag, hn::Add(p4_16, temp_hi_16)));
+        }
+        res_hi_16 = hn::Add(res_hi_16, res_sub_const);
+        res_hi_16 = hn::ShiftRightSame(hn::Add(res_hi_16, round_bits_const),
+                                       round_bits);
+        const auto res_8_hi =
+            hn::ReorderDemote2To(uint8_tag, res_hi_16, res_hi_16);
+        StoreRows(uint8x4_tag, uint8_tag, res_8_hi, p_stride, i + k, j + 4,
+                  pred);
+      } else {
+        hn::Store(hn::ResizeBitCast(uint16x4_tag, temp_hi_16), uint16x4_tag,
+                  pointers4[0]);
+        if constexpr (kNumRows >= 2) {
+          hn::Store(
+              hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<1>(temp_hi_16)),
+              uint16x4_tag, pointers4[1]);
+        }
+        if constexpr (kNumRows >= 3) {
+          hn::Store(
+              hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<2>(temp_hi_16)),
+              uint16x4_tag, pointers4[2]);
+          hn::Store(
+              hn::ResizeBitCast(uint16x4_tag, hn::ExtractBlock<3>(temp_hi_16)),
+              uint16x4_tag, pointers4[3]);
+        }
+      }
+    }
+  } else {
+    const auto res_lo_round =
+        hn::ShiftRightSame(hn::Add(res_lo, res_add_const), reduce_bits_vert);
+    const auto res_hi_round =
+        hn::ShiftRightSame(hn::Add(res_hi, res_add_const), reduce_bits_vert);
+
+    const auto res_16bit =
+        hn::ReorderDemote2To(int16_tag, res_lo_round, res_hi_round);
+    const auto res_8bit = hn::ReorderDemote2To(uint8_tag, res_16bit, res_16bit);
+    // Store, blending with 'pred' if needed
+    if (p_width == 4) {
+      StoreRows(uint8x4_tag, uint8_tag, res_8bit, p_stride, i + k, j, pred);
+    } else {
+      StoreRows(uint8x8_tag, uint8_tag, res_8bit, p_stride, i + k, j, pred);
+    }
+  }
+}
+
+template <bool InnerCoeffUpdate,
+          void (*PrepareCoeffs)(int gamma, int delta, int sy,
+                                int16_t *HWY_RESTRICT coeffs)>
+HWY_ATTR inline void WarpVerticalFilterTemplate(
+    uint8_t *HWY_RESTRICT pred, int16_t *HWY_RESTRICT horz_out,
+    ConvolveParams *HWY_RESTRICT conv_params, int16_t gamma, int16_t delta,
+    int p_height, int p_stride, int p_width, int i, int j, int sy4,
+    const int reduce_bits_vert, const IVec32 res_add_const,
+    const int round_bits, const IVec16 res_sub_const,
+    const IVec16 round_bits_const, const IVec16 wt) {
+  HWY_ALIGN int16_t src_lo[4 * hn::MaxLanes(int16_tag)];
+  HWY_ALIGN int16_t src_hi[4 * hn::MaxLanes(int16_tag)];
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    const auto horz_out_0 =
+        hn::Load(int16_tag, horz_out + 0 * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_1 =
+        hn::LoadU(int16_tag, horz_out + 1 * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_2 =
+        hn::LoadU(int16_tag, horz_out + 2 * hn::MaxLanes(int16x8_tag));
+    const auto horz_out_3 =
+        hn::LoadU(int16_tag, horz_out + 3 * hn::MaxLanes(int16x8_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_lo + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_hi + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_lo + 1 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_hi + 1 * hn::MaxLanes(int16_tag));
+  } else if constexpr (int16_tag.MaxBlocks() == 2) {
+    const auto horz_out_0 =
+        hn::Load(int16_tag, horz_out + 0 * hn::MaxLanes(int16_tag));
+    const auto horz_out_2 =
+        hn::Load(int16_tag, horz_out + 1 * hn::MaxLanes(int16_tag));
+    const auto horz_out_4 =
+        hn::Load(int16_tag, horz_out + 2 * hn::MaxLanes(int16_tag));
+    const auto horz_out_6 =
+        hn::Load(int16_tag, horz_out + 3 * hn::MaxLanes(int16_tag));
+    const auto horz_out_1 =
+        hn::ConcatLowerUpper(int16_tag, horz_out_2, horz_out_0);
+    const auto horz_out_3 =
+        hn::ConcatLowerUpper(int16_tag, horz_out_4, horz_out_2);
+    const auto horz_out_5 =
+        hn::ConcatLowerUpper(int16_tag, horz_out_6, horz_out_4);
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_lo + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_hi + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_lo + 1 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_hi + 1 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_4, horz_out_5), int16_tag,
+              src_lo + 2 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_4, horz_out_5), int16_tag,
+              src_hi + 2 * hn::MaxLanes(int16_tag));
+  } else {
+    const auto horz_out_0 =
+        hn::Load(int16_tag, horz_out + 0 * hn::MaxLanes(int16_tag));
+    const auto horz_out_1 =
+        hn::Load(int16_tag, horz_out + 1 * hn::MaxLanes(int16_tag));
+    const auto horz_out_2 =
+        hn::Load(int16_tag, horz_out + 2 * hn::MaxLanes(int16_tag));
+    const auto horz_out_3 =
+        hn::Load(int16_tag, horz_out + 3 * hn::MaxLanes(int16_tag));
+    const auto horz_out_4 =
+        hn::Load(int16_tag, horz_out + 4 * hn::MaxLanes(int16_tag));
+    const auto horz_out_5 =
+        hn::Load(int16_tag, horz_out + 5 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_lo + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_0, horz_out_1), int16_tag,
+              src_hi + 0 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_lo + 1 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_2, horz_out_3), int16_tag,
+              src_hi + 1 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveLower(int16_tag, horz_out_4, horz_out_5), int16_tag,
+              src_lo + 2 * hn::MaxLanes(int16_tag));
+    hn::Store(hn::InterleaveUpper(int16_tag, horz_out_4, horz_out_5), int16_tag,
+              src_hi + 2 * hn::MaxLanes(int16_tag));
+  }
+
+  HWY_ALIGN int16_t coeffs[8 * hn::MaxLanes(int16_tag)];
+  if constexpr (!InnerCoeffUpdate) {
+    PrepareCoeffs(gamma, delta, sy4, coeffs);
+  }
+
+  for (int k = -4; k < AOMMIN(4, p_height - i - 4);
+       k += static_cast<int>(int16_tag.MaxBlocks())) {
+    if constexpr (InnerCoeffUpdate) {
+      int sy = sy4 + delta * (k + 4);
+      PrepareCoeffs(gamma, delta, sy, coeffs);
+    }
+
+    IVec32 res_lo, res_hi;
+    FilterPixelsVertical(horz_out, src_lo, src_hi, coeffs, res_lo, res_hi,
+                         k + 4);
+    StoreVerticalFilterOutput(res_lo, res_hi, res_add_const, wt, res_sub_const,
+                              round_bits_const, pred, conv_params, i, j, k + 4,
+                              reduce_bits_vert, p_stride, p_width, round_bits);
+
+    if constexpr (int16_tag.MaxBlocks() >= 3) {
+      hn::Store(hn::Load(int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_lo + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_hi + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag));
+    } else if constexpr (int16_tag.MaxBlocks() == 2) {
+      hn::Store(hn::Load(int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_lo + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_hi + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag));
+      hn::Store(hn::Load(int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag)),
+                int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag));
+    } else if constexpr (int16_tag.MaxBlocks() == 1) {
+      const auto src_lo_0 =
+          hn::Load(int16_tag, src_lo + 0 * hn::MaxLanes(int16_tag));
+      const auto src_lo_1 =
+          hn::Load(int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag));
+      const auto src_lo_2 =
+          hn::Load(int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag));
+      const auto src_lo_3 =
+          hn::Load(int16_tag, src_lo + 3 * hn::MaxLanes(int16_tag));
+      const auto src_lo_0_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_lo_0), src_lo_1);
+      const auto src_lo_1_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_lo_1), src_lo_2);
+      const auto src_lo_2_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_lo_2), src_lo_3);
+      hn::Store(src_lo_0_new, int16_tag, src_lo + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(src_lo_1_new, int16_tag, src_lo + 1 * hn::MaxLanes(int16_tag));
+      hn::Store(src_lo_2_new, int16_tag, src_lo + 2 * hn::MaxLanes(int16_tag));
+      const auto src_hi_0 =
+          hn::Load(int16_tag, src_hi + 0 * hn::MaxLanes(int16_tag));
+      const auto src_hi_1 =
+          hn::Load(int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag));
+      const auto src_hi_2 =
+          hn::Load(int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag));
+      const auto src_hi_3 =
+          hn::Load(int16_tag, src_hi + 3 * hn::MaxLanes(int16_tag));
+      const auto src_hi_0_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_hi_0), src_hi_1);
+      const auto src_hi_1_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_hi_1), src_hi_2);
+      const auto src_hi_2_new = hn::InterleaveEven(
+          hn::ShiftRightLanes<1>(int16_tag, src_hi_2), src_hi_3);
+      hn::Store(src_hi_0_new, int16_tag, src_hi + 0 * hn::MaxLanes(int16_tag));
+      hn::Store(src_hi_1_new, int16_tag, src_hi + 1 * hn::MaxLanes(int16_tag));
+      hn::Store(src_hi_2_new, int16_tag, src_hi + 2 * hn::MaxLanes(int16_tag));
+    }
+  }
+}
+
+HWY_ATTR inline void PrepareWarpVerticalFilter(
+    uint8_t *HWY_RESTRICT pred, int16_t *HWY_RESTRICT horz_out,
+    ConvolveParams *HWY_RESTRICT conv_params, int16_t gamma, int16_t delta,
+    int p_height, int p_stride, int p_width, int i, int j, int sy4,
+    const int reduce_bits_vert, const IVec32 res_add_const,
+    const int round_bits, const IVec16 res_sub_const,
+    const IVec16 round_bits_const, const IVec16 wt) {
+  if (gamma == 0 && delta == 0)
+    WarpVerticalFilterTemplate<false, PrepareVerticalFilterCoeffsGamma0>(
+        pred, horz_out, conv_params, gamma, delta, p_height, p_stride, p_width,
+        i, j, sy4, reduce_bits_vert, res_add_const, round_bits, res_sub_const,
+        round_bits_const, wt);
+  else if (gamma == 0 && delta != 0)
+    WarpVerticalFilterTemplate<true, PrepareVerticalFilterCoeffsGamma0>(
+        pred, horz_out, conv_params, gamma, delta, p_height, p_stride, p_width,
+        i, j, sy4, reduce_bits_vert, res_add_const, round_bits, res_sub_const,
+        round_bits_const, wt);
+  else if (gamma != 0 && delta == 0)
+    WarpVerticalFilterTemplate<false, PrepareVerticalFilterCoeffsDelta0>(
+        pred, horz_out, conv_params, gamma, delta, p_height, p_stride, p_width,
+        i, j, sy4, reduce_bits_vert, res_add_const, round_bits, res_sub_const,
+        round_bits_const, wt);
+  else
+    WarpVerticalFilterTemplate<true, PrepareVerticalFilterCoeffs>(
+        pred, horz_out, conv_params, gamma, delta, p_height, p_stride, p_width,
+        i, j, sy4, reduce_bits_vert, res_add_const, round_bits, res_sub_const,
+        round_bits_const, wt);
+}
+
+HWY_ATTR inline void PrepareWarpHorizontalFilter(
+    const uint8_t *HWY_RESTRICT ref, int16_t *HWY_RESTRICT horz_out, int stride,
+    int32_t ix4, int32_t iy4, int32_t sx4, int alpha, int beta, int p_height,
+    int height, int i, const IVec16 round_const, const int reduce_bits_horiz) {
+  if (alpha == 0 && beta == 0)
+    WarpHorizontalFilterTemplate<false,
+                                 PrepareHorizontalFilterCoefficientsAlpha0>(
+        ref, horz_out, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz);
+  else if (alpha == 0 && beta != 0)
+    WarpHorizontalFilterTemplate<true,
+                                 PrepareHorizontalFilterCoefficientsAlpha0>(
+        ref, horz_out, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz);
+  else if (alpha != 0 && beta == 0)
+    WarpHorizontalFilterTemplate<false,
+                                 PrepareHorizontalFilterCoefficientsBeta0>(
+        ref, horz_out, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz);
+  else
+    WarpHorizontalFilterTemplate<true, PrepareHorizontalFilterCoefficients,
+                                 PrepareLastHorizontalFilterCoefficients>(
+        ref, horz_out, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz);
+}
+
+template <typename D>
+HWY_ATTR HWY_INLINE int WarpHorizontalFilterOutOfBoundsSetLoop(
+    D tag, const uint8_t *HWY_RESTRICT ref, int height, int stride,
+    int p_height, int i, int iy4, int16_t const4, int16_t const5, int offset,
+    int k, int16_t *HWY_RESTRICT horz_out) {
+  constexpr int kNumRows = tag.MaxBlocks();
+  for (; k < AOMMIN(8, p_height - i) - kNumRows; k += kNumRows) {
+    int iy = clamp(iy4 + k + 0, 0, height - 1);
+    auto src = hn::ResizeBitCast(
+        tag, hn::Set(int16x8_tag, const4 + ref[iy * stride + offset] * const5));
+    if constexpr (kNumRows >= 2) {
+      iy = clamp(iy4 + k + 1, 0, height - 1);
+      src = hn::InsertBlock<1>(
+          src,
+          hn::Set(int16x8_tag, const4 + ref[iy * stride + offset] * const5));
+    }
+    if constexpr (kNumRows >= 3) {
+      iy = clamp(iy4 + k + 2, 0, height - 1);
+      src = hn::InsertBlock<2>(
+          src,
+          hn::Set(int16x8_tag, const4 + ref[iy * stride + offset] * const5));
+      iy = clamp(iy4 + k + 3, 0, height - 1);
+      src = hn::InsertBlock<3>(
+          src,
+          hn::Set(int16x8_tag, const4 + ref[iy * stride + offset] * const5));
+    }
+    hn::Store(src, tag, horz_out + (k + 7) * hn::MaxLanes(int16x8_tag));
+  }
+  return k;
+}
+
+HWY_ATTR void WarpHorizontalFilterOutOfBoundsSet(
+    const uint8_t *HWY_RESTRICT ref, int height, int stride, int p_height,
+    int i, int iy4, int16_t const4, int16_t const5, int offset,
+    int16_t *HWY_RESTRICT horz_out) {
+  int k = -7, iy;
+  if constexpr (int16_tag.MaxBlocks() >= 3) {
+    k = WarpHorizontalFilterOutOfBoundsSetLoop(int16_tag, ref, height, stride,
+                                               p_height, i, iy4, const4, const5,
+                                               offset, k, horz_out);
+  }
+  if constexpr (int16_tag.MaxBlocks() >= 2) {
+    k = WarpHorizontalFilterOutOfBoundsSetLoop(int16x16_tag, ref, height,
+                                               stride, p_height, i, iy4, const4,
+                                               const5, offset, k, horz_out);
+  }
+  if constexpr (int16_tag.MaxBlocks() == 1) {
+    k = WarpHorizontalFilterOutOfBoundsSetLoop(int16x8_tag, ref, height, stride,
+                                               p_height, i, iy4, const4, const5,
+                                               offset, k, horz_out);
+  }
+  iy = iy4 + k;
+  iy = clamp(iy4 + k, 0, height - 1);
+  hn::Store(hn::Set(int16x8_tag, const4 + ref[iy * stride + offset] * const5),
+            int16x8_tag, horz_out + (k + 7) * hn::MaxLanes(int16x8_tag));
+}
+
+template <typename D>
+HWY_ATTR int WarpHorizontalFilterOutOfBoundsPadLoop(
+    D tag, const uint8_t *HWY_RESTRICT ref, int stride, int32_t ix4,
+    int32_t iy4, int32_t sx4, int alpha, int beta, int p_height, int height,
+    int i, const IVec16 round_const, const int reduce_bits_horiz,
+    int out_of_boundary_left, int out_of_boundary_right, int k,
+    int16_t *HWY_RESTRICT horz_out) {
+  constexpr int kNumRows = tag.MaxBlocks();
+  for (; k < (AOMMIN(8, p_height - i) - kNumRows); k += kNumRows) {
+    auto src = LoadRowsClamped(tag, ref + ix4 - 7, stride, iy4 + k, height);
+    if (out_of_boundary_left >= 0) {
+      const auto shuffle_reg_left =
+          hn::LoadDup128(tag, warp_pad_left[out_of_boundary_left]);
+      src = hn::TableLookupBytes(src, shuffle_reg_left);
+    }
+    if (out_of_boundary_right >= 0) {
+      const auto shuffle_reg_right =
+          hn::LoadDup128(tag, warp_pad_right[out_of_boundary_right]);
+      src = hn::TableLookupBytes(src, shuffle_reg_right);
+    }
+    int sx = sx4 + beta * (k + 4);
+    HorizontalFilter(tag, src, horz_out, sx, alpha, beta, k + 7, round_const,
+                     reduce_bits_horiz);
+  }
+  return k;
+}
+
+HWY_ATTR void WarpHorizontalFilterOutOfBoundsPad(
+    const uint8_t *HWY_RESTRICT ref, int stride, int32_t ix4, int32_t iy4,
+    int32_t sx4, int alpha, int beta, int p_height, int width, int height,
+    int i, const IVec16 round_const, const int reduce_bits_horiz,
+    int16_t *HWY_RESTRICT horz_out) {
+  const int out_of_boundary_left = -(ix4 - 6);
+  const int out_of_boundary_right = (ix4 + 8) - width;
+  int k = -7, iy, sx;
+  if constexpr (uint8_tag.MaxBlocks() >= 3) {
+    k = WarpHorizontalFilterOutOfBoundsPadLoop(
+        uint8_tag, ref, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz, out_of_boundary_left,
+        out_of_boundary_right, k, horz_out);
+  }
+  if constexpr (uint8_tag.MaxBlocks() >= 2) {
+    k = WarpHorizontalFilterOutOfBoundsPadLoop(
+        uint8x32_tag, ref, stride, ix4, iy4, sx4, alpha, beta, p_height, height,
+        i, round_const, reduce_bits_horiz, out_of_boundary_left,
+        out_of_boundary_right, k, horz_out);
+  }
+  if constexpr (uint8_tag.MaxBlocks() == 1) {
+    k = WarpHorizontalFilterOutOfBoundsPadLoop(
+        uint8_tag, ref, stride, ix4, iy4, sx4, alpha, beta, p_height, height, i,
+        round_const, reduce_bits_horiz, out_of_boundary_left,
+        out_of_boundary_right, k, horz_out);
+  }
+  iy = iy4 + k;
+  iy = clamp(iy, 0, height - 1);
+  auto src = hn::LoadU(uint8x16_tag, ref + iy * stride + ix4 - 7);
+  if (out_of_boundary_left >= 0) {
+    const auto shuffle_reg_left =
+        hn::LoadU(uint8x16_tag, warp_pad_left[out_of_boundary_left]);
+    src = hn::TableLookupBytes(src, shuffle_reg_left);
+  }
+  if (out_of_boundary_right >= 0) {
+    const auto shuffle_reg_right =
+        hn::LoadU(uint8x16_tag, warp_pad_right[out_of_boundary_right]);
+    src = hn::TableLookupBytes(src, shuffle_reg_right);
+  }
+  sx = sx4 + beta * (k + 4);
+  HWY_ALIGN int8_t coeff[4 * hn::MaxLanes(int8_tag)];
+  PrepareLastHorizontalFilterCoefficients(alpha, beta, sx, coeff);
+  FilterPixelsHorizontal(uint8x16_tag, src, horz_out, coeff, round_const,
+                         reduce_bits_horiz, k + 7);
+}
+
+HWY_ATTR void WarpAffine(const int32_t *HWY_RESTRICT mat,
+                         const uint8_t *HWY_RESTRICT ref, int width, int height,
+                         int stride, uint8_t *HWY_RESTRICT pred, int p_col,
+                         int p_row, int p_width, int p_height, int p_stride,
+                         int subsampling_x, int subsampling_y,
+                         ConvolveParams *HWY_RESTRICT conv_params,
+                         int16_t alpha, int16_t beta, int16_t gamma,
+                         int16_t delta) {
+  int i, j;
+  const int bd = 8;
+  const int reduce_bits_horiz = conv_params->round_0;
+  const int reduce_bits_vert = conv_params->is_compound
+                                   ? conv_params->round_1
+                                   : 2 * FILTER_BITS - reduce_bits_horiz;
+  const int offset_bits_horiz = bd + FILTER_BITS - 1;
+  assert(IMPLIES(conv_params->is_compound, conv_params->dst != NULL));
+
+  const int offset_bits_vert = bd + 2 * FILTER_BITS - reduce_bits_horiz;
+  const auto reduce_bits_vert_const =
+      hn::Set(int32_tag, ((1 << reduce_bits_vert) >> 1));
+  const auto res_add_const = hn::Set(int32_tag, 1 << offset_bits_vert);
+  const int round_bits =
+      2 * FILTER_BITS - conv_params->round_0 - conv_params->round_1;
+  const int offset_bits = bd + 2 * FILTER_BITS - conv_params->round_0;
+  assert(IMPLIES(conv_params->do_average, conv_params->is_compound));
+
+  const auto round_const = hn::Set(
+      int16_tag, (1 << offset_bits_horiz) + ((1 << reduce_bits_horiz) >> 1));
+
+  IVec16 res_sub_const, round_bits_const, wt;
+  UnpackWeightsAndSetRoundConst(conv_params, round_bits, offset_bits,
+                                res_sub_const, round_bits_const, wt);
+
+  IVec32 res_add_const_1;
+  if (conv_params->is_compound == 1) {
+    res_add_const_1 = hn::Add(reduce_bits_vert_const, res_add_const);
+  } else {
+    res_add_const_1 = hn::Set(int32_tag, -(1 << (bd + reduce_bits_vert - 1)) +
+                                             ((1 << reduce_bits_vert) >> 1));
+  }
+  const int32_t const1 = alpha * (-4) + beta * (-4) +
+                         (1 << (WARPEDDIFF_PREC_BITS - 1)) +
+                         (WARPEDPIXEL_PREC_SHIFTS << WARPEDDIFF_PREC_BITS);
+  const int32_t const2 = gamma * (-4) + delta * (-4) +
+                         (1 << (WARPEDDIFF_PREC_BITS - 1)) +
+                         (WARPEDPIXEL_PREC_SHIFTS << WARPEDDIFF_PREC_BITS);
+  const int32_t const3 = ((1 << WARP_PARAM_REDUCE_BITS) - 1);
+  const int16_t const4 = (1 << (bd + FILTER_BITS - reduce_bits_horiz - 1));
+  const int16_t const5 = (1 << (FILTER_BITS - reduce_bits_horiz));
+
+  for (i = 0; i < p_height; i += 8) {
+    for (j = 0; j < p_width; j += 8) {
+      HWY_ALIGN int16_t horz_out[8 * 16 + hn::MaxLanes(int16_tag)];
+      const int32_t src_x = (p_col + j + 4) << subsampling_x;
+      const int32_t src_y = (p_row + i + 4) << subsampling_y;
+      const int64_t dst_x =
+          (int64_t)mat[2] * src_x + (int64_t)mat[3] * src_y + (int64_t)mat[0];
+      const int64_t dst_y =
+          (int64_t)mat[4] * src_x + (int64_t)mat[5] * src_y + (int64_t)mat[1];
+      const int64_t x4 = dst_x >> subsampling_x;
+      const int64_t y4 = dst_y >> subsampling_y;
+
+      int32_t ix4 = (int32_t)(x4 >> WARPEDMODEL_PREC_BITS);
+      int32_t sx4 = x4 & ((1 << WARPEDMODEL_PREC_BITS) - 1);
+      int32_t iy4 = (int32_t)(y4 >> WARPEDMODEL_PREC_BITS);
+      int32_t sy4 = y4 & ((1 << WARPEDMODEL_PREC_BITS) - 1);
+
+      // Add in all the constant terms, including rounding and offset
+      sx4 += const1;
+      sy4 += const2;
+
+      sx4 &= ~const3;
+      sy4 &= ~const3;
+
+      // Horizontal filter
+      // If the block is aligned such that, after clamping, every sample
+      // would be taken from the leftmost/rightmost column, then we can
+      // skip the expensive horizontal filter.
+
+      if (ix4 <= -7) {
+        WarpHorizontalFilterOutOfBoundsSet(ref, height, stride, p_height, i,
+                                           iy4, const4, const5, 0, horz_out);
+      } else if (ix4 >= width + 6) {
+        WarpHorizontalFilterOutOfBoundsSet(ref, height, stride, p_height, i,
+                                           iy4, const4, const5, width - 1,
+                                           horz_out);
+      } else if (((ix4 - 7) < 0) || ((ix4 + 9) > width)) {
+        WarpHorizontalFilterOutOfBoundsPad(
+            ref, stride, ix4, iy4, sx4, alpha, beta, p_height, width, height, i,
+            round_const, reduce_bits_horiz, horz_out);
+      } else {
+        PrepareWarpHorizontalFilter(ref, horz_out, stride, ix4, iy4, sx4, alpha,
+                                    beta, p_height, height, i, round_const,
+                                    reduce_bits_horiz);
+      }
+
+      // Vertical filter
+      PrepareWarpVerticalFilter(pred, horz_out, conv_params, gamma, delta,
+                                p_height, p_stride, p_width, i, j, sy4,
+                                reduce_bits_vert, res_add_const_1, round_bits,
+                                res_sub_const, round_bits_const, wt);
+    }
+  }
+}
+
+}  // namespace HWY_NAMESPACE
+}  // namespace
+
+#define MAKE_WARP_AFFINE(suffix)                                             \
+  extern "C" void av1_warp_affine_##suffix(                                  \
+      const int32_t *HWY_RESTRICT mat, const uint8_t *HWY_RESTRICT ref,      \
+      int width, int height, int stride, uint8_t *HWY_RESTRICT pred,         \
+      int p_col, int p_row, int p_width, int p_height, int p_stride,         \
+      int subsampling_x, int subsampling_y,                                  \
+      ConvolveParams *HWY_RESTRICT conv_params, int16_t alpha, int16_t beta, \
+      int16_t gamma, int16_t delta);                                         \
+  HWY_ATTR void av1_warp_affine_##suffix(                                    \
+      const int32_t *HWY_RESTRICT mat, const uint8_t *HWY_RESTRICT ref,      \
+      int width, int height, int stride, uint8_t *HWY_RESTRICT pred,         \
+      int p_col, int p_row, int p_width, int p_height, int p_stride,         \
+      int subsampling_x, int subsampling_y,                                  \
+      ConvolveParams *HWY_RESTRICT conv_params, int16_t alpha, int16_t beta, \
+      int16_t gamma, int16_t delta) {                                        \
+    HWY_NAMESPACE::WarpAffine(mat, ref, width, height, stride, pred, p_col,  \
+                              p_row, p_width, p_height, p_stride,            \
+                              subsampling_x, subsampling_y, conv_params,     \
+                              alpha, beta, gamma, delta);                    \
+  }
+
+HWY_AFTER_NAMESPACE();
+
+#endif  // AV1_COMMON_WARP_PLANE_HWY_H_
diff --git a/av1/common/x86/warp_plane_avx2.c b/av1/common/x86/warp_plane_avx2.c
index a780939..4f6f910 100644
--- a/av1/common/x86/warp_plane_avx2.c
+++ b/av1/common/x86/warp_plane_avx2.c
@@ -14,6 +14,8 @@
 #include "av1/common/warped_motion.h"
 #include "aom_dsp/x86/synonyms.h"
 
+#if !CONFIG_HIGHWAY
+
 DECLARE_ALIGNED(32, static const uint8_t, shuffle_alpha0_mask01_avx2[32]) = {
   0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
   0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
@@ -1208,3 +1210,5 @@
     }
   }
 }
+
+#endif  // !CONFIG_HIGHWAY
diff --git a/av1/common/x86/warp_plane_hwy_avx2.cc b/av1/common/x86/warp_plane_hwy_avx2.cc
new file mode 100644
index 0000000..a721383
--- /dev/null
+++ b/av1/common/x86/warp_plane_hwy_avx2.cc
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2025, Alliance for Open Media. All rights reserved.
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#define HWY_BASELINE_TARGETS HWY_AVX2
+#define HWY_BROKEN_32BIT 0
+
+#include "av1/common/warp_plane_hwy.h"
+
+MAKE_WARP_AFFINE(avx2)
diff --git a/av1/common/x86/warp_plane_hwy_avx512.cc b/av1/common/x86/warp_plane_hwy_avx512.cc
new file mode 100644
index 0000000..a0e2a87
--- /dev/null
+++ b/av1/common/x86/warp_plane_hwy_avx512.cc
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2025, Alliance for Open Media. All rights reserved.
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#define HWY_BASELINE_TARGETS HWY_AVX3_DL
+#define HWY_BROKEN_32BIT 0
+
+#include "av1/common/warp_plane_hwy.h"
+
+MAKE_WARP_AFFINE(avx512)
diff --git a/av1/common/x86/warp_plane_hwy_sse4.cc b/av1/common/x86/warp_plane_hwy_sse4.cc
new file mode 100644
index 0000000..f7c74fb7
--- /dev/null
+++ b/av1/common/x86/warp_plane_hwy_sse4.cc
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2025, Alliance for Open Media. All rights reserved.
+ *
+ * This source code is subject to the terms of the BSD 2 Clause License and
+ * the Alliance for Open Media Patent License 1.0. If the BSD 2 Clause License
+ * was not distributed with this source code in the LICENSE file, you can
+ * obtain it at www.aomedia.org/license/software. If the Alliance for Open
+ * Media Patent License 1.0 was not distributed with this source code in the
+ * PATENTS file, you can obtain it at www.aomedia.org/license/patent.
+ */
+
+#define HWY_BASELINE_TARGETS HWY_SSE4
+#define HWY_BROKEN_32BIT 0
+
+#include "av1/common/warp_plane_hwy.h"
+
+MAKE_WARP_AFFINE(sse4_1)
diff --git a/av1/common/x86/warp_plane_sse4.c b/av1/common/x86/warp_plane_sse4.c
index eb02683..11de2b0 100644
--- a/av1/common/x86/warp_plane_sse4.c
+++ b/av1/common/x86/warp_plane_sse4.c
@@ -137,6 +137,8 @@
 };
 /* clang-format on */
 
+#if !CONFIG_HIGHWAY
+
 // Shuffle masks: we want to convert a sequence of bytes 0, 1, 2, ..., 15
 // in an SSE register into two sequences:
 // 0, 2, 2, 4, ..., 12, 12, 14, <don't care>
@@ -906,3 +908,5 @@
     }
   }
 }
+
+#endif  // !CONFIG_HIGHWAY
diff --git a/av1/decoder/obu.c b/av1/decoder/obu.c
index ea4b1b6..7e89332 100644
--- a/av1/decoder/obu.c
+++ b/av1/decoder/obu.c
@@ -628,7 +628,7 @@
 
 // On failure, calls aom_internal_error() and does not return.
 static void read_metadata_itut_t35(AV1Decoder *const pbi, const uint8_t *data,
-                                   size_t sz) {
+                                   size_t sz, bool has_obu_extension_header) {
   if (sz == 0) {
     aom_internal_error(&pbi->error, AOM_CODEC_CORRUPT_FRAME,
                        "itu_t_t35_country_code is missing");
@@ -657,7 +657,9 @@
                        data[end_index]);
   }
   alloc_read_metadata(pbi, OBU_METADATA_TYPE_ITUT_T35, data, end_index,
-                      AOM_MIF_ANY_FRAME);
+                      has_obu_extension_header
+                          ? AOM_MIF_ANY_FRAME_LAYER_SPECIFIC
+                          : AOM_MIF_ANY_FRAME);
 }
 
 // On success, returns the number of bytes read from 'data'. On failure, calls
@@ -784,7 +786,8 @@
 // On success, returns the number of bytes read from 'data'. On failure, sets
 // pbi->common.error.error_code and returns 0, or calls aom_internal_error()
 // and does not return.
-static size_t read_metadata(AV1Decoder *pbi, const uint8_t *data, size_t sz) {
+static size_t read_metadata(AV1Decoder *pbi, const uint8_t *data, size_t sz,
+                            bool has_obu_extension_header) {
   size_t type_length;
   uint64_t type_value;
   if (aom_uleb_decode(data, sz, &type_value, &type_length) < 0) {
@@ -803,7 +806,8 @@
   }
   if (metadata_type == OBU_METADATA_TYPE_ITUT_T35) {
     // read_metadata_itut_t35() checks trailing bits.
-    read_metadata_itut_t35(pbi, data + type_length, sz - type_length);
+    read_metadata_itut_t35(pbi, data + type_length, sz - type_length,
+                           has_obu_extension_header);
     return sz;
   } else if (metadata_type == OBU_METADATA_TYPE_HDR_CLL) {
     size_t bytes_read =
@@ -1057,10 +1061,12 @@
         }
         pbi->num_tile_groups++;
         break;
-      case OBU_METADATA:
-        decoded_payload_size = read_metadata(pbi, data, payload_size);
+      case OBU_METADATA: {
+        decoded_payload_size =
+            read_metadata(pbi, data, payload_size, obu_header.has_extension);
         if (pbi->error.error_code != AOM_CODEC_OK) return -1;
         break;
+      }
       case OBU_TILE_LIST:
         if (CONFIG_NORMAL_TILE_MODE) {
           pbi->error.error_code = AOM_CODEC_UNSUP_BITSTREAM;
diff --git a/av1/encoder/bitstream.c b/av1/encoder/bitstream.c
index a6a92bf..178a84a 100644
--- a/av1/encoder/bitstream.c
+++ b/av1/encoder/bitstream.c
@@ -4222,23 +4222,24 @@
   for (size_t i = 0; i < arr->sz; i++) {
     aom_metadata_t *current_metadata = arr->metadata_array[i];
     if (current_metadata && current_metadata->payload) {
+      const int metadata_insert_location =
+          current_metadata->insert_flag & AOM_MIF_INSERT_LOCATION_MASK;
       if ((cm->current_frame.frame_type == KEY_FRAME &&
-           current_metadata->insert_flag == AOM_MIF_KEY_FRAME) ||
+           metadata_insert_location == AOM_MIF_KEY_FRAME) ||
           (cm->current_frame.frame_type != KEY_FRAME &&
-           current_metadata->insert_flag == AOM_MIF_NON_KEY_FRAME) ||
-          current_metadata->insert_flag == AOM_MIF_ANY_FRAME) {
+           metadata_insert_location == AOM_MIF_NON_KEY_FRAME) ||
+          metadata_insert_location == AOM_MIF_ANY_FRAME) {
         // OBU header is either one or two bytes.
         if (dst_size < 2) {
           aom_internal_error(cm->error, AOM_CODEC_ERROR,
                              "av1_write_metadata_array: output buffer full");
         }
-        // According to the AV1 spec draft version (as of git commit 5e04f)
-        // Section 6.7.1, some metadata types can be layer specific, but we
-        // currently only support non-layer specific metadata.
+        const bool is_layer_specific_obu =
+            (current_metadata->insert_flag & AOM_MIF_LAYER_SPECIFIC) != 0;
         obu_header_size = av1_write_obu_header(
             &cpi->ppi->level_params, &cpi->frame_header_count, OBU_METADATA,
             cm->seq_params->has_nonzero_operating_point_idc,
-            /*is_layer_specific_obu=*/false, 0, dst);
+            is_layer_specific_obu, 0, dst);
         assert(obu_header_size <= 2);
         obu_payload_size =
             av1_write_metadata_obu(current_metadata, dst + obu_header_size,
diff --git a/av1/encoder/cost.c b/av1/encoder/cost.c
index e4d15e2..4bf8fce 100644
--- a/av1/encoder/cost.c
+++ b/av1/encoder/cost.c
@@ -10,6 +10,7 @@
  */
 #include <assert.h>
 
+#include "aom_dsp/entcode.h"
 #include "av1/encoder/cost.h"
 #include "av1/common/entropy.h"
 
diff --git a/av1/encoder/partition_model_weights.h b/av1/encoder/partition_model_weights.h
index 28d960d..493d893 100644
--- a/av1/encoder/partition_model_weights.h
+++ b/av1/encoder/partition_model_weights.h
@@ -1317,6 +1317,842 @@
 
 #define FEATURE_SIZE 18
 #define LABEL_SIZE 4
+#define NEW_LABEL_SIZE 3
+
+static const float av1_partition4_search_thresh[6][3][5] = {
+  // Aggressiveness = 0
+  {
+      // lowres
+      { 1.0f, 0.7451170578367811f, 0.7001978100387174f, 0.7410514819747771f,
+        1.0f },
+      // midres
+      { 1.0f, 0.7494756052316782f, 0.6892602739627772f, 0.8573332311578094f,
+        1.0f },
+      // hdres
+      { 1.0f, 0.7494756052316782f, 0.6892602739627772f, 0.8573332311578094f,
+        1.0f },
+  },
+  // Aggressiveness = 1
+  {
+      // lowres
+      { 1.0f, 0.8885108052540076f, 0.6485475489414385f, 0.8310364661853259f,
+        1.0f },
+      // midres
+      { 1.0f, 0.6669407157796841f, 0.7813399048525052f, 0.8876257905832826f,
+        1.0f },
+      // hdres
+      { 1.0f, 0.6669407157796841f, 0.7813399048525052f, 0.8876257905832826f,
+        1.0f },
+  },
+  // Aggressiveness = 2
+  {
+      // lowres
+      { 1.0f, 0.8885108052540076f, 0.6485475489414385f, 0.8310364661853259f,
+        1.0f },
+      // midres
+      { 1.0f, 0.6669407157796841f, 0.7813399048525052f, 0.8876257905832826f,
+        1.0f },
+      // hdres
+      { 1.0f, 0.6669407157796841f, 0.7813399048525052f, 0.8876257905832826f,
+        1.0f },
+  },
+  // Aggressiveness = 3
+  {
+      // lowres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // midres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // hdres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+  },
+  // Aggressiveness = 4
+  {
+      // lowres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // midres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // hdres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+  },
+  // Aggressiveness = 5
+  {
+      // lowres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // midres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+      // hdres
+      { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f },
+  },
+};
+
+static const float av1_partition4_not_search_thresh[6][3][5] = {
+  // Aggressiveness = 0
+  {
+      // lowres
+      { 0.0f, 0.001052842338862181f, 0.03450079872362423f,
+        0.002759167859733889f, 0.0f },
+      // midres
+      { 0.0f, 0.07053592384393781f, 0.03838345436035215f, 0.005675936031290841f,
+        0.0f },
+      // hdres
+      { 0.0f, 0.07053592384393781f, 0.03838345436035215f, 0.005675936031290841f,
+        0.0f },
+  },
+  // Aggressiveness = 1
+  {
+      // lowres
+      { 0.0f, 0.003841755695390817f, 0.04822550751756426f,
+        0.008487399657253631f, 0.0f },
+      // midres
+      { 0.0f, 0.0011047168521408242f, 0.04736033094945354f,
+        0.01638628311248205f, 0.0f },
+      // hdres
+      { 0.0f, 0.0011047168521408242f, 0.04736033094945354f,
+        0.01638628311248205f, 0.0f },
+  },
+  // Aggressiveness = 2
+  {
+      // lowres
+      { 0.0f, 0.003841755695390817f, 0.04822550751756426f,
+        0.008487399657253631f, 0.0f },
+      // midres
+      { 0.0f, 0.0011047168521408242f, 0.04736033094945354f,
+        0.01638628311248205f, 0.0f },
+      // hdres
+      { 0.0f, 0.0011047168521408242f, 0.04736033094945354f,
+        0.01638628311248205f, 0.0f },
+
+  },
+  // Aggressiveness = 3
+  {
+      // lowres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // midres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // hdres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+  },
+  // Aggressiveness = 4
+  {
+      // lowres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // midres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // hdres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+  },
+  // Aggressiveness = 5
+  {
+      // lowres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // midres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+      // hdres
+      { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f },
+  },
+};
+
+static const float av1_partition4_nn_mean_64[FEATURE_SIZE] = {
+  1.3752231712399923f,  7.8818296620238755f,  0.70302558915341073f,
+  0.89354391968945868f, 0.69912435960273867f, 0.894260991860082f,
+  0.3926311487269134f,  0.396667092758365f,   0.39248475739497596f,
+  0.60199542223376412f, 0.79385711361043609f, 0.71296985320460649f,
+  0.70907736145028422f, 0.79726457629807679f, 0.78323554106033988f,
+  0.70020724711817484f, 0.703606514382454f,   0.812416459726476f,
+};
+
+static const float av1_partition4_nn_std_64[FEATURE_SIZE] = {
+  1.5210330851172169f,  3.2502345451890569f,  0.31526449400836393f,
+  0.28189232462592312f, 0.31774840567547408f, 0.28162114702432761f,
+  0.32736466480428811f, 0.32670076454358127f, 0.327642806340881f,
+  0.39886024415245425f, 0.58334698402992669f, 0.43936371898261278f,
+  0.43195994078504846f, 0.6085133217391806f,  0.58153035818647314f,
+  0.43649370363303375f, 0.43352660981855345f, 0.6044169876719907f
+};
+
+static const float av1_partition4_nn_mean_32[FEATURE_SIZE] = {
+  1.2246859239905685f,  8.2993701373465054f,  0.766659262212363f,
+  0.94652387284747963f, 0.7600040783826757f,  0.95272747538676006f,
+  0.65159866132436739f, 0.65257588048947879f, 0.66423008846285458f,
+  0.76469913589630689f, 0.82278748860287532f, 0.74684876377199483f,
+  0.747579068836797f,   0.81934887236245424f, 0.83835878037641287f,
+  0.75032880096189358f, 0.75488102829056769f, 0.84759980602684326f,
+};
+
+static const float av1_partition4_nn_std_32[FEATURE_SIZE] = {
+  1.6075309663410215f,  2.5730348757929398f,  0.26321073080207291f,
+  0.16943329319298692f, 0.2651485345364597f,  0.16189549806595446f,
+  0.37708176589960463f, 0.37678428010229964f, 0.376236054947869f,
+  0.35251481923629219f, 0.54812998882840647f, 0.40597153804110392f,
+  0.4062410958984316f,  0.55966403917740415f, 0.551261263412241f,
+  0.39617447478383311f, 0.3975763070782759f,  0.56159171003404f
+};
+
+static const float av1_partition4_nn_mean_16[FEATURE_SIZE] = {
+  1.0045835791325202f,  8.4894105104733484f,  0.662368560626388f,
+  0.907764748376923f,   0.6627159222349438f,  0.936766105471581f,
+  0.45489736309556683f, 0.46132801974042803f, 0.46931301089726096f,
+  0.85059414467649674f, 0.779699826074789f,   0.70609877051124081f,
+  0.711834073256916f,   0.79712849358962179f, 0.8351636078095156f,
+  0.75367264126747469f, 0.75873775601873783f, 0.8466653756621495f
+};
+
+static const float av1_partition4_nn_std_16[FEATURE_SIZE] = {
+  1.6427502751165233f,  2.3099057132670047f,  0.24533039885709371f,
+  0.19978306019327963f, 0.24478191346772621f, 0.17204070228378546f,
+  0.33179173716308064f, 0.33283691607603066f, 0.33758170469103288f,
+  0.3009885171419176f,  0.550933084988767f,   0.41342659059456571f,
+  0.41384314531287553f, 0.56715063304567692f, 0.54874118341614453f,
+  0.39768773547851133f, 0.397536461220648f,   0.55655009687449353f
+};
+
+static const float *const av1_partition4_nn_mean[5] = {
+  NULL,
+  av1_partition4_nn_mean_64,
+  av1_partition4_nn_mean_32,
+  av1_partition4_nn_mean_16,
+  NULL,
+};
+
+static const float *const av1_partition4_nn_std[5] = {
+  NULL,
+  av1_partition4_nn_std_64,
+  av1_partition4_nn_std_32,
+  av1_partition4_nn_std_16,
+  NULL,
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_64_layer0[FEATURE_SIZE * 24] = {
+      0.187540665268898f,     0.008867358788847923f,  1.3245491981506348f,
+      0.18199755251407623f,   -1.320246934890747f,    -0.11215319484472275f,
+      -0.2073437124490738f,   -1.6804399490356445f,   1.0172151327133179f,
+      -0.04451597109436989f,  0.3141838610172272f,    -0.10051298141479492f,
+      0.19940334558486938f,   0.3332000970840454f,    -0.19904619455337524f,
+      0.575351357460022f,     -0.3605368435382843f,   -0.2180677056312561f,
+      -0.08732779324054718f,  -0.5763136148452759f,   -0.4628996253013611f,
+      -0.19735047221183777f,  -0.2669188976287842f,   -0.18305927515029907f,
+      1.9265873432159424f,    0.44848161935806274f,   0.48354148864746094f,
+      0.883162796497345f,     -0.2056751400232315f,   0.26105034351348877f,
+      0.22955726087093353f,   -0.07927738130092621f,  0.22645841538906097f,
+      -0.07682249695062637f,  0.46167850494384766f,   0.03855176270008087f,
+      0.27271342277526855f,   -0.032083675265312195f, 1.2821667194366455f,
+      0.2666876018047333f,    -0.09309065341949463f,  -0.15320968627929688f,
+      -1.6731971502304077f,   -1.3742161989212036f,   0.9385388493537903f,
+      -0.008384025655686855f, 0.24508674442768097f,   0.1386878341436386f,
+      0.02423120103776455f,   0.19301238656044006f,   0.13091489672660828f,
+      -0.6223750710487366f,   -0.051762085407972336f, 0.3243873119354248f,
+      -0.17509253323078156f,  -0.30384597182273865f,  2.6790037155151367f,
+      0.20672129094600677f,   -0.2797843813896179f,   0.4765399992465973f,
+      -0.45524242520332336f,  0.39230310916900635f,   -0.23308643698692322f,
+      -0.3281370997428894f,   -0.33606868982315063f,  -0.24563899636268616f,
+      -0.11075028777122498f,  0.26341673731803894f,   -0.1780794858932495f,
+      -0.3037680685520172f,   0.6263296604156494f,    -0.7237133383750916f,
+      -0.28761041164398193f,  0.09403169900178909f,   0.5554772019386292f,
+      0.08419027924537659f,   -0.20748449862003326f,  0.8960804343223572f,
+      1.731931447982788f,     0.9879814386367798f,    -0.7795932292938232f,
+      -0.08355121314525604f,  -1.2876038551330566f,   0.2683749794960022f,
+      0.046085603535175323f,  0.6003941297531128f,    0.39504218101501465f,
+      0.42627424001693726f,   0.2481756955385208f,    0.2539105713367462f,
+      0.3717970848083496f,    -1.372495174407959f,    1.779883623123169f,
+      0.0023461435921490192f, -0.4034990966320038f,   -0.005759629420936108f,
+      -0.3803994655609131f,   -0.3320770561695099f,   2.070441961288452f,
+      0.3284848630428314f,    0.35943546891212463f,   -0.7431299090385437f,
+      0.327751100063324f,     0.42656365036964417f,   0.0988604873418808f,
+      0.17223680019378662f,   -0.14063750207424164f,  0.01995408907532692f,
+      -1.0309178829193115f,   -0.18265926837921143f,  -0.9445133805274963f,
+      0.1541268527507782f,    0.7334791421890259f,    1.5515707731246948f,
+      -0.14698545634746552f,  0.49188539385795593f,   -0.6401307582855225f,
+      -0.24233347177505493f,  -0.24929864704608917f,  0.7331327795982361f,
+      -0.24525558948516846f,  -0.6303561925888062f,   0.2914137542247772f,
+      -0.1501816362142563f,   0.33382299542427063f,   0.20683197677135468f,
+      0.6331838965415955f,    0.21398495137691498f,   -0.7407512068748474f,
+      -0.38856998085975647f,  1.4056344032287598f,    0.2766624987125397f,
+      -0.47957277297973633f,  0.4720071256160736f,    -2.237626791000366f,
+      0.17220094799995422f,   0.1279810518026352f,    -0.170374795794487f,
+      0.15627238154411316f,   0.06910572201013565f,   0.1091245785355568f,
+      0.13457776606082916f,   0.0723930299282074f,    0.18878978490829468f,
+      -0.9244682788848877f,   0.49643731117248535f,   0.00938953272998333f,
+      0.817326009273529f,     -0.872970700263977f,    1.2162615060806274f,
+      1.6745234727859497f,    0.7649728655815125f,    0.6812664270401001f,
+      0.7555683851242065f,    -0.3641526997089386f,   -0.5201636552810669f,
+      0.33287081122398376f,   -0.015199027955532074f, 0.11200721561908722f,
+      0.1262284219264984f,    -0.30170801281929016f,  0.29431891441345215f,
+      -0.8517263531684875f,   -0.011521201580762863f, -0.2496422827243805f,
+      0.4014142155647278f,    -1.337280035018921f,    -0.4126336872577667f,
+      1.4324754476547241f,    0.5764614343643188f,    0.9383725523948669f,
+      0.15321384370326996f,   0.16989222168922424f,   0.3319786489009857f,
+      -0.07834333181381226f,  0.13870780169963837f,   0.06260085105895996f,
+      0.1040545254945755f,    -0.04136618599295616f,  -0.1875564157962799f,
+      0.12765266001224518f,   0.119786836206913f,     2.6260948181152344f,
+      -0.718943178653717f,    2.216468572616577f,     0.2265452891588211f,
+      0.3913697898387909f,    1.3077619075775146f,    0.8483156561851501f,
+      0.17309871315956116f,   -0.032115768641233444f, 0.06523145735263824f,
+      -0.039812054485082626f, 0.20178377628326416f,   0.09728340059518814f,
+      0.06543758511543274f,   -0.018573174253106117f, 0.22792796790599823f,
+      -0.25049784779548645f,  -0.26964524388313293f,  0.3470650613307953f,
+      0.07785002142190933f,   0.9330359697341919f,    0.13356055319309235f,
+      0.5518191456794739f,    -0.03740643337368965f,  -1.0142114162445068f,
+      -0.3127419948577881f,   0.6786038279533386f,    -0.746333658695221f,
+      0.7910341024398804f,    -0.5631489753723145f,   0.13812024891376495f,
+      0.5231190919876099f,    -0.4686489403247833f,   0.09117632359266281f,
+      0.7155937552452087f,    -0.7463059425354004f,   -0.4379642903804779f,
+      0.48007652163505554f,   0.0414154939353466f,    0.08547665923833847f,
+      -0.08740316331386566f,  -0.3762202262878418f,   1.650416374206543f,
+      0.6305139660835266f,    0.2807236909866333f,    0.3252447545528412f,
+      -1.2813769578933716f,   0.1806870996952057f,    -0.07484257966279984f,
+      -0.4510236084461212f,   -0.32734254002571106f,  -0.2525199055671692f,
+      -0.6823204755783081f,   0.4419010579586029f,    -0.13107456266880035f,
+      1.5115501880645752f,    -1.0415934324264526f,   0.6362841725349426f,
+      1.1276127099990845f,    1.6090517044067383f,    1.4753515720367432f,
+      -0.5725038647651672f,   0.4504983127117157f,    0.0542747899889946f,
+      0.23436619341373444f,   -0.09478037059307098f,  0.17619547247886658f,
+      0.5189757943153381f,    0.3518279492855072f,    -0.20417530834674835f,
+      -0.3984152376651764f,   1.4910639524459839f,    0.896291196346283f,
+      -0.07476390153169632f,  -0.38988080620765686f,  -0.36142808198928833f,
+      -0.5949490070343018f,   -0.34817272424697876f,  2.390124559402466f,
+      0.036864519119262695f,  -0.29277393221855164f,  0.2582416832447052f,
+      0.12344545125961304f,   -0.19278477132320404f,  -0.3593710958957672f,
+      0.052842043340206146f,  -0.4351940155029297f,   0.39930951595306396f,
+      -0.05878692492842674f,  -0.13607178628444672f,  -0.3738308250904083f,
+      0.07306762039661407f,   0.14370614290237427f,   0.7443045377731323f,
+      -0.6571366786956787f,   1.4732838869094849f,    0.07397766411304474f,
+      -0.026601698249578476f, 0.16516001522541046f,   -0.09079989790916443f,
+      0.08782806247472763f,   -0.02378779835999012f,  1.2652226686477661f,
+      1.1744853258132935f,    1.0768793821334839f,    1.2959463596343994f,
+      0.18875858187675476f,   -0.16353435814380646f,  0.12929506599903107f,
+      -0.028879646211862564f, 1.7920417785644531f,    0.2096109390258789f,
+      0.9538826942443848f,    -1.3157790899276733f,   1.025888204574585f,
+      0.14866796135902405f,   0.11838590353727341f,   -0.1063050851225853f,
+      0.10628897696733475f,   -0.07982262223958969f,  -0.26907843351364136f,
+      0.010558247566223145f,  0.13393597304821014f,   0.22190389037132263f,
+      -0.19494540989398956f,  0.3906410038471222f,    -0.9874037504196167f,
+      0.651418149471283f,     -0.8602163791656494f,   0.32449445128440857f,
+      1.2791974544525146f,    1.0395246744155884f,    1.5486201047897339f,
+      -0.7350376844406128f,   -0.6816073060035706f,   0.016160577535629272f,
+      -0.18916243314743042f,  -0.26918724179267883f,  0.4815976321697235f,
+      0.2128375917673111f,    0.32033786177635193f,   0.5065921545028687f,
+      -0.9307171106338501f,   0.007387777790427208f,  -0.45773541927337646f,
+      -0.19515591859817505f,  -0.28307247161865234f,  1.3757314682006836f,
+      1.5909974575042725f,    1.4594842195510864f,    1.3418850898742676f,
+      1.449257493019104f,     0.1290784776210785f,    -0.02872903272509575f,
+      0.0876379907131195f,    0.0945722684264183f,    -0.11620496213436127f,
+      0.03221723809838295f,   0.01832464151084423f,   -0.0350685641169548f,
+      0.2947345972061157f,    -0.6470182538032532f,   0.7246863842010498f,
+      -0.04240961745381355f,  -0.8078872561454773f,   0.2905123829841614f,
+      0.9782440066337585f,    -0.7729941010475159f,   0.9055317044258118f,
+      0.037821024656295776f,  0.123793825507164f,     0.17636357247829437f,
+      0.022636424750089645f,  0.31781089305877686f,   0.4737413227558136f,
+      -0.6823905110359192f,   1.179976463317871f,     -0.6974776983261108f,
+      -0.7578544616699219f,   0.4514416754245758f,    0.8697171211242676f,
+      -0.45408639311790466f,  0.06114231050014496f,   0.4886225461959839f,
+      -0.902211606502533f,    0.2965964078903198f,    0.0068778120912611485f,
+      0.7137743830680847f,    0.3116300702095032f,    -0.2853015661239624f,
+      0.9911389350891113f,    -0.21889007091522217f,  0.2786492705345154f,
+      -0.1595863401889801f,   -0.2824990749359131f,   0.6888416409492493f,
+      -0.8445804715156555f,   -0.07846039533615112f,  -0.9218422174453735f,
+      -0.36275193095207214f,  -0.24915170669555664f,  -0.022376133129000664f,
+      0.9710543751716614f,    1.1314961910247803f,    0.8179596662521362f,
+      -0.12418682873249054f,  0.1507333666086197f,    -0.22915267944335938f,
+      0.0621408149600029f,    -0.18360091745853424f,  0.045976653695106506f,
+      -0.039411406964063644f, 0.09902478009462357f,   0.02785186842083931f,
+      -0.11535168439149857f,  -0.021012280136346817f, -0.8498482704162598f,
+      -0.08284465223550797f,  1.7017556428909302f,    0.09805655479431152f,
+      -0.819100558757782f,    1.2941229343414307f,    -1.4690512418746948f,
+      -0.14031103253364563f,  -0.11753756552934647f,  -0.09136147797107697f,
+      -0.22433844208717346f,  0.553429126739502f,     0.03767363727092743f,
+      0.013898416422307491f,  0.25408342480659485f,   0.047054700553417206f,
+      0.38260170817375183f,   -0.10799504816532135f,  0.7206213474273682f,
+      0.40167760848999023f,   0.33209502696990967f,   -0.05047299340367317f,
+      -1.4984959363937378f,   -0.7021075487136841f,   0.33913472294807434f,
+      0.1524118036031723f,    -0.15081769227981567f,  -0.27317649126052856f,
+      -0.15303443372249603f,  -0.6619091033935547f,   -0.1907443255186081f,
+      0.0709405466914177f,    0.2743484377861023f,    -0.12694808840751648f
+    };
+
+static const float av1_hd_4_partition_nn_bias_64_layer0[24] = {
+  -0.017236731946468353f, 1.2984086275100708f,  -0.5433139204978943f,
+  -0.1250726878643036f,   0.1385560780763626f,  -0.715609073638916f,
+  0.23101328313350677f,   -0.5218021273612976f, 0.4044009745121002f,
+  1.2596054077148438f,    -1.4503605365753174f, -0.09927619248628616f,
+  -0.692095160484314f,    0.339007705450058f,   -0.9172008633613586f,
+  -0.6032222509384155f,   1.5522794723510742f,  -0.2724784016609192f,
+  1.1838854551315308f,    0.1453675925731659f,  -0.8429049253463745f,
+  1.4830207824707031f,    0.4452385902404785f,  -0.19120316207408905f
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_64_layer1[24 * NEW_LABEL_SIZE] = {
+      -0.13316302001476288f, 0.3058302104473114f,    0.1070905551314354f,
+      0.14361895620822906f,  -0.4736453890800476f,   -0.33969804644584656f,
+      0.030571846291422844f, 0.33224594593048096f,   -0.06829124689102173f,
+      0.21595115959644318f,  0.8105601668357849f,    -0.13124717772006989f,
+      -0.2749757170677185f,  -0.7534167170524597f,   0.006968092638999224f,
+      0.04543568938970566f,  -0.22467342019081116f,  -0.47277456521987915f,
+      0.7455835938453674f,   -0.17081353068351746f,  -0.33045464754104614f,
+      0.2652832865715027f,   -0.12469493597745895f,  0.010430725291371346f,
+      -0.6428546905517578f,  -0.06037088856101036f,  -0.6064530611038208f,
+      0.06202490255236626f,  0.0295811016112566f,    0.2661149799823761f,
+      0.09020768105983734f,  0.2919042408466339f,    -0.5376702547073364f,
+      -0.6015531420707703f,  -1.8029735088348389f,   0.07257948815822601f,
+      0.11942317336797714f,  -0.28495341539382935f,  -0.5178318619728088f,
+      0.416162371635437f,    -0.34815943241119385f,  -0.2532649636268616f,
+      -0.8336657285690308f,  -0.28184619545936584f,  -0.07333157956600189f,
+      -0.0849723070859909f,  -0.012794683687388897f, -0.300246000289917f,
+      0.30970948934555054f,  0.06398778408765793f,   0.3188731074333191f,
+      -0.32415783405303955f, -0.23729059100151062f,  -1.2901034355163574f,
+      -0.3617444634437561f,  -0.3765851557254791f,   -0.6880390644073486f,
+      -0.08437152206897736f, -3.1945059299468994f,   -0.41093024611473083f,
+      -0.5344216227531433f,  0.11081822216510773f,   -0.4143705666065216f,
+      -0.28478118777275085f, 0.37445685267448425f,   0.3433232307434082f,
+      -0.5481890439987183f,  0.31730446219444275f,   -0.04274527728557587f,
+      -0.17332714796066284f, -0.6555780172348022f,   -0.1539755016565323f
+    };
+
+static const float av1_hd_4_partition_nn_bias_64_layer1[NEW_LABEL_SIZE] = {
+  0.9356993436813354f, -0.9818968772888184f, -0.5307729244232178f
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_32_layer0[FEATURE_SIZE * 32] = {
+      -1.0885568857192993f,   0.7276804447174072f,     0.3899334967136383f,
+      1.786031723022461f,     0.047051746398210526f,   0.7338393330574036f,
+      -1.0688824653625488f,   0.7300519347190857f,     0.5556117296218872f,
+      0.15455445647239685f,   0.23771341145038605f,    0.17510148882865906f,
+      0.07676580548286438f,   0.33183372020721436f,    0.029779063537716866f,
+      -0.038887616246938705f, 0.12371493875980377f,    -0.1936332881450653f,
+      0.4791683554649353f,    0.01103244535624981f,    0.22002530097961426f,
+      0.1311550885438919f,    -0.25968194007873535f,   0.6977970004081726f,
+      0.4742209017276764f,    -1.7808669805526733f,    0.5283331871032715f,
+      0.12193027883768082f,   0.24376845359802246f,    -0.05244125798344612f,
+      -0.012799731455743313f, -0.11021115630865097f,   0.2635785937309265f,
+      0.5267615914344788f,    -0.6972594261169434f,    0.5042043924331665f,
+      0.5771448612213135f,    0.16394701600074768f,    0.6534922122955322f,
+      0.44265198707580566f,   0.001729517593048513f,   -0.24405747652053833f,
+      -0.46474769711494446f,  -1.2396752834320068f,    0.4316665828227997f,
+      -0.40889325737953186f,  0.21149589121341705f,    0.059366464614868164f,
+      -0.23183700442314148f,  0.26841700077056885f,    -0.18424294888973236f,
+      0.018943576142191887f,  0.1582796424627304f,     -0.04466909542679787f,
+      0.1756608933210373f,    -0.14650224149227142f,   0.6032110452651978f,
+      0.23037950694561005f,   0.05099782347679138f,    -0.11585348099470139f,
+      -0.6360398530960083f,   -0.504222571849823f,     1.2582234144210815f,
+      0.639336109161377f,     -0.08372239023447037f,   0.23684656620025635f,
+      -0.8035249710083008f,   0.6960340738296509f,     -0.1946241408586502f,
+      0.26888877153396606f,   0.0608806349337101f,     -0.0793396532535553f,
+      -0.21731014549732208f,  -0.44480085372924805f,   0.9669836759567261f,
+      -0.03727146238088608f,  -1.048531413078308f,     0.09407489746809006f,
+      0.7357499599456787f,    -0.1052381843328476f,    0.7807036638259888f,
+      -1.2890628576278687f,   -0.27905717492103577f,   -0.2336570918560028f,
+      0.24939477443695068f,   -0.1286485493183136f,    -0.017393384128808975f,
+      0.2383008748292923f,    -0.2378801703453064f,    0.046414244920015335f,
+      0.04329893738031387f,   0.09692811220884323f,    -0.23152312636375427f,
+      0.26998770236968994f,   0.11675097048282623f,    0.048871126025915146f,
+      -0.5766774415969849f,   0.5075773000717163f,     -0.16599853336811066f,
+      0.07355321198701859f,   1.2679802179336548f,     -0.470909059047699f,
+      -0.7353945970535278f,   0.2871803045272827f,     0.04421471804380417f,
+      0.028569340705871582f,  0.07299203425645828f,    0.17615129053592682f,
+      -0.6759434938430786f,   0.3759593665599823f,     -0.03345653414726257f,
+      0.15250401198863983f,   1.6251002550125122f,     -0.32556575536727905f,
+      -0.4355400800704956f,   0.18039146065711975f,    -0.6343501210212708f,
+      -0.26397016644477844f,  0.1334858238697052f,     0.17045122385025024f,
+      0.06101013347506523f,   0.15618449449539185f,    -0.05479817092418671f,
+      0.19636914134025574f,   -0.171217679977417f,     -0.2721777856349945f,
+      -0.03099956177175045f,  -0.09452107548713684f,   -0.2100360095500946f,
+      -0.08023128658533096f,  0.5380405783653259f,     0.04074520245194435f,
+      -0.45956486463546753f,  -0.26170605421066284f,   -0.036938007920980453f,
+      0.3081861436367035f,    0.7264460921287537f,     0.4579879939556122f,
+      0.5227369070053101f,    0.7139924764633179f,     -1.3371893167495728f,
+      0.8577211499214172f,    -0.05075681209564209f,   0.2690950036048889f,
+      -0.3681940734386444f,   0.08500909060239792f,    0.8850646018981934f,
+      0.4467816948890686f,    0.28028661012649536f,    0.27217888832092285f,
+      0.7745168209075928f,    0.7731392979621887f,     1.2513463497161865f,
+      -0.35894691944122314f,  0.060745496302843094f,   -0.04395193234086037f,
+      0.11425885558128357f,   -0.0328013151884079f,    -0.1833721548318863f,
+      -0.07130905985832214f,  0.05300029367208481f,    -0.10985758155584335f,
+      0.006968754343688488f,  0.9198938012123108f,     -0.17839254438877106f,
+      0.26150068640708923f,   0.3576587736606598f,     -0.23587310314178467f,
+      -0.0849660113453865f,   0.6160711050033569f,     -0.18021133542060852f,
+      -0.30802857875823975f,  0.43131136894226074f,    0.1159980297088623f,
+      -0.25173503160476685f,  0.1161213293671608f,     -0.1237974539399147f,
+      -0.7531319260597229f,   -0.00878852792084217f,   0.2545885741710663f,
+      0.007603700738400221f,  -0.654176652431488f,     0.44965240359306335f,
+      0.7931277751922607f,    0.18094922602176666f,    -0.17669828236103058f,
+      0.41031819581985474f,   0.12209723144769669f,    -0.2814178168773651f,
+      -0.5220250487327576f,   0.270843505859375f,      -1.7025750875473022f,
+      0.019062979146838188f,  0.03137283772230148f,    -0.17925892770290375f,
+      -0.42655882239341736f,  -0.6994401216506958f,    -0.07981792837381363f,
+      -0.1174769252538681f,   -0.17482306063175201f,   0.07917492091655731f,
+      0.5423614382743835f,    -0.19157607853412628f,   -0.05202263593673706f,
+      0.04898262768983841f,   0.9668978452682495f,     -0.27084100246429443f,
+      0.1934838443994522f,    0.20127049088478088f,    0.07396796345710754f,
+      -0.06063545122742653f,  -0.04891745001077652f,   -0.5657607316970825f,
+      0.5004168152809143f,    -0.8361988663673401f,    -0.1273525059223175f,
+      0.8776155710220337f,    -0.10246502608060837f,   0.46023881435394287f,
+      0.07886356115341187f,   0.9126405119895935f,     0.17006607353687286f,
+      -0.513926088809967f,    0.1184217631816864f,     0.4032308757305145f,
+      -0.6753421425819397f,   -0.043491754680871964f,  0.028111552819609642f,
+      -0.03926699236035347f,  -0.1501464545726776f,    -0.0014241111930459738f,
+      -0.004508028272539377f, 0.1367940455675125f,     -0.1648765653371811f,
+      -0.3467996418476105f,   0.8369541168212891f,     0.8667364120483398f,
+      0.6326888799667358f,    0.0986432433128357f,     0.20588631927967072f,
+      0.03708481788635254f,   0.7360354065895081f,     -0.43546396493911743f,
+      0.2251778244972229f,    0.5427256226539612f,     -0.08095406740903854f,
+      0.9812734723091125f,    0.6085113286972046f,     0.853574812412262f,
+      0.6173090934753418f,    0.6338220238685608f,     0.5749073624610901f,
+      0.16422225534915924f,   -0.024263106286525726f,  -0.26709797978401184f,
+      -0.27645328640937805f,  -0.18558774888515472f,   0.19723981618881226f,
+      0.3469878137111664f,    -0.853264331817627f,     0.1911182701587677f,
+      -0.15484920144081116f,  0.1289137303829193f,     0.16144156455993652f,
+      0.28407949209213257f,   0.06683658063411713f,    0.10071670264005661f,
+      -0.13541074097156525f,  0.8519083857536316f,     -1.3628060817718506f,
+      0.044920407235622406f,  -0.9304912090301514f,    -0.0543082021176815f,
+      -0.5795891880989075f,   0.7444899678230286f,     0.34974098205566406f,
+      -0.5224471688270569f,   0.7189369797706604f,     0.002385274972766638f,
+      0.28222599625587463f,   -0.48682257533073425f,   -0.24234719574451447f,
+      -0.04255838692188263f,  0.12397146970033646f,    0.618983805179596f,
+      0.20875586569309235f,   0.1820639818906784f,     0.6538670659065247f,
+      -0.42951807379722595f,  -0.9420881271362305f,    0.7871586680412292f,
+      0.3056884706020355f,    -0.22634023427963257f,   -0.4248776435852051f,
+      0.7732703685760498f,    -0.559977650642395f,     0.08991166949272156f,
+      -0.5344432592391968f,   0.4384763538837433f,     0.0689520537853241f,
+      0.054295241832733154f,  0.4064214527606964f,     0.33730319142341614f,
+      -0.5907483696937561f,   -0.25099942088127136f,   0.4158354103565216f,
+      -0.49657127261161804f,  -0.05017169192433357f,   -0.19485296308994293f,
+      0.6315739750862122f,    0.026322035118937492f,   -0.022450678050518036f,
+      0.7323209643363953f,    0.04370767995715141f,    -0.24496810138225555f,
+      0.11060530692338943f,   -0.17143982648849487f,   0.9523182511329651f,
+      -0.6612598896026611f,   -0.37088659405708313f,   -0.06733931601047516f,
+      0.10264944285154343f,   -0.5029839277267456f,    0.05873759835958481f,
+      -0.9268806576728821f,   0.08673790097236633f,    -0.05127667635679245f,
+      -0.12422613799571991f,  1.112686276435852f,      0.8936125636100769f,
+      -0.13199946284294128f,  -0.7763554453849792f,    -0.43546125292778015f,
+      0.2081739604473114f,    -0.02353234775364399f,   0.12321862578392029f,
+      0.21226866543293f,      -0.13369852304458618f,   -0.07170510292053223f,
+      0.022296279668807983f,  -0.056023065000772476f,  0.07766319811344147f,
+      0.538470983505249f,     -0.09681082516908646f,   0.04744873568415642f,
+      0.15997207164764404f,   0.1942814141511917f,     -0.606421709060669f,
+      -1.6964457035064697f,   1.8364115953445435f,     -1.2303802967071533f,
+      0.3570907413959503f,    0.010215475223958492f,   -0.08084885030984879f,
+      0.07036393880844116f,   0.08225959539413452f,    0.1582154929637909f,
+      0.051107257604599f,     0.007791138254106045f,   -0.07944256067276001f,
+      0.1391010731458664f,    -0.13120266795158386f,   -0.22520364820957184f,
+      0.5478659868240356f,    0.2474595457315445f,     0.5424359440803528f,
+      0.2528058588504791f,    -0.3962598741054535f,    0.3255935311317444f,
+      0.03695685788989067f,   -0.07569821178913116f,   0.11970824003219604f,
+      -0.09008245170116425f,  0.23039402067661285f,    0.22283776104450226f,
+      -1.016753911972046f,    -0.3937026262283325f,    0.5536431670188904f,
+      -0.0240758266299963f,   0.4818081855773926f,     0.05481531843543053f,
+      0.2614268660545349f,    -0.0978444516658783f,    0.5413931608200073f,
+      -0.12983573973178864f,  0.11426673084497452f,    -0.8978381156921387f,
+      2.279249429702759f,     0.2615402340888977f,     0.21105653047561646f,
+      0.5502492785453796f,    0.21621979773044586f,    0.084201879799366f,
+      0.3490462899208069f,    -0.026039229705929756f,  -0.10916426032781601f,
+      -0.07801277190446854f,  0.6263523101806641f,     0.6516022086143494f,
+      0.4567950665950775f,    -0.1453220695257187f,    0.2806280255317688f,
+      0.1846635937690735f,    -0.39070093631744385f,   0.2953556478023529f,
+      -0.6223417520523071f,   0.7452172040939331f,     -0.43376561999320984f,
+      0.08578093349933624f,   0.047524344176054f,      -0.1837213784456253f,
+      -0.1024935394525528f,   0.11444173008203506f,    -0.6773277521133423f,
+      0.27492958307266235f,   -0.09823667258024216f,   -0.22346174716949463f,
+      0.0019273801008239388f, -2.3131091594696045f,    0.21303993463516235f,
+      0.24907292425632477f,   -0.2713790833950043f,    -0.010520088486373425f,
+      0.008633764460682869f,  0.09717638790607452f,    -0.06675714254379272f,
+      -0.15131525695323944f,  -0.06854774802923203f,   0.18723946809768677f,
+      0.18391983211040497f,   -0.03460600599646568f,   0.04807600751519203f,
+      0.3075238764286041f,    -0.5352725386619568f,    -0.49505266547203064f,
+      0.13530148565769196f,   0.6102568507194519f,     0.07696415483951569f,
+      0.7194110155105591f,    0.06245575100183487f,    -0.5269941091537476f,
+      0.6319996118545532f,    0.19854962825775146f,    0.16299931704998016f,
+      -0.36398473381996155f,  -0.29766127467155457f,   -0.8556507229804993f,
+      0.19275438785552979f,   0.10368930548429489f,    -0.18639041483402252f,
+      -0.13361744582653046f,  0.20305071771144867f,    -0.2384209781885147f,
+      -0.05955479294061661f,  -0.1649128794670105f,    0.406151682138443f,
+      0.6018882393836975f,    0.7667891383171082f,     -1.7823314666748047f,
+      -0.6050052642822266f,   -0.33709582686424255f,   -0.07998203486204147f,
+      -0.48412758111953735f,  -0.05572658032178879f,   0.1572490930557251f,
+      0.2134942263364792f,    0.14582860469818115f,    0.18256354331970215f,
+      -0.377498060464859f,    -0.9428679347038269f,    0.7819222211837769f,
+      0.40199485421180725f,   0.37610843777656555f,    0.09845725446939468f,
+      -0.2473520040512085f,   1.216661810874939f,      0.30756333470344543f,
+      -0.6023632287979126f,   -0.43360501527786255f,   -0.13598665595054626f,
+      -0.2213943898677826f,   -0.05226043984293938f,   -0.4922758936882019f,
+      -0.0975913479924202f,   -0.21766597032546997f,   -0.3971554636955261f,
+      0.7108594179153442f,    -0.7833206057548523f,    -0.1663222461938858f,
+      -0.10129307955503464f,  0.694123387336731f,      -0.18871963024139404f,
+      -0.20957666635513306f,  -0.6052191853523254f,    -0.4168408513069153f,
+      0.5734631419181824f,    -0.16406677663326263f,   -0.025442425161600113f,
+      0.2279360443353653f,    -0.39298707246780396f,   0.6339749097824097f,
+      0.811789870262146f,     0.5733523964881897f,     0.7220039367675781f,
+      -0.2420847862958908f,   -0.036595702171325684f,  0.16434960067272186f,
+      0.1654539257287979f,    -0.05197849124670029f,   1.6182829141616821f,
+      0.10106345266103745f,   -0.4488222897052765f,    1.9005721807479858f,
+      1.1414402723312378f,    -0.0009835668606683612f, -0.053397852927446365f,
+      0.053350090980529785f,  -0.01116836816072464f,   -0.20531748235225677f,
+      0.031078532338142395f,  -0.12687280774116516f,   0.04337241128087044f,
+      -0.19423863291740417f,  -0.1944684386253357f,    -1.0170292854309082f,
+      0.253818154335022f,     0.2104361206293106f,     0.303714781999588f,
+      0.24978181719779968f,   -0.13854654133319855f,   0.43723681569099426f,
+      -0.35904571413993835f,  0.3720517158508301f,     0.37751492857933044f,
+      -0.22321447730064392f,  0.2643660604953766f,     -1.2104402780532837f,
+      -0.1291334182024002f,   -0.799103856086731f,     -0.6578419804573059f,
+      0.1304577738046646f,    -0.310467928647995f,     -0.5715723633766174f,
+      -0.2751500606536865f,   -1.4782752990722656f,    0.056794919073581696f,
+      -0.42365702986717224f,  0.29207560420036316f,    -0.7994689345359802f,
+      0.6121898293495178f,    -0.13781405985355377f,   -0.17137859761714935f,
+      -0.06067662686109543f,  -0.2110602855682373f,    -0.14195752143859863f,
+      -0.033988095819950104f, -0.290365606546402f,     -0.11846823245286942f,
+      -0.5680235624313354f,   -0.11251579970121384f,   0.11803086847066879f,
+      0.17645932734012604f,   0.173707515001297f,      -0.08128058165311813f,
+      1.9146397113800049f,    -2.025925636291504f,     -0.2421492487192154f,
+      0.1311802864074707f,    0.007723645307123661f,   0.03510530665516853f,
+      0.01055131945759058f,   -0.001576860318891704f,  0.12198811024427414f,
+      0.051576755940914154f,  0.29626473784446716f,    -0.21319039165973663f
+    };
+
+static const float av1_hd_4_partition_nn_bias_32_layer0[32] = {
+  0.2225671112537384f,   -1.488280177116394f,    -1.25272536277771f,
+  0.30195170640945435f,  -1.5750389099121094f,   -0.15505032241344452f,
+  0.4310544729232788f,   -0.7423095703125f,      0.3700641989707947f,
+  -0.6942943930625916f,  -1.8595184087753296f,   0.9343048930168152f,
+  -0.2134159654378891f,  -1.0248521566390991f,   -0.3916787803173065f,
+  -0.13198702037334442f, -0.004904405679553747f, -0.02703840285539627f,
+  0.6067211031913757f,   -0.783298909664154f,    -0.43120503425598145f,
+  -0.5023682117462158f,  -0.6118841171264648f,   -1.6906251907348633f,
+  -0.22733944654464722f, -0.1254553198814392f,   -1.4340238571166992f,
+  -0.22106066346168518f, -1.1736090183258057f,   -0.19184525310993195f,
+  1.6934525966644287f,   0.06706780195236206f
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_32_layer1[32 * NEW_LABEL_SIZE] = {
+      0.16060389578342438f,    -0.21775491535663605f,  0.19317272305488586f,
+      0.16914533078670502f,    -0.45330461859703064f,  0.2787623107433319f,
+      0.31299149990081787f,    0.05981941148638725f,   0.26470449566841125f,
+      0.09401649236679077f,    0.009454675950109959f,  0.07412216812372208f,
+      0.1894051879644394f,     -0.005571697372943163f, -0.29581448435783386f,
+      0.1277410089969635f,     -0.2395242303609848f,   -0.07200955599546432f,
+      -0.06016874685883522f,   -0.052163951098918915f, -0.17527195811271667f,
+      -0.35842975974082947f,   -0.2974209785461426f,   -0.28416672348976135f,
+      -0.13073766231536865f,   0.06440825760364532f,   0.11089038848876953f,
+      0.04264971986413002f,    -0.047677211463451385f, -0.15897323191165924f,
+      0.20627695322036743f,    0.14130337536334991f,   -0.34490400552749634f,
+      -0.5835226774215698f,    -0.2576460540294647f,   0.45405450463294983f,
+      -0.06914465129375458f,   0.47381001710891724f,   0.2275693118572235f,
+      -0.18174265325069427f,   -0.32693126797676086f,  -0.18394632637500763f,
+      0.32419729232788086f,    -0.312508225440979f,    -0.03289115056395531f,
+      -0.0015484002651646733f, -0.411852091550827f,    -0.025346053764224052f,
+      -0.5360790491104126f,    0.0021535682026296854f, -0.1873963326215744f,
+      -0.6050364375114441f,    -0.32903650403022766f,  0.054302141070365906f,
+      -0.2966037094593048f,    0.09255900233983994f,   -0.3126637041568756f,
+      0.3107044994831085f,     -0.41937246918678284f,  0.42857909202575684f,
+      0.413314551115036f,      0.08698277920484543f,   -0.2702505588531494f,
+      -0.3499393165111542f,    0.1393149048089981f,    -0.07625073194503784f,
+      0.22380144894123077f,    0.07417313754558563f,   -0.030424892902374268f,
+      0.010089819319546223f,   -0.18347591161727905f,  0.2737013101577759f,
+      -0.002240510890260339f,  0.08045932650566101f,   -0.3990366756916046f,
+      0.28912147879600525f,    -0.25353744626045227f,  -0.398074209690094f,
+      -0.04493967071175575f,   -0.26016902923583984f,  -0.0123167484998703f,
+      -0.37009286880493164f,   -0.5511913895606995f,   0.4690968990325928f,
+      0.04034169763326645f,    -0.04233121499419212f,  -0.006188324186950922f,
+      0.662132740020752f,      -0.19957122206687927f,  -0.2447078973054886f,
+      -0.37380027770996094f,   -0.019658207893371582f, -0.8110345005989075f,
+      0.04405290633440018f,    -0.6720242500305176f,   0.28009140491485596f
+    };
+
+static const float av1_hd_4_partition_nn_bias_32_layer1[NEW_LABEL_SIZE] = {
+  0.2621873617172241f, -0.3843003511428833f, 0.061671461910009384f
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_16_layer0[FEATURE_SIZE * 24] = {
+      0.016253290697932243f,   -1.297595500946045f,    -0.27500706911087036f,
+      -0.015488669276237488f,  0.13751500844955444f,   0.35479483008384705f,
+      0.14572829008102417f,    0.9681192636489868f,    0.1723741590976715f,
+      0.6096186637878418f,     -0.27040380239486694f,  0.10235186666250229f,
+      -1.328527569770813f,     -0.21975912153720856f,  -0.646583616733551f,
+      -0.055015575140714645f,  -0.2533200681209564f,   -0.3775864541530609f,
+      0.14515726268291473f,    0.7094276547431946f,    0.1637670248746872f,
+      -0.44492098689079285f,   0.45630398392677307f,   -0.09169628471136093f,
+      -1.2921737432479858f,    -0.7724092602729797f,   -0.3599521517753601f,
+      -0.5058088302612305f,    0.05034554377198219f,   -0.13767042756080627f,
+      -0.063873291015625f,     -0.22768817842006683f,  0.4867677688598633f,
+      0.30664312839508057f,    0.197816863656044f,     0.7504525184631348f,
+      -1.1621628999710083f,    1.2124617099761963f,    0.25230467319488525f,
+      -0.05503295734524727f,   -1.4942923784255981f,   -0.32912740111351013f,
+      0.3235040307044983f,     -0.2510537803173065f,   0.771889328956604f,
+      -0.3342115581035614f,    0.5448428988456726f,    0.33309075236320496f,
+      0.323270320892334f,      0.7348728775978088f,    0.08185584098100662f,
+      0.2510865032672882f,     0.3047282099723816f,    -0.1991347074508667f,
+      -0.12788036465644836f,   0.2775464951992035f,    0.8931537866592407f,
+      0.056438326835632324f,   -0.318469375371933f,    0.09217683225870132f,
+      0.5232695937156677f,     -1.103989601135254f,    1.3018217086791992f,
+      -0.009919252246618271f,  -0.2791503667831421f,   0.061792079359292984f,
+      0.07982557266950607f,    -0.0922408252954483f,   0.6757814884185791f,
+      -0.7555142641067505f,    -0.5889946222305298f,   0.6173131465911865f,
+      -0.8960505723953247f,    -0.5027951002120972f,   0.13630244135856628f,
+      0.6446100473403931f,     -0.4130464196205139f,   -0.4017779231071472f,
+      0.20977409183979034f,    -0.4052879810333252f,   1.3482177257537842f,
+      0.14319704473018646f,    0.10465224087238312f,   0.7072924971580505f,
+      -0.10870729386806488f,   -0.519813597202301f,    -0.044540468603372574f,
+      0.033170267939567566f,   0.09132528305053711f,   -0.06453194469213486f,
+      -0.21186839044094086f,   -0.05110237002372742f,  -0.20925769209861755f,
+      -0.21288996934890747f,   1.6771467924118042f,    0.5502645373344421f,
+      -1.7048338651657104f,    0.16985715925693512f,   -1.4219897985458374f,
+      -0.15837456285953522f,   -0.2376108467578888f,   -0.0772041529417038f,
+      -0.10152464359998703f,   -0.12048980593681335f,  -0.04174394533038139f,
+      0.03342839702963829f,    -0.11870015412569046f,  0.045642685145139694f,
+      -0.17824901640415192f,   -0.39495736360549927f,  0.023219818249344826f,
+      0.06599095463752747f,    -1.3451234102249146f,   -0.36058682203292847f,
+      -0.6323454976081848f,    0.6011573672294617f,    0.5714403986930847f,
+      -0.12171386182308197f,   -0.2662772238254547f,   0.11267754435539246f,
+      0.05486325919628143f,    -0.005583520047366619f, -0.4404391944408417f,
+      0.9215461611747742f,     -0.7242451906204224f,   0.35334518551826477f,
+      0.10570390522480011f,    -0.25813040137290955f,  -0.14261184632778168f,
+      0.23588822782039642f,    0.8678246140480042f,    0.18859107792377472f,
+      -1.3121229410171509f,    -0.3789387047290802f,   -0.12254016101360321f,
+      0.32620975375175476f,    -0.020516367629170418f, -0.02394155040383339f,
+      -0.04741834104061127f,   0.6299850940704346f,    0.4153035879135132f,
+      0.24157185852527618f,    0.1356470286846161f,    0.23200926184654236f,
+      0.1951109915971756f,     0.18018396198749542f,   1.8084784746170044f,
+      0.2517470717430115f,     -0.73732590675354f,     -0.3096546232700348f,
+      -1.2763310670852661f,    -1.914092779159546f,    1.0422425270080566f,
+      -0.10959721356630325f,   0.004097330383956432f,  0.029040994122624397f,
+      0.08122090995311737f,    0.12759143114089966f,   -0.10597894340753555f,
+      -0.08978335559368134f,   0.06638552993535995f,   -0.1336660236120224f,
+      -0.006535878870636225f,  -0.41592198610305786f,  0.5584124326705933f,
+      0.2367665022611618f,     0.196196049451828f,     -0.07954944670200348f,
+      0.5095464587211609f,     -0.09126847237348557f,  -0.21107706427574158f,
+      0.029520299285650253f,   -1.2078922986984253f,   0.3940129578113556f,
+      0.7266209721565247f,     -1.5306856632232666f,   -0.1339416354894638f,
+      0.1284932792186737f,     -0.21777178347110748f,  0.07416583597660065f,
+      -0.09128864854574203f,   0.054931897670030594f,  2.331453323364258f,
+      0.3878869414329529f,     3.0721747875213623f,    -0.20134149491786957f,
+      0.4023342430591583f,     1.0996533632278442f,    0.804786741733551f,
+      0.05523429065942764f,    0.3615642488002777f,    0.28358790278434753f,
+      1.0130475759506226f,     0.6530840992927551f,    0.5036265850067139f,
+      0.07072196900844574f,    1.1965914964675903f,    0.15188759565353394f,
+      -0.4931938946247101f,    -0.5595028400421143f,   0.4211946129798889f,
+      0.4326711595058441f,     1.68171226978302f,      -0.5289874076843262f,
+      0.5924357771873474f,     -1.0259782075881958f,   0.8006095886230469f,
+      -0.011948885396122932f,  -0.06665080040693283f,  0.017298908904194832f,
+      0.19788117706775665f,    0.030706975609064102f,  0.4442530870437622f,
+      -0.045532263815402985f,  0.9565532207489014f,    -0.13111355900764465f,
+      -0.17530879378318787f,   -1.7634141445159912f,   0.4697858691215515f,
+      0.1104392409324646f,     -0.2023022472858429f,   -0.22376781702041626f,
+      0.8732469081878662f,     0.8716534376144409f,    0.5823152661323547f,
+      -0.18083855509757996f,   -1.308184266090393f,    -0.654875636100769f,
+      0.02435542456805706f,    -0.0694618821144104f,   -0.8239381909370422f,
+      -0.3262384235858917f,    -0.4603644609451294f,   -0.34901463985443115f,
+      -0.45511960983276367f,   0.09622194617986679f,   -0.9455815553665161f,
+      -0.1763199418783188f,    -0.9315241575241089f,   -0.15244387090206146f,
+      1.9309401512145996f,     1.658923625946045f,     0.6284119486808777f,
+      1.4863715171813965f,     -0.10375337302684784f,  -0.10615994036197662f,
+      0.06735323369503021f,    -0.21195504069328308f,  0.09150348603725433f,
+      0.03171848878264427f,    0.08611147850751877f,   -0.186605304479599f,
+      0.1885748654603958f,     -0.03468483313918114f,  0.45047804713249207f,
+      0.28302377462387085f,    -0.06926856935024261f,  -0.3081843852996826f,
+      1.321082353591919f,      1.173419713973999f,     -0.6945704221725464f,
+      -0.0016985656693577766f, 0.491913765668869f,     0.44896093010902405f,
+      0.22076819837093353f,    -0.8053346276283264f,   0.469636470079422f,
+      0.3976723849773407f,     0.3846956193447113f,    0.3820555508136749f,
+      -0.5353395342826843f,    -1.3723238706588745f,   -0.6204023957252502f,
+      0.013284329324960709f,   1.4644020795822144f,    1.1471366882324219f,
+      0.5871153473854065f,     1.382630705833435f,     -0.19867826998233795f,
+      -0.36002323031425476f,   -0.5623172521591187f,   -0.6787152886390686f,
+      -0.46966180205345154f,   -0.06944172829389572f,  -0.37814465165138245f,
+      -0.2501605749130249f,    0.23046647012233734f,   -0.7194182872772217f,
+      0.03376816213130951f,    -0.5843089818954468f,   -0.9881840944290161f,
+      0.3359870910644531f,     0.4855765104293823f,    -0.014989004470407963f,
+      -0.506977915763855f,     -0.6273276805877686f,   0.6636883020401001f,
+      0.3604613244533539f,     -0.5910298824310303f,   0.7819033861160278f,
+      -0.279474139213562f,     0.42845603823661804f,   0.04737255349755287f,
+      0.4609934985637665f,     0.295590877532959f,     0.1710357666015625f,
+      -0.5633265376091003f,    -0.3976214528083801f,   -0.3271988034248352f,
+      1.0001078844070435f,     -0.8906325101852417f,   1.7183932065963745f,
+      0.657141923904419f,      -0.45635029673576355f,  1.7355040311813354f,
+      -1.08721923828125f,      -0.055488500744104385f, 0.060083694756031036f,
+      0.04308626428246498f,    0.16541598737239838f,   -0.004631219431757927f,
+      -0.05671098455786705f,   -0.04844706505537033f,  -0.04004818573594093f,
+      0.5423532724380493f,     -0.22633561491966248f,  -0.27722683548927307f,
+      -0.14303728938102722f,   0.09665144979953766f,   0.10049419850111008f,
+      0.0016488885739818215f,  -0.9644765853881836f,   -0.37397530674934387f,
+      -0.16206716001033783f,   0.46272340416908264f,   0.5197669267654419f,
+      0.22682012617588043f,    -0.14623111486434937f,  0.07362930476665497f,
+      -0.38951820135116577f,   -0.31893840432167053f,  -0.06790059059858322f,
+      -0.9926403760910034f,    -0.02509547397494316f,  -0.2454613894224167f,
+      1.5675048828125f,        -0.9967404007911682f,   -0.4881826937198639f,
+      2.075866937637329f,      0.9609254002571106f,    0.9962529540061951f,
+      0.28117066621780396f,    -0.049589112401008606f, -0.22676287591457367f,
+      0.02461501583456993f,    -0.015650441870093346f, 0.062187422066926956f,
+      0.03559637814760208f,    -0.022159814834594727f, 0.12390958517789841f,
+      -0.3556448221206665f,    -0.509732186794281f,    0.23648309707641602f,
+      -0.02601865492761135f,   -1.9125454425811768f,   0.2455364614725113f,
+      -0.2373759150505066f,    0.5479041337966919f,    -0.09244018793106079f,
+      0.4884181618690491f,     -0.24114874005317688f,  -0.14978568255901337f,
+      -0.17928700149059296f,   -0.16638562083244324f,  0.20396184921264648f,
+      0.35705462098121643f,    0.027246948331594467f,  -0.03450827673077583f,
+      -1.0188283920288086f,    -0.10113411396741867f,  -0.9864289164543152f,
+      -0.10024290531873703f,   -0.1372683346271515f,   1.2718243598937988f,
+      1.9245606660842896f,     1.5185524225234985f,    0.8519737720489502f,
+      1.2267814874649048f,     -0.04248328134417534f,  0.03472750261425972f,
+      -0.010050246492028236f,  0.011352861300110817f,  0.0025208336301147938f,
+      -0.08251981437206268f,   0.013549050316214561f,  0.03250717744231224f,
+      -0.08791140466928482f,   -0.6399267315864563f,   0.6154654026031494f,
+      0.17678213119506836f,    -0.25917088985443115f,  0.4593951106071472f,
+      0.9585320353507996f,     0.6765605211257935f,    0.18305635452270508f,
+      -0.31955450773239136f,   0.18701691925525665f,   -1.0847679376602173f,
+      -1.4102541208267212f,    0.5045204162597656f,    0.15169933438301086f,
+      0.24514462053775787f,    0.3000636696815491f,    0.032767392694950104f,
+      0.28788086771965027f,    0.17492298781871796f,   1.1169567108154297f,
+      0.1887919008731842f,     1.3159228563308716f,    0.5523889064788818f,
+      -0.3944258689880371f,    -1.4844435453414917f,   1.5090211629867554f,
+      0.3398161232471466f,     -0.006244915537536144f, 0.2332850992679596f,
+      -0.04527553915977478f,   -0.004237487446516752f, -0.3812176585197449f,
+      0.3342042565345764f,     0.09058798104524612f,   0.06223088502883911f
+    };
+
+static const float av1_hd_4_partition_nn_bias_16_layer0[24] = {
+  0.17428483068943024f,  -0.18228434026241302f, -0.22544708847999573f,
+  0.9856225252151489f,   0.7987083196640015f,   -0.2589545249938965f,
+  -0.34056714177131653f, 0.518294632434845f,    -0.8025332093238831f,
+  -0.04026109725236893f, -2.161693811416626f,   0.9118471145629883f,
+  0.928778350353241f,    1.0625823736190796f,   0.11546365916728973f,
+  0.006691100541502237f, -1.3805687427520752f,  -0.11042501777410507f,
+  -0.2545872926712036f,  1.2440357208251953f,   -1.6608234643936157f,
+  0.9239633083343506f,   -0.7040714621543884f,  1.1720298528671265f
+};
+
+static const float
+    av1_hd_4_partition_nn_weights_16_layer1[24 * NEW_LABEL_SIZE] = {
+      -0.0328470803797245f,   0.23792479932308197f,  0.25561416149139404f,
+      -0.020317574962973595f, 0.11895804107189178f,  0.11310578882694244f,
+      -0.11336013674736023f,  0.16002647578716278f,  0.18786849081516266f,
+      0.10171400010585785f,   2.3350460529327393f,   -0.2828865051269531f,
+      -0.1283670961856842f,   0.3708970248699188f,   -0.21812289953231812f,
+      -0.0782235786318779f,   -0.2884467840194702f,  -0.36653709411621094f,
+      0.28875768184661865f,   0.3709397315979004f,   -0.1924125701189041f,
+      0.8257001042366028f,    -0.3084794580936432f,  0.10378926247358322f,
+      0.10832912474870682f,   0.23994174599647522f,  -0.09522973001003265f,
+      -0.20610085129737854f,  -0.2905101180076599f,  0.18631568551063538f,
+      -0.3959210515022278f,   0.056451812386512756f, -1.2908092737197876f,
+      0.3916683793067932f,    -8.122172355651855f,   -0.12142644822597504f,
+      -0.4428127706050873f,   -0.24993056058883667f, 0.2432871013879776f,
+      -0.4911264181137085f,   0.23390810191631317f,  -0.08834259957075119f,
+      -0.0922209843993187f,   -0.791261613368988f,   -0.2527330815792084f,
+      -0.8133859038352966f,   0.14507368206977844f,  -0.45458677411079407f,
+      -0.5710628628730774f,   -0.2644200026988983f,  0.2170594483613968f,
+      0.40066513419151306f,   -0.18364030122756958f, -0.8001738786697388f,
+      0.14335019886493683f,   -0.2577047646045685f,  0.11850440502166748f,
+      -0.14166106283664703f,  -2.869997024536133f,   0.11104889959096909f,
+      -0.5155230760574341f,   -0.8204565048217773f,  -0.2842898666858673f,
+      -0.4841044247150421f,   -0.23635634779930115f, 0.3735388517379761f,
+      0.19305679202079773f,   -0.5582130551338196f,  0.5497944951057434f,
+      -0.4491751790046692f,   -0.2788706123828888f,  0.5158282518386841f
+    };
+
+static const float av1_hd_4_partition_nn_bias_16_layer1[NEW_LABEL_SIZE] = {
+  0.8354771137237549f, -0.7445926666259766f, -0.5160333514213562f
+};
 
 static const float av1_4_partition_nn_weights_16_layer0[FEATURE_SIZE * 24] = {
   -2.032866f, 0.056691f,  0.495960f,  0.778785f,  0.548153f,  -0.806942f,
@@ -1426,21 +2262,39 @@
   0.401786f,
 };
 
-static const NN_CONFIG av1_4_partition_nnconfig_16 = {
-  FEATURE_SIZE,  // num_inputs
-  LABEL_SIZE,    // num_outputs
-  1,             // num_hidden_layers
+static const NN_CONFIG av1_4_partition_nnconfig_16[2] = {
   {
-      24,  // num_hidden_nodes
+      FEATURE_SIZE,  // num_inputs
+      LABEL_SIZE,    // num_outputs
+      1,             // num_hidden_layers
+      {
+          24,  // num_hidden_nodes
+      },
+      {
+          av1_4_partition_nn_weights_16_layer0,
+          av1_4_partition_nn_weights_16_layer1,
+      },
+      {
+          av1_4_partition_nn_bias_16_layer0,
+          av1_4_partition_nn_bias_16_layer1,
+      },
   },
   {
-      av1_4_partition_nn_weights_16_layer0,
-      av1_4_partition_nn_weights_16_layer1,
-  },
-  {
-      av1_4_partition_nn_bias_16_layer0,
-      av1_4_partition_nn_bias_16_layer1,
-  },
+      FEATURE_SIZE,    // num_inputs
+      NEW_LABEL_SIZE,  // num_outputs
+      1,               // num_hidden_layers
+      {
+          24,  // num_hidden_nodes
+      },
+      {
+          av1_hd_4_partition_nn_weights_16_layer0,
+          av1_hd_4_partition_nn_weights_16_layer1,
+      },
+      {
+          av1_hd_4_partition_nn_bias_16_layer0,
+          av1_hd_4_partition_nn_bias_16_layer1,
+      },
+  }
 };
 
 static const float av1_4_partition_nn_weights_32_layer0[FEATURE_SIZE * 32] = {
@@ -1583,21 +2437,39 @@
   -0.261961f,
 };
 
-static const NN_CONFIG av1_4_partition_nnconfig_32 = {
-  FEATURE_SIZE,  // num_inputs
-  LABEL_SIZE,    // num_outputs
-  1,             // num_hidden_layers
+static const NN_CONFIG av1_4_partition_nnconfig_32[2] = {
   {
-      32,  // num_hidden_nodes
+      FEATURE_SIZE,  // num_inputs
+      LABEL_SIZE,    // num_outputs
+      1,             // num_hidden_layers
+      {
+          32,  // num_hidden_nodes
+      },
+      {
+          av1_4_partition_nn_weights_32_layer0,
+          av1_4_partition_nn_weights_32_layer1,
+      },
+      {
+          av1_4_partition_nn_bias_32_layer0,
+          av1_4_partition_nn_bias_32_layer1,
+      },
   },
   {
-      av1_4_partition_nn_weights_32_layer0,
-      av1_4_partition_nn_weights_32_layer1,
-  },
-  {
-      av1_4_partition_nn_bias_32_layer0,
-      av1_4_partition_nn_bias_32_layer1,
-  },
+      FEATURE_SIZE,    // num_inputs
+      NEW_LABEL_SIZE,  // num_outputs
+      1,               // num_hidden_layers
+      {
+          32,  // num_hidden_nodes
+      },
+      {
+          av1_hd_4_partition_nn_weights_32_layer0,
+          av1_hd_4_partition_nn_weights_32_layer1,
+      },
+      {
+          av1_hd_4_partition_nn_bias_32_layer0,
+          av1_hd_4_partition_nn_bias_32_layer1,
+      },
+  }
 };
 
 static const float av1_4_partition_nn_weights_64_layer0[FEATURE_SIZE * 24] = {
@@ -1708,25 +2580,44 @@
   0.040013f,
 };
 
-static const NN_CONFIG av1_4_partition_nnconfig_64 = {
-  FEATURE_SIZE,  // num_inputs
-  LABEL_SIZE,    // num_outputs
-  1,             // num_hidden_layers
+static const NN_CONFIG av1_4_partition_nnconfig_64[2] = {
   {
-      24,  // num_hidden_nodes
+      FEATURE_SIZE,  // num_inputs
+      LABEL_SIZE,    // num_outputs
+      1,             // num_hidden_layers
+      {
+          24,  // num_hidden_nodes
+      },
+      {
+          av1_4_partition_nn_weights_64_layer0,
+          av1_4_partition_nn_weights_64_layer1,
+      },
+      {
+          av1_4_partition_nn_bias_64_layer0,
+          av1_4_partition_nn_bias_64_layer1,
+      },
   },
   {
-      av1_4_partition_nn_weights_64_layer0,
-      av1_4_partition_nn_weights_64_layer1,
-  },
-  {
-      av1_4_partition_nn_bias_64_layer0,
-      av1_4_partition_nn_bias_64_layer1,
-  },
+      FEATURE_SIZE,    // num_inputs
+      NEW_LABEL_SIZE,  // num_outputs
+      1,               // num_hidden_layers
+      {
+          24,  // num_hidden_nodes
+      },
+      {
+          av1_hd_4_partition_nn_weights_64_layer0,
+          av1_hd_4_partition_nn_weights_64_layer1,
+      },
+      {
+          av1_hd_4_partition_nn_bias_64_layer0,
+          av1_hd_4_partition_nn_bias_64_layer1,
+      },
+  }
 };
 
 #undef FEATURE_SIZE
 #undef LABEL_SIZE
+#undef NEW_LABEL_SIZE
 
 #define FEATURE_SIZE 4
 // Mean and std
diff --git a/av1/encoder/partition_strategy.c b/av1/encoder/partition_strategy.c
index 526ccb1..4708101 100644
--- a/av1/encoder/partition_strategy.c
+++ b/av1/encoder/partition_strategy.c
@@ -1323,12 +1323,14 @@
 
 #define FEATURES 18
 #define LABELS 4
+#define NEW_LABELS 3
 // Use a ML model to predict if horz4 and vert4 should be considered.
 void av1_ml_prune_4_partition(AV1_COMP *const cpi, MACROBLOCK *const x,
                               int part_ctx, int64_t best_rd,
                               PartitionSearchState *part_state,
                               int *part4_allowed,
                               unsigned int pb_source_variance) {
+  const AV1_COMMON *const cm = &cpi->common;
   const PartitionBlkParams blk_params = part_state->part_blk_params;
   const int mi_row = blk_params.mi_row;
   const int mi_col = blk_params.mi_col;
@@ -1345,15 +1347,34 @@
   if (best_rd >= 1000000000) return;
   int64_t *horz_rd = rect_part_rd[HORZ4];
   int64_t *vert_rd = rect_part_rd[VERT4];
+
+  const int is_720p_or_larger = AOMMIN(cm->width, cm->height) >= 720;
+  const int is_480p_or_larger = AOMMIN(cm->width, cm->height) >= 480;
+  // res_idx is 0 for res < 480p, 1 for 480p, 2 for 720p+
+  const int res_idx = is_480p_or_larger + is_720p_or_larger;
+
+  const int bsize_idx = convert_bsize_to_idx(bsize);
+  if (bsize_idx < 0) return;
+  const float *ml_mean = av1_partition4_nn_mean[bsize_idx];
+  const float *ml_std = av1_partition4_nn_std[bsize_idx];
+
+  int ml_model_index = (cpi->sf.part_sf.ml_4_partition_search_level_index < 3);
+
   const NN_CONFIG *nn_config = NULL;
   // 4-way partitions are only allowed for these three square block sizes.
   switch (bsize) {
-    case BLOCK_16X16: nn_config = &av1_4_partition_nnconfig_16; break;
-    case BLOCK_32X32: nn_config = &av1_4_partition_nnconfig_32; break;
-    case BLOCK_64X64: nn_config = &av1_4_partition_nnconfig_64; break;
+    case BLOCK_16X16:
+      nn_config = &av1_4_partition_nnconfig_16[ml_model_index];
+      break;
+    case BLOCK_32X32:
+      nn_config = &av1_4_partition_nnconfig_32[ml_model_index];
+      break;
+    case BLOCK_64X64:
+      nn_config = &av1_4_partition_nnconfig_64[ml_model_index];
+      break;
     default: assert(0 && "Unexpected bsize.");
   }
-  if (!nn_config) return;
+  if (!nn_config || !ml_mean || !ml_std) return;
 
   // Generate features.
   float features[FEATURES];
@@ -1437,6 +1458,12 @@
   }
   assert(feature_index == FEATURES);
 
+  if (ml_model_index) {
+    for (int idx = 0; idx < FEATURES; idx++) {
+      features[idx] = (features[idx] - ml_mean[idx]) / ml_std[idx];
+    }
+  }
+
   // Write features to file
   if (!frame_is_intra_only(&cpi->common)) {
     write_features_to_file(cpi->oxcf.partition_info_path,
@@ -1444,34 +1471,61 @@
                            FEATURES, 7, bsize, mi_row, mi_col);
   }
 
-  // Calculate scores using the NN model.
-  float score[LABELS] = { 0.0f };
-  av1_nn_predict(features, nn_config, 1, score);
-  int int_score[LABELS];
-  int max_score = -1000;
-  for (int i = 0; i < LABELS; ++i) {
-    int_score[i] = (int)(100 * score[i]);
-    max_score = AOMMAX(int_score[i], max_score);
-  }
+  if (ml_model_index == 0) {
+    // Calculate scores using the NN model.
+    float score[LABELS] = { 0.0f };
+    av1_nn_predict(features, nn_config, 1, score);
+    int int_score[LABELS];
+    int max_score = -1000;
+    for (int i = 0; i < LABELS; ++i) {
+      int_score[i] = (int)(100 * score[i]);
+      max_score = AOMMAX(int_score[i], max_score);
+    }
 
-  // Make decisions based on the model scores.
-  int thresh = max_score;
-  switch (bsize) {
-    case BLOCK_16X16: thresh -= 500; break;
-    case BLOCK_32X32: thresh -= 500; break;
-    case BLOCK_64X64: thresh -= 200; break;
-    default: break;
-  }
-  av1_zero_array(part4_allowed, NUM_PART4_TYPES);
-  for (int i = 0; i < LABELS; ++i) {
-    if (int_score[i] >= thresh) {
-      if ((i >> 0) & 1) part4_allowed[HORZ4] = 1;
-      if ((i >> 1) & 1) part4_allowed[VERT4] = 1;
+    // Make decisions based on the model scores.
+    int thresh = max_score;
+    switch (bsize) {
+      case BLOCK_16X16: thresh -= 500; break;
+      case BLOCK_32X32: thresh -= 500; break;
+      case BLOCK_64X64: thresh -= 200; break;
+      default: break;
+    }
+    av1_zero_array(part4_allowed, NUM_PART4_TYPES);
+    for (int i = 0; i < LABELS; ++i) {
+      if (int_score[i] >= thresh) {
+        if ((i >> 0) & 1) part4_allowed[HORZ4] = 1;
+        if ((i >> 1) & 1) part4_allowed[VERT4] = 1;
+      }
+    }
+  } else {
+    // Calculate scores using the NN model.
+    float score[NEW_LABELS] = { 0.0f };
+    float probs[NEW_LABELS] = { 0.0f };
+    av1_nn_predict(features, nn_config, 1, score);
+
+    av1_nn_softmax(score, probs, NEW_LABELS);
+
+    // Make decisions based on the model scores.
+    const float search_thresh = av1_partition4_search_thresh
+        [cpi->sf.part_sf.ml_4_partition_search_level_index][res_idx][bsize_idx];
+    const float not_search_thresh = av1_partition4_not_search_thresh
+        [cpi->sf.part_sf.ml_4_partition_search_level_index][res_idx][bsize_idx];
+
+    for (int i = 1; i < NEW_LABELS; ++i) {
+      if (probs[i] >= search_thresh) {
+        if (i == 1) part4_allowed[HORZ4] = 1;
+        if (i == 2) part4_allowed[VERT4] = 1;
+      }
+      if (probs[i] < not_search_thresh) {
+        if (i == 1) part4_allowed[HORZ4] = 0;
+        if (i == 2) part4_allowed[VERT4] = 0;
+      }
     }
   }
 }
 #undef FEATURES
 #undef LABELS
+#undef NEW_LABELS
 
 #define FEATURES 4
 void av1_ml_predict_breakout(AV1_COMP *const cpi, const MACROBLOCK *const x,
diff --git a/av1/encoder/ratectrl.c b/av1/encoder/ratectrl.c
index 1b55fe2..085cc9f 100644
--- a/av1/encoder/ratectrl.c
+++ b/av1/encoder/ratectrl.c
@@ -3176,8 +3176,14 @@
     width = cpi->oxcf.frm_dim_cfg.width;
     height = cpi->oxcf.frm_dim_cfg.height;
   }
+  // Set src_sad_blk_64x64 to NULL also for number_spatial layers > 1, as
+  // it is never allocated for number_spatial_layers > 1 (see the condition
+  // under which we allocate cpi->src_sad_blk_64x64 later in this function).
+  // This is guard against the case where the number_spatial_layers
+  // is changed dynamically without re-alloc of encoder.
   if (width != cm->render_width || height != cm->render_height ||
-      unscaled_src == NULL || unscaled_last_src == NULL) {
+      cpi->svc.number_spatial_layers > 1 || unscaled_src == NULL ||
+      unscaled_last_src == NULL) {
     aom_free(cpi->src_sad_blk_64x64);
     cpi->src_sad_blk_64x64 = NULL;
   }
diff --git a/av1/encoder/speed_features.c b/av1/encoder/speed_features.c
index a981f08..fe1e0af 100644
--- a/av1/encoder/speed_features.c
+++ b/av1/encoder/speed_features.c
@@ -223,6 +223,7 @@
   }
 
   if (speed >= 1) {
+    sf->part_sf.ml_4_partition_search_level_index = 1;
     if (is_720p_or_larger) {
       sf->part_sf.use_square_partition_only_threshold = BLOCK_128X128;
     } else if (is_480p_or_larger) {
@@ -250,10 +251,9 @@
   }
 
   if (speed >= 2) {
+    sf->part_sf.ml_4_partition_search_level_index = 2;
     if (is_720p_or_larger) {
       sf->part_sf.use_square_partition_only_threshold = BLOCK_64X64;
-    } else if (is_480p_or_larger) {
-      sf->part_sf.use_square_partition_only_threshold = BLOCK_32X32;
     } else {
       sf->part_sf.use_square_partition_only_threshold = BLOCK_32X32;
     }
@@ -285,6 +285,7 @@
 
   if (speed >= 3) {
     sf->part_sf.ml_early_term_after_part_split_level = 0;
+    sf->part_sf.ml_4_partition_search_level_index = 3;
 
     if (is_720p_or_larger) {
       for (int i = 0; i < PARTITION_BLOCK_SIZES; ++i) {
@@ -772,6 +773,7 @@
   }
 
   if (speed >= 1) {
+    sf->part_sf.ml_4_partition_search_level_index = 1;
     if (is_480p_or_lesser) sf->inter_sf.skip_newmv_in_drl = 1;
 
     if (is_720p_or_larger) {
@@ -802,10 +804,9 @@
   }
 
   if (speed >= 2) {
+    sf->part_sf.ml_4_partition_search_level_index = 2;
     if (is_720p_or_larger) {
       sf->part_sf.use_square_partition_only_threshold = BLOCK_64X64;
-    } else if (is_480p_or_larger) {
-      sf->part_sf.use_square_partition_only_threshold = BLOCK_32X32;
     } else {
       sf->part_sf.use_square_partition_only_threshold = BLOCK_32X32;
     }
@@ -892,6 +893,8 @@
       sf->part_sf.ml_partition_search_breakout_model_index = 0;
     }
 
+    sf->part_sf.ml_4_partition_search_level_index = 3;
+
     if (is_720p_or_larger) {
       sf->part_sf.partition_search_breakout_dist_thr = (1 << 25);
       sf->part_sf.partition_search_breakout_rate_thr = 200;
@@ -2200,6 +2203,7 @@
         -1;  // -1 means not enabled.
   }
   part_sf->ml_partition_search_breakout_model_index = 0;
+  part_sf->ml_4_partition_search_level_index = 0;
   part_sf->simple_motion_search_prune_agg = SIMPLE_AGG_LVL0;
   part_sf->simple_motion_search_split = 0;
   part_sf->simple_motion_search_prune_rect = 0;
diff --git a/av1/encoder/speed_features.h b/av1/encoder/speed_features.h
index 01f55cd..8e358b6 100644
--- a/av1/encoder/speed_features.h
+++ b/av1/encoder/speed_features.h
@@ -673,6 +673,9 @@
   // ML based partition search breakout model index
   int ml_partition_search_breakout_model_index;
 
+  // ML based partition search breakout model index
+  int ml_4_partition_search_level_index;
+
   // Aggressiveness levels for pruning split and rectangular partitions based on
   // simple_motion_search. SIMPLE_AGG_LVL0 to SIMPLE_AGG_LVL5 correspond to
   // simple motion search based pruning. QIDX_BASED_AGG_LVL1 corresponds to
diff --git a/examples/multilayer_metadata.cc b/examples/multilayer_metadata.cc
index 9427029..4ab1349 100644
--- a/examples/multilayer_metadata.cc
+++ b/examples/multilayer_metadata.cc
@@ -430,6 +430,49 @@
   return true;
 }
 
+bool parse_multilayer_layer_local_metadata(
+    std::ifstream &file, int min_indent, int *line_idx,
+    std::vector<FrameLocalMetadata> &frames) {
+  bool has_list_prefix;
+  int indent = -1;
+  std::string field_name;
+  ParsedValue value;
+  bool syntax_error;
+  while (parse_line(file, min_indent, /*is_list=*/true, &indent,
+                    &has_list_prefix, line_idx, &field_name, &value,
+                    &syntax_error)) {
+    if (has_list_prefix) {
+      frames.push_back({});
+    }
+    if (frames.empty()) {
+      fprintf(stderr, "Error: Missing list prefix '-' at line %d\n", *line_idx);
+      return false;
+    }
+
+    if (field_name == "frame_idx") {
+      RETURN_IF_FALSE(
+          value.IntegerValueInRange(0, std::numeric_limits<long>::max(),
+                                    *line_idx, &frames.back().frame_idx));
+    } else if (field_name == "alpha") {
+      RETURN_IF_FALSE(parse_multilayer_layer_alpha(
+          file,
+          /*min_indent=*/indent + 1, line_idx, &frames.back().alpha));
+
+    } else if (field_name == "depth") {
+      RETURN_IF_FALSE(parse_multilayer_layer_depth(
+          file,
+          /*min_indent=*/indent + 1, line_idx, &frames.back().depth));
+
+    } else {
+      fprintf(stderr, "Error: Unknown field %s at line %d\n",
+              field_name.c_str(), *line_idx);
+      return false;
+    }
+  }
+  if (syntax_error) return false;
+  return true;
+}
+
 bool validate_layer(const LayerMetadata &layer, bool layer_has_alpha,
                     bool layer_has_depth) {
   if (layer_has_alpha != (layer.layer_type == MULTILAYER_LAYER_TYPE_ALPHA &&
@@ -475,7 +518,8 @@
 
       // Validate the previous layer.
       if (!layers.empty()) {
-        validate_layer(layers.back(), layer_has_alpha, layer_has_depth);
+        RETURN_IF_FALSE(
+            validate_layer(layers.back(), layer_has_alpha, layer_has_depth));
       }
       if (layers.size() == 1 && layers.back().layer_color_description.second) {
         fprintf(stderr,
@@ -520,24 +564,25 @@
       layer->layer_color_description = value_present(color_properties);
     } else if ((field_name == "alpha")) {
       layer_has_alpha = true;
-      RETURN_IF_FALSE(parse_multilayer_layer_alpha(
-          file,
-          /*min_indent=*/indent + 1, line_idx, &layer->global_alpha_info));
+      RETURN_IF_FALSE(parse_multilayer_layer_alpha(file,
+                                                   /*min_indent=*/indent + 1,
+                                                   line_idx, &layer->alpha));
     } else if (field_name == "depth") {
       layer_has_depth = true;
-      RETURN_IF_FALSE(parse_multilayer_layer_depth(
-          file,
-          /*min_indent=*/indent + 1, line_idx, &layer->global_depth_info));
-      if ((layer->global_depth_info.d_min.second ||
-           layer->global_depth_info.d_max.second) &&
-          layer->global_depth_info.disparity_ref_view_id ==
-              (layers.size() - 1)) {
+      RETURN_IF_FALSE(parse_multilayer_layer_depth(file,
+                                                   /*min_indent=*/indent + 1,
+                                                   line_idx, &layer->depth));
+      if ((layer->depth.d_min.second || layer->depth.d_max.second) &&
+          layer->depth.disparity_ref_view_id == (layers.size() - 1)) {
         fprintf(stderr,
                 "disparity_ref_view_id must be different from the layer's id "
                 "for layer %d (zero-based index)\n",
                 static_cast<int>(layers.size()) - 1);
         return false;
       }
+    } else if (field_name == "local_metadata") {
+      RETURN_IF_FALSE(parse_multilayer_layer_local_metadata(
+          file, /*min_indent=*/indent + 1, line_idx, layer->local_metadata));
     } else {
       fprintf(stderr, "Error: Unknown field %s at line %d\n",
               field_name.c_str(), *line_idx);
@@ -545,7 +590,8 @@
     }
   }
   if (syntax_error) return false;
-  validate_layer(layers.back(), layer_has_alpha, layer_has_depth);
+  RETURN_IF_FALSE(
+      validate_layer(layers.back(), layer_has_alpha, layer_has_depth));
   return true;
 }
 
@@ -786,6 +832,33 @@
   return true;
 }
 
+void print_alpha_information(const AlphaInformation &alpha) {
+  printf("    alpha_simple_flag: %d\n", alpha.alpha_simple_flag);
+  if (!alpha.alpha_simple_flag) {
+    printf("    alpha_bit_depth: %d\n", alpha.alpha_bit_depth);
+    printf("    alpha_clip_idc: %d\n", alpha.alpha_clip_idc);
+    printf("    alpha_incr_flag: %d\n", alpha.alpha_incr_flag);
+    printf("    alpha_transparent_value: %hu\n", alpha.alpha_transparent_value);
+    printf("    alpha_opaque_value: %hu\n", alpha.alpha_opaque_value);
+    printf("    alpha_color_description: %s\n",
+           format_color_properties(alpha.alpha_color_description).c_str());
+  }
+}
+
+void print_depth_information(const DepthInformation &depth) {
+  printf("    z_near: %s\n",
+         format_depth_representation_element(depth.z_near).c_str());
+  printf("    z_far: %s\n",
+         format_depth_representation_element(depth.z_far).c_str());
+  printf("    d_min: %s\n",
+         format_depth_representation_element(depth.d_min).c_str());
+  printf("    d_max: %s\n",
+         format_depth_representation_element(depth.d_max).c_str());
+  printf("    depth_representation_type: %d\n",
+         depth.depth_representation_type);
+  printf("    disparity_ref_view_id: %d\n", depth.disparity_ref_view_id);
+}
+
 }  // namespace
 
 double depth_representation_element_to_double(
@@ -881,45 +954,21 @@
     printf("  layer_color_description: %s\n",
            format_color_properties(layer.layer_color_description).c_str());
     if (layer.layer_type == MULTILAYER_LAYER_TYPE_ALPHA) {
-      printf("  alpha:\n");
-      printf("    alpha_use_idc: %d\n", layer.global_alpha_info.alpha_use_idc);
-      printf("    alpha_simple_flag: %d\n",
-             layer.global_alpha_info.alpha_simple_flag);
-      if (!layer.global_alpha_info.alpha_simple_flag) {
-        printf("    alpha_bit_depth: %d\n",
-               layer.global_alpha_info.alpha_bit_depth);
-        printf("    alpha_clip_idc: %d\n",
-               layer.global_alpha_info.alpha_clip_idc);
-        printf("    alpha_incr_flag: %d\n",
-               layer.global_alpha_info.alpha_incr_flag);
-        printf("    alpha_transparent_value: %hu\n",
-               layer.global_alpha_info.alpha_transparent_value);
-        printf("    alpha_opaque_value: %hu\n",
-               layer.global_alpha_info.alpha_opaque_value);
-        printf("    alpha_color_description: %s\n",
-               format_color_properties(
-                   layer.global_alpha_info.alpha_color_description)
-                   .c_str());
+      if (layer.layer_metadata_scope >= SCOPE_GLOBAL) {
+        printf("  global alpha:\n");
+        print_alpha_information(layer.alpha);
+      }
+      for (const FrameLocalMetadata &local_metadata : layer.local_metadata) {
+        printf("  local alpha for frame %ld:\n", local_metadata.frame_idx);
+        print_alpha_information(local_metadata.alpha);
       }
     } else if (layer.layer_type == MULTILAYER_LAYER_TYPE_DEPTH) {
-      printf("  depth:\n");
-      printf("    z_near: %s\n",
-             format_depth_representation_element(layer.global_depth_info.z_near)
-                 .c_str());
-      printf("    z_far: %s\n",
-             format_depth_representation_element(layer.global_depth_info.z_far)
-                 .c_str());
-      printf("    d_min: %s\n",
-             format_depth_representation_element(layer.global_depth_info.d_min)
-                 .c_str());
-      printf("    d_max: %s\n",
-             format_depth_representation_element(layer.global_depth_info.d_max)
-                 .c_str());
-      printf("    depth_representation_type: %d\n",
-             layer.global_depth_info.depth_representation_type);
-      printf("    disparity_ref_view_id: %d\n",
-             layer.global_depth_info.disparity_ref_view_id);
-      printf("\n");
+      printf("  global depth:\n");
+      print_depth_information(layer.depth);
+      for (const FrameLocalMetadata &local_metadata : layer.local_metadata) {
+        printf("  local depth for frame %ld:\n", local_metadata.frame_idx);
+        print_depth_information(local_metadata.depth);
+      }
     }
   }
   printf("\n");
diff --git a/examples/multilayer_metadata.h b/examples/multilayer_metadata.h
index 392f1c5..1ac1c3c 100644
--- a/examples/multilayer_metadata.h
+++ b/examples/multilayer_metadata.h
@@ -61,7 +61,7 @@
   std::pair<DepthRepresentationElement, bool> z_far;
   std::pair<DepthRepresentationElement, bool> d_min;
   std::pair<DepthRepresentationElement, bool> d_max;
-  uint8_t depth_representation_type;  // [0, 15]
+  uint8_t depth_representation_type;  // [0, 2]. Values 3 to 15 are reserved.
   // Only relevant if d_min or d_max are present.
   uint8_t disparity_ref_view_id;  // [0, 3]
 };
@@ -110,6 +110,14 @@
   // 4 to 7 are reserved.
 };
 
+struct FrameLocalMetadata {
+  long frame_idx;
+  // Relevant for MULTILAYER_LAYER_TYPE_ALPHA with scope != SCOPE_GLOBAL.
+  AlphaInformation alpha;
+  // Relevant for MULTILAYER_LAYER_TYPE_DEPTH with scope != SCOPE_GLOBAL.
+  DepthInformation depth;
+};
+
 struct LayerMetadata {
   LayerType layer_type;  // [0, 31]
   bool luma_plane_only_flag;
@@ -121,9 +129,12 @@
   std::pair<ColorProperties, bool> layer_color_description;
 
   // Relevant for MULTILAYER_LAYER_TYPE_ALPHA with scope >= SCOPE_GLOBAL.
-  AlphaInformation global_alpha_info;
+  AlphaInformation alpha;
   // Relevant for MULTILAYER_LAYER_TYPE_DEPTH with scope >= SCOPE_GLOBAL.
-  DepthInformation global_depth_info;
+  DepthInformation depth;
+
+  // Relevant when scope != SCOPE_GLOBAL.
+  std::vector<FrameLocalMetadata> local_metadata;
 };
 
 struct MultilayerMetadata {
diff --git a/examples/svc_encoder_rtc.cc b/examples/svc_encoder_rtc.cc
index 10c2102..1152bb0 100644
--- a/examples/svc_encoder_rtc.cc
+++ b/examples/svc_encoder_rtc.cc
@@ -1463,8 +1463,55 @@
   }
 }
 
+static void write_alpha_information(
+    struct aom_write_bit_buffer *buffer,
+    const libaom_examples::AlphaInformation &alpha_info) {
+  write_literal(buffer, alpha_info.alpha_use_idc, 2);
+  write_literal(buffer, alpha_info.alpha_simple_flag, 1);
+  if (!alpha_info.alpha_simple_flag) {
+    write_literal(buffer, alpha_info.alpha_bit_depth, 3, /*offset=*/8);
+    write_literal(buffer, alpha_info.alpha_clip_idc, 2);
+    write_literal(buffer, alpha_info.alpha_incr_flag, 1);
+    write_literal(buffer, alpha_info.alpha_transparent_value,
+                  alpha_info.alpha_bit_depth + 1);
+    write_literal(buffer, alpha_info.alpha_opaque_value,
+                  alpha_info.alpha_bit_depth + 1);
+    if (buffer->bit_offset % 8 != 0) {
+      // ai_byte_alignment_bits
+      write_literal(buffer, 0, 8 - (buffer->bit_offset % 8));
+    }
+    assert(buffer->bit_offset % 8 == 0);
+
+    write_literal(buffer, 0, 6);  // ai_reserved_6bits
+    write_color_properties(buffer, alpha_info.alpha_color_description);
+  } else {
+    write_literal(buffer, 0, 5);  // ai_reserved_5bits
+  }
+}
+
+static void write_depth_information(
+    struct aom_write_bit_buffer *buffer,
+    const libaom_examples::DepthInformation &depth_info) {
+  write_literal(buffer, depth_info.z_near.second, 1);
+  write_literal(buffer, depth_info.z_far.second, 1);
+  write_literal(buffer, depth_info.d_min.second, 1);
+  write_literal(buffer, depth_info.d_max.second, 1);
+  write_literal(buffer, depth_info.depth_representation_type, 4);
+  if (depth_info.d_min.second || depth_info.d_max.second) {
+    write_literal(buffer, depth_info.disparity_ref_view_id, 2);
+  }
+  write_depth_representation_element(buffer, depth_info.z_near);
+  write_depth_representation_element(buffer, depth_info.z_far);
+  write_depth_representation_element(buffer, depth_info.d_min);
+  write_depth_representation_element(buffer, depth_info.d_max);
+  if (buffer->bit_offset % 8 != 0) {
+    write_literal(buffer, 0, 8 - (buffer->bit_offset % 8));
+  }
+}
+
 static void add_multilayer_metadata(
-    aom_image_t *frame, const libaom_examples::MultilayerMetadata &multilayer) {
+    aom_image_t *frame, const libaom_examples::MultilayerMetadata &multilayer,
+    int frame_idx, int spatial_id) {
   // Large enough buffer for the multilayer metadata.
   // Each layer's metadata is less than 100 bytes and there are at most 4
   // layers.
@@ -1506,51 +1553,12 @@
 
     if (layer.layer_type == libaom_examples::MULTILAYER_LAYER_TYPE_ALPHA &&
         layer.layer_metadata_scope >= libaom_examples::SCOPE_GLOBAL) {
-      const libaom_examples::AlphaInformation &alpha_info =
-          layer.global_alpha_info;
-      write_literal(&buffer, alpha_info.alpha_use_idc, 2);
-      write_literal(&buffer, alpha_info.alpha_simple_flag, 1);
-      if (!alpha_info.alpha_simple_flag) {
-        write_literal(&buffer, alpha_info.alpha_bit_depth, 3, /*offset=*/8);
-        write_literal(&buffer, alpha_info.alpha_clip_idc, 2);
-        write_literal(&buffer, alpha_info.alpha_incr_flag, 1);
-        write_literal(&buffer, alpha_info.alpha_transparent_value,
-                      alpha_info.alpha_bit_depth + 1);
-        write_literal(&buffer, alpha_info.alpha_opaque_value,
-                      alpha_info.alpha_bit_depth + 1);
-        if (buffer.bit_offset % 8 != 0) {
-          // ai_byte_alignment_bits
-          write_literal(&buffer, 0, 8 - (buffer.bit_offset % 8));
-        }
-        assert(buffer.bit_offset % 8 == 0);
-
-        write_literal(&buffer, 0, 6);  // ai_reserved_6bits
-        write_color_properties(&buffer, alpha_info.alpha_color_description);
-      } else {
-        write_literal(&buffer, 0, 5);  // ai_reserved_5bits
-      }
-
+      write_alpha_information(&buffer, layer.alpha);
       assert(buffer.bit_offset % 8 == 0);
     } else if (layer.layer_type ==
                    libaom_examples::MULTILAYER_LAYER_TYPE_DEPTH &&
                layer.layer_metadata_scope >= libaom_examples::SCOPE_GLOBAL) {
-      const libaom_examples::DepthInformation &depth_info =
-          layer.global_depth_info;
-      write_literal(&buffer, depth_info.z_near.second, 1);
-      write_literal(&buffer, depth_info.z_far.second, 1);
-      write_literal(&buffer, depth_info.d_min.second, 1);
-      write_literal(&buffer, depth_info.d_max.second, 1);
-      write_literal(&buffer, depth_info.depth_representation_type, 4);
-      if (depth_info.d_min.second || depth_info.d_max.second) {
-        write_literal(&buffer, depth_info.disparity_ref_view_id, 2);
-      }
-      write_depth_representation_element(&buffer, depth_info.z_near);
-      write_depth_representation_element(&buffer, depth_info.z_far);
-      write_depth_representation_element(&buffer, depth_info.d_min);
-      write_depth_representation_element(&buffer, depth_info.d_max);
-      if (buffer.bit_offset % 8 != 0) {
-        write_literal(&buffer, 0, 8 - (buffer.bit_offset % 8));
-      }
+      write_depth_information(&buffer, layer.depth);
       assert(buffer.bit_offset % 8 == 0);
     }
 
@@ -1572,6 +1580,36 @@
                            AOM_MIF_KEY_FRAME)) {
     die("Error: Failed to add metadata\n");
   }
+
+  if ((int)multilayer.layers.size() > spatial_id) {
+    const libaom_examples::LayerMetadata &layer = multilayer.layers[spatial_id];
+    for (const libaom_examples::FrameLocalMetadata &local_metadata :
+         layer.local_metadata) {
+      if (local_metadata.frame_idx == frame_idx) {
+        if (layer.layer_type == libaom_examples::MULTILAYER_LAYER_TYPE_ALPHA) {
+          buffer = { data.data(), 0 };
+          write_alpha_information(&buffer, local_metadata.alpha);
+          if (aom_img_add_metadata(frame,
+                                   34 /*METADATA_TYPE_ALPHA_INFORMATION*/,
+                                   buffer.bit_buffer, buffer.bit_offset / 8,
+                                   AOM_MIF_ANY_FRAME_LAYER_SPECIFIC)) {
+            die("Error: Failed to add metadata\n");
+          }
+        } else if (layer.layer_type ==
+                   libaom_examples::MULTILAYER_LAYER_TYPE_DEPTH) {
+          buffer = { data.data(), 0 };
+          write_depth_information(&buffer, local_metadata.depth);
+          if (aom_img_add_metadata(frame,
+                                   35 /*METADATA_TYPE_DEPTH_INFORMATION*/,
+                                   buffer.bit_buffer, buffer.bit_offset / 8,
+                                   AOM_MIF_ANY_FRAME_LAYER_SPECIFIC)) {
+            die("Error: Failed to add metadata\n");
+          }
+        }
+        break;
+      }
+    }
+  }
 }
 
 #if CONFIG_AV1_DECODER
@@ -1637,7 +1675,7 @@
 }
 #endif  // CONFIG_AV1_DECODER
 
-struct psnr_stats {
+struct PsnrStats {
   // The second element of these arrays is reserved for high bitdepth.
   uint64_t psnr_sse_total[2];
   uint64_t psnr_samples_total[2];
@@ -1645,19 +1683,22 @@
   int psnr_count[2];
 };
 
-static void show_psnr(struct psnr_stats *psnr_stream, double peak) {
-  double ovpsnr;
+static void show_psnr(struct PsnrStats *psnr_stream, double peak,
+                      int num_layers) {
+  for (int sl = 0; sl < num_layers; ++sl) {
+    if (!psnr_stream[sl].psnr_count[0]) continue;
 
-  if (!psnr_stream->psnr_count[0]) return;
+    fprintf(stderr, "\nPSNR (Layer %d, Overall/Avg/Y/U/V)", sl);
+    const double ovpsnr =
+        sse_to_psnr((double)psnr_stream[sl].psnr_samples_total[0], peak,
+                    (double)psnr_stream[sl].psnr_sse_total[0]);
+    fprintf(stderr, " %.3f", ovpsnr);
 
-  fprintf(stderr, "\nPSNR (Overall/Avg/Y/U/V)");
-  ovpsnr = sse_to_psnr((double)psnr_stream->psnr_samples_total[0], peak,
-                       (double)psnr_stream->psnr_sse_total[0]);
-  fprintf(stderr, " %.3f", ovpsnr);
-
-  for (int i = 0; i < 4; i++) {
-    fprintf(stderr, " %.3f",
-            psnr_stream->psnr_totals[0][i] / psnr_stream->psnr_count[0]);
+    for (int i = 0; i < 4; i++) {
+      fprintf(
+          stderr, " %.3f",
+          psnr_stream[sl].psnr_totals[0][i] / psnr_stream[sl].psnr_count[0]);
+    }
   }
   fprintf(stderr, "\n");
 }
@@ -2095,7 +2136,7 @@
   }
 
   frame_avail = 1;
-  struct psnr_stats psnr_stream;
+  struct PsnrStats psnr_stream[MAX_NUM_SPATIAL_LAYERS];
   memset(&psnr_stream, 0, sizeof(psnr_stream));
   while (frame_avail || got_data) {
     struct aom_usec_timer timer;
@@ -2133,7 +2174,7 @@
                             &ref_frame_comp_pred);
         }
         if (app_input.multilayer_metadata_file != NULL) {
-          add_multilayer_metadata(&raw, multilayer_metadata);
+          add_multilayer_metadata(&raw, multilayer_metadata, frame_cnt, slx);
         }
         // Set the speed per layer.
         if (test_speed_per_layer) {
@@ -2376,12 +2417,17 @@
             break;
           case AOM_CODEC_PSNR_PKT:
             if (app_input.show_psnr) {
-              psnr_stream.psnr_sse_total[0] += pkt->data.psnr.sse[0];
-              psnr_stream.psnr_samples_total[0] += pkt->data.psnr.samples[0];
-              for (int plane = 0; plane < 4; plane++) {
-                psnr_stream.psnr_totals[0][plane] += pkt->data.psnr.psnr[plane];
+              const int sl = layer_id.spatial_layer_id;
+              const int show_psnr_hbd =
+                  (cfg.g_input_bit_depth > 8 || cfg.g_bit_depth > AOM_BITS_8);
+              const int hbd = show_psnr_hbd;
+              psnr_stream[sl].psnr_sse_total[hbd] += pkt->data.psnr.sse[0];
+              psnr_stream[sl].psnr_samples_total[hbd] +=
+                  pkt->data.psnr.samples[0];
+              for (i = 0; i < 4; i++) {
+                psnr_stream[sl].psnr_totals[hbd][i] += pkt->data.psnr.psnr[i];
               }
-              psnr_stream.psnr_count[0]++;
+              psnr_stream[sl].psnr_count[hbd]++;
             }
             break;
           default: break;
@@ -2432,7 +2478,10 @@
          1000000 * (double)frame_cnt / (double)cx_time);
 
   if (app_input.show_psnr) {
-    show_psnr(&psnr_stream, 255.0);
+    const int show_psnr_hbd =
+        (cfg.g_input_bit_depth > 8 || cfg.g_bit_depth > AOM_BITS_8);
+    show_psnr(psnr_stream, (double)((1 << (show_psnr_hbd ? 12 : 8)) - 1),
+              ss_number_layers);
   }
 
   if (aom_codec_destroy(&codec)) die_codec(&codec, "Failed to destroy encoder");
diff --git a/test/allintra_end_to_end_test.cc b/test/allintra_end_to_end_test.cc
index ba0f0a9..a33aaa2 100644
--- a/test/allintra_end_to_end_test.cc
+++ b/test/allintra_end_to_end_test.cc
@@ -24,13 +24,13 @@
 
 const unsigned int kFrames = 20;
 const int kBitrate = 500;
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/altref_test.cc b/test/altref_test.cc
index 354a5a8..ee1a347 100644
--- a/test/altref_test.cc
+++ b/test/altref_test.cc
@@ -15,14 +15,14 @@
 #include "test/i420_video_source.h"
 #include "test/util.h"
 namespace {
-typedef struct {
+struct AltRefTestParams {
   const unsigned int min_kf_dist;
   const unsigned int max_kf_dist;
   const unsigned int min_gf_interval;
   const unsigned int max_gf_interval;
   const unsigned int lag_in_frames;
   libaom_test::TestMode encoding_mode;
-} AltRefTestParams;
+};
 
 static const AltRefTestParams TestParams[] = {
   { 0, 10, 4, 8, 10, ::libaom_test::kOnePassGood },
@@ -113,11 +113,11 @@
                            ::testing::ValuesIn(TestParams),
                            ::testing::Values(AOM_Q, AOM_VBR, AOM_CBR, AOM_CQ));
 
-typedef struct {
+struct gfIntervalParam {
   const ::libaom_test::TestMode encoding_mode;
   const unsigned int min_gf_interval;
   const unsigned int max_gf_interval;
-} gfIntervalParam;
+};
 
 const gfIntervalParam gfTestParams[] = {
   // single pass
diff --git a/test/arf_freq_test.cc b/test/arf_freq_test.cc
index 1236ce2..d01eaed 100644
--- a/test/arf_freq_test.cc
+++ b/test/arf_freq_test.cc
@@ -28,7 +28,7 @@
 #define ARF_NOT_SEEN 1000001
 #define ARF_SEEN_ONCE 1000000
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int width;
   unsigned int height;
@@ -38,12 +38,12 @@
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
-typedef struct {
+struct TestEncodeParam {
   libaom_test::TestMode mode;
   int cpu_used;
-} TestEncodeParam;
+};
 
 const TestVideoParam kTestVectors[] = {
   // artificially increase framerate to trigger default check
diff --git a/test/av1_convolve_scale_test.cc b/test/av1_convolve_scale_test.cc
index a8344fe..8be4b87 100644
--- a/test/av1_convolve_scale_test.cc
+++ b/test/av1_convolve_scale_test.cc
@@ -175,7 +175,7 @@
   }
 }
 
-typedef tuple<int, int> BlockDimension;
+using BlockDimension = tuple<int, int>;
 
 struct BaseParams {
   BaseParams(BlockDimension dimensions) : dims(dimensions) {}
@@ -321,19 +321,19 @@
   ConvolveParams convolve_params_;
 };
 
-typedef tuple<int, int> BlockDimension;
+using BlockDimension = tuple<int, int>;
 
-typedef void (*LowbdConvolveFunc)(const uint8_t *src, int src_stride,
-                                  uint8_t *dst, int dst_stride, int w, int h,
-                                  const InterpFilterParams *filter_params_x,
-                                  const InterpFilterParams *filter_params_y,
-                                  const int subpel_x_qn, const int x_step_qn,
-                                  const int subpel_y_qn, const int y_step_qn,
-                                  ConvolveParams *conv_params);
+using LowbdConvolveFunc = void (*)(const uint8_t *src, int src_stride,
+                                   uint8_t *dst, int dst_stride, int w, int h,
+                                   const InterpFilterParams *filter_params_x,
+                                   const InterpFilterParams *filter_params_y,
+                                   const int subpel_x_qn, const int x_step_qn,
+                                   const int subpel_y_qn, const int y_step_qn,
+                                   ConvolveParams *conv_params);
 
 // Test parameter list:
 //  <tst_fun, dims, avg>
-typedef tuple<LowbdConvolveFunc, BlockDimension> LowBDParams;
+using LowBDParams = tuple<LowbdConvolveFunc, BlockDimension>;
 
 class LowBDConvolveScaleTest
     : public ConvolveScaleTestBase<uint8_t>,
@@ -417,17 +417,17 @@
 #endif  // HAVE_SSE4_1
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef void (*HighbdConvolveFunc)(const uint16_t *src, int src_stride,
-                                   uint16_t *dst, int dst_stride, int w, int h,
-                                   const InterpFilterParams *filter_params_x,
-                                   const InterpFilterParams *filter_params_y,
-                                   const int subpel_x_qn, const int x_step_qn,
-                                   const int subpel_y_qn, const int y_step_qn,
-                                   ConvolveParams *conv_params, int bd);
+using HighbdConvolveFunc = void (*)(const uint16_t *src, int src_stride,
+                                    uint16_t *dst, int dst_stride, int w, int h,
+                                    const InterpFilterParams *filter_params_x,
+                                    const InterpFilterParams *filter_params_y,
+                                    const int subpel_x_qn, const int x_step_qn,
+                                    const int subpel_y_qn, const int y_step_qn,
+                                    ConvolveParams *conv_params, int bd);
 
 // Test parameter list:
 //  <tst_fun, dims, avg, bd>
-typedef tuple<HighbdConvolveFunc, BlockDimension, int> HighBDParams;
+using HighBDParams = tuple<HighbdConvolveFunc, BlockDimension, int>;
 
 class HighBDConvolveScaleTest
     : public ConvolveScaleTestBase<uint16_t>,
diff --git a/test/av1_convolve_test.cc b/test/av1_convolve_test.cc
index b14c6b7..5392e5b 100644
--- a/test/av1_convolve_test.cc
+++ b/test/av1_convolve_test.cc
@@ -318,11 +318,11 @@
 ////////////////////////////////////////////////////////
 // Single reference convolve-x functions (low bit-depth)
 ////////////////////////////////////////////////////////
-typedef void (*convolve_x_func)(const uint8_t *src, int src_stride,
-                                uint8_t *dst, int dst_stride, int w, int h,
-                                const InterpFilterParams *filter_params_x,
-                                const int subpel_x_qn,
-                                ConvolveParams *conv_params);
+using convolve_x_func = void (*)(const uint8_t *src, int src_stride,
+                                 uint8_t *dst, int dst_stride, int w, int h,
+                                 const InterpFilterParams *filter_params_x,
+                                 const int subpel_x_qn,
+                                 ConvolveParams *conv_params);
 
 class AV1ConvolveXTest : public AV1ConvolveTest<convolve_x_func> {
  public:
@@ -535,10 +535,10 @@
 /////////////////////////////////////////////////////////
 // Single reference convolve-x functions (high bit-depth)
 /////////////////////////////////////////////////////////
-typedef void (*highbd_convolve_x_func)(
-    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
-    int h, const InterpFilterParams *filter_params_x, const int subpel_x_qn,
-    ConvolveParams *conv_params, int bd);
+using highbd_convolve_x_func =
+    void (*)(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride,
+             int w, int h, const InterpFilterParams *filter_params_x,
+             const int subpel_x_qn, ConvolveParams *conv_params, int bd);
 
 class AV1ConvolveXHighbdTest : public AV1ConvolveTest<highbd_convolve_x_func> {
  public:
@@ -754,10 +754,10 @@
 ////////////////////////////////////////////////////////
 // Single reference convolve-y functions (low bit-depth)
 ////////////////////////////////////////////////////////
-typedef void (*convolve_y_func)(const uint8_t *src, int src_stride,
-                                uint8_t *dst, int dst_stride, int w, int h,
-                                const InterpFilterParams *filter_params_y,
-                                const int subpel_y_qn);
+using convolve_y_func = void (*)(const uint8_t *src, int src_stride,
+                                 uint8_t *dst, int dst_stride, int w, int h,
+                                 const InterpFilterParams *filter_params_y,
+                                 const int subpel_y_qn);
 
 class AV1ConvolveYTest : public AV1ConvolveTest<convolve_y_func> {
  public:
@@ -951,10 +951,10 @@
 /////////////////////////////////////////////////////////
 // Single reference convolve-y functions (high bit-depth)
 /////////////////////////////////////////////////////////
-typedef void (*highbd_convolve_y_func)(
-    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
-    int h, const InterpFilterParams *filter_params_y, const int subpel_y_qn,
-    int bd);
+using highbd_convolve_y_func =
+    void (*)(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride,
+             int w, int h, const InterpFilterParams *filter_params_y,
+             const int subpel_y_qn, int bd);
 
 class AV1ConvolveYHighbdTest : public AV1ConvolveTest<highbd_convolve_y_func> {
  public:
@@ -1151,9 +1151,9 @@
 //////////////////////////////////////////////////////////////
 // Single reference convolve-copy functions (low bit-depth)
 //////////////////////////////////////////////////////////////
-typedef void (*convolve_copy_func)(const uint8_t *src, ptrdiff_t src_stride,
-                                   uint8_t *dst, ptrdiff_t dst_stride, int w,
-                                   int h);
+using convolve_copy_func = void (*)(const uint8_t *src, ptrdiff_t src_stride,
+                                    uint8_t *dst, ptrdiff_t dst_stride, int w,
+                                    int h);
 
 class AV1ConvolveCopyTest : public AV1ConvolveTest<convolve_copy_func> {
  public:
@@ -1195,9 +1195,9 @@
 ///////////////////////////////////////////////////////////////
 // Single reference convolve-copy functions (high bit-depth)
 ///////////////////////////////////////////////////////////////
-typedef void (*highbd_convolve_copy_func)(const uint16_t *src,
-                                          ptrdiff_t src_stride, uint16_t *dst,
-                                          ptrdiff_t dst_stride, int w, int h);
+using highbd_convolve_copy_func = void (*)(const uint16_t *src,
+                                           ptrdiff_t src_stride, uint16_t *dst,
+                                           ptrdiff_t dst_stride, int w, int h);
 
 class AV1ConvolveCopyHighbdTest
     : public AV1ConvolveTest<highbd_convolve_copy_func> {
@@ -1241,12 +1241,12 @@
 /////////////////////////////////////////////////////////
 // Single reference convolve-2D functions (low bit-depth)
 /////////////////////////////////////////////////////////
-typedef void (*convolve_2d_func)(const uint8_t *src, int src_stride,
-                                 uint8_t *dst, int dst_stride, int w, int h,
-                                 const InterpFilterParams *filter_params_x,
-                                 const InterpFilterParams *filter_params_y,
-                                 const int subpel_x_qn, const int subpel_y_qn,
-                                 ConvolveParams *conv_params);
+using convolve_2d_func = void (*)(const uint8_t *src, int src_stride,
+                                  uint8_t *dst, int dst_stride, int w, int h,
+                                  const InterpFilterParams *filter_params_x,
+                                  const InterpFilterParams *filter_params_y,
+                                  const int subpel_x_qn, const int subpel_y_qn,
+                                  ConvolveParams *conv_params);
 
 class AV1Convolve2DTest : public AV1ConvolveTest<convolve_2d_func> {
  public:
@@ -1483,11 +1483,11 @@
 // Single reference convolve-2d functions (high bit-depth)
 //////////////////////////////////////////////////////////
 
-typedef void (*highbd_convolve_2d_func)(
-    const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride, int w,
-    int h, const InterpFilterParams *filter_params_x,
-    const InterpFilterParams *filter_params_y, const int subpel_x_qn,
-    const int subpel_y_qn, ConvolveParams *conv_params, int bd);
+using highbd_convolve_2d_func =
+    void (*)(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride,
+             int w, int h, const InterpFilterParams *filter_params_x,
+             const InterpFilterParams *filter_params_y, const int subpel_x_qn,
+             const int subpel_y_qn, ConvolveParams *conv_params, int bd);
 
 class AV1Convolve2DHighbdTest
     : public AV1ConvolveTest<highbd_convolve_2d_func> {
@@ -2143,9 +2143,9 @@
 //////////////////////////////////////////////////////
 // Compound convolve-2d-copy functions (low bit-depth)
 //////////////////////////////////////////////////////
-typedef void (*compound_conv_2d_copy_func)(const uint8_t *src, int src_stride,
-                                           uint8_t *dst, int dst_stride, int w,
-                                           int h, ConvolveParams *conv_params);
+using compound_conv_2d_copy_func = void (*)(const uint8_t *src, int src_stride,
+                                            uint8_t *dst, int dst_stride, int w,
+                                            int h, ConvolveParams *conv_params);
 
 class AV1Convolve2DCopyCompoundTest
     : public AV1ConvolveTest<compound_conv_2d_copy_func> {
@@ -2266,11 +2266,9 @@
 ///////////////////////////////////////////////////////
 // Compound convolve-2d-copy functions (high bit-depth)
 ///////////////////////////////////////////////////////
-typedef void (*highbd_compound_conv_2d_copy_func)(const uint16_t *src,
-                                                  int src_stride, uint16_t *dst,
-                                                  int dst_stride, int w, int h,
-                                                  ConvolveParams *conv_params,
-                                                  int bd);
+using highbd_compound_conv_2d_copy_func =
+    void (*)(const uint16_t *src, int src_stride, uint16_t *dst, int dst_stride,
+             int w, int h, ConvolveParams *conv_params, int bd);
 
 class AV1Convolve2DCopyHighbdCompoundTest
     : public AV1ConvolveTest<highbd_compound_conv_2d_copy_func> {
diff --git a/test/av1_external_partition_test.cc b/test/av1_external_partition_test.cc
index 65fa001..1aecb34 100644
--- a/test/av1_external_partition_test.cc
+++ b/test/av1_external_partition_test.cc
@@ -31,11 +31,11 @@
 constexpr int kFrameNum = 8;
 constexpr int kVersion = 1;
 
-typedef struct TestData {
+struct TestData {
   int version = kVersion;
-} TestData;
+};
 
-typedef struct ToyModel {
+struct ToyModel {
   TestData *data;
   aom_ext_part_config_t config;
   aom_ext_part_funcs_t funcs;
@@ -44,7 +44,7 @@
   int frame_width;
   int frame_height;
   BLOCK_SIZE block_size;
-} ToyModel;
+};
 
 // Note:
 // if CONFIG_PARTITION_SEARCH_ORDER = 0, we test APIs designed for the baseline
diff --git a/test/av1_fwd_txfm2d_test.cc b/test/av1_fwd_txfm2d_test.cc
index d694a32..483cf81 100644
--- a/test/av1_fwd_txfm2d_test.cc
+++ b/test/av1_fwd_txfm2d_test.cc
@@ -35,7 +35,7 @@
 
 namespace {
 // tx_type_, tx_size_, max_error_, max_avg_error_
-typedef std::tuple<TX_TYPE, TX_SIZE, double, double> AV1FwdTxfm2dParam;
+using AV1FwdTxfm2dParam = std::tuple<TX_TYPE, TX_SIZE, double, double>;
 
 class AV1FwdTxfm2d : public ::testing::TestWithParam<AV1FwdTxfm2dParam> {
  public:
@@ -238,8 +238,8 @@
   }
 }
 
-typedef void (*lowbd_fwd_txfm_func)(const int16_t *src_diff, tran_low_t *coeff,
-                                    int diff_stride, TxfmParam *txfm_param);
+using lowbd_fwd_txfm_func = void (*)(const int16_t *src_diff, tran_low_t *coeff,
+                                     int diff_stride, TxfmParam *txfm_param);
 
 void AV1FwdTxfm2dMatchTest(TX_SIZE tx_size, lowbd_fwd_txfm_func target_func) {
   const int bd = 8;
@@ -356,7 +356,7 @@
   }
 }
 
-typedef std::tuple<TX_SIZE, lowbd_fwd_txfm_func> LbdFwdTxfm2dParam;
+using LbdFwdTxfm2dParam = std::tuple<TX_SIZE, lowbd_fwd_txfm_func>;
 
 class AV1FwdTxfm2dTest : public ::testing::TestWithParam<LbdFwdTxfm2dParam> {};
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1FwdTxfm2dTest);
@@ -521,8 +521,9 @@
 
 #endif  // HAVE_NEON
 
-typedef void (*Highbd_fwd_txfm_func)(const int16_t *src_diff, tran_low_t *coeff,
-                                     int diff_stride, TxfmParam *txfm_param);
+using Highbd_fwd_txfm_func = void (*)(const int16_t *src_diff,
+                                      tran_low_t *coeff, int diff_stride,
+                                      TxfmParam *txfm_param);
 
 void AV1HighbdFwdTxfm2dMatchTest(TX_SIZE tx_size,
                                  Highbd_fwd_txfm_func target_func) {
@@ -648,7 +649,7 @@
   }
 }
 
-typedef std::tuple<TX_SIZE, Highbd_fwd_txfm_func> HighbdFwdTxfm2dParam;
+using HighbdFwdTxfm2dParam = std::tuple<TX_SIZE, Highbd_fwd_txfm_func>;
 
 class AV1HighbdFwdTxfm2dTest
     : public ::testing::TestWithParam<HighbdFwdTxfm2dParam> {};
diff --git a/test/av1_highbd_iht_test.cc b/test/av1_highbd_iht_test.cc
index 24cf9b0..5da77fc 100644
--- a/test/av1_highbd_iht_test.cc
+++ b/test/av1_highbd_iht_test.cc
@@ -29,11 +29,11 @@
 using libaom_test::ACMRandom;
 using std::tuple;
 
-typedef void (*HbdHtFunc)(const int16_t *input, int32_t *output, int stride,
-                          TX_TYPE tx_type, int bd);
-
-typedef void (*IHbdHtFunc)(const int32_t *coeff, uint16_t *output, int stride,
+using HbdHtFunc = void (*)(const int16_t *input, int32_t *output, int stride,
                            TX_TYPE tx_type, int bd);
+
+using IHbdHtFunc = void (*)(const int32_t *coeff, uint16_t *output, int stride,
+                            TX_TYPE tx_type, int bd);
 static const char *tx_type_name[] = {
   "DCT_DCT",
   "ADST_DCT",
@@ -59,7 +59,7 @@
 //    num_coeffs,
 //    tx_type,
 //    bit_depth>
-typedef tuple<HbdHtFunc, IHbdHtFunc, IHbdHtFunc, int, TX_TYPE, int> IHbdHtParam;
+using IHbdHtParam = tuple<HbdHtFunc, IHbdHtFunc, IHbdHtFunc, int, TX_TYPE, int>;
 
 class AV1HighbdInvHTNxN : public ::testing::TestWithParam<IHbdHtParam> {
  public:
@@ -193,10 +193,10 @@
                          ::testing::ValuesIn(kArrayIhtParam));
 #endif  // HAVE_SSE4_1
 
-typedef void (*HighbdInvTxfm2dFunc)(const int32_t *input, uint8_t *output,
-                                    int stride, const TxfmParam *txfm_param);
+using HighbdInvTxfm2dFunc = void (*)(const int32_t *input, uint8_t *output,
+                                     int stride, const TxfmParam *txfm_param);
 
-typedef std::tuple<const HighbdInvTxfm2dFunc> AV1HighbdInvTxfm2dParam;
+using AV1HighbdInvTxfm2dParam = std::tuple<const HighbdInvTxfm2dFunc>;
 class AV1HighbdInvTxfm2d
     : public ::testing::TestWithParam<AV1HighbdInvTxfm2dParam> {
  public:
diff --git a/test/av1_horz_only_frame_superres_test.cc b/test/av1_horz_only_frame_superres_test.cc
index 20b1a97..d32ebd5 100644
--- a/test/av1_horz_only_frame_superres_test.cc
+++ b/test/av1_horz_only_frame_superres_test.cc
@@ -257,14 +257,14 @@
   TestImage<Pixel> *image_;
 };
 
-typedef void (*LowBDConvolveHorizRsFunc)(const uint8_t *src, int src_stride,
-                                         uint8_t *dst, int dst_stride, int w,
-                                         int h, const int16_t *x_filters,
-                                         const int x0_qn, const int x_step_qn);
+using LowBDConvolveHorizRsFunc = void (*)(const uint8_t *src, int src_stride,
+                                          uint8_t *dst, int dst_stride, int w,
+                                          int h, const int16_t *x_filters,
+                                          const int x0_qn, const int x_step_qn);
 
 // Test parameter list:
 //  <tst_fun_>
-typedef tuple<LowBDConvolveHorizRsFunc> LowBDParams;
+using LowBDParams = tuple<LowBDConvolveHorizRsFunc>;
 
 class LowBDConvolveHorizRSTest
     : public ConvolveHorizRSTestBase<uint8_t>,
@@ -322,15 +322,15 @@
 #endif
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef void (*HighBDConvolveHorizRsFunc)(const uint16_t *src, int src_stride,
-                                          uint16_t *dst, int dst_stride, int w,
-                                          int h, const int16_t *x_filters,
-                                          const int x0_qn, const int x_step_qn,
-                                          int bd);
+using HighBDConvolveHorizRsFunc = void (*)(const uint16_t *src, int src_stride,
+                                           uint16_t *dst, int dst_stride, int w,
+                                           int h, const int16_t *x_filters,
+                                           const int x0_qn, const int x_step_qn,
+                                           int bd);
 
 // Test parameter list:
 //  <tst_fun_, bd_>
-typedef tuple<HighBDConvolveHorizRsFunc, int> HighBDParams;
+using HighBDParams = tuple<HighBDConvolveHorizRsFunc, int>;
 
 class HighBDConvolveHorizRSTest
     : public ConvolveHorizRSTestBase<uint16_t>,
diff --git a/test/av1_inv_txfm1d_test.cc b/test/av1_inv_txfm1d_test.cc
index 156fb40..4220ad8 100644
--- a/test/av1_inv_txfm1d_test.cc
+++ b/test/av1_inv_txfm1d_test.cc
@@ -16,8 +16,6 @@
 #include "av1/common/av1_inv_txfm1d.h"
 #include "av1/encoder/av1_fwd_txfm1d.h"
 
-typedef TX_SIZE TxSize;
-
 using libaom_test::ACMRandom;
 using libaom_test::input_base;
 
@@ -75,7 +73,7 @@
   ASSERT_EQ(NELEMENTS(inv_txfm_func_ls), TX_SIZES);
   for (int i = 0; i < count_test_block; ++i) {
     // choose a random transform to test
-    const TxSize tx_size = static_cast<TxSize>(rnd.Rand8() % TX_SIZES);
+    const TX_SIZE tx_size = static_cast<TX_SIZE>(rnd.Rand8() % TX_SIZES);
     const int txfm_size = txfm_size_ls[tx_size];
     const TxfmFunc inv_txfm_func = inv_txfm_func_ls[tx_size][0];
 
diff --git a/test/av1_inv_txfm2d_test.cc b/test/av1_inv_txfm2d_test.cc
index 1f62daf..c26fe70 100644
--- a/test/av1_inv_txfm2d_test.cc
+++ b/test/av1_inv_txfm2d_test.cc
@@ -38,14 +38,11 @@
 
 using std::vector;
 
-typedef TX_TYPE TxType;
-typedef TX_SIZE TxSize;
-
 namespace {
 
 // AV1InvTxfm2dParam argument list:
 // tx_type_, tx_size_, max_error_, max_avg_error_
-typedef std::tuple<TxType, TxSize, int, double> AV1InvTxfm2dParam;
+using AV1InvTxfm2dParam = std::tuple<TX_TYPE, TX_SIZE, int, double>;
 
 class AV1InvTxfm2d : public ::testing::TestWithParam<AV1InvTxfm2dParam> {
  public:
@@ -91,7 +88,7 @@
         }
         double ref_coeffs[64 * 64] = { 0 };
         ASSERT_LE(txfm2d_size, NELEMENTS(ref_coeffs));
-        ASSERT_EQ(tx_type_, static_cast<TxType>(DCT_DCT));
+        ASSERT_EQ(tx_type_, static_cast<TX_TYPE>(DCT_DCT));
         libaom_test::reference_hybrid_2d(ref_input, ref_coeffs, tx_type_,
                                          tx_size_);
         DECLARE_ALIGNED(16, int32_t, ref_coeffs_int[64 * 64]) = { 0 };
@@ -146,8 +143,8 @@
 
   int max_error_;
   double max_avg_error_;
-  TxType tx_type_;
-  TxSize tx_size_;
+  TX_TYPE tx_type_;
+  TX_SIZE tx_size_;
 };
 
 static const int max_error_ls[TX_SIZES_ALL] = {
@@ -200,8 +197,8 @@
     const int max_error = max_error_ls[s];
     const double avg_error = avg_error_ls[s];
     for (int t = 0; t < TX_TYPES; ++t) {
-      const TxType tx_type = static_cast<TxType>(t);
-      const TxSize tx_size = static_cast<TxSize>(s);
+      const TX_TYPE tx_type = static_cast<TX_TYPE>(t);
+      const TX_SIZE tx_size = static_cast<TX_SIZE>(s);
       if (libaom_test::IsTxSizeTypeValid(tx_size, tx_type)) {
         param_list.push_back(
             AV1InvTxfm2dParam(tx_type, tx_size, max_error, avg_error));
@@ -223,18 +220,18 @@
     int8_t high_range = libaom_test::high_range_arr[bd_idx];
     for (int tx_size = 0; tx_size < TX_SIZES_ALL; ++tx_size) {
       for (int tx_type = 0; tx_type < TX_TYPES; ++tx_type) {
-        if (libaom_test::IsTxSizeTypeValid(static_cast<TxSize>(tx_size),
-                                           static_cast<TxType>(tx_type)) ==
+        if (libaom_test::IsTxSizeTypeValid(static_cast<TX_SIZE>(tx_size),
+                                           static_cast<TX_TYPE>(tx_type)) ==
             false) {
           continue;
         }
         TXFM_2D_FLIP_CFG cfg;
-        av1_get_inv_txfm_cfg(static_cast<TxType>(tx_type),
-                             static_cast<TxSize>(tx_size), &cfg);
+        av1_get_inv_txfm_cfg(static_cast<TX_TYPE>(tx_type),
+                             static_cast<TX_SIZE>(tx_size), &cfg);
         int8_t stage_range_col[MAX_TXFM_STAGE_NUM];
         int8_t stage_range_row[MAX_TXFM_STAGE_NUM];
         av1_gen_inv_stage_range(stage_range_col, stage_range_row, &cfg,
-                                static_cast<TxSize>(tx_size), bd);
+                                static_cast<TX_SIZE>(tx_size), bd);
         libaom_test::txfm_stage_range_check(stage_range_col, cfg.stage_num_col,
                                             cfg.cos_bit_col, low_range,
                                             high_range);
@@ -246,11 +243,11 @@
   }
 }
 
-typedef std::tuple<const LbdInvTxfm2dFunc> AV1LbdInvTxfm2dParam;
+using AV1LbdInvTxfm2dParam = std::tuple<const LbdInvTxfm2dFunc>;
 class AV1LbdInvTxfm2d : public ::testing::TestWithParam<AV1LbdInvTxfm2dParam> {
  public:
   void SetUp() override { target_func_ = GET_PARAM(0); }
-  void RunAV1InvTxfm2dTest(TxType tx_type, TxSize tx_size, int run_times,
+  void RunAV1InvTxfm2dTest(TX_TYPE tx_type, TX_SIZE tx_size, int run_times,
                            int gt_int16 = 0);
 
  private:
@@ -258,7 +255,7 @@
 };
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1LbdInvTxfm2d);
 
-void AV1LbdInvTxfm2d::RunAV1InvTxfm2dTest(TxType tx_type, TxSize tx_size,
+void AV1LbdInvTxfm2d::RunAV1InvTxfm2dTest(TX_TYPE tx_type, TX_SIZE tx_size,
                                           int run_times, int gt_int16) {
   FwdTxfm2dFunc fwd_func_ = libaom_test::fwd_txfm_func_ls[tx_size];
   InvTxfm2dFunc ref_func_ = libaom_test::inv_txfm_func_ls[tx_size];
@@ -340,21 +337,23 @@
 TEST_P(AV1LbdInvTxfm2d, match) {
   for (int j = 0; j < (int)(TX_SIZES_ALL); ++j) {
     for (int i = 0; i < (int)TX_TYPES; ++i) {
-      if (libaom_test::IsTxSizeTypeValid(static_cast<TxSize>(j),
-                                         static_cast<TxType>(i))) {
-        RunAV1InvTxfm2dTest(static_cast<TxType>(i), static_cast<TxSize>(j), 1);
+      if (libaom_test::IsTxSizeTypeValid(static_cast<TX_SIZE>(j),
+                                         static_cast<TX_TYPE>(i))) {
+        RunAV1InvTxfm2dTest(static_cast<TX_TYPE>(i), static_cast<TX_SIZE>(j),
+                            1);
       }
     }
   }
 }
 
 TEST_P(AV1LbdInvTxfm2d, gt_int16) {
-  static const TxType types[] = { DCT_DCT, ADST_DCT, FLIPADST_DCT, IDTX,
-                                  V_DCT,   H_DCT,    H_ADST,       H_FLIPADST };
+  static const TX_TYPE types[] = {
+    DCT_DCT, ADST_DCT, FLIPADST_DCT, IDTX, V_DCT, H_DCT, H_ADST, H_FLIPADST
+  };
   for (int j = 0; j < (int)(TX_SIZES_ALL); ++j) {
-    const TxSize sz = static_cast<TxSize>(j);
+    const TX_SIZE sz = static_cast<TX_SIZE>(j);
     for (uint8_t i = 0; i < sizeof(types) / sizeof(types[0]); ++i) {
-      const TxType tp = types[i];
+      const TX_TYPE tp = types[i];
       if (libaom_test::IsTxSizeTypeValid(sz, tp)) {
         RunAV1InvTxfm2dTest(tp, sz, 1, 1);
       }
@@ -365,9 +364,9 @@
 TEST_P(AV1LbdInvTxfm2d, DISABLED_Speed) {
   for (int j = 1; j < (int)(TX_SIZES_ALL); ++j) {
     for (int i = 0; i < (int)TX_TYPES; ++i) {
-      if (libaom_test::IsTxSizeTypeValid(static_cast<TxSize>(j),
-                                         static_cast<TxType>(i))) {
-        RunAV1InvTxfm2dTest(static_cast<TxType>(i), static_cast<TxSize>(j),
+      if (libaom_test::IsTxSizeTypeValid(static_cast<TX_SIZE>(j),
+                                         static_cast<TX_TYPE>(i))) {
+        RunAV1InvTxfm2dTest(static_cast<TX_TYPE>(i), static_cast<TX_SIZE>(j),
                             10000000);
       }
     }
@@ -377,7 +376,7 @@
 #if HAVE_SSSE3
 extern "C" void av1_lowbd_inv_txfm2d_add_ssse3(const int32_t *input,
                                                uint8_t *output, int stride,
-                                               TxType tx_type, TxSize tx_size,
+                                               TX_TYPE tx_type, TX_SIZE tx_size,
                                                int eob);
 INSTANTIATE_TEST_SUITE_P(SSSE3, AV1LbdInvTxfm2d,
                          ::testing::Values(av1_lowbd_inv_txfm2d_add_ssse3));
@@ -386,7 +385,7 @@
 #if HAVE_AVX2
 extern "C" void av1_lowbd_inv_txfm2d_add_avx2(const int32_t *input,
                                               uint8_t *output, int stride,
-                                              TxType tx_type, TxSize tx_size,
+                                              TX_TYPE tx_type, TX_SIZE tx_size,
                                               int eob);
 
 INSTANTIATE_TEST_SUITE_P(AVX2, AV1LbdInvTxfm2d,
diff --git a/test/av1_k_means_test.cc b/test/av1_k_means_test.cc
index db738472..58dcb01 100644
--- a/test/av1_k_means_test.cc
+++ b/test/av1_k_means_test.cc
@@ -28,20 +28,20 @@
 #include "test/util.h"
 
 namespace AV1Kmeans {
-typedef void (*av1_calc_indices_dim1_func)(const int16_t *data,
-                                           const int16_t *centroids,
-                                           uint8_t *indices,
-                                           int64_t *total_dist, int n, int k);
-typedef void (*av1_calc_indices_dim2_func)(const int16_t *data,
-                                           const int16_t *centroids,
-                                           uint8_t *indices,
-                                           int64_t *total_dist, int n, int k);
+using av1_calc_indices_dim1_func = void (*)(const int16_t *data,
+                                            const int16_t *centroids,
+                                            uint8_t *indices,
+                                            int64_t *total_dist, int n, int k);
+using av1_calc_indices_dim2_func = void (*)(const int16_t *data,
+                                            const int16_t *centroids,
+                                            uint8_t *indices,
+                                            int64_t *total_dist, int n, int k);
 
-typedef std::tuple<av1_calc_indices_dim1_func, BLOCK_SIZE>
-    av1_calc_indices_dim1Param;
+using av1_calc_indices_dim1Param =
+    std::tuple<av1_calc_indices_dim1_func, BLOCK_SIZE>;
 
-typedef std::tuple<av1_calc_indices_dim2_func, BLOCK_SIZE>
-    av1_calc_indices_dim2Param;
+using av1_calc_indices_dim2Param =
+    std::tuple<av1_calc_indices_dim2_func, BLOCK_SIZE>;
 
 class AV1KmeansTest1
     : public ::testing::TestWithParam<av1_calc_indices_dim1Param> {
diff --git a/test/av1_key_value_api_test.cc b/test/av1_key_value_api_test.cc
index 03cdeee..d3e9dc1 100644
--- a/test/av1_key_value_api_test.cc
+++ b/test/av1_key_value_api_test.cc
@@ -21,7 +21,7 @@
 #include "gtest/gtest.h"
 
 namespace {
-typedef std::tuple<const char *, const char *> KeyValParam;
+using KeyValParam = std::tuple<const char *, const char *>;
 
 class BaseKeyValAPI : public testing::Test {
  public:
diff --git a/test/av1_nn_predict_test.cc b/test/av1_nn_predict_test.cc
index 38981be..b05b83c 100644
--- a/test/av1_nn_predict_test.cc
+++ b/test/av1_nn_predict_test.cc
@@ -24,11 +24,11 @@
 #include "test/acm_random.h"
 
 namespace {
-typedef void (*NnPredict_Func)(const float *const input_nodes,
-                               const NN_CONFIG *const nn_config,
-                               int reduce_prec, float *const output);
+using NnPredict_Func = void (*)(const float *const input_nodes,
+                                const NN_CONFIG *const nn_config,
+                                int reduce_prec, float *const output);
 
-typedef std::tuple<const NnPredict_Func> NnPredictTestParam;
+using NnPredictTestParam = std::tuple<const NnPredict_Func>;
 
 const float epsilon = 1e-3f;  // Error threshold for functional equivalence
 
diff --git a/test/av1_quantize_test.cc b/test/av1_quantize_test.cc
index 101186b..4f0c178 100644
--- a/test/av1_quantize_test.cc
+++ b/test/av1_quantize_test.cc
@@ -22,7 +22,7 @@
 
 namespace {
 
-typedef void (*QuantizeFpFunc)(
+using QuantizeFpFunc = void (*)(
     const tran_low_t *coeff_ptr, intptr_t count, const int16_t *zbin_ptr,
     const int16_t *round_ptr, const int16_t *quant_ptr,
     const int16_t *quant_shift_ptr, tran_low_t *qcoeff_ptr,
diff --git a/test/av1_round_shift_array_test.cc b/test/av1_round_shift_array_test.cc
index e9731b1..39e99f8 100644
--- a/test/av1_round_shift_array_test.cc
+++ b/test/av1_round_shift_array_test.cc
@@ -25,7 +25,7 @@
 
 namespace AV1CompRoundShift {
 
-typedef void (*comp_round_shift_array_func)(int32_t *arr, int size, int bit);
+using comp_round_shift_array_func = void (*)(int32_t *arr, int size, int bit);
 
 #if HAVE_SSE4_1 || HAVE_NEON
 const int kValidBitCheck[] = {
@@ -33,8 +33,8 @@
 };
 #endif  // HAVE_SSE4_1 || HAVE_NEON
 
-typedef std::tuple<comp_round_shift_array_func, BLOCK_SIZE, int>
-    CompRoundShiftParam;
+using CompRoundShiftParam =
+    std::tuple<comp_round_shift_array_func, BLOCK_SIZE, int>;
 
 class AV1CompRoundShiftTest
     : public ::testing::TestWithParam<CompRoundShiftParam> {
diff --git a/test/av1_temporal_denoiser_test.cc b/test/av1_temporal_denoiser_test.cc
index 35a682a..0a80381 100644
--- a/test/av1_temporal_denoiser_test.cc
+++ b/test/av1_temporal_denoiser_test.cc
@@ -32,12 +32,12 @@
 
 const int kNumPixels = 128 * 128;
 
-typedef int (*Av1DenoiserFilterFunc)(const uint8_t *sig, int sig_stride,
-                                     const uint8_t *mc_avg, int mc_avg_stride,
-                                     uint8_t *avg, int avg_stride,
-                                     int increase_denoising, BLOCK_SIZE bs,
-                                     int motion_magnitude);
-typedef std::tuple<Av1DenoiserFilterFunc, BLOCK_SIZE> AV1DenoiserTestParam;
+using Av1DenoiserFilterFunc = int (*)(const uint8_t *sig, int sig_stride,
+                                      const uint8_t *mc_avg, int mc_avg_stride,
+                                      uint8_t *avg, int avg_stride,
+                                      int increase_denoising, BLOCK_SIZE bs,
+                                      int motion_magnitude);
+using AV1DenoiserTestParam = std::tuple<Av1DenoiserFilterFunc, BLOCK_SIZE>;
 
 class AV1DenoiserTest
     : public ::testing::Test,
diff --git a/test/av1_txfm_test.h b/test/av1_txfm_test.h
index a7f4a2b..e50dace 100644
--- a/test/av1_txfm_test.h
+++ b/test/av1_txfm_test.h
@@ -76,12 +76,12 @@
 template <typename Type>
 void fliplrud(Type *dest, int width, int height, int stride);
 
-typedef void (*TxfmFunc)(const int32_t *in, int32_t *out, const int8_t cos_bit,
-                         const int8_t *range_bit);
+using TxfmFunc = void (*)(const int32_t *in, int32_t *out, const int8_t cos_bit,
+                          const int8_t *range_bit);
 
-typedef void (*InvTxfm2dFunc)(const int32_t *, uint16_t *, int, TX_TYPE, int);
-typedef void (*LbdInvTxfm2dFunc)(const int32_t *, uint8_t *, int, TX_TYPE,
-                                 TX_SIZE, int);
+using InvTxfm2dFunc = void (*)(const int32_t *, uint16_t *, int, TX_TYPE, int);
+using LbdInvTxfm2dFunc = void (*)(const int32_t *, uint8_t *, int, TX_TYPE,
+                                  TX_SIZE, int);
 
 static const int bd = 10;
 static const int input_base = (1 << bd);
diff --git a/test/av1_wedge_utils_test.cc b/test/av1_wedge_utils_test.cc
index af11494..9719d2b 100644
--- a/test/av1_wedge_utils_test.cc
+++ b/test/av1_wedge_utils_test.cc
@@ -156,9 +156,9 @@
 // av1_wedge_sse_from_residuals - optimizations
 //////////////////////////////////////////////////////////////////////////////
 
-typedef uint64_t (*FSSE)(const int16_t *r1, const int16_t *d, const uint8_t *m,
-                         int N);
-typedef libaom_test::FuncParam<FSSE> TestFuncsFSSE;
+using FSSE = uint64_t (*)(const int16_t *r1, const int16_t *d, const uint8_t *m,
+                          int N);
+using TestFuncsFSSE = libaom_test::FuncParam<FSSE>;
 
 class WedgeUtilsSSEOptTest : public FunctionEquivalenceTest<FSSE> {
  protected:
@@ -222,9 +222,9 @@
 // av1_wedge_sign_from_residuals
 //////////////////////////////////////////////////////////////////////////////
 
-typedef int8_t (*FSign)(const int16_t *ds, const uint8_t *m, int N,
-                        int64_t limit);
-typedef libaom_test::FuncParam<FSign> TestFuncsFSign;
+using FSign = int8_t (*)(const int16_t *ds, const uint8_t *m, int N,
+                         int64_t limit);
+using TestFuncsFSign = libaom_test::FuncParam<FSign>;
 
 class WedgeUtilsSignOptTest : public FunctionEquivalenceTest<FSign> {
  protected:
@@ -324,8 +324,8 @@
 // av1_wedge_compute_delta_squares
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*FDS)(int16_t *d, const int16_t *a, const int16_t *b, int N);
-typedef libaom_test::FuncParam<FDS> TestFuncsFDS;
+using FDS = void (*)(int16_t *d, const int16_t *a, const int16_t *b, int N);
+using TestFuncsFDS = libaom_test::FuncParam<FDS>;
 
 class WedgeUtilsDeltaSquaresOptTest : public FunctionEquivalenceTest<FDS> {
  protected:
diff --git a/test/avg_test.cc b/test/avg_test.cc
index d1698fc..d2b4dbc 100644
--- a/test/avg_test.cc
+++ b/test/avg_test.cc
@@ -117,11 +117,11 @@
 
   ACMRandom rnd_;
 };
-typedef unsigned int (*AverageFunction)(const uint8_t *s, int pitch);
+using AverageFunction = unsigned int (*)(const uint8_t *s, int pitch);
 
 // Arguments: width, height, bit_depth, buffer start offset, block size, avg
 // function.
-typedef std::tuple<int, int, int, int, int, AverageFunction> AvgFunc;
+using AvgFunc = std::tuple<int, int, int, int, int, AverageFunction>;
 
 template <typename Pixel>
 class AverageTest : public AverageTestBase<Pixel>,
@@ -216,13 +216,13 @@
   int64_t opt_elapsed_time_ = 0;
 };
 
-typedef void (*AverageFunction_8x8_quad)(const uint8_t *s, int pitch, int x_idx,
-                                         int y_idx, int *avg);
+using AverageFunction_8x8_quad = void (*)(const uint8_t *s, int pitch,
+                                          int x_idx, int y_idx, int *avg);
 
 // Arguments: width, height, bit_depth, buffer start offset, block size, avg
 // function.
-typedef std::tuple<int, int, int, int, int, AverageFunction_8x8_quad>
-    AvgFunc_8x8_quad;
+using AvgFunc_8x8_quad =
+    std::tuple<int, int, int, int, int, AverageFunction_8x8_quad>;
 
 template <typename Pixel>
 class AverageTest_8x8_quad
@@ -350,12 +350,12 @@
 }
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-typedef void (*IntProRowFunc)(int16_t *hbuf, uint8_t const *ref,
-                              const int ref_stride, const int width,
-                              const int height, int norm_factor);
+using IntProRowFunc = void (*)(int16_t *hbuf, uint8_t const *ref,
+                               const int ref_stride, const int width,
+                               const int height, int norm_factor);
 
 // Params: width, height, asm function, c function.
-typedef std::tuple<int, int, IntProRowFunc, IntProRowFunc> IntProRowParam;
+using IntProRowParam = std::tuple<int, int, IntProRowFunc, IntProRowFunc>;
 
 class IntProRowTest : public AverageTestBase<uint8_t>,
                       public ::testing::WithParamInterface<IntProRowParam> {
@@ -452,12 +452,12 @@
 };
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(IntProRowTest);
 
-typedef void (*IntProColFunc)(int16_t *vbuf, uint8_t const *ref,
-                              const int ref_stride, const int width,
-                              const int height, int norm_factor);
+using IntProColFunc = void (*)(int16_t *vbuf, uint8_t const *ref,
+                               const int ref_stride, const int width,
+                               const int height, int norm_factor);
 
 // Params: width, height, asm function, c function.
-typedef std::tuple<int, int, IntProColFunc, IntProColFunc> IntProColParam;
+using IntProColParam = std::tuple<int, int, IntProColFunc, IntProColFunc>;
 
 class IntProColTest : public AverageTestBase<uint8_t>,
                       public ::testing::WithParamInterface<IntProColParam> {
@@ -633,10 +633,10 @@
   static const int num_random_cmp = 50;
 };
 
-typedef int (*VectorVarFunc)(const int16_t *ref, const int16_t *src,
-                             const int bwl);
+using VectorVarFunc = int (*)(const int16_t *ref, const int16_t *src,
+                              const int bwl);
 
-typedef std::tuple<int, VectorVarFunc, VectorVarFunc> VecVarFunc;
+using VecVarFunc = std::tuple<int, VectorVarFunc, VectorVarFunc>;
 
 class VectorVarTest : public VectorVarTestBase,
                       public ::testing::WithParamInterface<VecVarFunc> {
@@ -858,8 +858,8 @@
 #endif  // HAVE_NEON
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-typedef int (*SatdFunc)(const tran_low_t *coeffs, int length);
-typedef int (*SatdLpFunc)(const int16_t *coeffs, int length);
+using SatdFunc = int (*)(const tran_low_t *coeffs, int length);
+using SatdLpFunc = int (*)(const int16_t *coeffs, int length);
 
 template <typename SatdFuncType>
 struct SatdTestParam {
diff --git a/test/blend_a64_mask_1d_test.cc b/test/blend_a64_mask_1d_test.cc
index feee2d4..d20baa5 100644
--- a/test/blend_a64_mask_1d_test.cc
+++ b/test/blend_a64_mask_1d_test.cc
@@ -114,10 +114,10 @@
 // 8 bit version
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*F8B)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
-                    uint32_t src0_stride, const uint8_t *src1,
-                    uint32_t src1_stride, const uint8_t *mask, int w, int h);
-typedef libaom_test::FuncParam<F8B> TestFuncs;
+using F8B = void (*)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
+                     uint32_t src0_stride, const uint8_t *src1,
+                     uint32_t src1_stride, const uint8_t *mask, int w, int h);
+using TestFuncs = libaom_test::FuncParam<F8B>;
 
 class BlendA64Mask1DTest8B : public BlendA64Mask1DTest<F8B, uint8_t> {
  protected:
@@ -219,11 +219,11 @@
 // High bit-depth version
 //////////////////////////////////////////////////////////////////////////////
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef void (*FHBD)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
-                     uint32_t src0_stride, const uint8_t *src1,
-                     uint32_t src1_stride, const uint8_t *mask, int w, int h,
-                     int bd);
-typedef libaom_test::FuncParam<FHBD> TestFuncsHBD;
+using FHBD = void (*)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
+                      uint32_t src0_stride, const uint8_t *src1,
+                      uint32_t src1_stride, const uint8_t *mask, int w, int h,
+                      int bd);
+using TestFuncsHBD = libaom_test::FuncParam<FHBD>;
 
 class BlendA64Mask1DTestHBD : public BlendA64Mask1DTest<FHBD, uint16_t> {
  protected:
diff --git a/test/blend_a64_mask_test.cc b/test/blend_a64_mask_test.cc
index 43d0162..c9dad97 100644
--- a/test/blend_a64_mask_test.cc
+++ b/test/blend_a64_mask_test.cc
@@ -157,11 +157,11 @@
 // 8 bit version
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*F8B)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
-                    uint32_t src0_stride, const uint8_t *src1,
-                    uint32_t src1_stride, const uint8_t *mask,
-                    uint32_t mask_stride, int w, int h, int subx, int suby);
-typedef libaom_test::FuncParam<F8B> TestFuncs;
+using F8B = void (*)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
+                     uint32_t src0_stride, const uint8_t *src1,
+                     uint32_t src1_stride, const uint8_t *mask,
+                     uint32_t mask_stride, int w, int h, int subx, int suby);
+using TestFuncs = libaom_test::FuncParam<F8B>;
 
 class BlendA64MaskTest8B : public BlendA64MaskTest<F8B, uint8_t, uint8_t> {
  protected:
@@ -266,12 +266,13 @@
 // 8 bit _d16 version
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*F8B_D16)(uint8_t *dst, uint32_t dst_stride, const uint16_t *src0,
-                        uint32_t src0_stride, const uint16_t *src1,
-                        uint32_t src1_stride, const uint8_t *mask,
-                        uint32_t mask_stride, int w, int h, int subx, int suby,
-                        ConvolveParams *conv_params);
-typedef libaom_test::FuncParam<F8B_D16> TestFuncs_d16;
+using F8B_D16 = void (*)(uint8_t *dst, uint32_t dst_stride,
+                         const uint16_t *src0, uint32_t src0_stride,
+                         const uint16_t *src1, uint32_t src1_stride,
+                         const uint8_t *mask, uint32_t mask_stride, int w,
+                         int h, int subx, int suby,
+                         ConvolveParams *conv_params);
+using TestFuncs_d16 = libaom_test::FuncParam<F8B_D16>;
 
 class BlendA64MaskTest8B_d16
     : public BlendA64MaskTest<F8B_D16, uint16_t, uint8_t> {
@@ -387,12 +388,12 @@
 // High bit-depth version
 //////////////////////////////////////////////////////////////////////////////
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef void (*FHBD)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
-                     uint32_t src0_stride, const uint8_t *src1,
-                     uint32_t src1_stride, const uint8_t *mask,
-                     uint32_t mask_stride, int w, int h, int subx, int suby,
-                     int bd);
-typedef libaom_test::FuncParam<FHBD> TestFuncsHBD;
+using FHBD = void (*)(uint8_t *dst, uint32_t dst_stride, const uint8_t *src0,
+                      uint32_t src0_stride, const uint8_t *src1,
+                      uint32_t src1_stride, const uint8_t *mask,
+                      uint32_t mask_stride, int w, int h, int subx, int suby,
+                      int bd);
+using TestFuncsHBD = libaom_test::FuncParam<FHBD>;
 
 class BlendA64MaskTestHBD : public BlendA64MaskTest<FHBD, uint16_t, uint16_t> {
  protected:
@@ -490,13 +491,13 @@
 // HBD _d16 version
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*FHBD_D16)(uint8_t *dst, uint32_t dst_stride,
-                         const CONV_BUF_TYPE *src0, uint32_t src0_stride,
-                         const CONV_BUF_TYPE *src1, uint32_t src1_stride,
-                         const uint8_t *mask, uint32_t mask_stride, int w,
-                         int h, int subx, int suby, ConvolveParams *conv_params,
-                         const int bd);
-typedef libaom_test::FuncParam<FHBD_D16> TestFuncsHBD_d16;
+using FHBD_D16 = void (*)(uint8_t *dst, uint32_t dst_stride,
+                          const CONV_BUF_TYPE *src0, uint32_t src0_stride,
+                          const CONV_BUF_TYPE *src1, uint32_t src1_stride,
+                          const uint8_t *mask, uint32_t mask_stride, int w,
+                          int h, int subx, int suby,
+                          ConvolveParams *conv_params, const int bd);
+using TestFuncsHBD_d16 = libaom_test::FuncParam<FHBD_D16>;
 
 class BlendA64MaskTestHBD_d16
     : public BlendA64MaskTest<FHBD_D16, uint16_t, uint16_t> {
diff --git a/test/cdef_test.cc b/test/cdef_test.cc
index 958e6e5..6f985f0 100644
--- a/test/cdef_test.cc
+++ b/test/cdef_test.cc
@@ -32,9 +32,9 @@
 
 using CdefFilterBlockFunctions = std::array<cdef_filter_block_func, 4>;
 
-typedef std::tuple<CdefFilterBlockFunctions, CdefFilterBlockFunctions,
-                   BLOCK_SIZE, int, int>
-    cdef_dir_param_t;
+using cdef_dir_param_t =
+    std::tuple<CdefFilterBlockFunctions, CdefFilterBlockFunctions, BLOCK_SIZE,
+               int, int>;
 
 class CDEFBlockTest : public ::testing::TestWithParam<cdef_dir_param_t> {
  public:
@@ -56,13 +56,13 @@
 };
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFBlockTest);
 
-typedef CDEFBlockTest CDEFBlockHighbdTest;
+using CDEFBlockHighbdTest = CDEFBlockTest;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFBlockHighbdTest);
 
-typedef CDEFBlockTest CDEFSpeedTest;
+using CDEFSpeedTest = CDEFBlockTest;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFSpeedTest);
 
-typedef CDEFBlockTest CDEFSpeedHighbdTest;
+using CDEFSpeedHighbdTest = CDEFBlockTest;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFSpeedHighbdTest);
 
 int64_t test_cdef(BLOCK_SIZE bsize, int iterations,
@@ -206,10 +206,10 @@
       << "SIMD time: " << elapsed_time << " us" << std::endl;
 }
 
-typedef int (*find_dir_t)(const uint16_t *img, int stride, int32_t *var,
-                          int coeff_shift);
+using find_dir_t = int (*)(const uint16_t *img, int stride, int32_t *var,
+                           int coeff_shift);
 
-typedef std::tuple<find_dir_t, find_dir_t> find_dir_param_t;
+using find_dir_param_t = std::tuple<find_dir_t, find_dir_t>;
 
 class CDEFFindDirTest : public ::testing::TestWithParam<find_dir_param_t> {
  public:
@@ -225,7 +225,7 @@
 };
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFFindDirTest);
 
-typedef CDEFFindDirTest CDEFFindDirSpeedTest;
+using CDEFFindDirSpeedTest = CDEFFindDirTest;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFFindDirSpeedTest);
 
 void test_finddir(int (*finddir)(const uint16_t *img, int stride, int32_t *var,
@@ -293,11 +293,11 @@
       << "SIMD time: " << elapsed_time << " us" << std::endl;
 }
 
-typedef void (*find_dir_dual_t)(const uint16_t *img1, const uint16_t *img2,
-                                int stride, int32_t *var1, int32_t *var2,
-                                int coeff_shift, int *out1, int *out2);
+using find_dir_dual_t = void (*)(const uint16_t *img1, const uint16_t *img2,
+                                 int stride, int32_t *var1, int32_t *var2,
+                                 int coeff_shift, int *out1, int *out2);
 
-typedef std::tuple<find_dir_dual_t, find_dir_dual_t> find_dir_dual_param_t;
+using find_dir_dual_param_t = std::tuple<find_dir_dual_t, find_dir_dual_t>;
 
 class CDEFFindDirDualTest
     : public ::testing::TestWithParam<find_dir_dual_param_t> {
@@ -314,7 +314,7 @@
 };
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFFindDirDualTest);
 
-typedef CDEFFindDirDualTest CDEFFindDirDualSpeedTest;
+using CDEFFindDirDualSpeedTest = CDEFFindDirDualTest;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(CDEFFindDirDualSpeedTest);
 
 void test_finddir_dual(
diff --git a/test/cfl_test.cc b/test/cfl_test.cc
index 3f93305..6002fd8 100644
--- a/test/cfl_test.cc
+++ b/test/cfl_test.cc
@@ -172,8 +172,8 @@
   }
 };
 
-typedef cfl_subtract_average_fn (*sub_avg_fn)(TX_SIZE tx_size);
-typedef std::tuple<TX_SIZE, sub_avg_fn> sub_avg_param;
+using sub_avg_fn = cfl_subtract_average_fn (*)(TX_SIZE tx_size);
+using sub_avg_param = std::tuple<TX_SIZE, sub_avg_fn>;
 class CFLSubAvgTest : public ::testing::TestWithParam<sub_avg_param>,
                       public CFLTestWithData<uint16_t> {
  public:
@@ -280,10 +280,10 @@
   }
 };
 
-typedef cfl_subsample_lbd_fn (*get_subsample_lbd_fn)(TX_SIZE tx_size);
-typedef std::tuple<TX_SIZE, get_subsample_lbd_fn, get_subsample_lbd_fn,
-                   get_subsample_lbd_fn>
-    subsample_lbd_param;
+using get_subsample_lbd_fn = cfl_subsample_lbd_fn (*)(TX_SIZE tx_size);
+using subsample_lbd_param =
+    std::tuple<TX_SIZE, get_subsample_lbd_fn, get_subsample_lbd_fn,
+               get_subsample_lbd_fn>;
 class CFLSubsampleLBDTest
     : public CFLSubsampleTest<subsample_lbd_param, cfl_subsample_lbd_fn,
                               uint8_t> {
@@ -324,10 +324,10 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef cfl_subsample_hbd_fn (*get_subsample_hbd_fn)(TX_SIZE tx_size);
-typedef std::tuple<TX_SIZE, get_subsample_hbd_fn, get_subsample_hbd_fn,
-                   get_subsample_hbd_fn>
-    subsample_hbd_param;
+using get_subsample_hbd_fn = cfl_subsample_hbd_fn (*)(TX_SIZE tx_size);
+using subsample_hbd_param =
+    std::tuple<TX_SIZE, get_subsample_hbd_fn, get_subsample_hbd_fn,
+               get_subsample_hbd_fn>;
 class CFLSubsampleHBDTest
     : public CFLSubsampleTest<subsample_hbd_param, cfl_subsample_hbd_fn,
                               uint16_t> {
@@ -368,8 +368,8 @@
 }
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 
-typedef cfl_predict_lbd_fn (*get_predict_fn)(TX_SIZE tx_size);
-typedef std::tuple<TX_SIZE, get_predict_fn> predict_param;
+using get_predict_fn = cfl_predict_lbd_fn (*)(TX_SIZE tx_size);
+using predict_param = std::tuple<TX_SIZE, get_predict_fn>;
 class CFLPredictTest : public ::testing::TestWithParam<predict_param>,
                        public CFLTestWithAlignedData<uint8_t> {
  public:
@@ -417,8 +417,8 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef cfl_predict_hbd_fn (*get_predict_fn_hbd)(TX_SIZE tx_size);
-typedef std::tuple<TX_SIZE, get_predict_fn_hbd> predict_param_hbd;
+using get_predict_fn_hbd = cfl_predict_hbd_fn (*)(TX_SIZE tx_size);
+using predict_param_hbd = std::tuple<TX_SIZE, get_predict_fn_hbd>;
 class CFLPredictHBDTest : public ::testing::TestWithParam<predict_param_hbd>,
                           public CFLTestWithAlignedData<uint16_t> {
  public:
diff --git a/test/cnn_test.cc b/test/cnn_test.cc
index 3012451..fdf71d0 100644
--- a/test/cnn_test.cc
+++ b/test/cnn_test.cc
@@ -2510,13 +2510,13 @@
 
 namespace {
 
-typedef void (*CNNConvolveNoMaxpoolPaddingValidFunc)(
-    const float **input, int in_width, int in_height, int in_stride,
-    const CNN_LAYER_CONFIG *layer_config, float **output, int out_stride,
-    int start_idx, int cstep, int channel_step);
+using CNNConvolveNoMaxpoolPaddingValidFunc =
+    void (*)(const float **input, int in_width, int in_height, int in_stride,
+             const CNN_LAYER_CONFIG *layer_config, float **output,
+             int out_stride, int start_idx, int cstep, int channel_step);
 
-typedef libaom_test::FuncParam<CNNConvolveNoMaxpoolPaddingValidFunc>
-    CNNConvolveTestFuncs;
+using CNNConvolveTestFuncs =
+    libaom_test::FuncParam<CNNConvolveNoMaxpoolPaddingValidFunc>;
 
 class CNNConvolveTest : public ::testing::TestWithParam<CNNConvolveTestFuncs> {
  protected:
diff --git a/test/comp_mask_pred_test.cc b/test/comp_mask_pred_test.cc
index 953e481..7a23398 100644
--- a/test/comp_mask_pred_test.cc
+++ b/test/comp_mask_pred_test.cc
@@ -30,14 +30,14 @@
 #include "test/util.h"
 
 namespace {
-typedef void (*comp_mask_pred_func)(uint8_t *comp_pred, const uint8_t *pred,
-                                    int width, int height, const uint8_t *ref,
-                                    int ref_stride, const uint8_t *mask,
-                                    int mask_stride, int invert_mask);
+using comp_mask_pred_func = void (*)(uint8_t *comp_pred, const uint8_t *pred,
+                                     int width, int height, const uint8_t *ref,
+                                     int ref_stride, const uint8_t *mask,
+                                     int mask_stride, int invert_mask);
 
-typedef void (*comp_avg_pred_func)(uint8_t *comp_pred, const uint8_t *pred,
-                                   int width, int height, const uint8_t *ref,
-                                   int ref_stride);
+using comp_avg_pred_func = void (*)(uint8_t *comp_pred, const uint8_t *pred,
+                                    int width, int height, const uint8_t *ref,
+                                    int ref_stride);
 
 #if HAVE_SSSE3 || HAVE_SSE2 || HAVE_AVX2 || HAVE_NEON
 const BLOCK_SIZE kCompMaskPredParams[] = {
@@ -111,7 +111,7 @@
   aom_free(ref_buffer_);
 }
 
-typedef std::tuple<comp_mask_pred_func, BLOCK_SIZE> CompMaskPredParam;
+using CompMaskPredParam = std::tuple<comp_mask_pred_func, BLOCK_SIZE>;
 
 class AV1CompMaskPredTest
     : public AV1CompMaskPredBase,
@@ -207,14 +207,15 @@
 };
 #endif
 
-typedef void (*upsampled_pred_func)(MACROBLOCKD *xd, const AV1_COMMON *const cm,
-                                    int mi_row, int mi_col, const MV *const mv,
-                                    uint8_t *comp_pred, int width, int height,
-                                    int subpel_x_q3, int subpel_y_q3,
-                                    const uint8_t *ref, int ref_stride,
-                                    int subpel_search);
+using upsampled_pred_func = void (*)(MACROBLOCKD *xd,
+                                     const AV1_COMMON *const cm, int mi_row,
+                                     int mi_col, const MV *const mv,
+                                     uint8_t *comp_pred, int width, int height,
+                                     int subpel_x_q3, int subpel_y_q3,
+                                     const uint8_t *ref, int ref_stride,
+                                     int subpel_search);
 
-typedef std::tuple<upsampled_pred_func, BLOCK_SIZE> UpsampledPredParam;
+using UpsampledPredParam = std::tuple<upsampled_pred_func, BLOCK_SIZE>;
 
 class AV1UpsampledPredTest
     : public AV1CompMaskPredBase,
@@ -299,7 +300,7 @@
                        ::testing::ValuesIn(kValidBlockSize)));
 #endif
 
-typedef std::tuple<comp_avg_pred_func, BLOCK_SIZE> CompAvgPredParam;
+using CompAvgPredParam = std::tuple<comp_avg_pred_func, BLOCK_SIZE>;
 
 class AV1CompAvgPredTest : public ::testing::TestWithParam<CompAvgPredParam> {
  public:
@@ -481,14 +482,14 @@
   aom_free(ref_buffer_);
 }
 
-typedef void (*highbd_comp_mask_pred_func)(uint8_t *comp_pred8,
-                                           const uint8_t *pred8, int width,
-                                           int height, const uint8_t *ref8,
-                                           int ref_stride, const uint8_t *mask,
-                                           int mask_stride, int invert_mask);
+using highbd_comp_mask_pred_func = void (*)(uint8_t *comp_pred8,
+                                            const uint8_t *pred8, int width,
+                                            int height, const uint8_t *ref8,
+                                            int ref_stride, const uint8_t *mask,
+                                            int mask_stride, int invert_mask);
 
-typedef std::tuple<highbd_comp_mask_pred_func, BLOCK_SIZE, int>
-    HighbdCompMaskPredParam;
+using HighbdCompMaskPredParam =
+    std::tuple<highbd_comp_mask_pred_func, BLOCK_SIZE, int>;
 
 class AV1HighbdCompMaskPredTest
     : public AV1HighbdCompMaskPredTestBase,
@@ -607,14 +608,14 @@
                        ::testing::Range(8, 13, 2)));
 #endif
 
-typedef void (*highbd_upsampled_pred_func)(
-    MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row, int mi_col,
-    const MV *const mv, uint8_t *comp_pred8, int width, int height,
-    int subpel_x_q3, int subpel_y_q3, const uint8_t *ref8, int ref_stride,
-    int bd, int subpel_search);
+using highbd_upsampled_pred_func =
+    void (*)(MACROBLOCKD *xd, const struct AV1Common *const cm, int mi_row,
+             int mi_col, const MV *const mv, uint8_t *comp_pred8, int width,
+             int height, int subpel_x_q3, int subpel_y_q3, const uint8_t *ref8,
+             int ref_stride, int bd, int subpel_search);
 
-typedef std::tuple<highbd_upsampled_pred_func, BLOCK_SIZE, int>
-    HighbdUpsampledPredParam;
+using HighbdUpsampledPredParam =
+    std::tuple<highbd_upsampled_pred_func, BLOCK_SIZE, int>;
 
 class AV1HighbdUpsampledPredTest
     : public AV1HighbdCompMaskPredTestBase,
@@ -728,13 +729,13 @@
                        ::testing::Range(8, 13, 2)));
 #endif
 
-typedef void (*highbd_comp_avg_pred_func)(uint8_t *comp_pred,
-                                          const uint8_t *pred, int width,
-                                          int height, const uint8_t *ref,
-                                          int ref_stride);
+using highbd_comp_avg_pred_func = void (*)(uint8_t *comp_pred,
+                                           const uint8_t *pred, int width,
+                                           int height, const uint8_t *ref,
+                                           int ref_stride);
 
-typedef std::tuple<highbd_comp_avg_pred_func, BLOCK_SIZE, int>
-    HighbdCompAvgPredParam;
+using HighbdCompAvgPredParam =
+    std::tuple<highbd_comp_avg_pred_func, BLOCK_SIZE, int>;
 
 class AV1HighbdCompAvgPredTest
     : public ::testing::TestWithParam<HighbdCompAvgPredParam> {
diff --git a/test/convolve_test.cc b/test/convolve_test.cc
index 09e5f64..578d9a2 100644
--- a/test/convolve_test.cc
+++ b/test/convolve_test.cc
@@ -40,11 +40,11 @@
 static const int kNumFilterBanks = SWITCHABLE_FILTERS;
 static const int kNumFilters = 16;
 
-typedef void (*ConvolveFunc)(const uint8_t *src, ptrdiff_t src_stride,
-                             uint8_t *dst, ptrdiff_t dst_stride,
-                             const int16_t *filter_x, int filter_x_stride,
-                             const int16_t *filter_y, int filter_y_stride,
-                             int w, int h);
+using ConvolveFunc = void (*)(const uint8_t *src, ptrdiff_t src_stride,
+                              uint8_t *dst, ptrdiff_t dst_stride,
+                              const int16_t *filter_x, int filter_x_stride,
+                              const int16_t *filter_y, int filter_y_stride,
+                              int w, int h);
 
 struct ConvolveFunctions {
   ConvolveFunctions(ConvolveFunc h8, ConvolveFunc v8, int bd)
@@ -55,7 +55,7 @@
   int use_highbd_;  // 0 if high bitdepth not used, else the actual bit depth.
 };
 
-typedef std::tuple<int, int, const ConvolveFunctions *> ConvolveParam;
+using ConvolveParam = std::tuple<int, int, const ConvolveFunctions *>;
 
 #define ALL_SIZES_64(convolve_fn)                                         \
   make_tuple(4, 4, &convolve_fn), make_tuple(8, 4, &convolve_fn),         \
@@ -940,13 +940,13 @@
 #endif
 #endif  // HAVE_SVE
 
-typedef void (*ConvolveScale2DFunc)(const uint8_t *src, ptrdiff_t src_stride,
-                                    uint8_t *dst, ptrdiff_t dst_stride,
-                                    const InterpKernel *filter, int x0_q4,
-                                    int x_step_q4, int y0_q4, int y_step_q4,
-                                    int w, int h);
+using ConvolveScale2DFunc = void (*)(const uint8_t *src, ptrdiff_t src_stride,
+                                     uint8_t *dst, ptrdiff_t dst_stride,
+                                     const InterpKernel *filter, int x0_q4,
+                                     int x_step_q4, int y0_q4, int y_step_q4,
+                                     int w, int h);
 
-typedef std::tuple<int, int, ConvolveScale2DFunc> ConvolveScale2DParam;
+using ConvolveScale2DParam = std::tuple<int, int, ConvolveScale2DFunc>;
 
 class ConvolveScale2DTest
     : public ::testing::TestWithParam<ConvolveScale2DParam> {
diff --git a/test/corner_match_test.cc b/test/corner_match_test.cc
index a805329..e1353b8 100644
--- a/test/corner_match_test.cc
+++ b/test/corner_match_test.cc
@@ -27,19 +27,19 @@
 
 using libaom_test::ACMRandom;
 
-typedef bool (*ComputeMeanStddevFunc)(const unsigned char *frame, int stride,
-                                      int x, int y, double *mean,
-                                      double *one_over_stddev);
-typedef double (*ComputeCorrFunc)(const unsigned char *frame1, int stride1,
-                                  int x1, int y1, double mean1,
-                                  double one_over_stddev1,
-                                  const unsigned char *frame2, int stride2,
-                                  int x2, int y2, double mean2,
-                                  double one_over_stddev2);
+using ComputeMeanStddevFunc = bool (*)(const unsigned char *frame, int stride,
+                                       int x, int y, double *mean,
+                                       double *one_over_stddev);
+using ComputeCorrFunc = double (*)(const unsigned char *frame1, int stride1,
+                                   int x1, int y1, double mean1,
+                                   double one_over_stddev1,
+                                   const unsigned char *frame2, int stride2,
+                                   int x2, int y2, double mean2,
+                                   double one_over_stddev2);
 
 using std::make_tuple;
 using std::tuple;
-typedef tuple<int, ComputeMeanStddevFunc, ComputeCorrFunc> CornerMatchParam;
+using CornerMatchParam = tuple<int, ComputeMeanStddevFunc, ComputeCorrFunc>;
 
 class AV1CornerMatchTest : public ::testing::TestWithParam<CornerMatchParam> {
  public:
diff --git a/test/datarate_test.cc b/test/datarate_test.cc
index 3d86128..1533da28 100644
--- a/test/datarate_test.cc
+++ b/test/datarate_test.cc
@@ -103,37 +103,15 @@
     ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
                                          288, 30, 1, 0, 140);
     const int bitrate_array[2] = { 400, 800 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.7)
-        << " The datarate for the file is lower than target by too much!";
-    // FIXME(jingning): Lower this test threshold after vbr mode can render
-    // sufficiently accurate bit rate.
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.45)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, bitrate_array[GET_PARAM(4)], 0.7, 1.45);
   }
 
   virtual void BasicRateTargetingCBRTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
-
+    SetUpCBR();
     ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
                                          288, 30, 1, 0, 140);
     const int bitrate_array[2] = { 150, 550 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
-        << " The datarate for the file is lower than target by too much!";
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.19)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, bitrate_array[GET_PARAM(4)], 0.85, 1.19);
   }
 
 #if CONFIG_LIBYUV
@@ -143,10 +121,8 @@
   // with the flag avif_mode_. This test upsamples a QVGA clip to the target
   // resolution, using libyuv for the scaling.
   virtual void BasicRateTargetingCBRAssertAvifModeTest() {
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
+    SetUpCBR();
+    cfg_.rc_dropframe_thresh = 0;
     ResizingVideoSource video(2456, 2054, 320, 240,
                               "pixel_capture_w320h240.yuv", 100);
     const int bitrate_array[2] = { 1000, 2000 };
@@ -164,14 +140,10 @@
 #endif  // CONFIG_LIBYUV
 
   virtual void BasicRateTargetingCBRSpikeTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_dropframe_thresh = 0;
     cfg_.rc_min_quantizer = 2;
     cfg_.rc_max_quantizer = 56;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
     cfg_.kf_max_dist = 3000;
     cfg_.kf_min_dist = 3000;
 
@@ -192,14 +164,10 @@
   }
 
   virtual void BasicRateTargetingCBRDynamicBitrateTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_dropframe_thresh = 0;
     cfg_.rc_min_quantizer = 2;
     cfg_.rc_max_quantizer = 56;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
     cfg_.kf_max_dist = 3000;
     cfg_.kf_min_dist = 3000;
 
@@ -226,112 +194,56 @@
   virtual void BasicRateTargetingMultiThreadCBRTest() {
     ::libaom_test::I420VideoSource video("niklas_640_480_30.yuv", 640, 480, 30,
                                          1, 0, 400);
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
+    SetUpCBR();
     cfg_.g_threads = 4;
 
     const int bitrate_array[2] = { 250, 650 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
     ResetModel();
     tile_columns_ = 2;
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 0.85)
-        << " The datarate for the file exceeds the target by too much!";
-    ASSERT_LE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 1.15)
-        << " The datarate for the file missed the target!"
-        << cfg_.rc_target_bitrate << " " << effective_datarate_;
+    RunBasicRateTargetingTestReversed(&video, bitrate_array[GET_PARAM(4)], 0.85,
+                                      1.15);
   }
 
   virtual void ErrorResilienceOnSceneCuts() {
     if (GET_PARAM(4) > 0) return;
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_dropframe_thresh = 0;
     cfg_.g_error_resilient = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
 
     ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
                                          288, 30, 1, 0, 300);
-    cfg_.rc_target_bitrate = 500;
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
-        << " The datarate for the file is lower than target by too much!";
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.15)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, 500, 0.85, 1.15);
   }
 
   virtual void BasicRateTargetingCBRPeriodicKeyFrameTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
+    SetUpCBR();
     // Periodic keyframe
     cfg_.kf_max_dist = 50;
 
     ::libaom_test::I420VideoSource video("pixel_capture_w320h240.yuv", 320, 240,
                                          30, 1, 0, 310);
     const int bitrate_array[2] = { 150, 550 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
-        << " The datarate for the file is lower than target by too much!";
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.15)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, bitrate_array[GET_PARAM(4)], 0.85, 1.15);
   }
 
   virtual void CBRPeriodicKeyFrameOnSceneCuts() {
     if (GET_PARAM(4) > 0) return;
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_dropframe_thresh = 0;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
     // Periodic keyframe
     cfg_.kf_max_dist = 30;
     cfg_.kf_min_dist = 30;
 
     ::libaom_test::I420VideoSource video("hantro_collage_w352h288.yuv", 352,
                                          288, 30, 1, 0, 300);
-    cfg_.rc_target_bitrate = 500;
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
-        << " The datarate for the file is lower than target by too much!";
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.3)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, 500, 0.85, 1.3);
   }
 
   virtual void BasicRateTargetingAQModeOnOffCBRTest() {
     if (GET_PARAM(4) > 0) return;
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_dropframe_thresh = 0;
     cfg_.rc_min_quantizer = 2;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
-    cfg_.g_lag_in_frames = 0;
     cfg_.g_error_resilient = 0;
     cfg_.g_pass = AOM_RC_ONE_PASS;
     cfg_.g_usage = AOM_USAGE_REALTIME;
@@ -339,13 +251,7 @@
 
     ::libaom_test::I420VideoSource video("pixel_capture_w320h240.yuv", 320, 240,
                                          30, 1, 0, 310);
-    cfg_.rc_target_bitrate = 60;
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * 0.85)
-        << " The datarate for the file is lower than target by too much!";
-    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * 1.15)
-        << " The datarate for the file is greater than target by too much!";
+    RunBasicRateTargetingTest(&video, 60, 0.85, 1.15);
   }
 
   virtual void BasicRateTargeting444CBRScreenTest() {
@@ -354,26 +260,13 @@
     cfg_.g_profile = 1;
     cfg_.g_timebase = video.timebase();
 
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
+    SetUpCBR();
 
     const int bitrate_array[2] = { 250, 650 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
     ResetModel();
     screen_mode_ = true;
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 0.85)
-        << " The datarate for the file exceeds the target by too much!";
-    ASSERT_LE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 1.15)
-        << " The datarate for the file missed the target!"
-        << cfg_.rc_target_bitrate << " " << effective_datarate_;
+    RunBasicRateTargetingTestReversed(&video, bitrate_array[GET_PARAM(4)], 0.85,
+                                      1.15);
   }
 
   virtual void BasicRateTargetingSuperresCBR() {
@@ -383,29 +276,15 @@
     cfg_.g_profile = 0;
     cfg_.g_timebase = video.timebase();
 
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
+    SetUpCBR();
 
     cfg_.rc_superres_mode = AOM_SUPERRES_FIXED;
     cfg_.rc_superres_denominator = 16;
     cfg_.rc_superres_kf_denominator = 16;
 
     const int bitrate_array[2] = { 250, 650 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
-    ResetModel();
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 0.85)
-        << " The datarate for the file exceeds the target by too much!";
-    ASSERT_LE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 1.15)
-        << " The datarate for the file missed the target!"
-        << cfg_.rc_target_bitrate << " " << effective_datarate_;
+    RunBasicRateTargetingTestReversed(&video, bitrate_array[GET_PARAM(4)], 0.85,
+                                      1.15);
   }
 
   virtual void BasicRateTargetingSuperresCBRMultiThreads() {
@@ -415,13 +294,7 @@
     cfg_.g_profile = 0;
     cfg_.g_timebase = video.timebase();
 
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
-    cfg_.rc_dropframe_thresh = 1;
-    cfg_.rc_min_quantizer = 0;
-    cfg_.rc_max_quantizer = 63;
-    cfg_.rc_end_usage = AOM_CBR;
+    SetUpCBR();
     cfg_.g_threads = 2;
 
     cfg_.rc_superres_mode = AOM_SUPERRES_FIXED;
@@ -429,17 +302,10 @@
     cfg_.rc_superres_kf_denominator = 16;
 
     const int bitrate_array[2] = { 250, 650 };
-    cfg_.rc_target_bitrate = bitrate_array[GET_PARAM(4)];
     ResetModel();
     tile_columns_ = 1;
-    ASSERT_NO_FATAL_FAILURE(RunLoop(&video));
-    ASSERT_GE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 0.85)
-        << " The datarate for the file exceeds the target by too much!";
-    ASSERT_LE(static_cast<double>(cfg_.rc_target_bitrate),
-              effective_datarate_ * 1.15)
-        << " The datarate for the file missed the target!"
-        << cfg_.rc_target_bitrate << " " << effective_datarate_;
+    RunBasicRateTargetingTestReversed(&video, bitrate_array[GET_PARAM(4)], 0.85,
+                                      1.15);
   }
 };
 
@@ -463,17 +329,12 @@
   }
 
   virtual void ChangingDropFrameThreshTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_undershoot_pct = 20;
     cfg_.rc_undershoot_pct = 20;
     cfg_.rc_dropframe_thresh = 10;
-    cfg_.rc_min_quantizer = 0;
     cfg_.rc_max_quantizer = 50;
-    cfg_.rc_end_usage = AOM_CBR;
     cfg_.rc_target_bitrate = 200;
-    cfg_.g_lag_in_frames = 0;
     cfg_.g_error_resilient = 1;
     // TODO(marpan): Investigate datarate target failures with a smaller
     // keyframe interval (128).
@@ -594,17 +455,12 @@
   }
 
   virtual void ChangingSpeedTest() {
-    cfg_.rc_buf_initial_sz = 500;
-    cfg_.rc_buf_optimal_sz = 500;
-    cfg_.rc_buf_sz = 1000;
+    SetUpCBR();
     cfg_.rc_undershoot_pct = 20;
     cfg_.rc_undershoot_pct = 20;
     cfg_.rc_dropframe_thresh = 10;
-    cfg_.rc_min_quantizer = 0;
     cfg_.rc_max_quantizer = 50;
-    cfg_.rc_end_usage = AOM_CBR;
     cfg_.rc_target_bitrate = 200;
-    cfg_.g_lag_in_frames = 0;
     cfg_.g_error_resilient = 1;
     // TODO(marpan): Investigate datarate target failures with a smaller
     // keyframe interval (128).
@@ -743,16 +599,11 @@
 };
 
 TEST_P(DatarateTestSetFrameQpRealtime, SetFrameQpOnePass) {
-  cfg_.rc_buf_initial_sz = 500;
-  cfg_.rc_buf_optimal_sz = 500;
-  cfg_.rc_buf_sz = 1000;
+  SetUpCBR();
   cfg_.rc_undershoot_pct = 20;
   cfg_.rc_undershoot_pct = 20;
-  cfg_.rc_min_quantizer = 0;
   cfg_.rc_max_quantizer = 50;
-  cfg_.rc_end_usage = AOM_CBR;
   cfg_.rc_target_bitrate = 200;
-  cfg_.g_lag_in_frames = 0;
   cfg_.g_error_resilient = 1;
   cfg_.kf_max_dist = 9999;
   cfg_.rc_dropframe_thresh = 0;
diff --git a/test/datarate_test.h b/test/datarate_test.h
index 9064d1d..5dd2778 100644
--- a/test/datarate_test.h
+++ b/test/datarate_test.h
@@ -30,6 +30,17 @@
  protected:
   ~DatarateTest() override = default;
 
+  virtual void SetUpCBR() {
+    cfg_.rc_buf_initial_sz = 500;
+    cfg_.rc_buf_optimal_sz = 500;
+    cfg_.rc_buf_sz = 1000;
+    cfg_.rc_dropframe_thresh = 1;
+    cfg_.rc_min_quantizer = 0;
+    cfg_.rc_max_quantizer = 63;
+    cfg_.rc_end_usage = AOM_CBR;
+    cfg_.g_lag_in_frames = 0;
+  }
+
   virtual void ResetModel() {
     last_pts_ = 0;
     bits_in_buffer_model_ = cfg_.rc_target_bitrate * cfg_.rc_buf_initial_sz;
@@ -211,6 +222,34 @@
     }
   }
 
+  void RunBasicRateTargetingTest(::libaom_test::VideoSource *video,
+                                 const int bitrate, double low_rate_err_limit,
+                                 double high_rate_err_limit) {
+    cfg_.rc_target_bitrate = bitrate;
+    ResetModel();
+    ASSERT_NO_FATAL_FAILURE(RunLoop(video));
+    ASSERT_GE(effective_datarate_, cfg_.rc_target_bitrate * low_rate_err_limit)
+        << " The datarate for the file is lower than target by too much!";
+    ASSERT_LE(effective_datarate_, cfg_.rc_target_bitrate * high_rate_err_limit)
+        << " The datarate for the file is greater than target by too much!";
+  }
+
+  void RunBasicRateTargetingTestReversed(::libaom_test::VideoSource *video,
+                                         const int bitrate,
+                                         double low_rate_err_limit,
+                                         double high_rate_err_limit) {
+    cfg_.rc_target_bitrate = bitrate;
+    ResetModel();
+    ASSERT_NO_FATAL_FAILURE(RunLoop(video));
+    ASSERT_GE(static_cast<double>(cfg_.rc_target_bitrate),
+              effective_datarate_ * low_rate_err_limit)
+        << " The datarate for the file exceeds the target by too much!";
+    ASSERT_LE(static_cast<double>(cfg_.rc_target_bitrate),
+              effective_datarate_ * high_rate_err_limit)
+        << " The datarate for the file missed the target!"
+        << cfg_.rc_target_bitrate << " " << effective_datarate_;
+  }
+
   aom_codec_pts_t last_pts_;
   double timebase_;
   int frame_number_;      // Counter for number of non-dropped/encoded frames.
diff --git a/test/decode_perf_test.cc b/test/decode_perf_test.cc
index 14b0f9e..c751678 100644
--- a/test/decode_perf_test.cc
+++ b/test/decode_perf_test.cc
@@ -37,7 +37,7 @@
 /*
  DecodePerfTest takes a tuple of filename + number of threads to decode with
  */
-typedef std::tuple<const char *, unsigned> DecodePerfParam;
+using DecodePerfParam = std::tuple<const char *, unsigned int>;
 
 // TODO(jimbankoski): Add actual test vectors here when available.
 // const DecodePerfParam kAV1DecodePerfVectors[] = {};
diff --git a/test/dr_prediction_test.cc b/test/dr_prediction_test.cc
index de90ec7..4e4a06b 100644
--- a/test/dr_prediction_test.cc
+++ b/test/dr_prediction_test.cc
@@ -50,19 +50,19 @@
 
 using libaom_test::ACMRandom;
 
-typedef void (*DrPred_Hbd)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
-                           const uint16_t *above, const uint16_t *left,
-                           int upsample_above, int upsample_left, int dx,
-                           int dy, int bd);
+using DrPred_Hbd = void (*)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
+                            const uint16_t *above, const uint16_t *left,
+                            int upsample_above, int upsample_left, int dx,
+                            int dy, int bd);
 
-typedef void (*DrPred)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint8_t *above, const uint8_t *left,
-                       int upsample_above, int upsample_left, int dx, int dy,
-                       int bd);
+using DrPred = void (*)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint8_t *above, const uint8_t *left,
+                        int upsample_above, int upsample_left, int dx, int dy,
+                        int bd);
 
-typedef void (*Z1_Lbd)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint8_t *above, const uint8_t *left,
-                       int upsample_above, int dx, int dy);
+using Z1_Lbd = void (*)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint8_t *above, const uint8_t *left,
+                        int upsample_above, int dx, int dy);
 template <Z1_Lbd fn>
 void z1_wrapper(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
                 const uint8_t *above, const uint8_t *left, int upsample_above,
@@ -72,9 +72,9 @@
   fn(dst, stride, bw, bh, above, left, upsample_above, dx, dy);
 }
 
-typedef void (*Z2_Lbd)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint8_t *above, const uint8_t *left,
-                       int upsample_above, int upsample_left, int dx, int dy);
+using Z2_Lbd = void (*)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint8_t *above, const uint8_t *left,
+                        int upsample_above, int upsample_left, int dx, int dy);
 template <Z2_Lbd fn>
 void z2_wrapper(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
                 const uint8_t *above, const uint8_t *left, int upsample_above,
@@ -84,9 +84,9 @@
   fn(dst, stride, bw, bh, above, left, upsample_above, upsample_left, dx, dy);
 }
 
-typedef void (*Z3_Lbd)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint8_t *above, const uint8_t *left,
-                       int upsample_left, int dx, int dy);
+using Z3_Lbd = void (*)(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint8_t *above, const uint8_t *left,
+                        int upsample_left, int dx, int dy);
 template <Z3_Lbd fn>
 void z3_wrapper(uint8_t *dst, ptrdiff_t stride, int bw, int bh,
                 const uint8_t *above, const uint8_t *left, int upsample_above,
@@ -96,9 +96,9 @@
   fn(dst, stride, bw, bh, above, left, upsample_left, dx, dy);
 }
 
-typedef void (*Z1_Hbd)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint16_t *above, const uint16_t *left,
-                       int upsample_above, int dx, int dy, int bd);
+using Z1_Hbd = void (*)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint16_t *above, const uint16_t *left,
+                        int upsample_above, int dx, int dy, int bd);
 template <Z1_Hbd fn>
 void z1_wrapper_hbd(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
                     const uint16_t *above, const uint16_t *left,
@@ -109,10 +109,10 @@
   fn(dst, stride, bw, bh, above, left, upsample_above, dx, dy, bd);
 }
 
-typedef void (*Z2_Hbd)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint16_t *above, const uint16_t *left,
-                       int upsample_above, int upsample_left, int dx, int dy,
-                       int bd);
+using Z2_Hbd = void (*)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint16_t *above, const uint16_t *left,
+                        int upsample_above, int upsample_left, int dx, int dy,
+                        int bd);
 template <Z2_Hbd fn>
 void z2_wrapper_hbd(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
                     const uint16_t *above, const uint16_t *left,
@@ -123,9 +123,9 @@
      bd);
 }
 
-typedef void (*Z3_Hbd)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
-                       const uint16_t *above, const uint16_t *left,
-                       int upsample_left, int dx, int dy, int bd);
+using Z3_Hbd = void (*)(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
+                        const uint16_t *above, const uint16_t *left,
+                        int upsample_left, int dx, int dy, int bd);
 template <Z3_Hbd fn>
 void z3_wrapper_hbd(uint16_t *dst, ptrdiff_t stride, int bw, int bh,
                     const uint16_t *above, const uint16_t *left,
diff --git a/test/encodetxb_test.cc b/test/encodetxb_test.cc
index 55047e7..1881191 100644
--- a/test/encodetxb_test.cc
+++ b/test/encodetxb_test.cc
@@ -32,11 +32,11 @@
 namespace {
 using libaom_test::ACMRandom;
 
-typedef void (*GetNzMapContextsFunc)(const uint8_t *const levels,
-                                     const int16_t *const scan,
-                                     const uint16_t eob, const TX_SIZE tx_size,
-                                     const TX_CLASS tx_class,
-                                     int8_t *const coeff_contexts);
+using GetNzMapContextsFunc = void (*)(const uint8_t *const levels,
+                                      const int16_t *const scan,
+                                      const uint16_t eob, const TX_SIZE tx_size,
+                                      const TX_CLASS tx_class,
+                                      int8_t *const coeff_contexts);
 
 class EncodeTxbTest : public ::testing::TestWithParam<GetNzMapContextsFunc> {
  public:
@@ -202,11 +202,11 @@
                          ::testing::Values(av1_get_nz_map_contexts_neon));
 #endif
 
-typedef void (*av1_txb_init_levels_func)(const tran_low_t *const coeff,
-                                         const int width, const int height,
-                                         uint8_t *const levels);
+using av1_txb_init_levels_func = void (*)(const tran_low_t *const coeff,
+                                          const int width, const int height,
+                                          uint8_t *const levels);
 
-typedef std::tuple<av1_txb_init_levels_func, int> TxbInitLevelParam;
+using TxbInitLevelParam = std::tuple<av1_txb_init_levels_func, int>;
 
 class EncodeTxbInitLevelTest
     : public ::testing::TestWithParam<TxbInitLevelParam> {
diff --git a/test/end_to_end_psnr_test.cc b/test/end_to_end_psnr_test.cc
index e0e744b..9b27901 100644
--- a/test/end_to_end_psnr_test.cc
+++ b/test/end_to_end_psnr_test.cc
@@ -38,13 +38,13 @@
   { 34.9, 44.3, 38.5, 40.8 }
 };
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/end_to_end_qmpsnr_test.cc b/test/end_to_end_qmpsnr_test.cc
index c5911ff..64aee79 100644
--- a/test/end_to_end_qmpsnr_test.cc
+++ b/test/end_to_end_qmpsnr_test.cc
@@ -28,13 +28,13 @@
 const double kSsimThreshold[] = { 83.4, 83.4, 83.4, 83.3, 83.3,
                                   83.0, 82.3, 81.1, 81.1 };
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/end_to_end_ssim_test.cc b/test/end_to_end_ssim_test.cc
index 2b3fb87..730788b4 100644
--- a/test/end_to_end_ssim_test.cc
+++ b/test/end_to_end_ssim_test.cc
@@ -26,13 +26,13 @@
 const double kSsimThreshold[] = { 83.4, 83.4, 83.4, 83.3, 83.3,
                                   83.0, 82.3, 81.1, 81.1 };
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/fft_test.cc b/test/fft_test.cc
index 1d3d59a..a17249d 100644
--- a/test/fft_test.cc
+++ b/test/fft_test.cc
@@ -25,7 +25,7 @@
 
 namespace {
 
-typedef void (*tform_fun_t)(const float *input, float *temp, float *output);
+using tform_fun_t = void (*)(const float *input, float *temp, float *output);
 
 // Simple 1D FFT implementation
 template <typename InputType>
diff --git a/test/filterintra_test.cc b/test/filterintra_test.cc
index d425287..314be03 100644
--- a/test/filterintra_test.cc
+++ b/test/filterintra_test.cc
@@ -25,15 +25,15 @@
 using libaom_test::ACMRandom;
 using std::tuple;
 
-typedef void (*Predictor)(uint8_t *dst, ptrdiff_t stride, TX_SIZE tx_size,
-                          const uint8_t *above, const uint8_t *left, int mode);
+using Predictor = void (*)(uint8_t *dst, ptrdiff_t stride, TX_SIZE tx_size,
+                           const uint8_t *above, const uint8_t *left, int mode);
 
 // Note:
 //  Test parameter list:
 //  Reference predictor, optimized predictor, prediction mode, tx size
 //
-typedef tuple<Predictor, Predictor, int> PredFuncMode;
-typedef tuple<PredFuncMode, TX_SIZE> PredParams;
+using PredFuncMode = tuple<Predictor, Predictor, int>;
+using PredParams = tuple<PredFuncMode, TX_SIZE>;
 
 const int MaxTxSize = 32;
 
@@ -171,8 +171,6 @@
 #endif  // HAVE_SSE4_1
 
 #if HAVE_NEON
-// TODO(aomedia:349436249): enable for armv7 after SIGBUS is fixed.
-#if AOM_ARCH_AARCH64
 const PredFuncMode kPredFuncMdArrayNEON[] = {
   make_tuple(&av1_filter_intra_predictor_c, &av1_filter_intra_predictor_neon,
              FILTER_DC_PRED),
@@ -194,9 +192,6 @@
     NEON, AV1FilterIntraPredTest,
     ::testing::Combine(::testing::ValuesIn(kPredFuncMdArrayNEON),
                        ::testing::ValuesIn(kTxSizeNEON)));
-#else   // !AOM_ARCH_AARCH64
-GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AV1FilterIntraPredTest);
-#endif  // AOM_ARCH_AARCH64
 #endif  // HAVE_NEON
 
 }  // namespace
diff --git a/test/frame_resize_test.cc b/test/frame_resize_test.cc
index a9acb24..f9557e0 100644
--- a/test/frame_resize_test.cc
+++ b/test/frame_resize_test.cc
@@ -30,7 +30,7 @@
 
 const int kIters = 1000;
 
-typedef tuple<int, int> FrameDimension;
+using FrameDimension = tuple<int, int>;
 
 // Check that two 8-bit output buffers are identical.
 void AssertOutputBufferEq(const uint8_t *p1, const uint8_t *p2, int width,
@@ -50,12 +50,12 @@
   }
 }
 
-typedef bool (*LowBDResizeFunc)(uint8_t *intbuf, uint8_t *output,
-                                int out_stride, int height, int height2,
-                                int stride, int start_wd);
+using LowBDResizeFunc = bool (*)(uint8_t *intbuf, uint8_t *output,
+                                 int out_stride, int height, int height2,
+                                 int stride, int start_wd);
 // Test parameter list:
 //  <tst_fun, dims>
-typedef tuple<LowBDResizeFunc, FrameDimension> ResizeTestParams;
+using ResizeTestParams = tuple<LowBDResizeFunc, FrameDimension>;
 
 class AV1ResizeYTest : public ::testing::TestWithParam<ResizeTestParams> {
  public:
@@ -166,11 +166,11 @@
                        ::testing::ValuesIn(kFrameDim)));
 #endif
 
-typedef void (*LowBDResize_x_Func)(const uint8_t *const input, int in_stride,
-                                   uint8_t *intbuf, int height,
-                                   int filtered_length, int width2);
+using LowBDResize_x_Func = void (*)(const uint8_t *const input, int in_stride,
+                                    uint8_t *intbuf, int height,
+                                    int filtered_length, int width2);
 
-typedef tuple<LowBDResize_x_Func, FrameDimension> Resize_x_TestParams;
+using Resize_x_TestParams = tuple<LowBDResize_x_Func, FrameDimension>;
 
 class AV1ResizeXTest : public ::testing::TestWithParam<Resize_x_TestParams> {
  public:
diff --git a/test/frame_size_tests.cc b/test/frame_size_tests.cc
index 65269aa..4ed8354 100644
--- a/test/frame_size_tests.cc
+++ b/test/frame_size_tests.cc
@@ -327,10 +327,10 @@
     ::testing::Combine(::testing::Values(AOM_USAGE_ALL_INTRA),
                        ::testing::Values(AOM_Q), ::testing::Range(6, 10)));
 
-typedef struct {
+struct FrameSizeParam {
   unsigned int width;
   unsigned int height;
-} FrameSizeParam;
+};
 
 const FrameSizeParam FrameSizeTestParams[] = { { 96, 96 }, { 176, 144 } };
 
diff --git a/test/fwht4x4_test.cc b/test/fwht4x4_test.cc
index 8e9600a..c68879e 100644
--- a/test/fwht4x4_test.cc
+++ b/test/fwht4x4_test.cc
@@ -31,13 +31,13 @@
 using libaom_test::ACMRandom;
 
 namespace {
-typedef void (*FdctFunc)(const int16_t *in, tran_low_t *out, int stride);
-typedef void (*IdctFunc)(const tran_low_t *in, uint8_t *out, int stride);
+using FdctFunc = void (*)(const int16_t *in, tran_low_t *out, int stride);
+using IdctFunc = void (*)(const tran_low_t *in, uint8_t *out, int stride);
 
 using libaom_test::FhtFunc;
 
-typedef std::tuple<FdctFunc, IdctFunc, TX_TYPE, aom_bit_depth_t, int, FdctFunc>
-    Dct4x4Param;
+using Dct4x4Param =
+    std::tuple<FdctFunc, IdctFunc, TX_TYPE, aom_bit_depth_t, int, FdctFunc>;
 
 void fwht4x4_ref(const int16_t *in, tran_low_t *out, int stride,
                  TxfmParam * /*txfm_param*/) {
diff --git a/test/hash_test.cc b/test/hash_test.cc
index f824184..adc072e 100644
--- a/test/hash_test.cc
+++ b/test/hash_test.cc
@@ -24,10 +24,10 @@
 
 namespace {
 
-typedef uint32_t (*get_crc32c_value_func)(void *calculator, const uint8_t *p,
-                                          size_t length);
+using get_crc32c_value_func = uint32_t (*)(void *calculator, const uint8_t *p,
+                                           size_t length);
 
-typedef std::tuple<get_crc32c_value_func, int> HashParam;
+using HashParam = std::tuple<get_crc32c_value_func, int>;
 
 class AV1Crc32cHashTest : public ::testing::TestWithParam<HashParam> {
  public:
diff --git a/test/hbd_metrics_test.cc b/test/hbd_metrics_test.cc
index af050a0..6dfe4cc 100644
--- a/test/hbd_metrics_test.cc
+++ b/test/hbd_metrics_test.cc
@@ -29,11 +29,11 @@
 
 namespace {
 
-typedef double (*LBDMetricFunc)(const YV12_BUFFER_CONFIG *source,
-                                const YV12_BUFFER_CONFIG *dest);
-typedef double (*HBDMetricFunc)(const YV12_BUFFER_CONFIG *source,
-                                const YV12_BUFFER_CONFIG *dest, uint32_t in_bd,
-                                uint32_t bd);
+using LBDMetricFunc = double (*)(const YV12_BUFFER_CONFIG *source,
+                                 const YV12_BUFFER_CONFIG *dest);
+using HBDMetricFunc = double (*)(const YV12_BUFFER_CONFIG *source,
+                                 const YV12_BUFFER_CONFIG *dest, uint32_t in_bd,
+                                 uint32_t bd);
 
 double compute_hbd_psnr(const YV12_BUFFER_CONFIG *source,
                         const YV12_BUFFER_CONFIG *dest, uint32_t in_bd,
@@ -173,8 +173,8 @@
   HBDMetricFunc hbd_metric_;
 };
 
-typedef std::tuple<LBDMetricFunc, HBDMetricFunc, int, int, double>
-    MetricTestTParam;
+using MetricTestTParam =
+    std::tuple<LBDMetricFunc, HBDMetricFunc, int, int, double>;
 class HBDMetricsTest : public HBDMetricsTestBase,
                        public ::testing::TestWithParam<MetricTestTParam> {
  public:
diff --git a/test/hiprec_convolve_test_util.h b/test/hiprec_convolve_test_util.h
index 52a43e9..d33baa5 100644
--- a/test/hiprec_convolve_test_util.h
+++ b/test/hiprec_convolve_test_util.h
@@ -29,14 +29,14 @@
 
 namespace AV1HiprecConvolve {
 
-typedef void (*hiprec_convolve_func)(const uint8_t *src, ptrdiff_t src_stride,
-                                     uint8_t *dst, ptrdiff_t dst_stride,
-                                     const int16_t *filter_x, int x_step_q4,
-                                     const int16_t *filter_y, int y_step_q4,
-                                     int w, int h,
-                                     const WienerConvolveParams *conv_params);
+using hiprec_convolve_func = void (*)(const uint8_t *src, ptrdiff_t src_stride,
+                                      uint8_t *dst, ptrdiff_t dst_stride,
+                                      const int16_t *filter_x, int x_step_q4,
+                                      const int16_t *filter_y, int y_step_q4,
+                                      int w, int h,
+                                      const WienerConvolveParams *conv_params);
 
-typedef std::tuple<int, int, int, hiprec_convolve_func> HiprecConvolveParam;
+using HiprecConvolveParam = std::tuple<int, int, int, hiprec_convolve_func>;
 
 ::testing::internal::ParamGenerator<HiprecConvolveParam> BuildParams(
     hiprec_convolve_func filter);
@@ -58,14 +58,14 @@
 
 #if CONFIG_AV1_HIGHBITDEPTH
 namespace AV1HighbdHiprecConvolve {
-typedef void (*highbd_hiprec_convolve_func)(
-    const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
-    ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4,
-    const int16_t *filter_y, int y_step_q4, int w, int h,
-    const WienerConvolveParams *conv_params, int bps);
+using highbd_hiprec_convolve_func =
+    void (*)(const uint8_t *src, ptrdiff_t src_stride, uint8_t *dst,
+             ptrdiff_t dst_stride, const int16_t *filter_x, int x_step_q4,
+             const int16_t *filter_y, int y_step_q4, int w, int h,
+             const WienerConvolveParams *conv_params, int bps);
 
-typedef std::tuple<int, int, int, int, highbd_hiprec_convolve_func>
-    HighbdHiprecConvolveParam;
+using HighbdHiprecConvolveParam =
+    std::tuple<int, int, int, int, highbd_hiprec_convolve_func>;
 
 ::testing::internal::ParamGenerator<HighbdHiprecConvolveParam> BuildParams(
     highbd_hiprec_convolve_func filter);
diff --git a/test/horver_correlation_test.cc b/test/horver_correlation_test.cc
index bab3912..c9212de 100644
--- a/test/horver_correlation_test.cc
+++ b/test/horver_correlation_test.cc
@@ -26,10 +26,10 @@
 using libaom_test::ACMRandom;
 
 namespace {
-typedef void (*HorverFunc)(const int16_t *diff, int stride, int w, int h,
-                           float *hcorr, float *vcorr);
+using HorverFunc = void (*)(const int16_t *diff, int stride, int w, int h,
+                            float *hcorr, float *vcorr);
 
-typedef std::tuple<const HorverFunc> HorverTestParam;
+using HorverTestParam = std::tuple<const HorverFunc>;
 
 class HorverTest : public ::testing::TestWithParam<HorverTestParam> {
  public:
diff --git a/test/horz_superres_test.cc b/test/horz_superres_test.cc
index fee0cca..6d4ca8a 100644
--- a/test/horz_superres_test.cc
+++ b/test/horz_superres_test.cc
@@ -32,7 +32,7 @@
 
 const int kBitrate = 40;
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
@@ -41,7 +41,7 @@
   unsigned int screen_content;
   double psnr_threshold;   // used by modes other than AOM_SUPERRES_AUTO
   double psnr_threshold2;  // used by AOM_SUPERRES_AUTO
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
@@ -69,7 +69,7 @@
                                                           AOM_SUPERRES_AUTO };
 
 // Superres denominators and superres kf denominators to be tested
-typedef tuple<int, int> SuperresDenominatorPair;
+using SuperresDenominatorPair = tuple<int, int>;
 const SuperresDenominatorPair kSuperresDenominators[] = {
   make_tuple(16, 9),  make_tuple(13, 11), make_tuple(9, 9),
   make_tuple(13, 13), make_tuple(11, 16), make_tuple(8, 16),
@@ -77,7 +77,7 @@
 };
 
 // Superres q thresholds and superres kf q thresholds to be tested
-typedef tuple<int, int> SuperresQThresholdPair;
+using SuperresQThresholdPair = tuple<int, int>;
 const SuperresQThresholdPair kSuperresQThresholds[] = {
   make_tuple(63, 63), make_tuple(63, 41), make_tuple(17, 63),
   make_tuple(41, 11), make_tuple(1, 37),  make_tuple(11, 11),
@@ -88,9 +88,8 @@
 
 // Test parameter list:
 //  <[needed for EncoderTest], test_video_param_, superres_mode_>
-typedef tuple<const libaom_test::CodecFactory *, TestVideoParam,
-              aom_superres_mode>
-    HorzSuperresTestParam;
+using HorzSuperresTestParam =
+    tuple<const libaom_test::CodecFactory *, TestVideoParam, aom_superres_mode>;
 
 class HorzSuperresEndToEndTest
     : public ::testing::TestWithParam<HorzSuperresTestParam>,
@@ -187,9 +186,9 @@
 // Test parameter list:
 //  <[needed for EncoderTest], test_video_param_, tuple(superres_denom_,
 //  superres_kf_denom_)>
-typedef tuple<const libaom_test::CodecFactory *, TestVideoParam,
-              SuperresDenominatorPair>
-    HorzSuperresFixedTestParam;
+using HorzSuperresFixedTestParam =
+    tuple<const libaom_test::CodecFactory *, TestVideoParam,
+          SuperresDenominatorPair>;
 
 class HorzSuperresFixedEndToEndTest
     : public ::testing::TestWithParam<HorzSuperresFixedTestParam>,
@@ -297,9 +296,9 @@
 // Test parameter list:
 //  <[needed for EncoderTest], test_video_param_,
 //  tuple(superres_qthresh_,superres_kf_qthresh_)>
-typedef tuple<const libaom_test::CodecFactory *, TestVideoParam,
-              SuperresQThresholdPair>
-    HorzSuperresQThreshTestParam;
+using HorzSuperresQThreshTestParam =
+    tuple<const libaom_test::CodecFactory *, TestVideoParam,
+          SuperresQThresholdPair>;
 
 class HorzSuperresQThreshEndToEndTest
     : public ::testing::TestWithParam<HorzSuperresQThreshTestParam>,
diff --git a/test/intra_edge_test.cc b/test/intra_edge_test.cc
index d5cc3d5..9b992d2 100644
--- a/test/intra_edge_test.cc
+++ b/test/intra_edge_test.cc
@@ -62,8 +62,8 @@
   int size_;
 };
 
-typedef void (*UP8B)(uint8_t *p, int size);
-typedef libaom_test::FuncParam<UP8B> TestFuncs;
+using UP8B = void (*)(uint8_t *p, int size);
+using TestFuncs = libaom_test::FuncParam<UP8B>;
 
 class UpsampleTest8B : public UpsampleTest<UP8B, uint8_t> {
  protected:
@@ -154,8 +154,8 @@
   int strength_;
 };
 
-typedef void (*FE8B)(uint8_t *p, int size, int strength);
-typedef libaom_test::FuncParam<FE8B> FilterEdgeTestFuncs;
+using FE8B = void (*)(uint8_t *p, int size, int strength);
+using FilterEdgeTestFuncs = libaom_test::FuncParam<FE8B>;
 
 class FilterEdgeTest8B : public FilterEdgeTest<FE8B, uint8_t> {
  protected:
@@ -212,8 +212,8 @@
 
 #if CONFIG_AV1_HIGHBITDEPTH
 
-typedef void (*UPHB)(uint16_t *p, int size, int bd);
-typedef libaom_test::FuncParam<UPHB> TestFuncsHBD;
+using UPHB = void (*)(uint16_t *p, int size, int bd);
+using TestFuncsHBD = libaom_test::FuncParam<UPHB>;
 
 class UpsampleTestHB : public UpsampleTest<UPHB, uint16_t> {
  protected:
@@ -281,8 +281,8 @@
                                    av1_highbd_upsample_intra_edge_neon)));
 #endif  // HAVE_NEON
 
-typedef void (*FEHB)(uint16_t *p, int size, int strength);
-typedef libaom_test::FuncParam<FEHB> FilterEdgeTestFuncsHBD;
+using FEHB = void (*)(uint16_t *p, int size, int strength);
+using FilterEdgeTestFuncsHBD = libaom_test::FuncParam<FEHB>;
 
 class FilterEdgeTestHB : public FilterEdgeTest<FEHB, uint16_t> {
  protected:
diff --git a/test/intrapred_test.cc b/test/intrapred_test.cc
index bf9ab75..f3a246b 100644
--- a/test/intrapred_test.cc
+++ b/test/intrapred_test.cc
@@ -30,11 +30,11 @@
 
 const int count_test_block = 100000;
 
-typedef void (*HighbdIntraPred)(uint16_t *dst, ptrdiff_t stride,
-                                const uint16_t *above, const uint16_t *left,
-                                int bps);
-typedef void (*IntraPred)(uint8_t *dst, ptrdiff_t stride, const uint8_t *above,
-                          const uint8_t *left);
+using HighbdIntraPred = void (*)(uint16_t *dst, ptrdiff_t stride,
+                                 const uint16_t *above, const uint16_t *left,
+                                 int bps);
+using IntraPred = void (*)(uint8_t *dst, ptrdiff_t stride, const uint8_t *above,
+                           const uint8_t *left);
 
 }  // namespace
 
diff --git a/test/kf_test.cc b/test/kf_test.cc
index 3d76873..5d7cfdb 100644
--- a/test/kf_test.cc
+++ b/test/kf_test.cc
@@ -107,10 +107,10 @@
   }
 }
 
-typedef struct {
+struct kfIntervalParam {
   const unsigned int min_kf_dist;
   const unsigned int max_kf_dist;
-} kfIntervalParam;
+};
 
 const kfIntervalParam kfTestParams[] = {
   { 1, 1 }, { 0, 10 }, { 10, 10 }, { 0, 30 }, { 30, 30 }
diff --git a/test/loopfilter_control_test.cc b/test/loopfilter_control_test.cc
index 30f04b2..f498559 100644
--- a/test/loopfilter_control_test.cc
+++ b/test/loopfilter_control_test.cc
@@ -46,13 +46,13 @@
                            { 2, { { 0, 31.0 }, { 3, 31.0 } } },
                            { 3, { { 0, 31.0 }, { 3, 31.0 } } } } } };
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/lpf_test.cc b/test/lpf_test.cc
index edb107b..da9d162 100644
--- a/test/lpf_test.cc
+++ b/test/lpf_test.cc
@@ -45,16 +45,16 @@
       const uint8_t *thresh0, const uint8_t *blimit1, const uint8_t *limit1, \
       const uint8_t *thresh1
 
-typedef void (*loop_op_t)(uint8_t *s, LOOP_PARAM);
-typedef void (*dual_loop_op_t)(uint8_t *s, DUAL_LOOP_PARAM);
-typedef void (*hbdloop_op_t)(uint16_t *s, LOOP_PARAM, int bd);
-typedef void (*hbddual_loop_op_t)(uint16_t *s, DUAL_LOOP_PARAM, int bd);
+using loop_op_t = void (*)(uint8_t *s, LOOP_PARAM);
+using dual_loop_op_t = void (*)(uint8_t *s, DUAL_LOOP_PARAM);
+using hbdloop_op_t = void (*)(uint16_t *s, LOOP_PARAM, int bd);
+using hbddual_loop_op_t = void (*)(uint16_t *s, DUAL_LOOP_PARAM, int bd);
 
-typedef std::tuple<hbdloop_op_t, hbdloop_op_t, int> hbdloop_param_t;
-typedef std::tuple<hbddual_loop_op_t, hbddual_loop_op_t, int>
-    hbddual_loop_param_t;
-typedef std::tuple<loop_op_t, loop_op_t, int> loop_param_t;
-typedef std::tuple<dual_loop_op_t, dual_loop_op_t, int> dual_loop_param_t;
+using hbdloop_param_t = std::tuple<hbdloop_op_t, hbdloop_op_t, int>;
+using hbddual_loop_param_t =
+    std::tuple<hbddual_loop_op_t, hbddual_loop_op_t, int>;
+using loop_param_t = std::tuple<loop_op_t, loop_op_t, int>;
+using dual_loop_param_t = std::tuple<dual_loop_op_t, dual_loop_op_t, int>;
 
 template <typename Pixel_t, int PIXEL_WIDTH_t>
 void InitInput(Pixel_t *s, Pixel_t *ref_s, ACMRandom *rnd, const uint8_t limit,
@@ -161,15 +161,15 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef LoopTestParam<hbdloop_op_t, hbdloop_param_t> Loop8Test6Param_hbd;
+using Loop8Test6Param_hbd = LoopTestParam<hbdloop_op_t, hbdloop_param_t>;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(Loop8Test6Param_hbd);
-typedef LoopTestParam<hbddual_loop_op_t, hbddual_loop_param_t>
-    Loop8Test9Param_hbd;
+using Loop8Test9Param_hbd =
+    LoopTestParam<hbddual_loop_op_t, hbddual_loop_param_t>;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(Loop8Test9Param_hbd);
 #endif
-typedef LoopTestParam<loop_op_t, loop_param_t> Loop8Test6Param_lbd;
+using Loop8Test6Param_lbd = LoopTestParam<loop_op_t, loop_param_t>;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(Loop8Test6Param_lbd);
-typedef LoopTestParam<dual_loop_op_t, dual_loop_param_t> Loop8Test9Param_lbd;
+using Loop8Test9Param_lbd = LoopTestParam<dual_loop_op_t, dual_loop_param_t>;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(Loop8Test9Param_lbd);
 
 #define OPCHECK(a, b)                                                          \
diff --git a/test/masked_sad_test.cc b/test/masked_sad_test.cc
index 1b9e953..4de17ae 100644
--- a/test/masked_sad_test.cc
+++ b/test/masked_sad_test.cc
@@ -28,20 +28,20 @@
 namespace {
 const int number_of_iterations = 200;
 
-typedef unsigned int (*MaskedSADFunc)(const uint8_t *src, int src_stride,
-                                      const uint8_t *ref, int ref_stride,
-                                      const uint8_t *second_pred,
-                                      const uint8_t *msk, int msk_stride,
-                                      int invert_mask);
-typedef std::tuple<MaskedSADFunc, MaskedSADFunc> MaskedSADParam;
+using MaskedSADFunc = unsigned int (*)(const uint8_t *src, int src_stride,
+                                       const uint8_t *ref, int ref_stride,
+                                       const uint8_t *second_pred,
+                                       const uint8_t *msk, int msk_stride,
+                                       int invert_mask);
+using MaskedSADParam = std::tuple<MaskedSADFunc, MaskedSADFunc>;
 
-typedef void (*MaskedSADx4Func)(const uint8_t *src, int src_stride,
-                                const uint8_t *ref[], int ref_stride,
-                                const uint8_t *second_pred, const uint8_t *msk,
-                                int msk_stride, int invert_mask,
-                                unsigned sads[]);
+using MaskedSADx4Func = void (*)(const uint8_t *src, int src_stride,
+                                 const uint8_t *ref[], int ref_stride,
+                                 const uint8_t *second_pred, const uint8_t *msk,
+                                 int msk_stride, int invert_mask,
+                                 unsigned sads[]);
 
-typedef std::tuple<MaskedSADx4Func, MaskedSADx4Func> MaskedSADx4Param;
+using MaskedSADx4Param = std::tuple<MaskedSADx4Func, MaskedSADx4Func>;
 
 class MaskedSADTestBase : public ::testing::Test {
  public:
@@ -195,13 +195,13 @@
 TEST_P(MaskedSADTest, DISABLED_Speed) { runMaskedSADTest(2000000); }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef unsigned int (*HighbdMaskedSADFunc)(const uint8_t *src, int src_stride,
-                                            const uint8_t *ref, int ref_stride,
-                                            const uint8_t *second_pred,
-                                            const uint8_t *msk, int msk_stride,
-                                            int invert_mask);
-typedef std::tuple<HighbdMaskedSADFunc, HighbdMaskedSADFunc>
-    HighbdMaskedSADParam;
+using HighbdMaskedSADFunc = unsigned int (*)(const uint8_t *src, int src_stride,
+                                             const uint8_t *ref, int ref_stride,
+                                             const uint8_t *second_pred,
+                                             const uint8_t *msk, int msk_stride,
+                                             int invert_mask);
+using HighbdMaskedSADParam =
+    std::tuple<HighbdMaskedSADFunc, HighbdMaskedSADFunc>;
 
 class HighbdMaskedSADTest
     : public ::testing::TestWithParam<HighbdMaskedSADParam> {
diff --git a/test/masked_variance_test.cc b/test/masked_variance_test.cc
index 4d9ebe9..e527efa 100644
--- a/test/masked_variance_test.cc
+++ b/test/masked_variance_test.cc
@@ -32,13 +32,13 @@
 namespace {
 const int number_of_iterations = 200;
 
-typedef unsigned int (*MaskedSubPixelVarianceFunc)(
+using MaskedSubPixelVarianceFunc = unsigned int (*)(
     const uint8_t *src, int src_stride, int xoffset, int yoffset,
     const uint8_t *ref, int ref_stride, const uint8_t *second_pred,
     const uint8_t *msk, int msk_stride, int invert_mask, unsigned int *sse);
 
-typedef std::tuple<MaskedSubPixelVarianceFunc, MaskedSubPixelVarianceFunc>
-    MaskedSubPixelVarianceParam;
+using MaskedSubPixelVarianceParam =
+    std::tuple<MaskedSubPixelVarianceFunc, MaskedSubPixelVarianceFunc>;
 
 class MaskedSubPixelVarianceTest
     : public ::testing::TestWithParam<MaskedSubPixelVarianceParam> {
@@ -170,9 +170,9 @@
 }
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef std::tuple<MaskedSubPixelVarianceFunc, MaskedSubPixelVarianceFunc,
-                   aom_bit_depth_t>
-    HighbdMaskedSubPixelVarianceParam;
+using HighbdMaskedSubPixelVarianceParam =
+    std::tuple<MaskedSubPixelVarianceFunc, MaskedSubPixelVarianceFunc,
+               aom_bit_depth_t>;
 
 class HighbdMaskedSubPixelVarianceTest
     : public ::testing::TestWithParam<HighbdMaskedSubPixelVarianceParam> {
diff --git a/test/metadata_test.cc b/test/metadata_test.cc
index da0fb13..c34f4da 100644
--- a/test/metadata_test.cc
+++ b/test/metadata_test.cc
@@ -35,6 +35,12 @@
   0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
 };
 
+const size_t kMetadataPayloadSizeT35Two = 10;
+// 0xB5 stands for the itut t35 metadata country code for the Unites States
+const uint8_t kMetadataPayloadT35Two[kMetadataPayloadSizeT35] = {
+  0xB5, 0x01, 0x02, 0x42, 0xff, 0xff, 0x00, 0x07, 0x08, 0x09
+};
+
 const size_t kMetadataPayloadSizeMdcv = 24;
 // Arbitrary content.
 const uint8_t kMetadataPayloadMdcv[kMetadataPayloadSizeMdcv] = {
@@ -96,11 +102,18 @@
     ASSERT_EQ(aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_ITUT_T35,
                                    nullptr, 0, AOM_MIF_ANY_FRAME),
               -1);
+
     ASSERT_EQ(aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_ITUT_T35,
                                    kMetadataPayloadT35, kMetadataPayloadSizeT35,
                                    AOM_MIF_ANY_FRAME),
               0);
 
+    ASSERT_EQ(
+        aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_ITUT_T35,
+                             kMetadataPayloadT35Two, kMetadataPayloadSizeT35Two,
+                             AOM_MIF_ANY_FRAME_LAYER_SPECIFIC),
+        0);
+
     ASSERT_EQ(aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_HDR_MDCV,
                                    kMetadataPayloadMdcv,
                                    kMetadataPayloadSizeMdcv, AOM_MIF_KEY_FRAME),
@@ -143,32 +156,42 @@
 
     ASSERT_NE(img.metadata, nullptr);
 
-    ASSERT_EQ(img.metadata->sz, is_key_frame ? 3 : 1);
+    ASSERT_EQ(img.metadata->sz, is_key_frame ? 4 : 2);
 
-    ASSERT_EQ(OBU_METADATA_TYPE_ITUT_T35,
-              img.metadata->metadata_array[0]->type);
-    ASSERT_EQ(kMetadataPayloadSizeT35, img.metadata->metadata_array[0]->sz);
+    aom_metadata_t *metadata = img.metadata->metadata_array[0];
+    ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_ITUT_T35);
+    ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+    ASSERT_EQ(metadata->sz, kMetadataPayloadSizeT35);
     EXPECT_EQ(
-        memcmp(kMetadataPayloadT35, img.metadata->metadata_array[0]->payload,
-               kMetadataPayloadSizeT35),
+        memcmp(kMetadataPayloadT35, metadata->payload, kMetadataPayloadSizeT35),
         0);
 
-    if (is_key_frame) {
-      ASSERT_EQ(OBU_METADATA_TYPE_HDR_MDCV,
-                img.metadata->metadata_array[1]->type);
-      ASSERT_EQ(kMetadataPayloadSizeMdcv, img.metadata->metadata_array[1]->sz);
-      EXPECT_EQ(
-          memcmp(kMetadataPayloadMdcv, img.metadata->metadata_array[1]->payload,
-                 kMetadataPayloadSizeMdcv),
-          0);
+    metadata = img.metadata->metadata_array[1];
+    ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_ITUT_T35);
+    // AOM_MIF_ANY_FRAME and not AOM_MIF_ANY_FRAME_LAYER_SPECIFIC because the
+    // stream does not contain layers.
+    ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+    ASSERT_EQ(metadata->sz, kMetadataPayloadSizeT35Two);
+    EXPECT_EQ(memcmp(kMetadataPayloadT35Two, metadata->payload,
+                     kMetadataPayloadSizeT35Two),
+              0);
 
-      ASSERT_EQ(OBU_METADATA_TYPE_HDR_CLL,
-                img.metadata->metadata_array[2]->type);
-      ASSERT_EQ(kMetadataPayloadSizeCll, img.metadata->metadata_array[2]->sz);
-      EXPECT_EQ(
-          memcmp(kMetadataPayloadCll, img.metadata->metadata_array[2]->payload,
-                 kMetadataPayloadSizeCll),
-          0);
+    if (is_key_frame) {
+      metadata = img.metadata->metadata_array[2];
+      ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_HDR_MDCV);
+      ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+      ASSERT_EQ(metadata->sz, kMetadataPayloadSizeMdcv);
+      EXPECT_EQ(memcmp(kMetadataPayloadMdcv, metadata->payload,
+                       kMetadataPayloadSizeMdcv),
+                0);
+
+      metadata = img.metadata->metadata_array[3];
+      ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_HDR_CLL);
+      ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+      ASSERT_EQ(metadata->sz, kMetadataPayloadSizeCll);
+      EXPECT_EQ(memcmp(kMetadataPayloadCll, metadata->payload,
+                       kMetadataPayloadSizeCll),
+                0);
     }
   }
 
@@ -243,6 +266,12 @@
                                    AOM_MIF_ANY_FRAME),
               0);
 
+    ASSERT_EQ(
+        aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_ITUT_T35,
+                             kMetadataPayloadT35Two, kMetadataPayloadSizeT35Two,
+                             AOM_MIF_ANY_FRAME_LAYER_SPECIFIC),
+        0);
+
     ASSERT_EQ(aom_img_add_metadata(current_frame, OBU_METADATA_TYPE_HDR_MDCV,
                                    kMetadataPayloadMdcv,
                                    kMetadataPayloadSizeMdcv, AOM_MIF_KEY_FRAME),
@@ -288,32 +317,40 @@
 
     ASSERT_NE(img.metadata, nullptr);
 
-    ASSERT_EQ(img.metadata->sz, is_key_frame ? 3 : 1);
+    ASSERT_EQ(img.metadata->sz, is_key_frame ? 4 : 2);
 
-    ASSERT_EQ(OBU_METADATA_TYPE_ITUT_T35,
-              img.metadata->metadata_array[0]->type);
-    ASSERT_EQ(kMetadataPayloadSizeT35, img.metadata->metadata_array[0]->sz);
+    aom_metadata_t *metadata = img.metadata->metadata_array[0];
+    ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_ITUT_T35);
+    ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+    ASSERT_EQ(metadata->sz, kMetadataPayloadSizeT35);
     EXPECT_EQ(
-        memcmp(kMetadataPayloadT35, img.metadata->metadata_array[0]->payload,
-               kMetadataPayloadSizeT35),
+        memcmp(kMetadataPayloadT35, metadata->payload, kMetadataPayloadSizeT35),
         0);
 
-    if (is_key_frame) {
-      ASSERT_EQ(OBU_METADATA_TYPE_HDR_MDCV,
-                img.metadata->metadata_array[1]->type);
-      ASSERT_EQ(kMetadataPayloadSizeMdcv, img.metadata->metadata_array[1]->sz);
-      EXPECT_EQ(
-          memcmp(kMetadataPayloadMdcv, img.metadata->metadata_array[1]->payload,
-                 kMetadataPayloadSizeMdcv),
-          0);
+    metadata = img.metadata->metadata_array[1];
+    ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_ITUT_T35);
+    ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME_LAYER_SPECIFIC);
+    ASSERT_EQ(metadata->sz, kMetadataPayloadSizeT35Two);
+    EXPECT_EQ(memcmp(kMetadataPayloadT35Two, metadata->payload,
+                     kMetadataPayloadSizeT35Two),
+              0);
 
-      ASSERT_EQ(OBU_METADATA_TYPE_HDR_CLL,
-                img.metadata->metadata_array[2]->type);
-      ASSERT_EQ(kMetadataPayloadSizeCll, img.metadata->metadata_array[2]->sz);
-      EXPECT_EQ(
-          memcmp(kMetadataPayloadCll, img.metadata->metadata_array[2]->payload,
-                 kMetadataPayloadSizeCll),
-          0);
+    if (is_key_frame) {
+      metadata = img.metadata->metadata_array[2];
+      ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_HDR_MDCV);
+      ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+      ASSERT_EQ(metadata->sz, kMetadataPayloadSizeMdcv);
+      EXPECT_EQ(memcmp(kMetadataPayloadMdcv, metadata->payload,
+                       kMetadataPayloadSizeMdcv),
+                0);
+
+      metadata = img.metadata->metadata_array[3];
+      ASSERT_EQ(metadata->type, OBU_METADATA_TYPE_HDR_CLL);
+      ASSERT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+      ASSERT_EQ(metadata->sz, kMetadataPayloadSizeCll);
+      EXPECT_EQ(memcmp(kMetadataPayloadCll, metadata->payload,
+                       kMetadataPayloadSizeCll),
+                0);
     }
   }
 
@@ -402,6 +439,33 @@
             -1);
 }
 
+TEST(MetadataTest, AddLayerSpecificMetadataToImage) {
+  aom_image_t image;
+  image.metadata = nullptr;
+
+  ASSERT_EQ(
+      aom_img_add_metadata(
+          &image, OBU_METADATA_TYPE_ITUT_T35, kMetadataPayloadT35,
+          kMetadataPayloadSizeT35,
+          (aom_metadata_insert_flags_t)(AOM_MIF_ANY_FRAME_LAYER_SPECIFIC)),
+      0);
+  aom_img_metadata_array_free(image.metadata);
+}
+
+TEST(MetadataTest, AddLayerSpecificMetadataToImageNotAllowed) {
+  aom_image_t image;
+  image.metadata = nullptr;
+
+  // OBU_METADATA_TYPE_SCALABILITY cannot be layer specific.
+  ASSERT_EQ(
+      aom_img_add_metadata(
+          &image, OBU_METADATA_TYPE_SCALABILITY, kMetadataPayloadT35,
+          kMetadataPayloadSizeT35,
+          (aom_metadata_insert_flags_t)(AOM_MIF_ANY_FRAME_LAYER_SPECIFIC)),
+      -1);
+  aom_img_metadata_array_free(image.metadata);
+}
+
 TEST(MetadataTest, RemoveMetadataFromImage) {
   aom_image_t image;
   image.metadata = nullptr;
@@ -454,9 +518,13 @@
                                  kMetadataPayloadT35, kMetadataPayloadSizeT35,
                                  AOM_MIF_ANY_FRAME),
             0);
+  ASSERT_EQ(aom_img_add_metadata(&image, OBU_METADATA_TYPE_ITUT_T35,
+                                 kMetadataPayloadT35, kMetadataPayloadSizeT35,
+                                 AOM_MIF_ANY_FRAME_LAYER_SPECIFIC),
+            0);
 
   EXPECT_EQ(aom_img_get_metadata(nullptr, 0), nullptr);
-  EXPECT_EQ(aom_img_get_metadata(&image, 1u), nullptr);
+  EXPECT_EQ(aom_img_get_metadata(&image, 2u), nullptr);
   EXPECT_EQ(aom_img_get_metadata(&image, 10u), nullptr);
 
   const aom_metadata_t *metadata = aom_img_get_metadata(&image, 0);
@@ -465,6 +533,15 @@
   EXPECT_EQ(
       memcmp(kMetadataPayloadT35, metadata->payload, kMetadataPayloadSizeT35),
       0);
+  EXPECT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME);
+
+  metadata = aom_img_get_metadata(&image, 1);
+  ASSERT_NE(metadata, nullptr);
+  ASSERT_EQ(metadata->sz, kMetadataPayloadSizeT35);
+  EXPECT_EQ(
+      memcmp(kMetadataPayloadT35, metadata->payload, kMetadataPayloadSizeT35),
+      0);
+  EXPECT_EQ(metadata->insert_flag, AOM_MIF_ANY_FRAME_LAYER_SPECIFIC);
 
   aom_img_metadata_array_free(image.metadata);
 }
diff --git a/test/minmax_test.cc b/test/minmax_test.cc
index b87ee97..bd1a965 100644
--- a/test/minmax_test.cc
+++ b/test/minmax_test.cc
@@ -26,8 +26,8 @@
 
 using ::libaom_test::ACMRandom;
 
-typedef void (*MinMaxFunc)(const uint8_t *a, int a_stride, const uint8_t *b,
-                           int b_stride, int *min, int *max);
+using MinMaxFunc = void (*)(const uint8_t *a, int a_stride, const uint8_t *b,
+                            int b_stride, int *min, int *max);
 
 class MinMaxTest : public ::testing::TestWithParam<MinMaxFunc> {
  public:
diff --git a/test/multilayer_metadata_test.cc b/test/multilayer_metadata_test.cc
index 33ce45f..faaf75f 100644
--- a/test/multilayer_metadata_test.cc
+++ b/test/multilayer_metadata_test.cc
@@ -22,27 +22,27 @@
 TEST(MultilayerMetadataTest, ParseAlpha) {
   const std::string metadata = R"(
 
-use_case: 1 # global alpha
-layers:
-  - layer_type: 5 # alpha
-    luma_plane_only_flag: 1
-    layer_metadata_scope: 2 # global
-    alpha:
-      alpha_use_idc: 1 # premultiplied
-      alpha_bit_depth: 8
-      alpha_transparent_value: 0
-      alpha_opaque_value: 4
+ use_case: 1 # global alpha
+ layers:
+   - layer_type: 5 # alpha
+     luma_plane_only_flag: 1
+     layer_metadata_scope: 2 # global
+     alpha:
+       alpha_use_idc: 1 # premultiplied
+       alpha_bit_depth: 8
+       alpha_transparent_value: 0
+       alpha_opaque_value: 4
 
-  - layer_type: 1 # texture
-    luma_plane_only_flag: 0
-    layer_metadata_scope: 2 # global
-    layer_color_description:
-      color_range: 1
-      color_primaries: 1
-      transfer_characteristics: 13
-      matrix_coefficients: 6
+   - layer_type: 1 # texture
+     luma_plane_only_flag: 0
+     layer_metadata_scope: 2 # global
+     layer_color_description:
+       color_range: 1
+       color_primaries: 1
+       transfer_characteristics: 13
+       matrix_coefficients: 6
 
-    )";
+     )";
   libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
   fprintf(tmp_file.file(), "%s", metadata.c_str());
   fflush(tmp_file.file());
@@ -55,10 +55,10 @@
   EXPECT_EQ(multilayer.layers[0].layer_type, 5);
   EXPECT_EQ(multilayer.layers[0].luma_plane_only_flag, 1);
   EXPECT_EQ(multilayer.layers[0].layer_metadata_scope, 2);
-  EXPECT_EQ(multilayer.layers[0].global_alpha_info.alpha_use_idc, 1);
-  EXPECT_EQ(multilayer.layers[0].global_alpha_info.alpha_bit_depth, 8);
-  EXPECT_EQ(multilayer.layers[0].global_alpha_info.alpha_transparent_value, 0);
-  EXPECT_EQ(multilayer.layers[0].global_alpha_info.alpha_opaque_value, 4);
+  EXPECT_EQ(multilayer.layers[0].alpha.alpha_use_idc, 1);
+  EXPECT_EQ(multilayer.layers[0].alpha.alpha_bit_depth, 8);
+  EXPECT_EQ(multilayer.layers[0].alpha.alpha_transparent_value, 0);
+  EXPECT_EQ(multilayer.layers[0].alpha.alpha_opaque_value, 4);
   EXPECT_EQ(multilayer.layers[1].layer_type, 1);
   EXPECT_EQ(multilayer.layers[1].luma_plane_only_flag, 0);
   EXPECT_EQ(multilayer.layers[1].layer_metadata_scope, 2);
@@ -76,26 +76,26 @@
 
 TEST(MultilayerMetadataTest, ParseDepth) {
   const std::string metadata = R"(
-use_case: 2 # global depth
-layers:
-  - layer_type: 6 # depth
-    luma_plane_only_flag: 1
-    layer_metadata_scope: 2 # global
-    depth:
-      z_near: 1.456
-      z_far: 9.786
-      depth_representation_type: 2
+ use_case: 2 # global depth
+ layers:
+   - layer_type: 6 # depth
+     luma_plane_only_flag: 1
+     layer_metadata_scope: 2 # global
+     depth:
+       z_near: 1.456
+       z_far: 9.786
+       depth_representation_type: 2
 
-  - layer_type: 1 # texture
-    luma_plane_only_flag: 0
-    layer_metadata_scope: 2 # global
-    layer_color_description:
-      color_range: 1
-      color_primaries: 1
-      transfer_characteristics: 13
-      matrix_coefficients: 6
+   - layer_type: 1 # texture
+     luma_plane_only_flag: 0
+     layer_metadata_scope: 2 # global
+     layer_color_description:
+       color_range: 1
+       color_primaries: 1
+       transfer_characteristics: 13
+       matrix_coefficients: 6
 
-    )";
+     )";
   libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
   fprintf(tmp_file.file(), "%s", metadata.c_str());
   fflush(tmp_file.file());
@@ -107,16 +107,15 @@
   EXPECT_EQ(multilayer.layers[0].layer_type, 6);
   EXPECT_EQ(multilayer.layers[0].luma_plane_only_flag, 1);
   EXPECT_EQ(multilayer.layers[0].layer_metadata_scope, 2);
-  EXPECT_TRUE(multilayer.layers[0].global_depth_info.z_near.second);
+  EXPECT_TRUE(multilayer.layers[0].depth.z_near.second);
   EXPECT_NEAR(depth_representation_element_to_double(
-                  multilayer.layers[0].global_depth_info.z_near.first),
+                  multilayer.layers[0].depth.z_near.first),
               1.456, 0.00001);
-  EXPECT_TRUE(multilayer.layers[0].global_depth_info.z_far.second);
+  EXPECT_TRUE(multilayer.layers[0].depth.z_far.second);
   EXPECT_NEAR(depth_representation_element_to_double(
-                  multilayer.layers[0].global_depth_info.z_far.first),
+                  multilayer.layers[0].depth.z_far.first),
               9.786, 0.00001);
-  EXPECT_EQ(multilayer.layers[0].global_depth_info.depth_representation_type,
-            2);
+  EXPECT_EQ(multilayer.layers[0].depth.depth_representation_type, 2);
   EXPECT_EQ(multilayer.layers[1].layer_type, 1);
   EXPECT_EQ(multilayer.layers[1].luma_plane_only_flag, 0);
   EXPECT_EQ(multilayer.layers[1].layer_metadata_scope, 2);
@@ -132,22 +131,115 @@
       6);
 }
 
-TEST(MultilayerMetadataTest, ParseInvalid) {
+TEST(MultilayerMetadataTest, ParseLocalDepth) {
   const std::string metadata = R"(
-
-use_case: 1 # global alpha
+use_case: 4 # depth
 layers:
-  - layer_type: 5 # alpha
+  - layer_type: 6 # depth
     luma_plane_only_flag: 1
-    layer_metadata_scope: 2 # global
+    layer_metadata_scope: 3 # mixed
+    depth:
+      z_near: 1.456
+      z_far: 9.786
+      depth_representation_type: 2
+    local_metadata:
+      - frame_idx: 4
+        depth:
+          z_near: 2.78933
+          z_far: 20.663
+          depth_representation_type: 0
+      - frame_idx: 100
+        depth:
+          z_near: 0
+          z_far: 24
+          depth_representation_type: 0
 
   - layer_type: 1 # texture
     luma_plane_only_flag: 0
-    layer_metadata_scope: 2 # global
+    layer_metadata_scope: 3 # mixed
+    layer_color_description:
+      color_range: 1
+      color_primaries: 1
+      transfer_characteristics: 13
+      matrix_coefficients: 6
+    )";
+  libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
+  fprintf(tmp_file.file(), "%s", metadata.c_str());
+  fflush(tmp_file.file());
+
+  MultilayerMetadata multilayer;
+  EXPECT_TRUE(parse_multilayer_file(tmp_file.file_name().c_str(), &multilayer));
+  EXPECT_EQ(multilayer.use_case, 4);
+  ASSERT_EQ(multilayer.layers.size(), 2);
+  EXPECT_EQ(multilayer.layers[0].layer_type, 6);
+  EXPECT_EQ(multilayer.layers[0].luma_plane_only_flag, 1);
+  EXPECT_EQ(multilayer.layers[0].layer_metadata_scope, 3);
+  EXPECT_TRUE(multilayer.layers[0].depth.z_near.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].depth.z_near.first),
+              1.456, 0.00001);
+  EXPECT_TRUE(multilayer.layers[0].depth.z_far.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].depth.z_far.first),
+              9.786, 0.00001);
+  EXPECT_EQ(multilayer.layers[0].depth.depth_representation_type, 2);
+  ASSERT_EQ(multilayer.layers[0].local_metadata.size(), 2);
+  EXPECT_EQ(multilayer.layers[0].local_metadata[0].frame_idx, 4);
+  EXPECT_TRUE(multilayer.layers[0].local_metadata[0].depth.z_near.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].local_metadata[0].depth.z_near.first),
+              2.78933, 0.00001);
+  EXPECT_TRUE(multilayer.layers[0].local_metadata[0].depth.z_far.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].local_metadata[0].depth.z_far.first),
+              20.663, 0.00001);
+  EXPECT_EQ(
+      multilayer.layers[0].local_metadata[0].depth.depth_representation_type,
+      0);
+  EXPECT_EQ(multilayer.layers[0].local_metadata[1].frame_idx, 100);
+  EXPECT_TRUE(multilayer.layers[0].local_metadata[1].depth.z_near.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].local_metadata[1].depth.z_near.first),
+              0, 0.00001);
+  EXPECT_TRUE(multilayer.layers[0].local_metadata[1].depth.z_far.second);
+  EXPECT_NEAR(depth_representation_element_to_double(
+                  multilayer.layers[0].local_metadata[1].depth.z_far.first),
+              24, 0.00001);
+  EXPECT_EQ(
+      multilayer.layers[0].local_metadata[1].depth.depth_representation_type,
+      0);
+  EXPECT_EQ(multilayer.layers[1].layer_type, 1);
+  EXPECT_EQ(multilayer.layers[1].luma_plane_only_flag, 0);
+  EXPECT_EQ(multilayer.layers[1].layer_metadata_scope, 3);
+  EXPECT_TRUE(multilayer.layers[1].layer_color_description.second);
+  EXPECT_EQ(multilayer.layers[1].layer_color_description.first.color_range, 1);
+  EXPECT_EQ(multilayer.layers[1].layer_color_description.first.color_primaries,
+            1);
+  EXPECT_EQ(multilayer.layers[1]
+                .layer_color_description.first.transfer_characteristics,
+            13);
+  EXPECT_EQ(
+      multilayer.layers[1].layer_color_description.first.matrix_coefficients,
+      6);
+  EXPECT_EQ(multilayer.layers[1].local_metadata.size(), 0);
+}
+
+TEST(MultilayerMetadataTest, ParseInvalid) {
+  const std::string metadata = R"(
+
+use_case: 3 # alpha
+layers:
+  - layer_type: 5 # alpha
+    luma_plane_only_flag: 1
+    layer_metadata_scope: 3 # mixed
+
+  - layer_type: 1 # texture
+    luma_plane_only_flag: 0
+    layer_metadata_scope: 3 # mixed
 
   - layer_type: 6 # depth => bad layer type
     luma_plane_only_flag: 1
-    layer_metadata_scope: 2 # global
+    layer_metadata_scope: 3 # mixed
     )";
   libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
   fprintf(tmp_file.file(), "%s", metadata.c_str());
@@ -162,16 +254,16 @@
 TEST(MultilayerMetadataTest, ParseBadIndent) {
   const std::string metadata = R"(
 
-use_case: 1 # global alpha
-layers:
-  - layer_type: 5 # alpha
-    luma_plane_only_flag: 1
-      layer_metadata_scope: 2 # global
+ use_case: 1 # global alpha
+ layers:
+   - layer_type: 5 # alpha
+     luma_plane_only_flag: 1
+       layer_metadata_scope: 2 # global
 
-  - layer_type: 1 # texture
-    luma_plane_only_flag: 0
-    layer_metadata_scope: 2 # global
-    )";
+   - layer_type: 1 # texture
+     luma_plane_only_flag: 0
+     layer_metadata_scope: 2 # global
+     )";
   libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
   fprintf(tmp_file.file(), "%s", metadata.c_str());
   fflush(tmp_file.file());
@@ -185,17 +277,17 @@
 TEST(MultilayerMetadataTest, ParseUnknownField) {
   const std::string metadata = R"(
 
-use_case: 1 # global alpha
-layers:
-  - layer_type: 5 # alpha
-    luma_plane_only_flag: 1
-    layer_metadata_scope: 2 # global
-    foobar: 42
+ use_case: 1 # global alpha
+ layers:
+   - layer_type: 5 # alpha
+     luma_plane_only_flag: 1
+     layer_metadata_scope: 2 # global
+     foobar: 42
 
-  - layer_type: 1 # texture
-    luma_plane_only_flag: 0
-    layer_metadata_scope: 2 # global
-    )";
+   - layer_type: 1 # texture
+     luma_plane_only_flag: 0
+     layer_metadata_scope: 2 # global
+     )";
   libaom_test::TempOutFile tmp_file(/*text_mode=*/true);
   fprintf(tmp_file.file(), "%s", metadata.c_str());
   fflush(tmp_file.file());
diff --git a/test/noise_model_test.cc b/test/noise_model_test.cc
index 9f05a44..23f300c 100644
--- a/test/noise_model_test.cc
+++ b/test/noise_model_test.cc
@@ -359,7 +359,7 @@
 // (uint8_t and uint16_t) and bit depths (8, 10, 12).
 template <typename T, int bit_depth, bool use_highbd>
 struct BitDepthParams {
-  typedef T data_type_t;
+  using data_type_t = T;
   static const int kBitDepth = bit_depth;
   static const bool kUseHighBD = use_highbd;
 };
@@ -368,7 +368,7 @@
 class FlatBlockEstimatorTest : public ::testing::Test, public T {
  public:
   void SetUp() override { random_.Reset(171); }
-  typedef std::vector<typename T::data_type_t> VecType;
+  using VecType = std::vector<typename T::data_type_t>;
   VecType data_;
   libaom_test::ACMRandom random_;
 };
@@ -527,11 +527,9 @@
 REGISTER_TYPED_TEST_SUITE_P(FlatBlockEstimatorTest, ExtractBlock,
                             FindFlatBlocks);
 
-typedef ::testing::Types<BitDepthParams<uint8_t, 8, false>,   // lowbd
-                         BitDepthParams<uint16_t, 8, true>,   // lowbd in 16-bit
-                         BitDepthParams<uint16_t, 10, true>,  // highbd data
-                         BitDepthParams<uint16_t, 12, true> >
-    AllBitDepthParams;
+using AllBitDepthParams = ::testing::Types<
+    BitDepthParams<uint8_t, 8, false>, BitDepthParams<uint16_t, 8, true>,
+    BitDepthParams<uint16_t, 10, true>, BitDepthParams<uint16_t, 12, true>>;
 // Note the empty final argument can be removed if C++20 is made the minimum
 // requirement.
 INSTANTIATE_TYPED_TEST_SUITE_P(FlatBlockInstatiation, FlatBlockEstimatorTest,
diff --git a/test/obmc_sad_test.cc b/test/obmc_sad_test.cc
index dd6484f..17f3251 100644
--- a/test/obmc_sad_test.cc
+++ b/test/obmc_sad_test.cc
@@ -28,9 +28,9 @@
 static const int kIterations = 1000;
 static const int kMaskMax = 64;
 
-typedef unsigned int (*ObmcSadF)(const uint8_t *pre, int pre_stride,
-                                 const int32_t *wsrc, const int32_t *mask);
-typedef libaom_test::FuncParam<ObmcSadF> TestFuncs;
+using ObmcSadF = unsigned int (*)(const uint8_t *pre, int pre_stride,
+                                  const int32_t *wsrc, const int32_t *mask);
+using TestFuncs = libaom_test::FuncParam<ObmcSadF>;
 
 ////////////////////////////////////////////////////////////////////////////////
 // 8 bit
diff --git a/test/obmc_variance_test.cc b/test/obmc_variance_test.cc
index 39b816e..16482f7 100644
--- a/test/obmc_variance_test.cc
+++ b/test/obmc_variance_test.cc
@@ -30,10 +30,10 @@
 static const int kIterations = 1000;
 static const int kMaskMax = 64;
 
-typedef unsigned int (*ObmcVarF)(const uint8_t *pre, int pre_stride,
-                                 const int32_t *wsrc, const int32_t *mask,
-                                 unsigned int *sse);
-typedef libaom_test::FuncParam<ObmcVarF> TestFuncs;
+using ObmcVarF = unsigned int (*)(const uint8_t *pre, int pre_stride,
+                                  const int32_t *wsrc, const int32_t *mask,
+                                  unsigned int *sse);
+using TestFuncs = libaom_test::FuncParam<ObmcVarF>;
 
 ////////////////////////////////////////////////////////////////////////////////
 // 8 bit
diff --git a/test/pickrst_test.cc b/test/pickrst_test.cc
index fd25b0c..98b686d 100644
--- a/test/pickrst_test.cc
+++ b/test/pickrst_test.cc
@@ -29,7 +29,7 @@
 namespace pickrst_test_lowbd {
 static const int kIterations = 100;
 
-typedef int64_t (*lowbd_pixel_proj_error_func)(
+using lowbd_pixel_proj_error_func = int64_t (*)(
     const uint8_t *src8, int width, int height, int src_stride,
     const uint8_t *dat8, int dat_stride, int32_t *flt0, int flt0_stride,
     int32_t *flt1, int flt1_stride, int xq[2], const sgr_params_type *params);
@@ -38,7 +38,7 @@
 // 8 bit
 ////////////////////////////////////////////////////////////////////////////////
 
-typedef std::tuple<const lowbd_pixel_proj_error_func> PixelProjErrorTestParam;
+using PixelProjErrorTestParam = std::tuple<const lowbd_pixel_proj_error_func>;
 
 class PixelProjErrorTest
     : public ::testing::TestWithParam<PixelProjErrorTestParam> {
@@ -201,7 +201,7 @@
 namespace pickrst_test_highbd {
 static const int kIterations = 100;
 
-typedef int64_t (*highbd_pixel_proj_error_func)(
+using highbd_pixel_proj_error_func = int64_t (*)(
     const uint8_t *src8, int width, int height, int src_stride,
     const uint8_t *dat8, int dat_stride, int32_t *flt0, int flt0_stride,
     int32_t *flt1, int flt1_stride, int xq[2], const sgr_params_type *params);
@@ -210,7 +210,7 @@
 // High bit-depth
 ////////////////////////////////////////////////////////////////////////////////
 
-typedef std::tuple<const highbd_pixel_proj_error_func> PixelProjErrorTestParam;
+using PixelProjErrorTestParam = std::tuple<const highbd_pixel_proj_error_func>;
 
 class PixelProjHighbdErrorTest
     : public ::testing::TestWithParam<PixelProjErrorTestParam> {
@@ -379,15 +379,15 @@
 namespace get_proj_subspace_test_lowbd {
 static const int kIterations = 100;
 
-typedef void (*set_get_proj_subspace)(const uint8_t *src8, int width,
-                                      int height, int src_stride,
-                                      const uint8_t *dat8, int dat_stride,
-                                      int32_t *flt0, int flt0_stride,
-                                      int32_t *flt1, int flt1_stride,
-                                      int64_t H[2][2], int64_t C[2],
-                                      const sgr_params_type *params);
+using set_get_proj_subspace = void (*)(const uint8_t *src8, int width,
+                                       int height, int src_stride,
+                                       const uint8_t *dat8, int dat_stride,
+                                       int32_t *flt0, int flt0_stride,
+                                       int32_t *flt1, int flt1_stride,
+                                       int64_t H[2][2], int64_t C[2],
+                                       const sgr_params_type *params);
 
-typedef std::tuple<const set_get_proj_subspace> GetProjSubspaceTestParam;
+using GetProjSubspaceTestParam = std::tuple<const set_get_proj_subspace>;
 
 class GetProjSubspaceTest
     : public ::testing::TestWithParam<GetProjSubspaceTestParam> {
@@ -564,15 +564,15 @@
 namespace get_proj_subspace_test_hbd {
 static const int kIterations = 100;
 
-typedef void (*set_get_proj_subspace_hbd)(const uint8_t *src8, int width,
-                                          int height, int src_stride,
-                                          const uint8_t *dat8, int dat_stride,
-                                          int32_t *flt0, int flt0_stride,
-                                          int32_t *flt1, int flt1_stride,
-                                          int64_t H[2][2], int64_t C[2],
-                                          const sgr_params_type *params);
+using set_get_proj_subspace_hbd = void (*)(const uint8_t *src8, int width,
+                                           int height, int src_stride,
+                                           const uint8_t *dat8, int dat_stride,
+                                           int32_t *flt0, int flt0_stride,
+                                           int32_t *flt1, int flt1_stride,
+                                           int64_t H[2][2], int64_t C[2],
+                                           const sgr_params_type *params);
 
-typedef std::tuple<const set_get_proj_subspace_hbd> GetProjSubspaceHBDTestParam;
+using GetProjSubspaceHBDTestParam = std::tuple<const set_get_proj_subspace_hbd>;
 
 class GetProjSubspaceTestHBD
     : public ::testing::TestWithParam<GetProjSubspaceHBDTestParam> {
diff --git a/test/quant_test.cc b/test/quant_test.cc
index 120eae7..505d19d 100644
--- a/test/quant_test.cc
+++ b/test/quant_test.cc
@@ -91,10 +91,10 @@
                            ::testing::Range(5, 9));
 
 #if !CONFIG_REALTIME_ONLY
-typedef struct {
+struct QuantParam {
   const unsigned int min_q;
   const unsigned int max_q;
-} QuantParam;
+};
 
 const QuantParam QuantTestParams[] = {
   { 0, 10 }, { 0, 60 }, { 20, 35 }, { 35, 50 }, { 50, 63 }
diff --git a/test/quantize_func_test.cc b/test/quantize_func_test.cc
index 782f93c..64d9ce4 100644
--- a/test/quantize_func_test.cc
+++ b/test/quantize_func_test.cc
@@ -43,9 +43,9 @@
       const int16_t *dequant_ptr, uint16_t *eob_ptr, const int16_t *scan,  \
       const int16_t *iscan
 
-typedef void (*LPQuantizeFunc)(LP_QUANTIZE_PARAM_LIST);
-typedef void (*QuantizeFunc)(QUAN_PARAM_LIST);
-typedef void (*QuantizeFuncHbd)(QUAN_PARAM_LIST, int log_scale);
+using LPQuantizeFunc = void (*)(LP_QUANTIZE_PARAM_LIST);
+using QuantizeFunc = void (*)(QUAN_PARAM_LIST);
+using QuantizeFuncHbd = void (*)(QUAN_PARAM_LIST, int log_scale);
 
 #undef LP_QUANTIZE_PARAM_LIST
 
@@ -83,10 +83,10 @@
 using QuantizeParam =
     tuple<FuncType, FuncType, TX_SIZE, QuantType, aom_bit_depth_t>;
 
-typedef struct {
+struct QuanTable {
   QUANTS quant;
   Dequants dequant;
-} QuanTable;
+};
 
 const int kTestNum = 1000;
 
diff --git a/test/rt_end_to_end_test.cc b/test/rt_end_to_end_test.cc
index dced915..244f465 100644
--- a/test/rt_end_to_end_test.cc
+++ b/test/rt_end_to_end_test.cc
@@ -60,13 +60,13 @@
                            { 9, { { 0, 31.0 }, { 3, 31.0 } } },
                            { 10, { { 0, 31.0 }, { 3, 31.0 } } } } } };
 
-typedef struct {
+struct TestVideoParam {
   const char *filename;
   unsigned int input_bit_depth;
   aom_img_fmt fmt;
   aom_bit_depth_t bit_depth;
   unsigned int profile;
-} TestVideoParam;
+};
 
 std::ostream &operator<<(std::ostream &os, const TestVideoParam &test_arg) {
   return os << "TestVideoParam { filename:" << test_arg.filename
diff --git a/test/sad_test.cc b/test/sad_test.cc
index af73b68..416f0ee 100644
--- a/test/sad_test.cc
+++ b/test/sad_test.cc
@@ -26,41 +26,41 @@
 #include "aom_mem/aom_mem.h"
 #include "aom_ports/mem.h"
 
-typedef unsigned int (*SadMxNFunc)(const uint8_t *src_ptr, int src_stride,
-                                   const uint8_t *ref_ptr, int ref_stride);
-typedef std::tuple<int, int, SadMxNFunc, int> SadMxNParam;
+using SadMxNFunc = unsigned int (*)(const uint8_t *src_ptr, int src_stride,
+                                    const uint8_t *ref_ptr, int ref_stride);
+using SadMxNParam = std::tuple<int, int, SadMxNFunc, int>;
 
-typedef unsigned int (*SadSkipMxNFunc)(const uint8_t *src_ptr, int src_stride,
-                                       const uint8_t *ref_ptr, int ref_stride);
-typedef std::tuple<int, int, SadSkipMxNFunc, int> SadSkipMxNParam;
+using SadSkipMxNFunc = unsigned int (*)(const uint8_t *src_ptr, int src_stride,
+                                        const uint8_t *ref_ptr, int ref_stride);
+using SadSkipMxNParam = std::tuple<int, int, SadSkipMxNFunc, int>;
 
-typedef uint32_t (*SadMxNAvgFunc)(const uint8_t *src_ptr, int src_stride,
-                                  const uint8_t *ref_ptr, int ref_stride,
-                                  const uint8_t *second_pred);
-typedef std::tuple<int, int, SadMxNAvgFunc, int> SadMxNAvgParam;
+using SadMxNAvgFunc = uint32_t (*)(const uint8_t *src_ptr, int src_stride,
+                                   const uint8_t *ref_ptr, int ref_stride,
+                                   const uint8_t *second_pred);
+using SadMxNAvgParam = std::tuple<int, int, SadMxNAvgFunc, int>;
 
-typedef unsigned int (*DistWtdSadMxhFunc)(const uint8_t *src_ptr,
-                                          int src_stride,
-                                          const uint8_t *ref_ptr,
-                                          int ref_stride, int width,
-                                          int height);
-typedef std::tuple<int, int, DistWtdSadMxhFunc, int> DistWtdSadMxhParam;
+using DistWtdSadMxhFunc = unsigned int (*)(const uint8_t *src_ptr,
+                                           int src_stride,
+                                           const uint8_t *ref_ptr,
+                                           int ref_stride, int width,
+                                           int height);
+using DistWtdSadMxhParam = std::tuple<int, int, DistWtdSadMxhFunc, int>;
 
-typedef void (*SadMxNx4Func)(const uint8_t *src_ptr, int src_stride,
-                             const uint8_t *const ref_ptr[], int ref_stride,
-                             uint32_t *sad_array);
-typedef std::tuple<int, int, SadMxNx4Func, int> SadMxNx4Param;
+using SadMxNx4Func = void (*)(const uint8_t *src_ptr, int src_stride,
+                              const uint8_t *const ref_ptr[], int ref_stride,
+                              uint32_t *sad_array);
+using SadMxNx4Param = std::tuple<int, int, SadMxNx4Func, int>;
 
-typedef void (*SadSkipMxNx4Func)(const uint8_t *src_ptr, int src_stride,
+using SadSkipMxNx4Func = void (*)(const uint8_t *src_ptr, int src_stride,
+                                  const uint8_t *const ref_ptr[],
+                                  int ref_stride, uint32_t *sad_array);
+using SadSkipMxNx4Param = std::tuple<int, int, SadSkipMxNx4Func, int>;
+
+using SadMxNx4AvgFunc = void (*)(const uint8_t *src_ptr, int src_stride,
                                  const uint8_t *const ref_ptr[], int ref_stride,
+                                 const uint8_t *second_pred,
                                  uint32_t *sad_array);
-typedef std::tuple<int, int, SadSkipMxNx4Func, int> SadSkipMxNx4Param;
-
-typedef void (*SadMxNx4AvgFunc)(const uint8_t *src_ptr, int src_stride,
-                                const uint8_t *const ref_ptr[], int ref_stride,
-                                const uint8_t *second_pred,
-                                uint32_t *sad_array);
-typedef std::tuple<int, int, SadMxNx4AvgFunc, int> SadMxNx4AvgParam;
+using SadMxNx4AvgParam = std::tuple<int, int, SadMxNx4AvgFunc, int>;
 
 using libaom_test::ACMRandom;
 
diff --git a/test/selfguided_filter_test.cc b/test/selfguided_filter_test.cc
index 7540036..4a2d814 100644
--- a/test/selfguided_filter_test.cc
+++ b/test/selfguided_filter_test.cc
@@ -31,13 +31,13 @@
 using std::make_tuple;
 using std::tuple;
 
-typedef int (*SgrFunc)(const uint8_t *dat8, int width, int height, int stride,
-                       int eps, const int *xqd, uint8_t *dst8, int dst_stride,
-                       int32_t *tmpbuf, int bit_depth, int highbd);
+using SgrFunc = int (*)(const uint8_t *dat8, int width, int height, int stride,
+                        int eps, const int *xqd, uint8_t *dst8, int dst_stride,
+                        int32_t *tmpbuf, int bit_depth, int highbd);
 
 // Test parameter list:
 //  <tst_fun_>
-typedef tuple<SgrFunc> FilterTestParam;
+using FilterTestParam = tuple<SgrFunc>;
 
 class AV1SelfguidedFilterTest
     : public ::testing::TestWithParam<FilterTestParam> {
@@ -232,7 +232,7 @@
 #if CONFIG_AV1_HIGHBITDEPTH
 // Test parameter list:
 //  <tst_fun_, bit_depth>
-typedef tuple<SgrFunc, int> HighbdFilterTestParam;
+using HighbdFilterTestParam = tuple<SgrFunc, int>;
 
 class AV1HighbdSelfguidedFilterTest
     : public ::testing::TestWithParam<HighbdFilterTestParam> {
diff --git a/test/simd_cmp_impl.inc b/test/simd_cmp_impl.inc
index 0a9a195..a2373d7 100644
--- a/test/simd_cmp_impl.inc
+++ b/test/simd_cmp_impl.inc
@@ -463,13 +463,13 @@
   return c_v256_ssd_s16_sum(::c_v256_ssd_s16(c_v256_ssd_s16_init(), a, b));
 }
 
-typedef void (*fptr)();
+using fptr = void (*)();
 
-typedef struct {
+struct mapping {
   const char *name;
   fptr ref;
   fptr simd;
-} mapping;
+};
 
 #define MAP(name) \
   { #name, reinterpret_cast < fptr>(c_##name), reinterpret_cast < fptr>(name) }
diff --git a/test/simd_impl.h b/test/simd_impl.h
index 20737a0..cb5b2d6 100644
--- a/test/simd_impl.h
+++ b/test/simd_impl.h
@@ -35,9 +35,9 @@
 };
 
 // Create one typedef for each function signature
-#define TYPEDEF_SIMD(name)                                             \
-  typedef TestIntrinsic<std::tuple<uint32_t, uint32_t, const char *> > \
-  ARCH_POSTFIX(name)
+#define TYPEDEF_SIMD(name)   \
+  using ARCH_POSTFIX(name) = \
+      TestIntrinsic<std::tuple<uint32_t, uint32_t, const char *> >
 
 TYPEDEF_SIMD(V64_U8);
 TYPEDEF_SIMD(V64_U16);
@@ -87,17 +87,17 @@
 TYPEDEF_SIMD(V64_V256);
 
 // Google Test allows up to 50 tests per case, so split the largest
-typedef ARCH_POSTFIX(V64_V64) ARCH_POSTFIX(V64_V64_Part2);
-typedef ARCH_POSTFIX(V64_V64V64) ARCH_POSTFIX(V64_V64V64_Part2);
-typedef ARCH_POSTFIX(V128_V128) ARCH_POSTFIX(V128_V128_Part2);
-typedef ARCH_POSTFIX(V128_V128) ARCH_POSTFIX(V128_V128_Part3);
-typedef ARCH_POSTFIX(V128_V128) ARCH_POSTFIX(V128_V128_Part4);
-typedef ARCH_POSTFIX(V128_V128V128) ARCH_POSTFIX(V128_V128V128_Part2);
-typedef ARCH_POSTFIX(V256_V256) ARCH_POSTFIX(V256_V256_Part2);
-typedef ARCH_POSTFIX(V256_V256) ARCH_POSTFIX(V256_V256_Part3);
-typedef ARCH_POSTFIX(V256_V256) ARCH_POSTFIX(V256_V256_Part4);
-typedef ARCH_POSTFIX(V256_V256) ARCH_POSTFIX(V256_V256_Part5);
-typedef ARCH_POSTFIX(V256_V256V256) ARCH_POSTFIX(V256_V256V256_Part2);
+using ARCH_POSTFIX(V64_V64_Part2) = ARCH_POSTFIX(V64_V64);
+using ARCH_POSTFIX(V64_V64V64_Part2) = ARCH_POSTFIX(V64_V64V64);
+using ARCH_POSTFIX(V128_V128_Part2) = ARCH_POSTFIX(V128_V128);
+using ARCH_POSTFIX(V128_V128_Part3) = ARCH_POSTFIX(V128_V128);
+using ARCH_POSTFIX(V128_V128_Part4) = ARCH_POSTFIX(V128_V128);
+using ARCH_POSTFIX(V128_V128V128_Part2) = ARCH_POSTFIX(V128_V128V128);
+using ARCH_POSTFIX(V256_V256_Part2) = ARCH_POSTFIX(V256_V256);
+using ARCH_POSTFIX(V256_V256_Part3) = ARCH_POSTFIX(V256_V256);
+using ARCH_POSTFIX(V256_V256_Part4) = ARCH_POSTFIX(V256_V256);
+using ARCH_POSTFIX(V256_V256_Part5) = ARCH_POSTFIX(V256_V256);
+using ARCH_POSTFIX(V256_V256V256_Part2) = ARCH_POSTFIX(V256_V256V256);
 
 // These functions are machine tuned located elsewhere
 template <typename c_ret, typename c_arg>
diff --git a/test/sse_sum_test.cc b/test/sse_sum_test.cc
index 54fbeae..8b33cf1 100644
--- a/test/sse_sum_test.cc
+++ b/test/sse_sum_test.cc
@@ -35,9 +35,9 @@
 namespace {
 const int kNumIterations = 10000;
 
-typedef uint64_t (*SSI16Func)(const int16_t *src, int src_stride, int width,
-                              int height, int *sum);
-typedef libaom_test::FuncParam<SSI16Func> TestFuncs;
+using SSI16Func = uint64_t (*)(const int16_t *src, int src_stride, int width,
+                               int height, int *sum);
+using TestFuncs = libaom_test::FuncParam<SSI16Func>;
 
 class SumSSETest : public ::testing::TestWithParam<TestFuncs> {
  public:
diff --git a/test/subtract_test.cc b/test/subtract_test.cc
index ca2b2a9..c478210 100644
--- a/test/subtract_test.cc
+++ b/test/subtract_test.cc
@@ -24,10 +24,10 @@
 #include "aom_mem/aom_mem.h"
 #include "aom_ports/mem.h"
 
-typedef void (*SubtractFunc)(int rows, int cols, int16_t *diff_ptr,
-                             ptrdiff_t diff_stride, const uint8_t *src_ptr,
-                             ptrdiff_t src_stride, const uint8_t *pred_ptr,
-                             ptrdiff_t pred_stride);
+using SubtractFunc = void (*)(int rows, int cols, int16_t *diff_ptr,
+                              ptrdiff_t diff_stride, const uint8_t *src_ptr,
+                              ptrdiff_t src_stride, const uint8_t *pred_ptr,
+                              ptrdiff_t pred_stride);
 
 namespace {
 
diff --git a/test/sum_squares_test.cc b/test/sum_squares_test.cc
index 26d0361..09b3c05 100644
--- a/test/sum_squares_test.cc
+++ b/test/sum_squares_test.cc
@@ -38,9 +38,9 @@
 
 static const int16_t kInt13Max = (1 << 12) - 1;
 
-typedef uint64_t (*SSI16Func)(const int16_t *src, int stride, int width,
-                              int height);
-typedef libaom_test::FuncParam<SSI16Func> TestFuncs;
+using SSI16Func = uint64_t (*)(const int16_t *src, int stride, int width,
+                               int height);
+using TestFuncs = libaom_test::FuncParam<SSI16Func>;
 
 class SumSquaresTest : public ::testing::TestWithParam<TestFuncs> {
  public:
@@ -191,8 +191,8 @@
 // 1D version
 //////////////////////////////////////////////////////////////////////////////
 
-typedef uint64_t (*F1D)(const int16_t *src, uint32_t n);
-typedef libaom_test::FuncParam<F1D> TestFuncs1D;
+using F1D = uint64_t (*)(const int16_t *src, uint32_t n);
+using TestFuncs1D = libaom_test::FuncParam<F1D>;
 
 class SumSquares1DTest : public FunctionEquivalenceTest<F1D> {
  protected:
@@ -261,11 +261,11 @@
 
 #endif  // HAVE_SVE
 
-typedef int64_t (*SSEFunc)(const uint8_t *a, int a_stride, const uint8_t *b,
-                           int b_stride, int width, int height);
-typedef libaom_test::FuncParam<SSEFunc> TestSSEFuncs;
+using SSEFunc = int64_t (*)(const uint8_t *a, int a_stride, const uint8_t *b,
+                            int b_stride, int width, int height);
+using TestSSEFuncs = libaom_test::FuncParam<SSEFunc>;
 
-typedef std::tuple<TestSSEFuncs, int> SSETestParam;
+using SSETestParam = std::tuple<TestSSEFuncs, int>;
 
 class SSETest : public ::testing::TestWithParam<SSETestParam> {
  public:
@@ -471,11 +471,11 @@
 // get_blk sum squares test functions
 //////////////////////////////////////////////////////////////////////////////
 
-typedef void (*sse_sum_func)(const int16_t *data, int stride, int bw, int bh,
-                             int *x_sum, int64_t *x2_sum);
-typedef libaom_test::FuncParam<sse_sum_func> TestSSE_SumFuncs;
+using sse_sum_func = void (*)(const int16_t *data, int stride, int bw, int bh,
+                              int *x_sum, int64_t *x2_sum);
+using TestSSE_SumFuncs = libaom_test::FuncParam<sse_sum_func>;
 
-typedef std::tuple<TestSSE_SumFuncs, TX_SIZE> SSE_SumTestParam;
+using SSE_SumTestParam = std::tuple<TestSSE_SumFuncs, TX_SIZE>;
 
 class SSE_Sum_Test : public ::testing::TestWithParam<SSE_SumTestParam> {
  public:
@@ -631,8 +631,8 @@
 // 2D Variance test functions
 //////////////////////////////////////////////////////////////////////////////
 
-typedef uint64_t (*Var2DFunc)(uint8_t *src, int stride, int width, int height);
-typedef libaom_test::FuncParam<Var2DFunc> TestFuncVar2D;
+using Var2DFunc = uint64_t (*)(uint8_t *src, int stride, int width, int height);
+using TestFuncVar2D = libaom_test::FuncParam<Var2DFunc>;
 
 const uint16_t test_block_size[2] = { 128, 256 };
 
diff --git a/test/temporal_filter_test.cc b/test/temporal_filter_test.cc
index cda06c5..2e0ae0f 100644
--- a/test/temporal_filter_test.cc
+++ b/test/temporal_filter_test.cc
@@ -37,22 +37,22 @@
 
 #if !CONFIG_REALTIME_ONLY
 namespace {
-typedef enum {
+enum ColorFormat {
   I400,  // Monochrome
   I420,  // 4:2:0
   I422,  // 4:2:2
   I444,  // 4:4:4
-} ColorFormat;
+};
 static const char *color_fmt_str[] = { "I400", "I420", "I422", "I444" };
-typedef void (*TemporalFilterFunc)(
+using TemporalFilterFunc = void (*)(
     const YV12_BUFFER_CONFIG *frame_to_filter, const MACROBLOCKD *mbd,
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_level, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
     int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-typedef libaom_test::FuncParam<TemporalFilterFunc> TemporalFilterFuncParam;
+using TemporalFilterFuncParam = libaom_test::FuncParam<TemporalFilterFunc>;
 
-typedef std::tuple<TemporalFilterFuncParam, int> TemporalFilterWithParam;
+using TemporalFilterWithParam = std::tuple<TemporalFilterFuncParam, int>;
 
 class TemporalFilterTest
     : public ::testing::TestWithParam<TemporalFilterWithParam> {
@@ -325,11 +325,11 @@
 const int kHeights[] = { 2160, 1080, 720, 600, 480, 240, 237 };
 #endif  // HAVE_AVX2 || HAVE_NEON
 
-typedef double (*EstimateNoiseFunc)(const uint8_t *src, int height, int width,
-                                    int stride, int edge_thresh);
+using EstimateNoiseFunc = double (*)(const uint8_t *src, int height, int width,
+                                     int stride, int edge_thresh);
 
-typedef std::tuple<EstimateNoiseFunc, EstimateNoiseFunc, int, int>
-    EstimateNoiseWithParam;
+using EstimateNoiseWithParam =
+    std::tuple<EstimateNoiseFunc, EstimateNoiseFunc, int, int>;
 
 class EstimateNoiseTest
     : public ::testing::TestWithParam<EstimateNoiseWithParam> {
@@ -423,16 +423,16 @@
 
 #if CONFIG_AV1_HIGHBITDEPTH
 
-typedef void (*HBDTemporalFilterFunc)(
+using HBDTemporalFilterFunc = void (*)(
     const YV12_BUFFER_CONFIG *frame_to_filter, const MACROBLOCKD *mbd,
     const BLOCK_SIZE block_size, const int mb_row, const int mb_col,
     const int num_planes, const double *noise_level, const MV *subblock_mvs,
     const int *subblock_mses, const int q_factor, const int filter_strength,
     int tf_wgt_calc_lvl, const uint8_t *pred, uint32_t *accum, uint16_t *count);
-typedef libaom_test::FuncParam<HBDTemporalFilterFunc>
-    HBDTemporalFilterFuncParam;
+using HBDTemporalFilterFuncParam =
+    libaom_test::FuncParam<HBDTemporalFilterFunc>;
 
-typedef std::tuple<HBDTemporalFilterFuncParam, int> HBDTemporalFilterWithParam;
+using HBDTemporalFilterWithParam = std::tuple<HBDTemporalFilterFuncParam, int>;
 
 class HBDTemporalFilterTest
     : public ::testing::TestWithParam<HBDTemporalFilterWithParam> {
diff --git a/test/test_intra_pred_speed.cc b/test/test_intra_pred_speed.cc
index 65468d9..80e1719 100644
--- a/test/test_intra_pred_speed.cc
+++ b/test/test_intra_pred_speed.cc
@@ -35,8 +35,8 @@
 // 0: Generate MD5 array as required
 #define APPLY_UNIT_TESTS 1
 
-typedef void (*AvxPredFunc)(uint8_t *dst, ptrdiff_t y_stride,
-                            const uint8_t *above, const uint8_t *left);
+using AvxPredFunc = void (*)(uint8_t *dst, ptrdiff_t y_stride,
+                             const uint8_t *above, const uint8_t *left);
 
 const int kBPS = 64;
 const int kTotalPixels = kBPS * kBPS;
@@ -87,7 +87,7 @@
 // -----------------------------------------------------------------------------
 // Low Bittdepth
 
-typedef IntraPredTestMem<uint8_t> Av1IntraPredTestMem;
+using Av1IntraPredTestMem = IntraPredTestMem<uint8_t>;
 
 static const char *const kTxSizeStrings[TX_SIZES_ALL] = {
   "4X4",  "8X8",  "16X16", "32X32", "64X64", "4X8",   "8X4",
@@ -1007,11 +1007,11 @@
 // High Bitdepth
 namespace {
 
-typedef void (*AvxHighbdPredFunc)(uint16_t *dst, ptrdiff_t y_stride,
-                                  const uint16_t *above, const uint16_t *left,
-                                  int bd);
+using AvxHighbdPredFunc = void (*)(uint16_t *dst, ptrdiff_t y_stride,
+                                   const uint16_t *above, const uint16_t *left,
+                                   int bd);
 
-typedef IntraPredTestMem<uint16_t> Av1HighbdIntraPredTestMem;
+using Av1HighbdIntraPredTestMem = IntraPredTestMem<uint16_t>;
 
 void TestHighbdIntraPred(TX_SIZE tx_size, AvxHighbdPredFunc const *pred_funcs,
                          const char *const signatures[]) {
diff --git a/test/test_vector_test.cc b/test/test_vector_test.cc
index b521e93..6c132b6 100644
--- a/test/test_vector_test.cc
+++ b/test/test_vector_test.cc
@@ -34,7 +34,7 @@
 const int kFileName = 1;
 const int kRowMT = 2;
 
-typedef std::tuple<int, const char *, int> DecodeParam;
+using DecodeParam = std::tuple<int, const char *, int>;
 
 class TestVectorTest : public ::libaom_test::DecoderTest,
                        public ::libaom_test::CodecTestWithParam<DecodeParam> {
diff --git a/test/tile_config_test.cc b/test/tile_config_test.cc
index ed3216f..ff7a460 100644
--- a/test/tile_config_test.cc
+++ b/test/tile_config_test.cc
@@ -19,14 +19,14 @@
 #include "test/util.h"
 
 namespace {
-typedef struct {
+struct uniformTileConfigParam {
   // Superblock size
   const unsigned int sb_size;
   // log2(number of tile rows)
   const unsigned int tile_rows;
   // log2(number of tile columns)
   const unsigned int tile_cols;
-} uniformTileConfigParam;
+};
 
 const libaom_test::TestMode kTestModeParams[] =
 #if CONFIG_REALTIME_ONLY
@@ -42,7 +42,7 @@
   { 64, 2, 2 },  { 64, 3, 3 },  { 64, 4, 4 }
 };
 
-typedef struct {
+struct nonUniformTileConfigParam {
   // Superblock size
   const unsigned int sb_size;
   // number of tile widths
@@ -53,7 +53,7 @@
   const unsigned int tile_height_count;
   // list of tile heights
   int tile_heights[AOM_MAX_TILE_ROWS];
-} nonUniformTileConfigParam;
+};
 
 const nonUniformTileConfigParam nonUniformTileConfigParams[] = {
   { 64, 1, { 3 }, 1, { 3 } },          { 64, 2, { 1, 2 }, 2, { 1, 2 } },
@@ -271,14 +271,14 @@
                            ::testing::ValuesIn(nonUniformTileConfigParams),
                            ::testing::Values(AOM_Q, AOM_VBR, AOM_CBR, AOM_CQ));
 
-typedef struct {
+struct TileGroupConfigParams {
   // Number of tile groups to set.
   const int num_tg;
   // Number of tile rows to set
   const int num_tile_rows;
   // Number of tile columns to set
   const int num_tile_cols;
-} TileGroupConfigParams;
+};
 
 static const TileGroupConfigParams tileGroupTestParams[] = {
   { 5, 4, 4 }, { 3, 3, 3 }, { 5, 3, 3 }, { 7, 5, 5 }, { 7, 3, 3 }, { 7, 4, 4 }
diff --git a/test/variance_test.cc b/test/variance_test.cc
index 1999cc9..c95183f 100644
--- a/test/variance_test.cc
+++ b/test/variance_test.cc
@@ -30,39 +30,40 @@
 
 namespace {
 
-typedef uint64_t (*MseWxH16bitFunc)(uint8_t *dst, int dstride, uint16_t *src,
-                                    int sstride, int w, int h);
-typedef uint64_t (*Mse16xH16bitFunc)(uint8_t *dst, int dstride, uint16_t *src,
-                                     int w, int h);
-typedef unsigned int (*VarianceMxNFunc)(const uint8_t *a, int a_stride,
-                                        const uint8_t *b, int b_stride,
-                                        unsigned int *sse);
-typedef void (*GetSseSum8x8QuadFunc)(const uint8_t *a, int a_stride,
-                                     const uint8_t *b, int b_stride,
-                                     uint32_t *sse8x8, int *sum8x8,
-                                     unsigned int *tot_sse, int *tot_sum,
-                                     uint32_t *var8x8);
-typedef void (*GetSseSum16x16DualFunc)(const uint8_t *a, int a_stride,
-                                       const uint8_t *b, int b_stride,
-                                       uint32_t *sse16x16,
-                                       unsigned int *tot_sse, int *tot_sum,
-                                       uint32_t *var16x16);
-typedef unsigned int (*SubpixVarMxNFunc)(const uint8_t *a, int a_stride,
-                                         int xoffset, int yoffset,
+using MseWxH16bitFunc = uint64_t (*)(uint8_t *dst, int dstride, uint16_t *src,
+                                     int sstride, int w, int h);
+using Mse16xH16bitFunc = uint64_t (*)(uint8_t *dst, int dstride, uint16_t *src,
+                                      int w, int h);
+using VarianceMxNFunc = unsigned int (*)(const uint8_t *a, int a_stride,
                                          const uint8_t *b, int b_stride,
                                          unsigned int *sse);
-typedef unsigned int (*SubpixAvgVarMxNFunc)(const uint8_t *a, int a_stride,
-                                            int xoffset, int yoffset,
-                                            const uint8_t *b, int b_stride,
-                                            uint32_t *sse,
-                                            const uint8_t *second_pred);
-typedef unsigned int (*SumOfSquaresFunction)(const int16_t *src);
+using GetSseSum8x8QuadFunc = void (*)(const uint8_t *a, int a_stride,
+                                      const uint8_t *b, int b_stride,
+                                      uint32_t *sse8x8, int *sum8x8,
+                                      unsigned int *tot_sse, int *tot_sum,
+                                      uint32_t *var8x8);
+using GetSseSum16x16DualFunc = void (*)(const uint8_t *a, int a_stride,
+                                        const uint8_t *b, int b_stride,
+                                        uint32_t *sse16x16,
+                                        unsigned int *tot_sse, int *tot_sum,
+                                        uint32_t *var16x16);
+using SubpixVarMxNFunc = unsigned int (*)(const uint8_t *a, int a_stride,
+                                          int xoffset, int yoffset,
+                                          const uint8_t *b, int b_stride,
+                                          unsigned int *sse);
+using SubpixAvgVarMxNFunc = unsigned int (*)(const uint8_t *a, int a_stride,
+                                             int xoffset, int yoffset,
+                                             const uint8_t *b, int b_stride,
+                                             uint32_t *sse,
+                                             const uint8_t *second_pred);
+using SumOfSquaresFunction = unsigned int (*)(const int16_t *src);
 
 #if !CONFIG_REALTIME_ONLY
-typedef uint32_t (*ObmcSubpelVarFunc)(const uint8_t *pre, int pre_stride,
-                                      int xoffset, int yoffset,
-                                      const int32_t *wsrc, const int32_t *mask,
-                                      unsigned int *sse);
+using ObmcSubpelVarFunc = uint32_t (*)(const uint8_t *pre, int pre_stride,
+                                       int xoffset, int yoffset,
+                                       const int32_t *wsrc, const int32_t *mask,
+                                       unsigned int *sse);
+
 #endif
 
 using libaom_test::ACMRandom;
@@ -1437,7 +1438,7 @@
 
 static const int kMaskMax = 64;
 
-typedef TestParams<ObmcSubpelVarFunc> ObmcSubpelVarianceParams;
+using ObmcSubpelVarianceParams = TestParams<ObmcSubpelVarFunc>;
 
 template <typename FunctionType>
 class ObmcVarianceTest
@@ -1592,19 +1593,19 @@
 
 #endif  // !CONFIG_REALTIME_ONLY
 
-typedef MseWxHTestClass<MseWxH16bitFunc> MseWxHTest;
-typedef Mse16xHTestClass<Mse16xH16bitFunc> Mse16xHTest;
-typedef MainTestClass<VarianceMxNFunc> AvxMseTest;
-typedef MainTestClass<VarianceMxNFunc> AvxVarianceTest;
-typedef MainTestClass<GetSseSum8x8QuadFunc> GetSseSum8x8QuadTest;
-typedef MainTestClass<GetSseSum16x16DualFunc> GetSseSum16x16DualTest;
-typedef SubpelVarianceTest<SubpixVarMxNFunc> AvxSubpelVarianceTest;
-typedef SubpelVarianceTest<SubpixAvgVarMxNFunc> AvxSubpelAvgVarianceTest;
+using MseWxHTest = MseWxHTestClass<MseWxH16bitFunc>;
+using Mse16xHTest = Mse16xHTestClass<Mse16xH16bitFunc>;
+using AvxMseTest = MainTestClass<VarianceMxNFunc>;
+using AvxVarianceTest = MainTestClass<VarianceMxNFunc>;
+using GetSseSum8x8QuadTest = MainTestClass<GetSseSum8x8QuadFunc>;
+using GetSseSum16x16DualTest = MainTestClass<GetSseSum16x16DualFunc>;
+using AvxSubpelVarianceTest = SubpelVarianceTest<SubpixVarMxNFunc>;
+using AvxSubpelAvgVarianceTest = SubpelVarianceTest<SubpixAvgVarMxNFunc>;
 #if !CONFIG_REALTIME_ONLY
-typedef ObmcVarianceTest<ObmcSubpelVarFunc> AvxObmcSubpelVarianceTest;
+using AvxObmcSubpelVarianceTest = ObmcVarianceTest<ObmcSubpelVarFunc>;
 #endif
-typedef TestParams<MseWxH16bitFunc> MseWxHParams;
-typedef TestParams<Mse16xH16bitFunc> Mse16xHParams;
+using MseWxHParams = TestParams<MseWxH16bitFunc>;
+using Mse16xHParams = TestParams<Mse16xH16bitFunc>;
 
 TEST_P(MseWxHTest, RefMse) { RefMatchTestMse(); }
 TEST_P(MseWxHTest, DISABLED_SpeedMse) { SpeedTest(); }
@@ -1659,14 +1660,14 @@
                          ::testing::Values(aom_get_mb_ss_c));
 #endif  // !CONFIG_REALTIME_ONLY
 
-typedef TestParams<VarianceMxNFunc> MseParams;
+using MseParams = TestParams<VarianceMxNFunc>;
 INSTANTIATE_TEST_SUITE_P(C, AvxMseTest,
                          ::testing::Values(MseParams(4, 4, &aom_mse16x16_c),
                                            MseParams(4, 3, &aom_mse16x8_c),
                                            MseParams(3, 4, &aom_mse8x16_c),
                                            MseParams(3, 3, &aom_mse8x8_c)));
 
-typedef TestParams<VarianceMxNFunc> VarianceParams;
+using VarianceParams = TestParams<VarianceMxNFunc>;
 const VarianceParams kArrayVariance_c[] = {
   VarianceParams(7, 7, &aom_variance128x128_c),
   VarianceParams(7, 6, &aom_variance128x64_c),
@@ -1696,7 +1697,7 @@
 INSTANTIATE_TEST_SUITE_P(C, AvxVarianceTest,
                          ::testing::ValuesIn(kArrayVariance_c));
 
-typedef TestParams<GetSseSum8x8QuadFunc> GetSseSumParams;
+using GetSseSumParams = TestParams<GetSseSum8x8QuadFunc>;
 const GetSseSumParams kArrayGetSseSum8x8Quad_c[] = {
   GetSseSumParams(7, 7, &aom_get_var_sse_sum_8x8_quad_c, 0),
   GetSseSumParams(6, 6, &aom_get_var_sse_sum_8x8_quad_c, 0),
@@ -1706,7 +1707,7 @@
 INSTANTIATE_TEST_SUITE_P(C, GetSseSum8x8QuadTest,
                          ::testing::ValuesIn(kArrayGetSseSum8x8Quad_c));
 
-typedef TestParams<GetSseSum16x16DualFunc> GetSseSumParamsDual;
+using GetSseSumParamsDual = TestParams<GetSseSum16x16DualFunc>;
 const GetSseSumParamsDual kArrayGetSseSum16x16Dual_c[] = {
   GetSseSumParamsDual(7, 7, &aom_get_var_sse_sum_16x16_dual_c, 0),
   GetSseSumParamsDual(6, 6, &aom_get_var_sse_sum_16x16_dual_c, 0),
@@ -1717,7 +1718,7 @@
 INSTANTIATE_TEST_SUITE_P(C, GetSseSum16x16DualTest,
                          ::testing::ValuesIn(kArrayGetSseSum16x16Dual_c));
 
-typedef TestParams<SubpixVarMxNFunc> SubpelVarianceParams;
+using SubpelVarianceParams = TestParams<SubpixVarMxNFunc>;
 const SubpelVarianceParams kArraySubpelVariance_c[] = {
   SubpelVarianceParams(7, 7, &aom_sub_pixel_variance128x128_c, 0),
   SubpelVarianceParams(7, 6, &aom_sub_pixel_variance128x64_c, 0),
@@ -1747,7 +1748,7 @@
 INSTANTIATE_TEST_SUITE_P(C, AvxSubpelVarianceTest,
                          ::testing::ValuesIn(kArraySubpelVariance_c));
 
-typedef TestParams<SubpixAvgVarMxNFunc> SubpelAvgVarianceParams;
+using SubpelAvgVarianceParams = TestParams<SubpixAvgVarMxNFunc>;
 const SubpelAvgVarianceParams kArraySubpelAvgVariance_c[] = {
   SubpelAvgVarianceParams(7, 7, &aom_sub_pixel_avg_variance128x128_c, 0),
   SubpelAvgVarianceParams(7, 6, &aom_sub_pixel_avg_variance128x64_c, 0),
@@ -1808,9 +1809,8 @@
 #endif
 
 #if CONFIG_AV1_HIGHBITDEPTH
-typedef uint64_t (*MseHBDWxH16bitFunc)(uint16_t *dst, int dstride,
-                                       uint16_t *src, int sstride, int w,
-                                       int h);
+using MseHBDWxH16bitFunc = uint64_t (*)(uint16_t *, int, uint16_t *, int, int,
+                                        int);
 
 template <typename FunctionType>
 class MseHBDWxHTestClass
@@ -1909,15 +1909,15 @@
   }
 }
 
-typedef TestParams<MseHBDWxH16bitFunc> MseHBDWxHParams;
-typedef MseHBDWxHTestClass<MseHBDWxH16bitFunc> MseHBDWxHTest;
-typedef MainTestClass<VarianceMxNFunc> AvxHBDMseTest;
+using MseHBDWxHParams = TestParams<MseHBDWxH16bitFunc>;
+using MseHBDWxHTest = MseHBDWxHTestClass<MseHBDWxH16bitFunc>;
+using AvxHBDMseTest = MainTestClass<VarianceMxNFunc>;
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AvxHBDMseTest);
-typedef MainTestClass<VarianceMxNFunc> AvxHBDVarianceTest;
-typedef SubpelVarianceTest<SubpixVarMxNFunc> AvxHBDSubpelVarianceTest;
-typedef SubpelVarianceTest<SubpixAvgVarMxNFunc> AvxHBDSubpelAvgVarianceTest;
+using AvxHBDVarianceTest = MainTestClass<VarianceMxNFunc>;
+using AvxHBDSubpelVarianceTest = SubpelVarianceTest<SubpixVarMxNFunc>;
+using AvxHBDSubpelAvgVarianceTest = SubpelVarianceTest<SubpixAvgVarMxNFunc>;
 #if !CONFIG_REALTIME_ONLY
-typedef ObmcVarianceTest<ObmcSubpelVarFunc> AvxHBDObmcSubpelVarianceTest;
+using AvxHBDObmcSubpelVarianceTest = ObmcVarianceTest<ObmcSubpelVarFunc>;
 #endif
 GTEST_ALLOW_UNINSTANTIATED_PARAMETERIZED_TEST(AvxHBDObmcSubpelVarianceTest);
 
diff --git a/test/warp_filter_test.cc b/test/warp_filter_test.cc
index 3a35075..530f2aa 100644
--- a/test/warp_filter_test.cc
+++ b/test/warp_filter_test.cc
@@ -66,6 +66,12 @@
 #endif  // CONFIG_AV1_HIGHBITDEPTH
 #endif  // HAVE_AVX2
 
+#if CONFIG_HIGHWAY && HAVE_AVX512
+INSTANTIATE_TEST_SUITE_P(
+    AVX512, AV1WarpFilterTest,
+    libaom_test::AV1WarpFilter::BuildParams(av1_warp_affine_avx512));
+#endif  // CONFIG_HIGHWAY && HAVE_AVX512
+
 #if HAVE_NEON
 INSTANTIATE_TEST_SUITE_P(
     NEON, AV1WarpFilterTest,
diff --git a/test/warp_filter_test_util.cc b/test/warp_filter_test_util.cc
index 9dadbaf..19683f0 100644
--- a/test/warp_filter_test_util.cc
+++ b/test/warp_filter_test_util.cc
@@ -258,9 +258,13 @@
                 conv_params.fwd_offset = quant_dist_lookup_table[jj][ii];
                 conv_params.bck_offset = quant_dist_lookup_table[jj][1 - ii];
               }
-              av1_warp_affine_c(mat, input, w, h, stride, output.get(), 32, 32,
-                                out_w, out_h, out_w, sub_x, sub_y, &conv_params,
-                                alpha, beta, gamma, delta);
+              // Around 6% of the time, use a random column/row to test
+              // out-of-bounds filtering cases.
+              const int col = rnd_.Rand8() < 16 ? rnd_.Rand8() : 32;
+              const int row = rnd_.Rand8() < 16 ? rnd_.Rand8() : 32;
+              av1_warp_affine_c(mat, input, w, h, stride, output.get(), col,
+                                row, out_w, out_h, out_w, sub_x, sub_y,
+                                &conv_params, alpha, beta, gamma, delta);
               if (use_no_round) {
                 conv_params = get_conv_params_no_round(
                     do_average, 0, dstb.get(), out_w, 1, bd);
@@ -272,9 +276,9 @@
                 conv_params.fwd_offset = quant_dist_lookup_table[jj][ii];
                 conv_params.bck_offset = quant_dist_lookup_table[jj][1 - ii];
               }
-              test_impl(mat, input, w, h, stride, output2.get(), 32, 32, out_w,
-                        out_h, out_w, sub_x, sub_y, &conv_params, alpha, beta,
-                        gamma, delta);
+              test_impl(mat, input, w, h, stride, output2.get(), col, row,
+                        out_w, out_h, out_w, sub_x, sub_y, &conv_params, alpha,
+                        beta, gamma, delta);
               if (use_no_round) {
                 for (int j = 0; j < out_w * out_h; ++j)
                   ASSERT_EQ(dsta[j], dstb[j])
@@ -458,8 +462,12 @@
                 conv_params.bck_offset = quant_dist_lookup_table[jj][1 - ii];
               }
 
+              // Around 6% of the time, use a random column/row to test
+              // out-of-bounds filtering cases.
+              const int col = rnd_.Rand8() < 16 ? rnd_.Rand8() : 32;
+              const int row = rnd_.Rand8() < 16 ? rnd_.Rand8() : 32;
               av1_highbd_warp_affine_c(mat, input, w, h, stride, output.get(),
-                                       32, 32, out_w, out_h, out_w, sub_x,
+                                       col, row, out_w, out_h, out_w, sub_x,
                                        sub_y, bd, &conv_params, alpha, beta,
                                        gamma, delta);
               if (use_no_round) {
@@ -475,9 +483,9 @@
                 conv_params.fwd_offset = quant_dist_lookup_table[jj][ii];
                 conv_params.bck_offset = quant_dist_lookup_table[jj][1 - ii];
               }
-              test_impl(mat, input, w, h, stride, output2.get(), 32, 32, out_w,
-                        out_h, out_w, sub_x, sub_y, bd, &conv_params, alpha,
-                        beta, gamma, delta);
+              test_impl(mat, input, w, h, stride, output2.get(), col, row,
+                        out_w, out_h, out_w, sub_x, sub_y, bd, &conv_params,
+                        alpha, beta, gamma, delta);
 
               if (use_no_round) {
                 for (int j = 0; j < out_w * out_h; ++j)
diff --git a/test/warp_filter_test_util.h b/test/warp_filter_test_util.h
index 7e401d8..25fa110 100644
--- a/test/warp_filter_test_util.h
+++ b/test/warp_filter_test_util.h
@@ -34,16 +34,16 @@
 
 namespace AV1WarpFilter {
 
-typedef void (*warp_affine_func)(const int32_t *mat, const uint8_t *ref,
-                                 int width, int height, int stride,
-                                 uint8_t *pred, int p_col, int p_row,
-                                 int p_width, int p_height, int p_stride,
-                                 int subsampling_x, int subsampling_y,
-                                 ConvolveParams *conv_params, int16_t alpha,
-                                 int16_t beta, int16_t gamma, int16_t delta);
+using warp_affine_func = void (*)(const int32_t *mat, const uint8_t *ref,
+                                  int width, int height, int stride,
+                                  uint8_t *pred, int p_col, int p_row,
+                                  int p_width, int p_height, int p_stride,
+                                  int subsampling_x, int subsampling_y,
+                                  ConvolveParams *conv_params, int16_t alpha,
+                                  int16_t beta, int16_t gamma, int16_t delta);
 
-typedef std::tuple<int, int, int, warp_affine_func> WarpTestParam;
-typedef std::tuple<WarpTestParam, int, int, int, int> WarpTestParams;
+using WarpTestParam = std::tuple<int, int, int, warp_affine_func>;
+using WarpTestParams = std::tuple<WarpTestParam, int, int, int, int>;
 
 ::testing::internal::ParamGenerator<WarpTestParams> BuildParams(
     warp_affine_func filter);
@@ -64,19 +64,17 @@
 
 #if CONFIG_AV1_HIGHBITDEPTH
 namespace AV1HighbdWarpFilter {
-typedef void (*highbd_warp_affine_func)(const int32_t *mat, const uint16_t *ref,
-                                        int width, int height, int stride,
-                                        uint16_t *pred, int p_col, int p_row,
-                                        int p_width, int p_height, int p_stride,
-                                        int subsampling_x, int subsampling_y,
-                                        int bd, ConvolveParams *conv_params,
-                                        int16_t alpha, int16_t beta,
-                                        int16_t gamma, int16_t delta);
+using highbd_warp_affine_func =
+    void (*)(const int32_t *mat, const uint16_t *ref, int width, int height,
+             int stride, uint16_t *pred, int p_col, int p_row, int p_width,
+             int p_height, int p_stride, int subsampling_x, int subsampling_y,
+             int bd, ConvolveParams *conv_params, int16_t alpha, int16_t beta,
+             int16_t gamma, int16_t delta);
 
-typedef std::tuple<int, int, int, int, highbd_warp_affine_func>
-    HighbdWarpTestParam;
-typedef std::tuple<HighbdWarpTestParam, int, int, int, int>
-    HighbdWarpTestParams;
+using HighbdWarpTestParam =
+    std::tuple<int, int, int, int, highbd_warp_affine_func>;
+using HighbdWarpTestParams =
+    std::tuple<HighbdWarpTestParam, int, int, int, int>;
 
 ::testing::internal::ParamGenerator<HighbdWarpTestParams> BuildParams(
     highbd_warp_affine_func filter);
diff --git a/test/wiener_test.cc b/test/wiener_test.cc
index fe6e40e..afb636e 100644
--- a/test/wiener_test.cc
+++ b/test/wiener_test.cc
@@ -175,18 +175,18 @@
 }
 
 static const int kIterations = 100;
-typedef void (*compute_stats_Func)(int wiener_win, const uint8_t *dgd,
-                                   const uint8_t *src, int16_t *dgd_avg,
-                                   int16_t *src_avg, int h_start, int h_end,
-                                   int v_start, int v_end, int dgd_stride,
-                                   int src_stride, int64_t *M, int64_t *H,
-                                   int use_downsampled_wiener_stats);
+using compute_stats_Func = void (*)(int wiener_win, const uint8_t *dgd,
+                                    const uint8_t *src, int16_t *dgd_avg,
+                                    int16_t *src_avg, int h_start, int h_end,
+                                    int v_start, int v_end, int dgd_stride,
+                                    int src_stride, int64_t *M, int64_t *H,
+                                    int use_downsampled_wiener_stats);
 
 ////////////////////////////////////////////////////////////////////////////////
 // 8 bit
 ////////////////////////////////////////////////////////////////////////////////
 
-typedef std::tuple<const compute_stats_Func> WienerTestParam;
+using WienerTestParam = std::tuple<const compute_stats_Func>;
 
 class WienerTest : public ::testing::TestWithParam<WienerTestParam> {
  public:
@@ -537,14 +537,14 @@
 }
 
 static const int kIterations = 100;
-typedef void (*compute_stats_Func)(int wiener_win, const uint8_t *dgd,
-                                   const uint8_t *src, int16_t *d, int16_t *s,
-                                   int h_start, int h_end, int v_start,
-                                   int v_end, int dgd_stride, int src_stride,
-                                   int64_t *M, int64_t *H,
-                                   aom_bit_depth_t bit_depth);
+using compute_stats_Func = void (*)(int wiener_win, const uint8_t *dgd,
+                                    const uint8_t *src, int16_t *d, int16_t *s,
+                                    int h_start, int h_end, int v_start,
+                                    int v_end, int dgd_stride, int src_stride,
+                                    int64_t *M, int64_t *H,
+                                    aom_bit_depth_t bit_depth);
 
-typedef std::tuple<const compute_stats_Func> WienerTestParam;
+using WienerTestParam = std::tuple<const compute_stats_Func>;
 
 class WienerTestHighbd : public ::testing::TestWithParam<WienerTestParam> {
  public:
diff --git a/third_party/fastfeat/LICENSE b/third_party/fastfeat/LICENSE
index f347008..ee48fe6 100644
--- a/third_party/fastfeat/LICENSE
+++ b/third_party/fastfeat/LICENSE
@@ -13,8 +13,8 @@
 	 notice, this list of conditions and the following disclaimer in the
 	 documentation and/or other materials provided with the distribution.
 
-	*Neither the name of the University of Cambridge nor the names of 
-	 its contributors may be used to endorse or promote products derived 
+	*Neither the name of the University of Cambridge nor the names of
+	 its contributors may be used to endorse or promote products derived
 	 from this software without specific prior written permission.
 
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS