Fix range check in half_btf() function

Per the linked bug report, the condition on 'result_64' in half_btf()
is not satisfied by all conformant bitstreams. But it can easily be
modified to give a condition which *is* true for all conformant bitstreams.

Add a comment explaining the updated condition, as well as its implications
for alternative implementations (ie. hardware and/or vectorized code)

Similarly, there is an unnecessary range check on the output of
av1_iadst4_new(). This is followed immediately by a clamp to the same
range, so we shouldn't check the range here (and indeed the spec doesn't
do so).

BUG=aomedia:2151

Change-Id: Iab30344e2e7b1ce00245541b1a32a9495d5ed717
diff --git a/av1/common/av1_inv_txfm1d.c b/av1/common/av1_inv_txfm1d.c
index ab144f2..7ef2d6d 100644
--- a/av1/common/av1_inv_txfm1d.c
+++ b/av1/common/av1_inv_txfm1d.c
@@ -711,7 +711,6 @@
   output[1] = round_shift(x1, bit);
   output[2] = round_shift(x2, bit);
   output[3] = round_shift(x3, bit);
-  av1_range_check_buf(6, input, output, 4, stage_range[6]);
 }
 
 void av1_iadst8_new(const int32_t *input, int32_t *output, int8_t cos_bit,
diff --git a/av1/common/av1_txfm.h b/av1/common/av1_txfm.h
index c92f91d..59d64ca 100644
--- a/av1/common/av1_txfm.h
+++ b/av1/common/av1_txfm.h
@@ -78,10 +78,25 @@
 static INLINE int32_t half_btf(int32_t w0, int32_t in0, int32_t w1, int32_t in1,
                                int bit) {
   int64_t result_64 = (int64_t)(w0 * in0) + (int64_t)(w1 * in1);
+  int64_t intermediate = result_64 + (1LL << (bit - 1));
+  // NOTE(david.barker): The value 'result_64' may not necessarily fit
+  // into 32 bits. However, the result of this function is nominally
+  // ROUND_POWER_OF_TWO_64(result_64, bit)
+  // and that is required to fit into stage_range[stage] many bits
+  // (checked by range_check_buf()).
+  //
+  // Here we've unpacked that rounding operation, and it can be shown
+  // that the value of 'intermediate' here *does* fit into 32 bits
+  // for any conformant bitstream.
+  // The upshot is that, if you do all this calculation using
+  // wrapping 32-bit arithmetic instead of (non-wrapping) 64-bit arithmetic,
+  // then you'll still get the correct result.
+  // To provide a check on this logic, we assert that 'intermediate'
+  // would fit into an int32 if range checking is enabled.
 #if CONFIG_COEFFICIENT_RANGE_CHECKING
-  assert(result_64 >= INT32_MIN && result_64 <= INT32_MAX);
+  assert(intermediate >= INT32_MIN && intermediate <= INT32_MAX);
 #endif
-  return round_shift(result_64, bit);
+  return (int32_t)(intermediate >> bit);
 }
 
 static INLINE uint16_t highbd_clip_pixel_add(uint16_t dest, tran_high_t trans,