| /*!\page encoder_guide AV1 ENCODER GUIDE |
| |
| \tableofcontents |
| |
| \section architecture_introduction Introduction |
| |
| This document provides an architectural overview of the libaom AV1 encoder. |
| |
| It is intended as a high level starting point for anyone wishing to contribute |
| to the project, that will help them to more quickly understand the structure |
| of the encoder and find their way around the codebase. |
| |
| It stands above and will where necessary link to more detailed function |
| level documents. |
| |
| \subsection architecture_gencodecs Generic Block Transform Based Codecs |
| |
| Most modern video encoders including VP8, H.264, VP9, HEVC and AV1 |
| (in increasing order of complexity) share a common basic paradigm. This |
| comprises separating a stream of raw video frames into a series of discrete |
| blocks (of one or more sizes), then computing a prediction signal and a |
| quantized, transform coded, residual error signal. The prediction and residual |
| error signal, along with any side information needed by the decoder, are then |
| entropy coded and packed to form the encoded bitstream. See Figure 1: below, |
| where the blue blocks are, to all intents and purposes, the lossless parts of |
| the encoder and the red block is the lossy part. |
| |
| This is of course a gross oversimplification, even in regard to the simplest |
| of the above codecs. For example, all of them allow for block based |
| prediction at multiple different scales (i.e. different block sizes) and may |
| use previously coded pixels in the current frame for prediction or pixels from |
| one or more previously encoded frames. Further, they may support multiple |
| different transforms and transform sizes and quality optimization tools like |
| loop filtering. |
| |
| \image html genericcodecflow.png "" width=70% |
| |
| \subsection architecture_av1_structure AV1 Structure and Complexity |
| |
| As previously stated, AV1 adopts the same underlying paradigm as other block |
| transform based codecs. However, it is much more complicated than previous |
| generation codecs and supports many more block partitioning, prediction and |
| transform options. |
| |
| AV1 supports block partitions of various sizes from 128x128 pixels down to 4x4 |
| pixels using a multi-layer recursive tree structure as illustrated in figure 2 |
| below. |
| |
| \image html av1partitions.png "" width=70% |
| |
| AV1 also provides 71 basic intra prediction modes, 56 single frame inter prediction |
| modes (7 reference frames x 4 modes x 2 for OBMC (overlapped block motion |
| compensation)), 12768 compound inter prediction modes (that combine inter |
| predictors from two reference frames) and 36708 compound inter / intra |
| prediction modes. Furthermore, in addition to simple inter motion estimation, |
| AV1 also supports warped motion prediction using affine transforms. |
| |
| In terms of transform coding, it has 16 separable 2-D transform kernels |
| \f$(DCT, ADST, fADST, IDTX)^2\f$ that can be applied at up to 19 different |
| scales from 64x64 down to 4x4 pixels. |
| |
| When combined together, this means that for any one 8x8 pixel block in a |
| source frame, there are approximately 45,000,000 different ways that it can |
| be encoded. |
| |
| Consequently, AV1 requires complex control processes. While not necessarily |
| a normative part of the bitstream, these are the algorithms that turn a set |
| of compression tools and a bitstream format specification, into a coherent |
| and useful codec implementation. These may include but are not limited to |
| things like :- |
| |
| - Rate distortion optimization (The process of trying to choose the most |
| efficient combination of block size, prediction mode, transform type |
| etc.) |
| - Rate control (regulation of the output bitrate) |
| - Encoder speed vs quality trade offs. |
| - Features such as two pass encoding or optimization for low delay |
| encoding. |
| |
| For a more detailed overview of AV1's encoding tools and a discussion of some |
| of the design considerations and hardware constraints that had to be |
| accommodated, please refer to (TODO REF) <b>A Technical Overview of the AV1 |
| Standard</b> (TODO add link to Jingning's AV1 overview paper). |
| |
| Figure 3 provides a slightly expanded but still simplistic view of the |
| AV1 encoder architecture with blocks that relate to some of the subsequent |
| sections of this document. In this diagram, the raw uncompressed frame buffers |
| are shown in dark green and the reconstructed frame buffers used for |
| prediction in light green. Red indicates those parts of the codec that are |
| (or may be) lossy, where fidelity can be traded off against compression |
| efficiency, whilst light blue shows algorithms or coding tools that are |
| lossless. The yellow blocks represent non-bitstream normative configuration |
| and control algorithms. |
| |
| \image html av1encoderflow.png "" width=70% |
| |
| \section architecture_command_line The Libaom Command Line Interface |
| |
| Add details or links here: TODO ? elliotk@ |
| |
| \section architecture_enc_data_structures Main Encoder Data Structures |
| |
| The following are the main high level data structures used by the libaom AV1 |
| encoder and referenced elsewhere in this overview document: |
| |
| - \ref AV1_COMP |
| - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) |
| - \ref AV1_COMP.alt_ref_buffer (\ref yv12_buffer_config) |
| - \ref AV1_COMP.rc (\ref RATE_CONTROL) |
| - \ref AV1_COMP.twopass (\ref TWO_PASS) |
| - \ref AV1_COMP.gf_group (\ref GF_GROUP) |
| - \ref AV1_COMP.speed |
| - \ref AV1_COMP.sf (\ref SPEED_FEATURES) |
| - \ref AV1_COMP.lap_enabled |
| |
| - \ref AV1EncoderConfig (Encoder configuration parameters) |
| - \ref AV1EncoderConfig.pass |
| - \ref AV1EncoderConfig.algo_cfg (\ref AlgoCfg) |
| - \ref AV1EncoderConfig.kf_cfg (\ref KeyFrameCfg) |
| - \ref AV1EncoderConfig.rc_cfg (\ref RateControlCfg) |
| |
| - \ref AlgoCfg (Algorithm related configuration parameters) |
| - \ref AlgoCfg.arnr_max_frames |
| - \ref AlgoCfg.arnr_strength |
| |
| - \ref KeyFrameCfg (Keyframe coding configuration parameters) |
| - \ref KeyFrameCfg.enable_keyframe_filtering |
| |
| - \ref RateControlCfg (Rate control configuration) |
| - \ref RateControlCfg.mode |
| - \ref RateControlCfg.target_bandwidth |
| - \ref RateControlCfg.best_allowed_q |
| - \ref RateControlCfg.worst_allowed_q |
| - \ref RateControlCfg.qp |
| - \ref RateControlCfg.under_shoot_pct |
| - \ref RateControlCfg.over_shoot_pct |
| - \ref RateControlCfg.maximum_buffer_size_ms |
| - \ref RateControlCfg.starting_buffer_level_ms |
| - \ref RateControlCfg.optimal_buffer_level_ms |
| - \ref RateControlCfg.vbrmin_section |
| - \ref RateControlCfg.vbrmax_section |
| |
| - \ref RATE_CONTROL (Rate control status) |
| - \ref RATE_CONTROL.intervals_till_gf_calculate_due |
| - \ref RATE_CONTROL.gf_intervals[] |
| - \ref RATE_CONTROL.cur_gf_index |
| - \ref RATE_CONTROL.frames_till_gf_update_due |
| - \ref RATE_CONTROL.frames_to_key |
| |
| - \ref TWO_PASS (Two pass status and control data) |
| |
| - \ref GF_GROUP (Data relating to the current GF/ARF group) |
| |
| - \ref FIRSTPASS_STATS (Defines entries in the first pass stats buffer) |
| - \ref FIRSTPASS_STATS.coded_error |
| |
| - \ref SPEED_FEATURES (Encode speed vs quality tradeoff parameters) |
| - \ref SPEED_FEATURES.hl_sf (\ref HIGH_LEVEL_SPEED_FEATURES) |
| |
| - \ref HIGH_LEVEL_SPEED_FEATURES |
| - \ref HIGH_LEVEL_SPEED_FEATURES.recode_loop |
| - \ref HIGH_LEVEL_SPEED_FEATURES.recode_tolerance |
| |
| - \ref TplParams |
| |
| \section architecture_enc_use_cases Encoder Use Cases |
| |
| The libaom AV1 encoder is configurable to support a number of different use |
| cases and rate control strategies. |
| |
| The principle use cases for which it is optimised are as follows: |
| |
| - <b>Video on Demand / Streaming</b> |
| - <b>Low Delay or Live Streaming</b> |
| - <b>Video Conferencing / Real Time Coding (RTC)</b> |
| - <b>Fixed Quality / Testing</b> |
| |
| Other examples of use cases for which the encoder could be configured but for |
| which there is less by way of specific optimizations include: |
| |
| - <b>Download and Play</b> |
| - <b>Disk Playback</b>> |
| - <b>Storage</b> |
| - <b>Editing</b> |
| - <b>Broadcast video</b> |
| |
| Specific use cases may have particular requirements or constraints. For |
| example: |
| |
| <b>Video Conferencing:</b> In a video conference we need to encode the video |
| in real time and to avoid any coding tools that could increase latency, such |
| as frame look ahead. |
| |
| <b>Live Streams:</b> In cases such as live streaming of games or events, it |
| may be possible to allow some limited buffering of the video and use of |
| lookahead coding tools to improve encoding quality. However, whilst a lag of |
| a second or two may be fine given the one way nature of this type of video, |
| it is clearly not possible to use tools such as two pass coding. |
| |
| <b>Broadcast:</b> Broadcast video (e.g. digital TV over satellite) may have |
| specific requirements such as frequent and regular key frames (e.g. once per |
| second or more) as these are important as entry points to users when switching |
| channels. There may also be strict upper limits on bandwidth over a short |
| window of time. |
| |
| <b>Download and Play:</b> Download and play applications may have less strict |
| requirements in terms of local frame by frame rate control but there may be a |
| requirement to accurately hit a file size target for the video clip as a |
| whole. Similar considerations may apply to playback from mass storage devices |
| such as DVD or disk drives. |
| |
| <b>Editing:</b> In certain special use cases such as offline editing, it may |
| be desirable to have very high quality and data rate but also very frequent |
| key frames or indeed to encode the video exclusively as key frames. Lossless |
| video encoding may also be required in this use case. |
| |
| <b>VOD / Streaming:</b> One of the most important and common use cases for AV1 |
| is video on demand or streaming, for services such as YouTube and Netflix. In |
| this use case it is possible to do two or even multi-pass encoding to improve |
| compression efficiency. Streaming services will often store many encoded |
| copies of a video at different resolutions and data rates to support users |
| with different types of playback device and bandwidth limitations. |
| Furthermore, these services support dynamic switching between multiple |
| streams, so that they can respond to changing network conditions. |
| |
| Exact rate control when encoding for a specific format (e.g 360P or 1080P on |
| YouTube) may not be critical, provided that the video bandwidth remains within |
| allowed limits. Whilst a format may have a nominal target data rate, this can |
| be considered more as the desired average egress rate over the video corpus |
| rather than a strict requirement for any individual clip. Indeed, in order |
| to maintain optimal quality of experience for the end user, it may be |
| desirable to encode some easier videos or sections of video at a lower data |
| rate and harder videos or sections at a higher rate. |
| |
| VOD / streaming does not usually require very frequent key frames (as in the |
| broadcast case) but key frames are important in trick play (scanning back and |
| forth to different points in a video) and for adaptive stream switching. As |
| such, in a use case like YouTube, there is normally an upper limit on the |
| maximum time between key frames of a few seconds, but within certain limits |
| the encoder can try to align key frames with real scene cuts. |
| |
| Whilst encoder speed may not seem to be as critical in this use case, for |
| services such as YouTube, where millions of new videos have to be encoded |
| every day, encoder speed is still important, so libaom allows command line |
| control of the encode speed vs quality trade off. |
| |
| <b>Fixed Quality / Testing Mode:</b> Libaom also has a fixed quality encoder |
| pathway designed for testing under highly constrained conditions. |
| |
| \section architecture_enc_speed_quality Speed vs Quality Trade Off |
| |
| In any modern video encoder there are trade offs that can be made in regard to |
| the amount of time spent encoding a video or video frame vs the quality of the |
| final encode. |
| |
| These trade offs typically limit the scope of the search for an optimal |
| prediction / transform combination with faster encode modes doing fewer |
| partition, reference frame, prediction mode and transform searches at the cost |
| of some reduction in coding efficiency. |
| |
| The pruning of the size of the search tree is typically based on assumptions |
| about the likelihood of different search modes being selected based on what |
| has gone before and features such as the dimensions of the video frames and |
| the Q value selected for encoding the frame. For example certain intra modes |
| are less likely to be chosen at high Q but may be more likely if similar |
| modes were used for the previously coded blocks above and to the left of the |
| current block. |
| |
| The speed settings depend both on the use case (e.g. Real Time encoding) and |
| an explicit speed control passed in on the command line as <b>--cpu-used</b> |
| and stored in the \ref AV1_COMP.speed field of the main compressor instance |
| data structure (<b>cpi</b>). |
| |
| The control flags for the speed trade off are stored the \ref AV1_COMP.sf |
| field of the compressor instancve and are set in the following functions:- |
| |
| - \ref av1_set_speed_features_framesize_independent() |
| - \ref av1_set_speed_features_framesize_dependent() |
| - \ref av1_set_speed_features_qindex_dependent() |
| |
| A second factor impacting the speed of encode is rate distortion optimisation |
| (<b>rd vs non-rd</b> encoding). |
| |
| When rate distortion optimization is enabled each candidate combination of |
| a prediction mode and transform coding strategy is fully encoded and the |
| resulting error (or distortion) as compared to the original source and the |
| number of bits used, are passed to a rate distortion function. This function |
| converts the distortion and cost in bits to a single <b>RD</b> value (where |
| lower is better). This <b>RD</b> value is used to decide between different |
| encoding strategies for the current block where, for example, a one may |
| result in a lower distortion but a larger number of bits. |
| |
| The calculation of this <b>RD</b> value is broadly speaking as follows: |
| |
| \f[ |
| RD = (λ * Rate) + Distortion |
| \f] |
| |
| This assumes a linear relationship between the number of bits used and |
| distortion (represented by the rate multiplier value <b>λ</b>) which is |
| not actually valid across a broad range of rate and distortion values. |
| Typically, where distortion is high, expending a small number of extra bits |
| will result in a large change in distortion. However, at lower values of |
| distortion the cost in bits of each incremental improvement is large. |
| |
| To deal with this we scale the value of <b>λ</b> based on the quantizer |
| value chosen for the frame. This is assumed to be a proxy for our approximate |
| position on the true rate distortion curve and it is further assumed that over |
| a limited range of distortion values, a linear relationship between distortion |
| and rate is a valid approximation. |
| |
| Doing a rate distortion test on each candidate prediction / transform |
| combination is expensive in terms of cpu cycles. Hence, for cases where encode |
| speed is critical, libaom implements a non-rd pathway where the <b>RD</b> |
| value is estimated based on the prediction error and quantizer setting. |
| |
| \section architecture_enc_src_proc Source Frame Processing |
| |
| \subsection architecture_enc_frame_proc_data Main Data Structures |
| |
| The following are the main data structures referenced in this section |
| (see also \ref architecture_enc_data_structures): |
| |
| - \ref AV1_COMP cpi (the main compressor instance data structure) |
| - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) |
| - \ref AV1_COMP.alt_ref_buffer (\ref yv12_buffer_config) |
| |
| - \ref AV1EncoderConfig (Encoder configuration parameters) |
| - \ref AV1EncoderConfig.algo_cfg (\ref AlgoCfg) |
| - \ref AV1EncoderConfig.kf_cfg (\ref KeyFrameCfg) |
| |
| - \ref AlgoCfg (Algorithm related configuration parameters) |
| - \ref AlgoCfg.arnr_max_frames |
| - \ref AlgoCfg.arnr_strength |
| |
| - \ref KeyFrameCfg (Keyframe coding configuration parameters) |
| - \ref KeyFrameCfg.enable_keyframe_filtering |
| |
| \subsection architecture_enc_frame_proc_ingest Frame Ingest / Coding Pipeline |
| |
| To encode a frame, first call \ref av1_receive_raw_frame() to obtain the raw |
| frame data. Then call \ref av1_get_compressed_data() to encode raw frame data |
| into compressed frame data. The main body of \ref av1_get_compressed_data() |
| is \ref av1_encode_strategy(), which determines high-level encode strategy |
| (frame type, frame placement, etc.) and then encodes the frame by calling |
| \ref av1_encode(). In \ref av1_encode(), \ref av1_first_pass() will execute |
| the first_pass of two-pass encoding, while \ref encode_frame_to_data_rate() |
| will perform the final pass for either one-pass or two-pass encoding. |
| |
| The main body of \ref encode_frame_to_data_rate() is |
| \ref encode_with_recode_loop_and_filter(), which handles encoding before |
| in-loop filters (with recode loops \ref encode_with_recode_loop(), or |
| without any recode loop \ref encode_without_recode()), followed by in-loop |
| filters (deblocking filters \ref loopfilter_frame(), CDEF filters and |
| restoration filters \ref cdef_restoration_frame()). |
| |
| Except for rate/quality control, both \ref encode_with_recode_loop() and |
| \ref encode_without_recode() call \ref av1_encode_frame() to manage the |
| reference frame buffers and \ref encode_frame_internal() to perform the |
| rest of encoding that does not require access to external frames. |
| \ref encode_frame_internal() is the starting point for the partition search |
| (see \ref architecture_enc_partitions). |
| |
| \subsection architecture_enc_frame_proc_tf Temporal Filtering |
| |
| \subsubsection architecture_enc_frame_proc_tf_overview Overview |
| |
| Video codecs exploit the spatial and temporal correlations in video signals to |
| achieve compression efficiency. The noise factor in the source signal |
| attenuates such correlation and impedes the codec performance. Denoising the |
| video signal is potentially a promising solution. |
| |
| One strategy for denoising a source is motion compensated temporal filtering. |
| Unlike image denoising, where only the spatial information is available, |
| video denoising can leverage a combination of the spatial and temporal |
| information. Specifically, in the temporal domain, similar pixels can often be |
| tracked along the motion trajectory of moving objects. Motion estimation is |
| applied to neighboring frames to find similar patches or blocks of pixels that |
| can be combined to create a temporally filtered output. |
| |
| AV1, in common with VP8 and VP9, uses an in-loop motion compensated temporal |
| filter to generate what are referred to as alternate reference frames (or ARF |
| frames). These can be encoded in the bitstream and stored as frame buffers for |
| use in the prediction of subsequent frames, but are not usually directly |
| displayed (hence they are sometimes referred to as non-display frames). |
| |
| The following command line parameters set the strength of the filter, the |
| number of frames used and determine whether filtering is allowed for key |
| frames. |
| |
| - <b>--arnr-strength</b> (\ref AlgoCfg.arnr_strength) |
| - <b>--arnr-maxframes</b> (\ref AlgoCfg.arnr_max_frames) |
| - <b>--enable-keyframe-filtering</b> |
| (\ref KeyFrameCfg.enable_keyframe_filtering) |
| |
| Note that in AV1, the temporal filtering scheme is designed around the |
| hierarchical ARF based pyramid coding structure. We typically apply denoising |
| only on key frame and ARF frames at the highest (and sometimes the second |
| highest) layer in the hierarchical coding structure. |
| |
| \subsubsection architecture_enc_frame_proc_tf_algo Temporal Filtering Algorithm |
| |
| Our method divides the current frame into "MxM" blocks. For each block, a |
| motion search is applied on frames before and after the current frame. Only |
| the best matching patch with the smallest mean square error (MSE) is kept as a |
| candidate patch for a neighbour frame. The current block is also a candidate |
| patch. A total of N candidate patches are combined to generate the filtered |
| output. |
| |
| Let f(i) represent the filtered sample value and \f$p_{j}(i)\f$ the sample |
| value of the j-th patch. The filtering process is: |
| |
| \f[ |
| f(i) = \frac{p_{0}(i) + \sum_{j=1}^{N} ω_{j}(i).p_{j}(i)} |
| {1 + \sum_{j=1}^{N} ω_{j}(i)} |
| \f] |
| |
| where \f$ ω_{j}(i) \f$ is the weight of the j-th patch from a total of |
| N patches. The weight is determined by the patch difference as: |
| |
| \f[ |
| ω_{j}(i) = exp(-\frac{D_{j}(i)}{h^2}) |
| \f] |
| |
| where \f$ D_{j}(i) \f$ is the sum of squared difference between the current |
| block and the j-th candidate patch: |
| |
| \f[ |
| D_{j}(i) = \sum_{k\inΩ_{i}}||p_{0}(k) - p_{j}(k)||_{2} |
| \f] |
| |
| where: |
| - \f$p_{0}\f$ refers to the current frame. |
| - \f$Ω_{i}\f$ is the patch window, an "LxL" pixel square. |
| - h is a critical parameter that controls the decay of the weights measured by |
| the Euclidean distance. It is derived from an estimate of noise amplitude in |
| the source. This allows the filter coefficients to adapt for videos with |
| different noise characteristics. |
| - Usually, M = 32, N = 7, and L = 5, but they can be adjusted. |
| |
| It is recommended that the reader refers to the code for more details. |
| |
| \subsubsection architecture_enc_frame_proc_tf_funcs Temporal Filter Functions |
| |
| The main entry point for temporal filtering is \ref av1_temporal_filter(). |
| This function returns 1 if temporal filtering is successful, otherwise 0. |
| When temporal filtering is applied, the filtered frame will be held in |
| the frame buffer \ref AV1_COMP.alt_ref_buffer, which is the frame to be |
| encoded in the following encoding process. |
| |
| Almost all temporal filter related code is in av1/encoder/temporal_filter.c |
| and av1/encoder/temporal_filter.h. |
| |
| Inside \ref av1_temporal_filter(), the reader's attention is directed to |
| \ref tf_setup_filtering_buffer() and \ref tf_do_filtering(). |
| |
| - \ref tf_setup_filtering_buffer(): sets up the frame buffer for |
| temporal filtering, determines the number of frames to be used, and |
| calculates the noise level of each frame. |
| |
| - \ref tf_do_filtering(): the main function for the temporal |
| filtering algorithm. It breaks each frame into "MxM" blocks. For each |
| block a motion search \ref tf_motion_search() is applied to find |
| the motion vector from one neighboring frame. tf_build_predictor() is then |
| called to build the matching patch and \ref av1_apply_temporal_filter_c() (see |
| also optimised SIMD versions) to apply temporal filtering. The weighted |
| average over each pixel is accumulated and finally normalized in |
| \ref tf_normalize_filtered_frame() to generate the final filtered frame. |
| |
| - \ref av1_apply_temporal_filter_c(): the core function of our temporal |
| filtering algorithm (see also optimised SIMD versions). |
| |
| \subsection architecture_enc_frame_proc_film Film Grain Modelling |
| |
| Add details here. |
| |
| \section architecture_enc_rate_ctrl Rate Control |
| |
| \subsection architecture_enc_rate_ctrl_data Main Data Structures |
| |
| The following are the main data structures referenced in this section |
| (see also \ref architecture_enc_data_structures): |
| |
| - \ref AV1_COMP cpi (the main compressor instance data structure) |
| - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) |
| - \ref AV1_COMP.rc (\ref RATE_CONTROL) |
| - \ref AV1_COMP.twopass (\ref TWO_PASS) |
| - \ref AV1_COMP.sf (\ref SPEED_FEATURES) |
| |
| - \ref AV1EncoderConfig (Encoder configuration parameters) |
| - \ref AV1EncoderConfig.rc_cfg (\ref RateControlCfg) |
| |
| - \ref FIRSTPASS_STATS *frame_stats_buf (used to store per frame first |
| pass stats) |
| |
| - \ref SPEED_FEATURES (Encode speed vs quality tradeoff parameters) |
| - \ref SPEED_FEATURES.hl_sf (\ref HIGH_LEVEL_SPEED_FEATURES) |
| |
| \subsection architecture_enc_rate_ctrl_options Supported Rate Control Options |
| |
| Different use cases (\ref architecture_enc_use_cases) may have different |
| requirements in terms of data rate control. |
| |
| The broad rate control strategy is selected using the <b>--end-usage</b> |
| parameter on the command line, which maps onto the field |
| \ref aom_codec_enc_cfg_t.rc_end_usage in \ref aom_encoder.h. |
| |
| The four supported options are:- |
| |
| - <b>VBR</b> (Variable Bitrate) |
| - <b>CBR</b> (Constant Bitrate) |
| - <b>CQ</b> (Constrained Quality mode ; A constrained variant of VBR) |
| - <b>Fixed Q</b> (Constant quality of Q mode) |
| |
| The value of \ref aom_codec_enc_cfg_t.rc_end_usage is in turn copied over |
| into the encoder rate control configuration data structure as |
| \ref RateControlCfg.mode. |
| |
| In regards to the most important use cases above, Video on demand uses either |
| VBR or CQ mode. CBR is the preferred rate control model for RTC and Live |
| streaming and Fixed Q is only used in testing. |
| |
| The behaviour of each of these modes is regulated by a series of secondary |
| command line rate control options but also depends somewhat on the selected |
| use case, whether 2-pass coding is enabled and the selected encode speed vs |
| quality trade offs (\ref AV1_COMP.speed and \ref AV1_COMP.sf). |
| |
| The list below gives the names of the main rate control command line |
| options together with the names of the corresponding fields in the rate |
| control configuration data structures. |
| |
| - <b>--target-bitrate</b> (\ref RateControlCfg.target_bandwidth) |
| - <b>--min-qp</b> (\ref RateControlCfg.best_allowed_q) |
| - <b>--max-qp</b> (\ref RateControlCfg.worst_allowed_q) |
| - <b>--qp</b> (\ref RateControlCfg.qp) |
| - <b>--undershoot-pct</b> (\ref RateControlCfg.under_shoot_pct) |
| - <b>--overshoot-pct</b> (\ref RateControlCfg.over_shoot_pct) |
| |
| The following control aspects of vbr encoding |
| |
| - <b>--minsection-pct</b> ((\ref RateControlCfg.vbrmin_section) |
| - <b>--maxsection-pct</b> ((\ref RateControlCfg.vbrmax_section) |
| |
| The following relate to buffer and delay management in one pass low delay and |
| real time coding |
| |
| - <b>--buf-sz</b> (\ref RateControlCfg.maximum_buffer_size_ms) |
| - <b>--buf-initial-sz</b> (\ref RateControlCfg.starting_buffer_level_ms) |
| - <b>--buf-optimal-sz</b> (\ref RateControlCfg.optimal_buffer_level_ms) |
| |
| \subsection architecture_enc_vbr Variable Bitrate (VBR) Encoding |
| |
| For streamed VOD content the most common rate control strategy is Variable |
| Bitrate (VBR) encoding. The CQ mode mentioned above is a variant of this |
| where additional quantizer and quality constraints are applied. VBR |
| encoding may in theory be used in conjunction with either 1-pass or 2-pass |
| encoding. |
| |
| VBR encoding varies the number of bits given to each frame or group of frames |
| according to the difficulty of that frame or group of frames, such that easier |
| frames are allocated fewer bits and harder frames are allocated more bits. The |
| intent here is to even out the quality between frames. This contrasts with |
| Constant Bitrate (CBR) encoding where each frame is allocated the same number |
| of bits. |
| |
| Whilst for any given frame or group of frames the data rate may vary, the VBR |
| algorithm attempts to deliver a given average bitrate over a wider time |
| interval. In standard VBR encoding, the time interval over which the data rate |
| is averaged is usually the duration of the video clip. An alternative |
| approach is to target an average VBR bitrate over the entire video corpus for |
| a particular video format (corpus VBR). |
| |
| \subsubsection architecture_enc_1pass_vbr 1 Pass VBR Encoding |
| |
| The command line for libaom does allow 1 Pass VBR, but this has not been |
| properly optimised and behaves much like 1 pass CBR in most regards, with bits |
| allocated to frames by the following functions: |
| |
| - \ref av1_calc_iframe_target_size_one_pass_vbr() |
| - \ref av1_calc_pframe_target_size_one_pass_vbr() |
| |
| \subsubsection architecture_enc_2pass_vbr 2 Pass VBR Encoding |
| |
| The main focus here will be on 2-pass VBR encoding (and the related CQ mode) |
| as these are the modes most commonly used for VOD content. |
| |
| 2-pass encoding is selected on the command line by setting --passes=2 |
| (or -p 2). |
| |
| Generally speaking, in 2-pass encoding, an encoder will first encode a video |
| using a default set of parameters and assumptions. Depending on the outcome |
| of that first encode, the baseline assumptions and parameters will be adjusted |
| to optimize the output during the second pass. In essence the first pass is a |
| fact finding mission to establish the complexity and variability of the video, |
| in order to allow a better allocation of bits in the second pass. |
| |
| The libaom 2-pass algorithm is unusual in that the first pass is not a full |
| encode of the video. Rather it uses a limited set of prediction and transform |
| options and a fixed quantizer, to generate statistics about each frame. No |
| output bitstream is created and the per frame first pass statistics are stored |
| entirely in volatile memory. This has some disadvantages when compared to a |
| full first pass encode, but avoids the need for file I/O and improves speed. |
| |
| For two pass encoding, the function \ref av1_encode() will first be called |
| for each frame in the video with the value \ref AV1EncoderConfig.pass = 1. |
| This will result in calls to \ref av1_first_pass(). |
| |
| Statistics for each frame are stored in \ref FIRSTPASS_STATS frame_stats_buf. |
| |
| After completion of the first pass, \ref av1_encode() will be called again for |
| each frame with \ref AV1EncoderConfig.pass = 2. The frames are then encoded in |
| accordance with the statistics gathered during the first pass by calls to |
| \ref encode_frame_to_data_rate() which in turn calls |
| \ref av1_get_second_pass_params(). |
| |
| In summary the second pass code :- |
| |
| - Searches for scene cuts (if auto key frame detection is enabled). |
| - Defines the length of and hierarchical structure to be used in each |
| ARF/GF group. |
| - Allocates bits based on the relative complexity of each frame, the quality |
| of frame to frame prediction and the type of frame (e.g. key frame, ARF |
| frame, golden frame or normal leaf frame). |
| - Suggests a maximum Q (quantizer value) for each ARF/GF group, based on |
| estimated complexity and recent rate control compliance |
| (\ref RATE_CONTROL.active_worst_quality) |
| - Tracks adherence to the overall rate control objectives and adjusts |
| heuristics. |
| |
| The main two pass functions in regard to the above include:- |
| |
| - \ref find_next_key_frame() |
| - \ref define_gf_group() |
| - \ref calculate_total_gf_group_bits() |
| - \ref get_twopass_worst_quality() |
| - \ref av1_gop_setup_structure() |
| - \ref av1_gop_bit_allocation() |
| - \ref av1_twopass_postencode_update() |
| |
| For each frame, the two pass algorithm defines a target number of bits |
| \ref RATE_CONTROL.base_frame_target, which is then adjusted if necessary to |
| reflect any undershoot or overshoot on previous frames to give |
| \ref RATE_CONTROL.this_frame_target. |
| |
| As well as \ref RATE_CONTROL.active_worst_quality, the two pass code also |
| maintains a record of the actual Q value used to encode previous frames |
| at each level in the current pyramid hierarchy |
| (\ref RATE_CONTROL.active_best_quality). The function |
| \ref rc_pick_q_and_bounds(), uses these values to set a permitted Q range |
| for each frame. |
| |
| \subsubsection architecture_enc_1pass_lagged 1 Pass Lagged VBR Encoding |
| |
| 1 pass lagged encode falls between simple 1 pass encoding and full two pass |
| encoding and is used for cases where it is not possible to do a full first |
| pass through the entire video clip, but where some delay is permissible. For |
| example near live streaming where there is a delay of up to a few seconds. In |
| this case the first pass and second pass are in effect combined such that the |
| first pass starts encoding the clip and the second pass lags behind it by a |
| few frames. When using this method, full sequence level statistics are not |
| available, but it is possible to collect and use frame or group of frame level |
| data to help in the allocation of bits and in defining ARF/GF coding |
| hierarchies. The reader is referred to the \ref AV1_COMP.lap_enabled field |
| in the main compressor instance (where <b>lap</b> stands for |
| <b>look ahead processing</b>). This encoding mode for the most part uses the |
| same rate control pathways as two pass VBR encoding. |
| |
| \subsection architecture_enc_rc_loop The Main Rate Control Loop |
| |
| Having established a target rate for a given frame and an allowed range of Q |
| values, the encoder then tries to encode the frame at a rate that is as close |
| as possible to the target value, given the Q range constraints. |
| |
| There are two main mechanisms by which this is achieved. |
| |
| The first selects a frame level Q, using an adaptive estimate of the number of |
| bits that will be generated when the frame is encoded at any given Q. |
| Fundamentally this mechanism is common to VBR, CBR and to use cases such as |
| RTC with small adjustments. |
| |
| As the Q value mainly adjusts the precision of the residual signal, it is not |
| actually a reliable basis for accurately predicting the number of bits that |
| will be generated across all clips. A well predicted clip, for example, may |
| have a much smaller error residual after prediction. The algorithm copes with |
| this by adapting its predictions on the fly using a feedback loop based on how |
| well it did the previous time around. |
| |
| The main functions responsible for the prediction of Q and the adaptation over |
| time, for the two pass encoding pipeline are: |
| |
| - \ref rc_pick_q_and_bounds() |
| - \ref get_q() |
| - \ref av1_rc_regulate_q() |
| - \ref get_rate_correction_factor() |
| - \ref set_rate_correction_factor() |
| - \ref find_closest_qindex_by_rate() |
| - \ref av1_twopass_postencode_update() |
| - \ref av1_rc_update_rate_correction_factors() |
| |
| A second mechanism for control comes into play if there is a large rate miss |
| for the current frame (much too big or too small). This is a recode mechanism |
| which allows the current frame to be re-encoded one or more times with a |
| revised Q value. This obviously has significant implications for encode speed |
| and in the case of RTC latency (hence it is not used for the RTC pathway). |
| |
| Whether or not a recode is allowed for a given frame depends on the selected |
| encode speed vs quality trade off. This is set on the command line using the |
| --cpu-used parameter which maps onto the \ref AV1_COMP.speed field in the main |
| compressor instance data structure. |
| |
| The value of \ref AV1_COMP.speed, combined with the use case, is used to |
| populate the speed features data structure AV1_COMP.sf. In particular |
| \ref HIGH_LEVEL_SPEED_FEATURES.recode_loop determines the types of frames that |
| may be recoded and \ref HIGH_LEVEL_SPEED_FEATURES.recode_tolerance is a rate |
| error trigger threshold. |
| |
| For more information the reader is directed to the following functions: |
| |
| - \ref encode_with_recode_loop() |
| - \ref encode_without_recode() |
| - \ref recode_loop_update_q() |
| - \ref recode_loop_test() |
| - \ref av1_set_speed_features_framesize_independent() |
| - \ref av1_set_speed_features_framesize_dependent() |
| |
| \subsection architecture_enc_fixed_q Fixed Q Mode |
| |
| There are two main fixed Q cases: |
| -# Fixed Q with adaptive qp offsets: same qp offset for each pyramid level |
| in a given video, but these offsets are adaptive based on video content. |
| -# Fixed Q with fixed qp offsets: content-independent fixed qp offsets for |
| each pyramid level. (see \ref get_q_using_fixed_offsets()). |
| |
| The reader is also refered to the following functions: |
| - \ref av1_rc_pick_q_and_bounds() |
| - \ref rc_pick_q_and_bounds_no_stats_cbr() |
| - \ref rc_pick_q_and_bounds_no_stats() |
| - \ref rc_pick_q_and_bounds() |
| |
| \section architecture_enc_frame_groups GF/ ARF Frame Groups & Hierarchical Coding |
| |
| \subsection architecture_enc_frame_groups_data Main Data Structures |
| |
| The following are the main data structures referenced in this section |
| (see also \ref architecture_enc_data_structures): |
| |
| - \ref AV1_COMP cpi (the main compressor instance data structure) |
| - \ref AV1_COMP.rc (\ref RATE_CONTROL) |
| |
| - \ref FIRSTPASS_STATS *frame_stats_buf (used to store per frame first pass |
| stats) |
| |
| \subsection architecture_enc_frame_groups_groups Frame Groups |
| |
| To process a sequence/stream of video frames, the encoder divides the frames |
| into groups and encodes them sequentially (possibly dependent on previous |
| groups). In AV1 such a group is usually referred to as a golden frame group |
| (GF group) or sometimes an Alt-Ref (ARF) group or a group of pictures (GOP). |
| A GF group determines and stores the coding structure of the frames (for |
| example, frame type, usage of the hierarchical structure, usage of overlay |
| frames, etc.) and can be considered as the base unit to process the frames, |
| therefore playing an important role in the encoder. |
| |
| The length of a specific GF group is arguably the most important aspect when |
| determining a GF group. This is because most GF group level decisions are |
| based on the frame characteristics, if not on the length itself directly. |
| Note that the GF group is always a group of consecutive frames, which means |
| the start and end of the group (so again, the length of it) determines which |
| frames are included in it and hence determines the characteristics of the GF |
| group. Therefore, in this document we will first discuss the GF group length |
| decision in Libaom, followed by frame structure decisions when defining a GF |
| group with a certain length. |
| |
| \subsection architecture_enc_gf_length GF / ARF Group Length Determination |
| |
| The basic intuition of determining the GF group length is that it is usually |
| desirable to group together frames that are similar. Hence, we may choose |
| longer groups when consecutive frames are very alike and shorter ones when |
| they are very different. |
| |
| The determination of the GF group length is done in function \ref |
| calculate_gf_length(). The following encoder use cases are supported: |
| |
| <ul> |
| <li><b>Single pass with look-ahead disabled(\ref has_no_stats_stage()): |
| </b> in this case there is no information available on the following stream |
| of frames, therefore the function will set the GF group length for the |
| current and the following GF groups (a total number of MAX_NUM_GF_INTERVALS |
| groups) to be the maximum value allowed.</li> |
| |
| <li><b>Single pass with look-ahead enabled (\ref AV1_COMP.lap_enabled):</b> |
| look-ahead processing is enabled for single pass, therefore there is a |
| limited amount of information available regarding future frames. In this |
| case the function will determine the length based on \ref FIRSTPASS_STATS |
| (which is generated when processing the look-ahead buffer) for only the |
| current GF group.</li> |
| |
| <li><b>Two pass:</b> the first pass in two-pass encoding collects the stats |
| and will not call the function. In the second pass, the function tries to |
| determine the GF group length of the current and the following GF groups (a |
| total number of MAX_NUM_GF_INTERVALS groups) based on the first-pass |
| statistics. Note that as we will be discussing later, such decisions may not |
| be accurate and can be changed later.</li> |
| </ul> |
| |
| Except for the first trivial case where there is no prior knowledge of the |
| following frames, the function \ref calculate_gf_length() tries to |
| determine the GF group length based on the first pass statistics. As shown |
| in figure [TODO BohalLi@], the determination is divided into two parts: |
| |
| <ol> |
| <li>Baseline decision based on accumulated statistics: this part of the function |
| iterates through the firstpass statistics of the following frames and |
| accumulates the statistics with function accumulate_next_frame_stats. |
| The accumulated statistics are then used to determine whether the |
| correlation in the GF group has dropped too much in function detect_gf_cut. |
| If detect_gf_cut returns non-zero, or if we've reached the end of |
| first-pass statistics, the baseline decision is set at the current point.</li> |
| |
| <li>If we are not at the end of the first-pass statistics, the next part will |
| try to refine the baseline decision. The algorithm is based on |
| \ref FIRSTPASS_STATS.coded_error. It tries to label the frames in the |
| baseline group into two classes: high-error and low-error, and cuts the GF |
| group at the furthest location that is also of the low-error class. |
| Detailed algorithm description is introduced here [TODO].</li> |
| </ol> |
| |
| As mentioned, for two-pass encoding, the function \ref |
| calculate_gf_length() tries to determine the length of as many as |
| MAX_NUM_GF_INTERVALS groups. The decisions are stored in |
| \ref RATE_CONTROL.gf_intervals[]. The variables |
| \ref RATE_CONTROL.intervals_till_gf_calculate_due and |
| \ref RATE_CONTROL.cur_gf_index help with managing and updating the stored |
| decisions. In the function \ref define_gf_group(), the corresponding |
| stored length decision will be used to define the current GF group. |
| |
| When the maximum GF group length is larger or equal to 32, the encoder will |
| enforce an extra layer to determine whether to use maximum GF length of 32 |
| or 16 for every GF group. In such a case, \ref calculate_gf_length() is |
| first called with the original maximum length (>=32). Afterwards, |
| \ref av1_tpl_setup_stats() is called to analyze the determined GF group |
| and compare the reference to the last frame and the middle frame. If it is |
| decided that we should use a maximum GF length of 16, the function |
| \ref calculate_gf_length() is called again with the updated maximum |
| length, and it only sets the length for a single GF group |
| (\ref RATE_CONTROL.intervals_till_gf_calculate_due is set to 1). This process |
| is shown in [TODO BohalLi@]. |
| |
| Before encoding each frame, the encoder checks |
| \ref RATE_CONTROL.frames_till_gf_update_due. If it is zero, indicating |
| processing of the current GF group is done, the encoder will check whether |
| \ref RATE_CONTROL.intervals_till_gf_calculate_due is zero. If it is, as |
| discussed above, \ref calculate_gf_length() is called with original |
| maximum length. If it is not zero, then the GF group length value stored |
| in \ref RATE_CONTROL.gf_intervals[\ref RATE_CONTROL.cur_gf_index] is used |
| (subject to change as discussed above). |
| |
| \subsection architecture_enc_gf_structure Defining a GF Group's Structure |
| |
| The function \ref define_gf_group() defines the frame structure as well |
| as other GF group level parameters (e.g. bit allocation) once the length of |
| the current GF group is determined. |
| |
| The function first iterates through the first pass statistics in the GF group |
| to accumulate various stats, using (TODO REF) accumulate_this_frame_stats() |
| and (TODO REF)accumulate_next_frame_stats(). The accumulated statistics are |
| then used to determine the use of the use of ALTREF frame along with other |
| properties of the GF group. The values of \ref RATE_CONTROL.cur_gf_index, |
| \ref RATE_CONTROL.intervals_till_gf_calculate_due and |
| \ref RATE_CONTROL.frames_till_gf_update_due are also updated accordingly. |
| |
| The function \ref av1_gop_setup_structure() is called at the end to |
| determine the frame layers and reference maps in the GF group, as shown in |
| [TODO BohalLi@]. The (TODO REF) construct_multi_layer_gf_structure() |
| function sets the frame update types for each frame and the group structure. |
| |
| - If ALTREF frames are allowed for the GF group: the first frame is set to |
| KF_UPDATE, OVERLAY_UPDATE or GF_UPDATE based on the previous GF group |
| (if it exists). The last frame of the GF group is set to ARF_UPDATE. |
| Then in (TODO REF) set_multi_layer_params(), frame update types are |
| determined recursively in a binary tree fashion, and assigned to give |
| the final IBBB structure for the group. |
| - If the current branch has more than 2 frames and we have not reached |
| maximum layer depth, then the middle frame is set as INTNL_ARF_UPDATE, |
| and the left and right branches are processed recursively. |
| - If the current branch has less than 3 frames, or we have reached maximum |
| layer depth, then every frame in the branch is set to LF_UPDATE. |
| - If ALTREF frame is not allowed for the GF group: the first frame is set to |
| KF_UPDATE, OVERLAY_UPDATE or GF_UPDATE, and the rest of them are set as |
| LF_UPDATE. This basically forms an IPPP GF group structure. |
| |
| The encoder may use Temporal dependancy modelling (TPL - see |
| \ref architecture_enc_tpl) to determine whether we should use a maximum length |
| of 32 or 16 for the current GF group. This requires calls to |
| \ref define_gf_group() but should not change other settings (since it is in |
| essence a trial). This special case is indicated by the setting parameter |
| <b>is_final_pass</b> for to zero. |
| |
| For single pass encodes where look-ahead processing is disabled |
| (\ref AV1_COMP.lap_enabled = 0), \ref define_gf_group_pass0() is used |
| instead of \ref define_gf_group(). |
| |
| \subsection architecture_enc_kf_groups Key Frame Groups |
| |
| A special constraint for GF group length is the location of the next keyframe |
| (KF). The frames between two KFs are referred to as a KF group. Each KF group |
| can be encoded and decoded independently. Because of this, a GF group cannot |
| span beyond a KF and the location of the next KF is set as a hard boundary |
| for GF group length. |
| |
| <ul> |
| <li>For two-pass encoding \ref RATE_CONTROL.frames_to_key controls when to |
| encode a key frame. When it is zero, the current frame is a keyframe and |
| the function \ref find_next_key_frame() is called. This in turn calls |
| \ref define_kf_interval() to work out where the next key frame should |
| be placed.</li> |
| |
| <li>For single-pass with look-ahead enabled, \ref define_kf_interval() |
| is called whenever a GF group update is needed (when |
| \ref RATE_CONTROL.frames_till_gf_update_due is zero). This is because |
| generally KFs are more widely spaced and the look-ahead buffer is usually |
| not long enough.</li> |
| |
| <li>For single-pass with look-ahead disabled, the KFs are placed according |
| to the command line parameter <b>--kf-max-dist</b> (The above two cases are |
| also subject to this constraint).</li> |
| </ul> |
| |
| The function \ref define_kf_interval() tries to detect a scenecut. |
| If a scenecut within kf-max-dist is detected, then it is set as the next |
| keyframe. Otherwise the given maximum value is used. |
| |
| \section architecture_enc_tpl Temporal Dependency Modelling |
| |
| The temporal dependency model runs at the beginning of each GOP. It builds the |
| motion trajectory within the GOP in units of 16x16 blocks. The temporal |
| dependency of a 16x16 block is evaluated as the predictive coding gains it |
| contributes to its trailing motion trajectory. This temporal dependency model |
| reflects how important a coding block is for the coding efficiency of the |
| overall GOP. It is hence used to scale the Lagrangian multiplier used in the |
| rate-distortion optimization framework. |
| |
| \subsection architecture_enc_tpl_config Configurations |
| |
| The temporal dependency model and its applications are by default turned on in |
| libaom encoder for the VoD use case. To disable it, use --tpl-model=0 in the |
| aomenc configuration. |
| |
| \subsection architecture_enc_tpl_algoritms Algorithms |
| |
| The scheme works in the reverse frame processing order over the source frames, |
| propagating information from future frames back to the current frame. For each |
| frame, a propagation step is run for each MB. it operates as follows: |
| |
| <ul> |
| <li> Estimate the intra prediction cost in terms of sum of absolute Hadamard |
| transform difference (SATD) noted as intra_cost. It also loads the motion |
| information available from the first-pass encode and estimates the inter |
| prediction cost as inter_cost. Due to the use of hybrid inter/intra |
| prediction mode, the inter_cost value is further upper bounded by |
| intra_cost. A propagation cost variable is used to collect all the |
| information flowed back from future processing frames. It is initialized as |
| 0 for all the blocks in the last processing frame in a group of pictures |
| (GOP).</li> |
| |
| <li> The fraction of information from a current block to be propagated towards |
| its reference block is estimated as: |
| \f[ |
| propagation\_fraction = (1 - inter\_cost/intra\_cost) |
| \f] |
| It reflects how much the motion compensated reference would reduce the |
| prediction error in percentage.</li> |
| |
| <li> The total amount of information the current block contributes to the GOP |
| is estimated as intra_cost + propagation_cost. The information that it |
| propagates towards its reference block is captured by: |
| |
| \f[ |
| propagation\_amount = |
| (intra\_cost + propagation\_cost) * propagation\_fraction |
| \f]</li> |
| |
| <li> Note that the reference block may not necessarily sit on the grid of |
| 16x16 blocks. The propagation amount is hence dispensed to all the blocks |
| that overlap with the reference block. The corresponding block in the |
| reference frame accumulates its own propagation cost as it receives back |
| propagation. |
| |
| \f[ |
| propagation\_cost = propagation\_cost + |
| (\frac{overlap\_area}{(16*16)} * propagation\_amount) |
| \f]</li> |
| |
| <li> In the final encoding stage, the distortion propagation factor of a block |
| is evaluated as \f$(1 + \frac{propagation\_cost}{intra\_cost})\f$, where the second term |
| captures its impact on later frames in a GOP.</li> |
| |
| <li> The Lagrangian multiplier is adapted at the 64x64 block level. For every |
| 64x64 block in a frame, we have a distortion propagation factor: |
| |
| \f[ |
| dist\_prop[i] = 1 + \frac{propagation\_cost[i]}{intra\_cost[i]} |
| \f] |
| |
| where i denotes the block index in the frame. We also have the frame level |
| distortion propagation factor: |
| |
| \f[ |
| dist\_prop = 1 + |
| \frac{\sum_{i}propagation\_cost[i]}{\sum_{i}intra\_cost[i]} |
| \f] |
| |
| which is used to normalize the propagation factor at the 64x64 block level. The |
| Lagrangian multiplier is hence adapted as: |
| |
| \f[ |
| λ[i] = λ[0] * \frac{dist\_prop}{dist\_prop[i]} |
| \f] |
| |
| where λ0 is the multiplier associated with the frame level QP. The |
| 64x64 block level QP is scaled according to the Lagrangian multiplier. |
| </ul> |
| |
| \subsection architecture_enc_tpl_keyfun Key Functions and data structures |
| |
| The reader is also refered to the following functions and data structures: |
| |
| - \ref TplParams |
| - \ref av1_tpl_setup_stats() builds the TPL model. |
| - \ref setup_delta_q() Assign different quantization parameters to each super |
| block based on its TPL weight. |
| |
| \section architecture_enc_partitions Block Partition Search |
| |
| A frame is first split into tiles in \ref encode_tiles(), with each tile |
| compressed by av1_encode_tile(). Then a tile is processed in superblock rows |
| via \ref av1_encode_sb_row() and then \ref encode_sb_row(). |
| |
| The partition search processes superblocks sequentially in \ref |
| encode_sb_row(). |
| |
| Partition search over the recursive quad-tree space is implemented by |
| recursive calls to |
| \ref av1_rd_use_partition(), or av1_rd_pick_partition() and returning best |
| options for sub-trees to their parent partitions. |
| |
| In libaom, the partition search lays on top of the mode search (predictor, |
| transform, etc.), instead of being a separate module. The interface of mode |
| search is \ref pick_sb_modes(), which connects the partition_search with |
| \ref architecture_enc_inter_modes and \ref architecture_enc_intra_modes. To |
| make good decisions, reconstruction is also required in order to build |
| references and contexts. This is implemented by \ref encode_sb() at the |
| sub-tree level and \ref encode_b() at coding block level. |
| |
| See also \ref partition_search |
| |
| \section architecture_enc_intra_modes Intra Mode Search |
| |
| AV1 also provides 71 different intra prediction modes, i.e. modes that predict |
| only based upon information in the current frame with no dependency on |
| previous or future frames. For key frames, where this independence from any |
| other frame is a defining requirement and for other cases where intra only |
| frames are required, the encoder need only considers these modes in the rate |
| distortion loop. |
| |
| Even so, in most use cases, searching all possible intra prediction modes for |
| every block and partition size is not practical and some pruning of the search |
| tree is necessary. |
| |
| For the Rate distortion optimized case, the main top level function |
| responsible for selecting the intra prediction mode for a given block is |
| \ref av1_rd_pick_intra_mode_sb(). Further fine control of the speed vs quality |
| trade off is provided by means of fields in \ref AV1_COMP.sf (which has type |
| \ref SPEED_FEATURES). |
| |
| Note that some intra modes are only considered for specific use cases or |
| types of video. For example the palette based prediction modes are often |
| valueable for graphics or screen share content but not for natural video. |
| (See \ref av1_search_palette_mode()) |
| |
| See also \ref intra_mode_search for more details. |
| |
| \section architecture_enc_inter_modes Inter Prediction Mode Search |
| |
| For inter frames, where we also allow prediction using one or more previously |
| coded frames (which may chronologically speaking be past or future frames or |
| non-display reference buffers such as ARF frames), the size of the search tree |
| that needs to be traversed, to select a prediction mode, is considerably more |
| massive. |
| |
| In addition to the 71 possible intra modes we also need to consider 56 single |
| frame inter prediction modes (7 reference frames x 4 modes x 2 for OBMC |
| (overlapped block motion compensation)), 12768 compound inter prediction modes |
| (these are modes that combine inter predictors from two reference frames) and |
| 36708 compound inter / intra prediction modes. |
| |
| Various heuristics and predictive strategies are used to prune the search tree |
| with fine control provided through the speed features parameter in the main |
| compressor instance data structure \ref AV1_COMP.sf. |
| |
| It is worth noting, that some prediction modes incurr a much larger rate cost |
| than others (ignoring for now the cost of coding the error residual). For |
| example, a compound mode that requires the encoder to specify two reference |
| frames and two new motion vectors will almost inevitable have a higher rate |
| cost than a simple inter prediction mode that uses a predicted or 0,0 motion |
| vector. As such, if we have already found a mode for the current block that |
| has a low RD cost, we can skip a large number of the possible modes on the |
| basis that even if the error residual is 0 the inherent rate cost of the |
| mode itself will garauntee that it is not chosen. |
| |
| See also \ref inter_mode_search for more details. |
| |
| \section architecture_enc_tx_search Transform Search |
| |
| AV1 implements the transform stage using 4 seperable 1-d transforms (DCT, |
| ADST, FLIPADST and IDTX, where FLIPADST is the reversed version of ADST |
| and IDTX is the identity transform) which can be combined to give 16 2-d |
| combinations. |
| |
| These combinations can be applied at 19 different scales from 64x64 pixels |
| down to 4x4 pixels. |
| |
| This gives rise to a large number of possible candidate transform options |
| for coding the residual error after prediction. An exhaustive rate-distortion |
| based evaluation of all candidates would not be practical from a speed |
| perspective in a production encoder implementation. Hence libaom addopts a |
| number of strategies to prune the selection of both the transform size and |
| transform type. |
| |
| There are a number of strategies that have been tested and implememnted in |
| libaom including: |
| |
| - A statistics based approach that looks at the frequency with which certain |
| combinations are used in a given context and prunes out very unlikely |
| candidates. It is worth noting here that some size candidates can be pruned |
| out immediately based on the size of the prediction partition. For example it |
| does not make sense to use a transform size that is larger than the |
| prediction partition size but also a very large prediction partition size is |
| unlikely to be optimally pared with small transforms. |
| |
| - A Machine learning based model |
| |
| - A method that initially tests candidates using a fast algorithm that skips |
| entropy encoding and uses an estimated cost model to choose a reduced subset |
| for full RD analysis. This subject is covered more fully in a paper authored |
| by Bohan Li, Jingning Han, and Yaowu Xu titled: <b>Fast Transform Type |
| Selection Using Conditional Laplace Distribution Based Rate Estimation</b> |
| |
| <b>TODO Add link to paper when available</b> |
| |
| See also \ref transform_search for more details. |
| |
| \section architecture_post_enc_filt Post Encode Loop Filtering |
| |
| AV1 supports three types of post encode <b>in loop</b> filtering to improve |
| the quality of the reconstructed video. |
| |
| - <b>Deblocking Filter</b> The first of these is a farily traditional boundary |
| deblocking filter that attempts to smooth discontinuities that may occur at |
| the boundaries between blocks. See also \ref in_loop_filter. |
| |
| - <b>CDEF Filter</b> The constrained directional enhancement filter (CDEF) |
| allows the codec to apply a non-linear deringing filter along certain |
| (potentially oblique) directions. A primary filter is applied along the |
| selected direction, whilst a secondary filter is applied at 45 degrees to |
| the primary direction. (See also \ref in_loop_cdef and (TODO REF) |
| <b>A Technical Overview of the AV1 Standard</b> (TODO add link to |
| Jingning's AV1 overview paper). |
| |
| - <b>Loop Restoration Filter</b> The loop restoration filter is applied after |
| any prior post filtering stages. It acts on units of either 64 x 64, |
| 128 x 128, or 256 x 256 pixel blocks, refered to as loop restoration units. |
| Each unit can independently select either to bypass filtering, use a Wiener |
| filter, or use a self-guided filter. (See also \ref in_loop_restoration and |
| (TODO REF) <b>A Technical Overview of the AV1 Standard</b> (TODO add link |
| to Jingning's AV1 overview paper). |
| |
| \section architecture_entropy Entropy Coding |
| |
| \subsection architecture_entropy_aritmetic Arithmetic Coder |
| |
| VP9, used a binary arithmetic coder to encode symbols, where the propability |
| of a 1 or 0 at each descision node was based on a context model that took |
| into account recently coded values (for example previously coded coefficients |
| in the current block). A mechanism existed to update the context model each |
| frame, either explicitly in the bitstream, or implicitly at both the encoder |
| and decoder based on the observed frequency of different outcomes in the |
| previous frame. VP9 also supported seperate context models for different types |
| of frame (e.g. inter coded frames and key frames). |
| |
| In contrast, AV1 uses an M-ary symbol arithmetic coder to compress the syntax |
| elements, where integer \f$M\in[2, 14]\f$. This approach is based upon the entropy |
| coding strategy used in the Daala video codec and allows for some bit-level |
| parallelism in its implementation. AV1 also has an extended context model and |
| allows for updates to the probabilities on a per symbol basis as opposed to |
| the per frame strategy in VP9. |
| |
| To improve the performance / throughput of the arithmetic encoder, especially |
| in hardware implementations, the probability model is updated and maintained |
| at 15-bit precision, but the arithmetic encoder only uses the most significant |
| 9 bits when encoding a symbol. A more detailed discussion of the algorithm |
| and design constraints can be found in (TODO REF) <b>A Technical Overview of |
| the AV1 Standard</b> (TODO add link to Jingning's AV1 overview paper). |
| |
| TODO add references to key functions / files. |
| |
| As with VP9, a mechanism exists in AV1 to encode some elements into the |
| bitstream as uncrompresed bits or literal values, without using the arithmetic |
| coder. For example, some frame and sequence header values, where it is |
| beneficial to be able to read the values directly. |
| |
| TODO add references to key functions / files. |
| |
| \subsection architecture_entropy_coef Coefficient Coding and Optimisaztion |
| |
| See also \ref coefficient_coding for more details. |
| |
| */ |
| |
| /*!\defgroup encoder_algo Encoder Algorithm |
| * |
| * The encoder algorithm describes how a sequence is encoded, including high |
| * level decision as well as algorithm used at every encoding stage. |
| */ |
| |
| /*!\defgroup high_level_algo High-level Algorithm |
| * \ingroup encoder_algo |
| * This module describes sequence level/frame level algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| |
| /*!\defgroup speed_features Speed vs Quality Trade Off |
| * \ingroup high_level_algo |
| * This module describes the encode speed vs quality tradeoff |
| * @{ |
| */ |
| /*! @} - end defgroup speed_features */ |
| |
| /*!\defgroup src_frame_proc Source Frame Processing |
| * \ingroup high_level_algo |
| * This module describes algorithms in AV1 assosciated with the |
| * pre-processing of source frames. See also \ref architecture_enc_src_proc |
| * |
| * @{ |
| */ |
| /*! @} - end defgroup src_frame_proc */ |
| |
| /*!\defgroup rate_control Rate Control |
| * \ingroup high_level_algo |
| * This module describes rate control algorithm in AV1. |
| * See also \ref architecture_enc_rate_ctrl |
| * @{ |
| */ |
| /*! @} - end defgroup rate_control */ |
| |
| /*!\defgroup tpl_modelling Temporal Dependency Modelling |
| * \ingroup high_level_algo |
| * This module includes algorithms to implement temporal dependency modelling. |
| * See also \ref architecture_enc_tpl |
| * @{ |
| */ |
| /*! @} - end defgroup tpl_modelling */ |
| |
| /*!\defgroup two_pass_algo Two Pass Mode |
| \ingroup high_level_algo |
| |
| In two pass mode, the input file is passed into the encoder for a quick |
| first pass, where statistics are gathered. These statistics and the input |
| file are then passed back into the encoder for a second pass. The statistics |
| help the encoder reach the desired bitrate without as much overshooting or |
| undershooting. |
| |
| During the first pass, the codec will return "stats" packets that contain |
| information useful for the second pass. The caller should concatenate these |
| packets as they are received. In the second pass, the concatenated packets |
| are passed in, along with the frames to encode. During the second pass, |
| "frame" packets are returned that represent the compressed video. |
| |
| A complete example can be found in `examples/twopass_encoder.c`. Pseudocode |
| is provided below to illustrate the core parts. |
| |
| During the first pass, the uncompressed frames are passed in and stats |
| information is appended to a byte array. |
| |
| ~~~~~~~~~~~~~~~{.c} |
| // For simplicity, assume that there is enough memory in the stats buffer. |
| // Actual code will want to use a resizable array. stats_len represents |
| // the length of data already present in the buffer. |
| void get_stats_data(aom_codec_ctx_t *encoder, char *stats, |
| size_t *stats_len, bool *got_data) { |
| const aom_codec_cx_pkt_t *pkt; |
| aom_codec_iter_t iter = NULL; |
| while ((pkt = aom_codec_get_cx_data(encoder, &iter))) { |
| *got_data = true; |
| if (pkt->kind != AOM_CODEC_STATS_PKT) continue; |
| memcpy(stats + *stats_len, pkt->data.twopass_stats.buf, |
| pkt->data.twopass_stats.sz); |
| *stats_len += pkt->data.twopass_stats.sz; |
| } |
| } |
| |
| void first_pass(char *stats, size_t *stats_len) { |
| struct aom_codec_enc_cfg first_pass_cfg; |
| ... // Initialize the config as needed. |
| first_pass_cfg.g_pass = AOM_RC_FIRST_PASS; |
| aom_codec_ctx_t first_pass_encoder; |
| ... // Initialize the encoder. |
| |
| while (frame_available) { |
| // Read in the uncompressed frame, update frame_available |
| aom_image_t *frame_to_encode = ...; |
| aom_codec_encode(&first_pass_encoder, img, pts, duration, flags); |
| get_stats_data(&first_pass_encoder, stats, stats_len); |
| } |
| // After all frames have been processed, call aom_codec_encode with |
| // a NULL ptr repeatedly, until no more data is returned. The NULL |
| // ptr tells the encoder that no more frames are available. |
| bool got_data; |
| do { |
| got_data = false; |
| aom_codec_encode(&first_pass_encoder, NULL, pts, duration, flags); |
| get_stats_data(&first_pass_encoder, stats, stats_len, &got_data); |
| } while (got_data); |
| |
| aom_codec_destroy(&first_pass_encoder); |
| } |
| ~~~~~~~~~~~~~~~ |
| |
| During the second pass, the uncompressed frames and the stats are |
| passed into the encoder. |
| |
| ~~~~~~~~~~~~~~~{.c} |
| // Write out each encoded frame to the file. |
| void get_cx_data(aom_codec_ctx_t *encoder, FILE *file, |
| bool *got_data) { |
| const aom_codec_cx_pkt_t *pkt; |
| aom_codec_iter_t iter = NULL; |
| while ((pkt = aom_codec_get_cx_data(encoder, &iter))) { |
| *got_data = true; |
| if (pkt->kind != AOM_CODEC_CX_FRAME_PKT) continue; |
| fwrite(pkt->data.frame.buf, 1, pkt->data.frame.sz, file); |
| } |
| } |
| |
| void second_pass(char *stats, size_t stats_len) { |
| struct aom_codec_enc_cfg second_pass_cfg; |
| ... // Initialize the config file as needed. |
| second_pass_cfg.g_pass = AOM_RC_LAST_PASS; |
| cfg.rc_twopass_stats_in.buf = stats; |
| cfg.rc_twopass_stats_in.sz = stats_len; |
| aom_codec_ctx_t second_pass_encoder; |
| ... // Initialize the encoder from the config. |
| |
| FILE *output = fopen("output.obu", "wb"); |
| while (frame_available) { |
| // Read in the uncompressed frame, update frame_available |
| aom_image_t *frame_to_encode = ...; |
| aom_codec_encode(&second_pass_encoder, img, pts, duration, flags); |
| get_cx_data(&second_pass_encoder, output); |
| } |
| // Pass in NULL to flush the encoder. |
| bool got_data; |
| do { |
| got_data = false; |
| aom_codec_encode(&second_pass_encoder, NULL, pts, duration, flags); |
| get_cx_data(&second_pass_encoder, output, &got_data); |
| } while (got_data); |
| |
| aom_codec_destroy(&second_pass_encoder); |
| } |
| ~~~~~~~~~~~~~~~ |
| */ |
| |
| /*!\defgroup look_ahead_buffer The Look-Ahead Buffer |
| \ingroup high_level_algo |
| |
| A program should call \ref aom_codec_encode() for each frame that needs |
| processing. These frames are internally copied and stored in a fixed-size |
| circular buffer, known as the look-ahead buffer. Other parts of the code |
| will use future frame information to inform current frame decisions; |
| examples include the first-pass algorithm, TPL model, and temporal filter. |
| Note that this buffer also keeps a reference to the last source frame. |
| |
| The look-ahead buffer is defined in \ref av1/encoder/lookahead.h. It acts as an |
| opaque structure, with an interface to create and free memory associated with |
| it. It supports pushing and popping frames onto the structure in a FIFO |
| fashion. It also allows look-ahead when using the \ref av1_lookahead_peek() |
| function with a non-negative number, and look-behind when -1 is passed in (for |
| the last source frame; e.g., firstpass will use this for motion estimation). |
| The \ref av1_lookahead_depth() function returns the current number of frames |
| stored in it. Note that \ref av1_lookahead_pop() is a bit of a misnomer - it |
| only pops if either the "flush" variable is set, or the buffer is at maximum |
| capacity. |
| |
| The buffer is stored in the \ref AV1_COMP::lookahead field. |
| It is initialized in the first call to \ref aom_codec_encode(), in the |
| \ref av1_receive_raw_frame() sub-routine. The buffer size is defined by |
| the g_lag_in_frames parameter set in the |
| \ref aom_codec_enc_cfg_t::g_lag_in_frames struct. |
| This can be modified manually but should only be set once. On the command |
| line, the flag "--lag-in-frames" controls it. The default size is 19. |
| Note that a maximum value of 35 is enforced. |
| |
| A frame will stay in the buffer as long as possible. As mentioned above, |
| the \ref av1_lookahead_pop() only removes a frame when either flush is set, |
| or the buffer is full. Note that each call to \ref aom_codec_encode() inserts |
| another frame into the buffer, and pop is called by the sub-function |
| \ref av1_encode_strategy(). The buffer is told to flush when |
| \ref aom_codec_encode() is passed a NULL image pointer. Note that the caller |
| must repeatedly call \ref aom_codec_encode() with a NULL image pointer, until |
| no more packets are available, in order to fully flush the buffer. |
| |
| */ |
| |
| /*! @} - end defgroup high_level_algo */ |
| |
| /*!\defgroup partition_search Partition Search |
| * \ingroup encoder_algo |
| * For and overview of the partition search see \ref architecture_enc_partitions |
| * @{ |
| */ |
| |
| /*! @} - end defgroup partition_search */ |
| |
| /*!\defgroup intra_mode_search Intra Mode Search |
| * \ingroup encoder_algo |
| * This module describes intra mode search algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup intra_mode_search */ |
| |
| /*!\defgroup inter_mode_search Inter Mode Search |
| * \ingroup encoder_algo |
| * This module describes inter mode search algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup inter_mode_search */ |
| |
| /*!\defgroup palette_mode_search Palette Mode Search |
| * \ingroup intra_mode_search |
| * This module describes palette mode search algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup palette_mode_search */ |
| |
| /*!\defgroup transform_search Transform Search |
| * \ingroup encoder_algo |
| * This module describes transform search algorithm in AV1. |
| * @{ |
| */ |
| /*! @} - end defgroup transform_search */ |
| |
| /*!\defgroup coefficient_coding Transform Coefficient Coding and Optimization |
| * \ingroup encoder_algo |
| * This module describes the algorithms of transform coefficient coding and optimization in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup coefficient_coding */ |
| |
| /*!\defgroup in_loop_filter In-loop Filter |
| * \ingroup encoder_algo |
| * This module describes in-loop filter algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup in_loop_filter */ |
| |
| /*!\defgroup in_loop_cdef CDEF |
| * \ingroup encoder_algo |
| * This module describes the CDEF parameter search algorithm |
| * in AV1. More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup in_loop_restoration */ |
| |
| /*!\defgroup in_loop_restoration Loop Restoration |
| * \ingroup encoder_algo |
| * This module describes the loop restoration search |
| * and estimation algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup in_loop_restoration */ |
| |
| /*!\defgroup cyclic_refresh Cyclic Refresh |
| * \ingroup encoder_algo |
| * This module describes the cyclic refresh (aq-mode=3) in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup cyclic_refresh */ |
| |
| /*!\defgroup SVC Scalable Video Coding |
| * \ingroup encoder_algo |
| * This module describes scalable video coding algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup SVC */ |
| /*!\defgroup variance_partition Variance Partition |
| * \ingroup encoder_algo |
| * This module describes variance partition algorithm in AV1. |
| * More details will be added. |
| * @{ |
| */ |
| /*! @} - end defgroup variance_partition */ |