Lossy audio compression: principles, methods, misconceptions


Finding a source that explains the logic behind perceptual encoding (e.g. MP3, AAC, Opus) in an easy-to-read style is an impossible mission. One can read complex math and heavy jargon filled tech articles all day long without understanding anything about this technology. The following article covers the ground rules and provides an insight into how perceptual encoders work - into one of the most impressive technology from the 20th century, used by millions every day.


January 28, 2025

Audio compression comes in two flavors: lossy and lossless. With lossless compression (like FLAC) the decoded audio data is identical with the original. There is no data loss, the encoder just removes the redundancy from the stream. Downside of lossless compression is poor efficiency.

Lossy compression (MP3, AAC, Opus) provides much higher compression, but the decoded data is not identical with the original. However, this doesn't mean that it's impossible to achieve pristine quality with lossy encoders, even at low bitrates (~130 kbps). Quality of lossy audio compression greatly depends on bitrate and some encoders do better job than others. (Note: bitrate is measured in kilobits per second, abbreviated as kbps.)

The most common perceptual audio encoding formats and their typical application:

Encoding formatTypical application
AACHDTV standard (MPEG-2, MPEG-4)
Apple (iTunes), video streaming services
Dolby AC-3DVD audio, video
DTSDVD audio, video
MP3First MPEG standard, widespread support
OpusYouTube
VorbisSpotify

Let's discover some of the tricks and methods used in perceptual encoding formats.


1. Auditory masking - psychoacoustics of lossy audio compression

Perceptual encoding utilizes the phenomenon called auditory masking. Masking means that in the neighborhood of a (loud) tone the hearing threshold is raised. Signals below the threshold are inaudible.


The previous graph shows the masking threshold (masking function) of a pure tone. In encoder hearing models slightly different and a bit more "strict" versions are used, mainly because the playback level is not known during encoding (masking is greater at high SPL).


2. Adaptive quantization in the frequency domain

A WAV file is a stream of sampled values. In transform codecs the input PCM audio is divided into blocks and the PCM data is transformed into the frequency domain. For example, an MP3 file contains amplitude-frequency data of 26 msec long blocks. As quantization in the time domain, quantization in the frequency domain results in wideband noise. However, if we split the frequency range into narrow bands and use different quantizers in each band then the quantization noise can be shaped according to the shape of the masking threshold.

Not all encoders follow this logic (e.g. Vorbis), but the concept represented here is the key to all encoding formats.

The following image shows the amplitude spectrum of a triangle waveform, before and after compression (MP3, 80 kbps). We can see that close to the signal components the level of noise is increased. With a uniform quantizer the quantization noise spectrum is flat. (By the way, the noise is inaudible in both cases.)

It's worth to separate ADC type quantization (quantization during analog-to-digital conversion) from quantization in audio codecs. Digital representation of analog values is similar to truncation or rounding rational numbers where rational numbers represent analog values. In audio codecs quantization is a division of integer values by power of two (2, 4, 8, 16...). For example, instead of transmitting the data series of 496, 13, 1611, a codec may transmit 62, 2, 201 with the quantizer set to eight. Larger quantizer values results in smaller data values which results in less bits and lower bitrate, but also an increased quantization noise. (Note: instead of power of two, power of 3√2 is used that provides finer selection of quantizers)


Original:496
(9 bit)
 13 
(4 bit)
1611
(11 bit)
Transmitted:62
(6 bit)
2
(2 bit)
201
(8 bit)
Decoded:496161608

(For the table: transmitted = round (original / quantizer); decoded = transmitted × quantizer, quantizer = 8.) Note: the relative error is less in larger values, also big values tend to mask low values and low level quantization errors.

Lossy audio compression doesn't discard "unnecessary" data, only increase quantization noise within "quantization bands" (called scalefactor bands). At low bitrates encoders may apply a low pass filter (cut off frequency data above 14 kHz or 16 kHz) in order to increase fidelity, but this is not the main principle of operation. At high bitrates the high frequency loss is subtle and inaudible. "Lossy audio compression removes redundant audio data" is an utter nonsense.


3. Definition of transparency and fidelity for lossy compression

Depending on bitrate and encoder settings, the quality of lossy audio compression ranges from perceptually lossless to terrible. Lossy encoding process generates psychoacoustically shaped quantization noise whose spectrum follows the spectrum of the audio signal. If the bitrate exceeds a certain limit, the quantization noise generated during compression falls below the masking threshold and the noise is not audible (as we can't hear noise below the masking threshold). The required momentary bitrate changes from block to block and depends on the temporal and spectral properties of the audio signal in the current block (one block is 20-40 milliseconds long)

There is another type of coding artifact: when the encoder "runs out of bits" and can't allocate more code words for spectral lines in the audible range. This usually only occurs at very low bitrates, with certain signals or instrument sounds (e.g. hi-hat). The audible effect is not noise, but a strange spectral distortion in the high frequency range. The best example is the annoying sound of hi-hats in 128 kbps MP3 files.


4. Pre-echo & transients: a side effect of transform and subband coding

When a block of audio samples are converted to the frequency domain, quantized and converted back to the time domain, quantization noise will spread within the entire transform block. This means that quantization noise appears before transients. Since auditory masking is only effective up to ~2 msec before transients (pre-masking), the ear is very sensitive to pre-echo type artifacts. As a consequence high fidelity encoding of transients requires higher momentary bitrate than other signals.

Loudness of pre-echo depends on the length of the block, position of the transient signal in the block and bitrate. In order to lower the bitrate demand of transients, short blocks are used for encoding transients.

It's worth mentioning that some continuous signals are perceived as a series of pulses. For example, a 400 Hz sawtooth signal is heard as a smooth continuous signal, whereas a 100 Hz sawtooth sounds like a low pass filtered pulse series.


5. Signal complexity and bitrate

The required momentary bitrate depends on the harmonic and temporal "texture" of the audio signal. Harmonically rich sounds require more bits than sounds with low harmonic content. Percussive sounds (hi-hat, castanet, sidestick) require higher bitrate than any other sounds.

Transform codecs (MP3, AAC, Opus...) are able to compress slowly changing signals at very low bitrates. For example, MP3 is able to compress slowly changing complex mono signals in perfect fidelity at ~80 kbps, whereas encoding of mono transients needs ~130 kbps (per channel). In the case of pulses, the instantaneous mono bitrate requirement can reach 230 kbps.

Among steady-state artificial signals third octave multitone has the highest bitrate requirement and pure tone has the lowest. Highest bitrate is required for pulses and pulse series as mentioned previously.


(Some notes on this bar chart with nice colors: Pulse best case means that the pulse is located in the middle of a short block. Pulse worst case means that the pulse is located at the meeting point of two blocks and 'granulas'.)

Mono recordings are rare. A mono test is only useful for determining the bitrate required to encode an instrument without audible artifacts. With stereo signals, the encoder can perform some optimization based on the correlation (similarities) between the two channels. If the channels are similar, the two channels are encoded as mid and side channels (side channel containing the difference) and the bitrate will be lower than if they were encoded as two separate channels. However, channels with large difference are encoded as two separate left and right channels.


6. Transform gain and compression ratio

In lossy audio formats the total bitrate reduction is the "sum" of lossless and lossy compression (quantization). Some of the bitrate reduction comes from lossless compression! Moreover, encoding efficiency is largely related to the efficiency of lossless coding stages.

It may sound strange, but deep inside MP3, AAC or Opus is a lossless compression with quantization and a hearing model. Only the transformation into the frequency domain itself - which is a lossless process - results in a great reduction of bitrate. The efficiency of lossless compression depends on the spectral complexity of the signal: signals with simple harmonic structure can be compressed more efficiently. What's really interesting is that when the bitrate requirement for lossless compression is high, the bitrate requirement for lossy compression is also high.

How much is the lossless compression in lossy formats? Let's see a case when the source is 16 bit / 44.1 kHz WAV and the encoding is MP3. When complex signals with many scattered components (e.g. third-octave multitone) are converted, the lossless compression results in ~50 % bitrate reduction. When periodical signals with simple harmonic structure (e.g. triangle) are converted with MP3, the lossless compression results in ~75 % bitrate reduction. The bottom line is that at a bitrate of 320 kbps the lossy bit reduction is only 10% - 50%, much less than the overall bitrate reduction (77 %).


7. Some lossy audio formats provide better compression than others

Opus, Vorbis and AAC provide better fidelity at low bitrates than MP3. With MP3, isolated percussive sounds may require bitrate as high as 256 kbps, whereas the bitrate demand for a mix of a bass guitar, guitar, vocal, drum (with hi-hat, snare) is about 192 kbps. With Opus 128 kbps is enough to encode the wildest transients without audible distortion (Opus @ 130kbps VBR is a standard for stereo tracks on YouTube).


8. Encoder settings are important at low bitrates

Quality is primarily determined by bitrate and encoding format. However, encoder settings have a huge impact on the quality below a certain bitrate. An online AAC-LC encoder optimized to speed may produce low quality files at 128 kbps, whereas an encoder optimized to quality gives near CD quality at this bitrate.


9. Audio samples

One of the difficulties in designing a listening test is how to select the best, most "representative" audio samples and how to relate the listening experience to encoding artifacts. Hi-hats, snare and kick drums, brass instruments, steel-string acoustic guitar are the key instruments. Large difference between the two channels also makes the encoder work harder.

Headphones and a quiet room are recommended for the following demo. It's worth reducing the volume to zero and increasing it to the desired level.

The first sample is a famous drum intro from a Led Zeppelin song. Hi-hat "smearing" is evident in the 128 kbps version of MP3 (missing spectral values, not pre-echo, though 128 kbps version may contain some audible pre-echo).

Usually guitar sounds can be encoded in high quality at 128 kbps with MP3. The following guitar riff (from Aerosmith) is an exception, because the two channels are very different and the encoder can't perform mid-side optimization.

These examples are "extremes" or worst-cases. The opposite scenario is when all instruments are "panned" to the center with little stereo reverb. In this case, even 128 kbps can be sufficient with MP3.

Songs: Led Zeppelin - Rock And Roll, Aerosmith - Livin' on the edge.

Csaba Horváth