High-resolution audio vs. 16 bit / 44.1kHz
Investigation of the optimal sampling frequency and bit depth of distribution audio formats.
Last edited: April 8, 2023
Uploaded: Nov. 2, 2021
There is a huge amount of misinformation on the net about the optimal sampling frequency and resolution (bit depth) of audio files for distribution. The Compact Disc (CD) was released forty years ago, but digital audio is still a chaotic subject, mainly dominated by misconceptions and marketing lies.
The behavior of digital signals is counterintuitive, which explains the multitude of misconceptions and frauds. False representations of pulse-code modulation, the use of old digital stereotypes, fundamental misconceptions for advertising purposes are very common. The marketing term "CD quality" is also misleading, because it suggests that quality is proportional to file size or bitrate. Since 16bit/44.1kHz provides the same playback fidelity as 24bit/96kHz, studio quality recordings can be distributed in 16/44.1 formats (16/44.1 is studio quality).
Sampling frequency, bit depth ("resolution")
The sampling rate shows how many samples are taken per second during digital conversion. The sampling frequency determines the frequency range that can be recorded. The highest frequency is half the sampling frequency.
If the sampling frequency is at least twice the frequency of the highest component of the original signal, then the original signal can be perfectly reconstructed from the sampled values. This means that a sampling frequency of 40 kHz is required to 'sample' the 0-20 kHz range. Prior to sampling and during sample rate conversions, we have to get rid of components above half the sampling frequency. Low-pass filters (resampling, anti-aliasing) are used for this task. The DA conversion also includes smoothing low-pass filters. Since these filters have some bandwidth requirement, the actual sampling frequency is a bit higher than the theoretical value.
Bottom line: at a sampling frequency of 44.1 kHz, the original signal can be reconstructed from the sample points up to 20 kHz with correct amplitude and phase. This applies to all kinds of signals.
Bit-depth determines the noise floor and the dynamic range. Thanks to the plethora of conversion methods and how the human ear perceives low level noise, traditional formulas (DR = n * 6.02 decibels; n = number of bits) are incorrect and unusable. The real dynamic range of a digital system or an audio file is usually higher than n * 6.02 decibels and higher than the measured SNR (Signal-To-Noise Ratio).
Monty Montgomery (Xiph.org) wrote a detailed article in 2012 on why 24bit/96kHz and 24bit/192kHz file downloads don't make sense (24/192 Music Downloads ... and why they make no sense) . The article was removed from Xiph.org a few years ago, but it is still available in the web archive (link➚). Monty also created a short video about PCM encoding. In the video different analog signals are converted to digital and then converted back to analog. The reconstructed signal is completely analog, there are no stair steps in the waveform. In order words the output of a properly dithered and filtered digital system is indistinguishable from a band-limited pure analog system with the same noise floor (video on YouTube➚, text of the video➚).
About the testing methods and arguments
Debunking the claims that support high-resolution audio is not rocket science. First, we need to collect all the supporting arguments and select those that come from a misunderstanding of how PCM encoding works. Most of the arguments are nothing but misconceptions (formally these are straw man fallacies). In the next step we examine the limits of 16bit/44.1kHz. Frequency response, noise floor (for dynamic range), impulse response (or phase shift, they are the two sides of the same coin). Since we are focusing on an ideal digital channel, we can skip the investigation of nonlinear distortion. As a last step, the limits should be compared with the limits of human hearing. That's all.
Misconceptions (PCM encoding)
Usually the following misconceptions are used as arguments in favour of high-resolution audio:
- Waveform fallacy: "The output waveform of DA converters is not continuous, therefore increasing the bit-depth and sampling frequency the output wave becomes more and more 'analog', accurate etc. ";
- Sampling fallacy: "The time resolution of the sampled signal is the sampling period." This leads to the false conclusion, that "higher sampling rate offers higher resolution in time.";
- Quantization fallacy: "Digital can't represent analog values below the least significant bit." This leads to the false conclusion, that "dynamic range of 16-bit is 96 decibels".;
- Time smearing fallacy: "At a sampling frequency of 44.1 kHz, digital filters (anti-aliasing, resampling filters) cause audible distortion (pre-ringing in the impulse response is audible)."
All of these claims are false, since the signal at the output of a DA converter is continuous, the time resolution of a digital system is independent of sampling frequency and the dynamic range of 16 bits with noise shaping can reach 120 decibels (even 96 decibels is huge). Linear phase filters add some delay to the signal, but this delay is uniform in the passband region and varies only in the transition region. This means that the passband region is free from phase and group delay distortions. At a sampling frequency of 44.1 kHz the top of the passband is between 20 kHz and 21 kHz.
The upper limit of human hearing
Usually people can't hear tones above 20 kHz. This is true for almost everyone - and for everyone over the age of 25. An extremely small group of people under the age of 25 is able to hear tones above 20 kHz under experimental conditions. But as far as audio reproduction and sampling frequency are concerned, hearing tones above 20 kHz doesn't matter.
The upper limit of hearing varies not only from individual to individual and by age, but also depends on the amplitude (sound pressure) of the test signal: the highest audible frequency for a test tone at 100 decibels is higher than for a test tone at 80 decibels. In high frequency hearing tests the sound pressure of the test signal around the hearing threshold is well above the sound pressure values found in live music and movies. The test signal can reach 110 decibels, while the maximum sound pressure level in music at 20 kHz is approx. 85 decibels (crash cymbal). The normal hi-hat level in records is about 60 dBSPL and instruments (brass instruments, violins) hardly produce more than 60 dBSPL in this range. In summary, even if someone can hear test tones up to 26 kHz, he or she will not hear harmonics above 20-22 kHz.
Properties of 16bit/44.1kHz
Dynamic range of 16bit/44.1kHz is huge and completely covers the range required for any type of audio reproduction. In a hi-fi system, the peak SPL and thus the maximum dynamics is approx. 110 decibels. The quantization noise of 16-bit dither becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 105 dBSPL. With noise shaping the gain (volume) can be set about 18 decibels higher.
44.1 kHz sampling frequency provides flat frequency response up to at least 20 kHz. If we don't mind a few dB attenuation, the upper limit is 21 kHz.
Transient response (impulse response, phase response)
Linear phase shift, zero group delay distortion up to at least 21 kHz. This means that the impulse response is perfect in the audible range. There is a small inaudible ringing (pre-ringing) near 22 kHz.
In the human hearing range, perfect frequency response and transient response can be achieved with a sampling frequency of 44.1 kHz. Quantization noise of 16b/44k becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 120 dBSPL. All in all, 16b/44k is perfect for sound reproduction and distributing studio quality recordings.
In fact, 16bit/44.1kHz is a bit overkill
If we look more closely, 16bit/44.1kHz not only provides transparency, but it is an overkill for most people, and overkill for everyone most of the time, considering typical musical events and common playback levels. High quality music could be distributed even in 14bit/32kHz and the difference would be minor at most.
The dynamic range of 16-bit is huge
Compressed pop and rock music could be distributed in 8-10 bits without compromising playback fidelity, because threshold of hearing is not constant, but changes with the loudness of music. In terms of dynamic range, analog recording technology is equivalent to a 12 or 13-bit digital system, depending on quality. 13-bit provides high fidelity with any kind of music: for example 10-bit version of an arpeggio is a bit noisy, but the 13-bit version sounds the same as the 16-bit or the 24-bit version. The only issue with 13-bit is that the noise is audible during fade-out and extremely quiet parts. There is a significant difference between 10 and 13 bit with non-compressed music, but between 13 and 16 bit? Hard to hear.
A wide variety of music could be released at 32 kHz without compromising playback fidelity
Only a few instruments can produce relatively high SPLs above 16 kHz. Cymbals, hi-hats, castanets, brass instruments, steel-string acoustic guitar, in rare case violin benefit from 44.1 kHz sample rate. Piano and cello works, chamber music and many classical tunes could be released at 32 kHz without without compromising playback fidelity.
Hearing and age
As we get older, our ability to hear high pitched sounds decreases. People over 40 can't hear tones above 16 kHz, which means that 32 kHz sampling rate could deliver the same fidelity as 44.1 kHz for them.
Returning to dynamic range and bits, when we are listening to a digital transfer of an analog recording (CD, FLAC, MP3 version... or a YouTube video), then in terms of dynamic range, this is equivalent to listening to a 12-bit or a 13-bit digital system.
About the recording formats
16bit/44.1kHz is perfect for distribution, but what about recording formats? It seems that 24b/96k became the standard in music production, but does it mean any improvement over 16/44.1?
Effects and editing operations produce two kinds of distortion: arithmetic distortion (quantization distortion) and aliasing distortion. Arithmetic distortion can be reduced by increasing the bit-depth, aliasing distortion can be reduced by increasing the sample rate. All effects generate arithmetic distortion, but very few effects generate aliasing distortion. So sample rates above 48 kHz usually are not useful in recording formats.
The largest group of effects only uses linear operations (linear effects). These are: equalizers (filters), reverb, volume change, fade-in, fade-out. Linear effects and linear editing operations only generate arithmetic distortion. This means that the quality of linear effects and linear editing operations is not affected by the sampling frequency. For an EQ or a reverb the 44.1 kHz as good as the 96 kHz. However, linear effects require bit-depth higher than 16 bits to keep the distortion products below 120 decibels. Keeping the files in 24-bit format can prevent the accumulation of noise due to continuous requantization in the DAW (audio editor).
Nonlinear effects (e.g. guitar distortion) are sensitive to the sampling frequency, however, they all use internal oversampling, so the use of a higher sampling frequency has no benefit. Modulation effects (chorus, flanger) may sound a bit cleaner at higher sample rates, because smooth time stretching requires correct intersample calculation.
In summary, there is no point in claiming that 24b/96k is better recording format than 16b/44k or 24b/44k. For live recordings 16b/44k is sufficient, while 24b/44k may provide lower final noise for an effects-packed multi-track recording.
Although recordings in the studio are stored in 24-bit files, the real bit-depth of recordings often does not reach 16 bits and the remaining bits only contain noise. There are no 24-bit recordings, neither 20-bit nor 19-bit. Even 18 and 17-bit recordings are very rare and with noise shaping they can be transferred to 16 bits (19-bit dynamics can be transferred to 16 bits with noise shaping). So, modern 24-bit recordings are still 16-bit recordings. The dynamic range of analog recordings is even lower, as analog tape recorders can only capture 13-bit dynamic range. It means that analog recordings could be distributed in 13 bits without compromising playback fidelity. The main role of 24-bit in recording formats is to ensure the accumulation of conversion errors is small enough to be completely irrelevant.
Format discrimination studies
And what about listening tests? We cannot understand how things work by listening to music and changing components, so if we use listening tests to understand digital signal behaviour or hearing then we just shoot ourselves into the foot. Similarly, we cannot understand how a display works by watching movies and pushing the buttons on the remote controller. In all areas of life we try to solve problems in the fastest and most reliable way, not the other way around. (As a general guideline we should listen to music for our enjoyment and not for fixing problems in the audio chain.)
But there is another reason why traditional listening tests, format discrimination studies lead nowhere. We can cover all possibilities by testing the "worst case". Worst-case analysis helps us get out of the hell of statistical mumbo-jumbo filled discrimination studies and metaanalysis kind of creationism science. (The infamous Metaanalysis from 2016 is bad not only because the selection of the papers is slightly biased - it is bad because the idea is bad. Who cares about small biases, measurement errors if you have a good idea? The core principle of these kind of studies is bad, because they diverts attention from digital signal behaviour and human hearing.)
So, instead of performing zillion song based listening tests, we have to find the "worst-case" and use "worst-case" as an "experimentum crucis". The other benefit of worst-case analysis is that it has much higher sensitivity than traditional song based tests.
|Type of test||Worst case|
|Sample rate||80 dBSPL pure tone, 21kHz-25kHz|
|Filter ringing||pulse (bipolar and unipolar)|
|Quantization (optimal)||residual noise (no signal)|
|Quantization (not optimal)||pure tone with fade out|
It's worth mentioning that format discrimination studies neglect null tests for some reason. As in a null test we don't compare the audio samples with each other, the test procedure doesn't rely on auditory memory and the use of blind test (ABX) is not required (in a null test we're listening to the difference between two versions of a file; we are not comparing A to B, we are simply listening to [A - B]). The test procedure is a simple sound detection in silence, which is a huge benefit. Null test is similar to examining things under a magnifying glass, it magnifies small differences a thousandfold, however it is also a disadvantage, as a null test is much more susceptible to errors, especially errors that may lead to false positive results (for example, low level intermodulation distortion that would be masked in normal tests can falsify the results).
Another concept which is often misunderstood is the effect of training. Training is a learning process, similar to learning the letters of an alphabet or learning the road signs. Learning improves 'pattern recognition', but the absolute limits of our perception don't change. The difference between a trained listener and a non-trained listener is that a trained listener knows what to look for, which part of the music to pay attention to.
Music with extra amount of noise
High-resolution downloads don't make sense, but there is a different reason why distributing audio in high-res formats is really stupid. A recording in 24-bit / 88.2 kHz WAV is three times larger than in 16-bit / 44.1 kHz WAV. The ratio is slightly worse with FLAC files. And what’s the plus in a high resolution recording? Noise. Huge amount of inaudible, 'functionless' noise.
In 16-bit files the amount of recording noise is negligible compared to the file size. In a 24/96 (or24/88.2) recording, the size of the inaudible and completely unnecessary recording noise is roughly equal to the size of a 16 bit / 44.1 kHz WAV file of the same length. That's a pretty big waste. Another problem is that noise cannot be compressed with lossless tools, which is also evident in the huge size of high-resolution FLAC files. The average bitrate of a 24/96 recording in FLAC is approx. 2.5 Mbps (Megabits per second), of which 1.4 Mbps is unnecessary recording noise. That is, more than half of a 24/96 FLAC file consists of inaudible noise...
High-resolution music streaming is a noise pollution. It takes away bandwidth and energy and in return gives nothing, especially when we can stream in studio quality at 130kbps with AAC and Opus.
Footnote #1 - high-frequency auditory thresholds:
‘‘Ultrahigh-frequency auditory thresholds in young adults: Reliable responses up to 24 kHz with a quasi-free-field technique‘‘, K. R. Henry and G. A. Fast, 1984
‘‘Extended high-frequency (9 – 20 kHz) audiometry reference thresholds in 645 healthy subjects‘‘, A. Rodríguez Valiente et al., April 2014, Int J Audiol.
‘‘Threshold of hearing in free field for high-frequency tones from 1 to 20 kHz‘‘, Kaoru Ashihara et al, 2003
‘‘Hearing threshold for pure tones above 20 kHz‘‘, Kaoru Ashihara et al, 2005
‘‘Hearing threshold for pure tones above 16 kHz‘‘, Kaoru Ashihara, 2007
Footnote #2 - most important studies on dither, noise shaping & noise audibility:
‘‘Dither in Digital Audio‘‘, John Vanderkooy, Stanley Lipshitz, 1987
‘‘Optimal Noise Shaping and Dither of Digital Signals‘‘, Michael Gerzon, Peter G. Craven, 1989
‘‘Minimally Audible Noise Shaping‘‘, S. P. Lipshitz, J. Vanderkooy, and R. A. Wannamaker, 1991
‘‘Noise: Methods for Estimating Detectability and Threshold‘‘, R. Stuart, 1994