High-resolution audio vs. 16 bit / 44.1kHz
Investigation of the optimal sampling frequency and bit depth of distribution audio formats.
Last edited: Feb. 3, 2024
Since the introduction of the "Compact Disc", the quality of the format (16 bit/44.1kHz) has been a subject of constant debate. Beliefs and misconceptions also appeared with the spread of digital technology: CD is not "analog", filter resonance degrades the sound quality. From an initially misunderstood technology (that some people considered best to avoid), it later became a misunderstood format (that some people consider best to avoid).
The behavior of digital signals is counterintuitive, which explains the multitude of misconceptions and frauds. Misleading representation of pulse-code modulation, the use of old digital stereotypes, fundamental misconceptions for advertising purposes are very common. The marketing term "CD quality" is also misleading, because it suggests that quality is proportional to file size or bitrate. Since 16bit/44.1kHz provides the same playback fidelity as 24bit/96kHz, studio quality recordings can be distributed in 16/44.1 formats (16/44.1 is studio quality).
Scientific looking statistical mumbo-jumbo, accompanied with poor reasoning, speculative and ambiguous assertions also help spread lies (e.g. "high resolution audio perception could neither be confirmed nor denied" nonsense).
Sampling & quantization
Sampling rate shows how many samples are taken per second during digital conversion. Sampling frequency determines bandwidth or frequency range: the highest frequency is half the sampling frequency.
As a result of sampling, new frequency components (called images or image frequencies) are created. If the sampled signal doesn't contain harmonics above the Nyquist frequency (half the sampling frequency), the generated components (images) appear above the Nyquist frequency and they can be removed with a low-pass filter. If the sampled signal contains harmonics above the Nyquist frequency, certain image components appear below the Nyquist frequency ("aliasing"). The purpose of the anti-alias filter is to prevent "aliasing" by attenuating frequencies above the Nyquist frequency. (Online simulation of sampling, demonstration of aliasing, oversampling)
Bottom line: at a sampling frequency of 44.1 kHz, the original signal can be reconstructed from the sample points up to 20 kHz with correct amplitude and phase. This applies to all kinds of signals.
Bit-depth determines noise floor and dynamic range. Thanks to the plethora of conversion methods and how the human ear perceives low level noise, traditional formulas (DR = n * 6.02 decibels; n = number of bits) are incorrect and unusable. The real dynamic range of a digital system or an audio file is usually higher than n * 6.02 decibels and higher than the measured SNR (Signal-To-Noise Ratio).
Monty Montgomery (Xiph.org) wrote a detailed article in 2012 on why 24bit/96kHz and 24bit/192kHz file downloads don't make sense (24/192 Music Downloads ... and why they make no sense) . The article was removed from Xiph.org a few years ago, but it is still available in the web archive (link➚). Monty also created a short video about PCM encoding. In the video different analog signals are converted to digital and then converted back to analog. The reconstructed signal is completely analog, there are no stair steps in the waveform. In order words the output of a properly dithered and filtered digital system is indistinguishable from a band-limited pure analog system with the same noise floor (video on YouTube➚, text of the video➚).
The main question
The primary question is not "is there an audible difference between 44.1kHz and 96 kHz sampling rates". It is a secondary question.
The real question is about the limits: "do the limits of human hearing exceed the limits of 16-bit and 44.1 kHz?". This question can be refined by excluding unrealistically high sound pressure levels and including age groups.
How do we know what parameters are important? Simple. There are only three fundamental questions to be answered regarding PCM encoding:
- What does quantization do with the signal? (What is the effect of quantization?)
- What does sampling do with the signal? (What is the effect of sampling?)
- What do filters do with the signal? (What are the effects of filters?)
Answers (assuming ideal or "text-book" quantization and sampling):
- Quantization: noise.
- Sampling: new image frequencies (but they are removed by the resampling filter).
- Resampling filter: amplitude vs. frequency response and group delay vs frequency response.
So frequency response (frequency range), noise floor (for dynamic range) and group delay have to be analyzed. (Analysis is much easier with group delay than with phase shift or impulse response. Pulse is a good test and measurement signal, but waveform analysis can be misleading.) Since we are focusing on an ideal digital channel, we can skip the investigation of nonlinear distortion. As a last step, the limits should be compared with the limits of human hearing. That's all.
Misconceptions (PCM encoding)
Usually the following misconceptions are used as arguments in favour of high-resolution audio:
- Waveform fallacy: "The output waveform of DA converters is not continuous, therefore increasing the bit-depth and sampling frequency the output wave becomes more and more 'analog', accurate etc. ";
- Sampling fallacy: "The time resolution of the sampled signal is the sampling period." This leads to the false conclusion, that "higher sampling rate offers higher resolution in time.";
- Quantization fallacy: "Digital can't represent analog values below the least significant bit." This leads to the false conclusion, that "dynamic range of 16-bit is 96 decibels".;
- Time smearing fallacy: "At a sampling frequency of 44.1 kHz, digital filters (anti-aliasing, resampling filters) cause audible distortion (pre-ringing in the impulse response is audible)."
All of these claims are false, since the signal at the output of a DA converter is continuous, sampling doesn't affect time resolution and quantization (with dither) just adds noise to the signal. Linear phase filters add some delay to the signal, but this delay is uniform in the passband region and varies only in the transition region. This means that the passband region is free from phase and group delay distortions. At a sampling frequency of 44.1 kHz the top of the passband is between 20 kHz and 21 kHz.
The upper limit of human hearing
Usually people can't hear tones above 20 kHz. This is true for almost everyone - and for everyone over the age of 25. An extremely small group of people under the age of 25 is able to hear tones above 20 kHz under experimental conditions. But as far as audio reproduction and sampling frequency are concerned, hearing tones above 20 kHz doesn't matter.
The upper limit of hearing varies not only from individual to individual and by age, but also depends on the level of the test signal: the highest audible frequency for a test tone at 100 decibels is higher than for a test tone at 80 decibels. In high frequency hearing tests the sound pressure of the test signal around the hearing threshold is well above the sound pressure values found in live music and movies. The test signal can reach 110 decibels, while the maximum sound pressure level in music at 20 kHz is approx. 85 decibels (crash cymbal). The normal hi-hat level in records is about 60 dBSPL and instruments (brass instruments, violins) hardly produce more than 60 dBSPL in this range. In summary, even if someone can hear test tones up to 26 kHz, he or she will not hear harmonics above 20-22 kHz.
Properties of 16bit/44.1kHz
Dynamic range of 16bit/44.1kHz is huge and completely covers the range required for any type of audio reproduction. In a hi-fi system, the peak SPL and thus the maximum dynamics is approx. 110 decibels. The quantization noise of 16-bit dither becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 105 dBSPL. With noise shaping the gain (volume) can be set about 18 decibels higher.
44.1 kHz sampling frequency provides flat frequency response up to at least 20 kHz. If we don't mind a few dB attenuation, the upper limit is 21 kHz.
Transient response (impulse response, phase response)
Linear phase shift, zero group delay distortion up to at least 21 kHz. This means that the impulse response is perfect in the audible range. There is a small inaudible ringing (pre-ringing) near 22 kHz.
In the human hearing range, perfect frequency response and transient response can be achieved with a sampling frequency of 44.1 kHz. Quantization noise of 16b/44k becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 120 dBSPL. All in all, 16b/44k is perfect for sound reproduction and distributing studio quality recordings.
In fact, 16bit/44.1kHz is a bit overkill
If we look more closely, 16bit/44.1kHz not only provides transparency, but it is an overkill for most people, and overkill for everyone most of the time, considering typical musical events and common playback levels. High quality music could be distributed even in 14bit/32kHz and the difference would be minor at most.
The dynamic range of 16-bit is huge
Compressed pop and rock music could be distributed in 8-10 bits without compromising playback fidelity, because threshold of hearing is not constant, but changes with the loudness of music. In terms of dynamic range, analog recording technology is equivalent to a 12 or 13-bit digital system, depending on quality. 13-bit provides high fidelity with any kind of music: for example 10-bit version of an arpeggio is a bit noisy, but the 13-bit version sounds the same as the 16-bit or the 24-bit version. The only issue with 13-bit is that the noise is audible during fade-out and extremely quiet parts. There is a significant difference between 10 and 13 bit with non-compressed music, but between 13 and 16 bit? Hard to hear.
A wide variety of music could be released at 32 kHz without compromising playback fidelity
Only a few instruments can produce relatively high SPLs above 16 kHz. Cymbals, hi-hats, castanets, brass instruments, steel-string acoustic guitar, in rare case violin benefit from 44.1 kHz sample rate. Piano and cello works, chamber music and many classical tunes could be released at 32 kHz without without compromising playback fidelity.
Hearing and age
As we get older, our ability to hear high pitched sounds decreases. People over 40 can't hear tones above 16 kHz, which means that 32 kHz sampling rate could deliver the same fidelity as 44.1 kHz for them.
Returning to dynamic range and bits, when we are listening to a digital transfer of an analog recording (CD, FLAC, MP3... or YouTube), in terms of dynamic range, this is equivalent to listening to a 12-bit or a 13-bit digital system. (Analog tape recorders can only capture max. 13-bit dynamic range. It also means that analog recordings could be distributed in 13 bits without compromising playback fidelity.)
Noise floor in 24-bit recordings
Noise in an audio file is the 'sum' of the quantization noise and the noise already present in the recording (or captured during recording). The "other noise" can be room noise, an other system's quantization noise (e.g. AD converter's internal noise) or in the case of digital version of analog recordings, noise of the tape recorder. If their levels are different, the noise floor is determined by the higher one.
The effective or equivalent bit-depth of a recording can be determined by comparing the recording noise spectrum with the noise floor of a 16 bit / 44.1 kHz digital system with standard dither ('TPDF' dither). The selection must be at least 100 ms long and contain nothing but recording noise.
The effective bit-depth of recordings - even 24 bit recordings - often does not reach 16 bits and the remaining bits only contain noise. We won't find 24-bit recordings, neither 20-bit nor 19-bit. Even 18 and 17-bit recordings are very rare and with noise shaping they can be transferred to 16 bits (19-bit dynamics can be transferred to 16 bits with noise shaping). Also, reducing the noise of the recording system below the "18-bit line" does not result in lower noise, it just reveals more and more room noise.
Listening tests: from fallacies to alternative test methods
Unfortunately, this field has become too "test oriented" often ignoring digital signal behavior and basic models of human hearing. We cannot understand how things work by listening to music and changing components. Similarly, we cannot understand how a display works by watching movies and pushing the buttons on the remote controller. The goal is to understand technology and hearing, not just "testing" and be happy (or unhappy) with the test results. Real knowledge is when we can answer the question "Why?" or "How does it work?" or "Why did this happen?" and not just "it happened this way, just accept it". Furthermore, blind test is only a "safety" tool and not a problem-solving method, nor a method for discovering cause-effect.
Application of blind test is not the goal - it is a tool. This principle is often overlooked. The goal is to find a valid observation from we can infer cause-effect. Another misconception is that the source of self-deception is ignoring blind test methods, so if you follow blind test methods you're in safe water. Self-deception may have several reasons, not just one: lack of understanding technology and hearing, substituting knowledge with belief, lack of understanding what is a valid test, which lead to the inability of recognizing false positive results.
Rejecting blind tests based on imaginary reasons ("blind test is not natural") is a mistake. On the other hand, forcing every problem in the framework of double blind testing is also a mistake. A solution to the "test madness" is to replace discrimination type tests with sound detection type tests. If the test procedure is only sound detection in silence, the use of blind test (ABX or other) is not necessary, because the method doesn't rely on auditory memory (the possibility of self-deception is extremely low). Null test and listening to high-pass filtered audio samples (for sample rate test) are detection type tests.
Null test is very simple: instead of comparing A to B, we listen to the difference between A and B. Creating high-pass filtered audio samples for testing sample rate is even more simple. (Note: null test is unusable with lossy compression.)
Another concept which is often misunderstood is the effect of training. Training is a learning process, similar to learning the letters of an alphabet or learning the road signs. Learning improves 'pattern recognition', but the absolute limits of our perception don't change. The difference between a trained listener and a non-trained listener is that a trained listener knows what to look for, which part of the music to pay attention to.
Definitely, we need a bigger goal.
Null test method and high pass filter test method still can't be considered a "surefire test". The goal is to create a definitive test, which is not hard if we recall what the real question was: "do the limits of human hearing exceed the limits of 16-bit and 44.1 kHz?". This question can be refined by excluding unrealistically high sound pressure levels at high frequencies. So, instead of performing zillion song based listening tests, we are looking for test methods that can be accepted as "experimentum crucis". Basically, a worst-case analysis, a limit test - testing the limits of 16/44k with special ("worst-case") signals. The other benefit of worst-case analysis is the increased sensitivity compared to traditional song based tests.
|Type of test
|Sample rate / bandwidth
|80 dBSPL pure tone, 22kHz-24kHz
|Sample rate / filter 'ringing'
|pulse (bipolar, unipolar)
|residual noise (no signal)
|Quantization (not optimal)
|pure tone with fade out
Music with extra amount of noise
High-resolution downloads don't make sense, but there is a different reason why distributing audio in high-res formats is really stupid. A recording in 24-bit / 88.2 kHz WAV is three times larger than in 16-bit / 44.1 kHz WAV. The ratio is slightly worse with FLAC files. And what’s the plus in a high resolution recording? Noise. Huge amount of inaudible, 'functionless' noise.
In 16-bit files the amount of recording noise is negligible compared to the file size. In a 24/96 (or24/88.2) recording, the size of the inaudible and completely unnecessary recording noise is roughly equal to the size of a 16 bit / 44.1 kHz WAV file of the same length. That's a pretty big waste. Another problem is that noise cannot be compressed with lossless tools, which is also evident in the huge size of high-resolution FLAC files. The average bitrate of a 24/96 recording in FLAC is approx. 2.5 Mbps (Megabits per second), of which 1.4 Mbps is unnecessary recording noise. That is, more than half of a 24/96 FLAC file consists of inaudible noise...
High-resolution music streaming is a noise pollution. It takes away bandwidth and energy and in return gives nothing, especially when we can stream in studio quality at 130kbps with AAC and Opus.
Footnote #1 - high-frequency auditory thresholds:
‘‘Ultrahigh-frequency auditory thresholds in young adults: Reliable responses up to 24 kHz with a quasi-free-field technique‘‘, K. R. Henry and G. A. Fast, 1984
‘‘Extended high-frequency (9 – 20 kHz) audiometry reference thresholds in 645 healthy subjects‘‘, A. Rodríguez Valiente et al., April 2014, Int J Audiol.
‘‘Threshold of hearing in free field for high-frequency tones from 1 to 20 kHz‘‘, Kaoru Ashihara et al, 2003
‘‘Hearing threshold for pure tones above 20 kHz‘‘, Kaoru Ashihara et al, 2005
‘‘Hearing threshold for pure tones above 16 kHz‘‘, Kaoru Ashihara, 2007
Footnote #2 - most important studies on dither, noise shaping & noise audibility:
‘‘Dither in Digital Audio‘‘, John Vanderkooy, Stanley Lipshitz, 1987
‘‘Optimal Noise Shaping and Dither of Digital Signals‘‘, Michael Gerzon, Peter G. Craven, 1989
‘‘Minimally Audible Noise Shaping‘‘, S. P. Lipshitz, J. Vanderkooy, and R. A. Wannamaker, 1991
‘‘Noise: Methods for Estimating Detectability and Threshold‘‘, R. Stuart, 1994