High-resolution audio vs. 16 bit / 44.1kHz
On the optimal sampling frequency and bit-depth of audio files for distribution.
Nov. 2, 2021
There is a huge amount of misinformation on the net about the optimal sampling frequency and resolution (bit depth) of audio files for distribution. The CD (Compact Disc) was released forty years ago, but digital audio is still a chaotic subject, mainly dominated by misconceptions.
More and more digital music store offer recordings in 24 bit / 96 kHz or other ‘hi-res’ formats (MQA, DSD). The hype around high-resolution formats in the media also keeps growing. However, promises remain promises, as bit-depths higher than 16 bits or sampling frequencies higher than 44.1 kHz will not improve the playback fidelity.
High-resolution music and hi-res players are advertised with misleading descriptions and ads almost everywhere. False representations of pulse-code modulation, the use of old digital stereotypes, fundamental misconceptions for advertising purposes are very common. The marketing term "CD quality" is also misleading, mainly because it suggests that quality is proportional to file size or bitrate. Since 16 bit / 44.1 kHz provides the same playback fidelity as 24 bit / 96 kHz, studio quality recordings can be distributed in 16/44.1 formats. 16/44.1 is studio quality.
Perhaps the biggest problem with digital audio is that it’s easy to create theories that seem true and are extremely convincing, which make their promotion simple, yet they have nothing to do with reality. There is no error in the logic of reasoning, however the basic assumptions (premises) are false, and if the initial statements are false, the whole theory collapses (these false theories are very similar to the straw man fallacy➚). This is true for vinyl hype, DSD, high resolution audio and MQA encoding. Each of its starting point is wrong (digital signals are square waves, effect of filters is audible, etc.).
False analogies are another large group of fallacies (e.g., misleading illustrations of PCM encoding in high-resolution ads). The real appeal of false theories and analogies is their simplicity, as they provide simpler explanations than real answers. However, the real problem is a long-term confirmation and self deception, which can only be avoided if we recognize mistakes in time.
It is a common mistake in hi-fi to shift the boundary between subjective and objective, that is, to claim that an objective phenomenon or characteristics is subjective. It is also a common trick to present a simple acoustic or electrical phenomenon as if there were no rational explanation for it. In amplifiers and in digital audio problems are quite straightforward and we can’t find subjective elements here. All we need to do is to look at how and how much the signal changes after passing through a component and compare the change to known thresholds. Audio measurements provide a method that is straightforward and perfect.
Sampling frequency, resolution (bit depth)
The sampling rate shows how many samples are taken per second during digital conversion. The sampling frequency determines the frequency range that can be recorded. The highest frequency is half the sampling frequency.
If the sampling frequency is at least twice the frequency of the highest component of the original signal, then the original signal can be perfectly reconstructed from the sampled values. This means that a sampling frequency of 40 kHz is required to 'sample' the 0-20 kHz range. Prior to sampling and during sample rate conversions, we have to get rid of components above half the sampling frequency. Low-pass filters (resampling, anti-aliasing) are used for this task. The DA conversion also includes smoothing low-pass filters. Since these filters have some bandwidth requirement, the actual sampling frequency is a bit higher than the theoretical value.
Bottom line: at a sampling frequency of 44.1 kHz, the original signal can be reconstructed from the sample points up to 20 kHz with correct amplitude and phase. This applies to all kinds of signals.
Bit-depth determines the noise floor and the dynamic range. Thanks to the plethora of conversion methods and how the human ear perceives low level noise, traditional formulas (DR = n * 6.02 decibels; n = number of bits) are incorrect and unusable. The real dynamic range of a digital system or an audio file is usually higher than n * 6.02 decibels and higher than the measured SNR (Signal-To-Noise Ratio).
Monty Montgomery (Xiph.org) wrote a detailed article in 2012 on why 24bit / 96kHz and 24bit / 192kHz file downloads don't make sense (24/192 Music Downloads ... and why they make no sense) . The article was removed from Xiph.org a few years ago, but it is still available in the web archive (link➚). Monty also created a short video about PCM encoding. In the video different analog signals are converted to digital and then converted back to analog. The reconstructed signal is completely analog, there are no stair steps in the waveform. In order words the output of a properly dithered and filtered digital system is indistinguishable from a band-limited pure analog system with the same noise floor (video on YouTube➚, text of the video➚).
Origin of high-resolution audio
The use of bit-depths higher than 16 bits and sampling frequencies higher than 44.1 kHz is not new. The oldest audio file formats (WAV, AIFF) support non-standard bit-depths and sample rates since their release. However, the spread of high-resolution recordings has been held back by a number of factors: the small size of hard drives, the lack of drivers and hardware and the hype surrounding DSD around 2000.
Bit-depths greater than 16 bits and sampling frequencies higher than 44.1 kHz were originally used for recording, editing and mixing. The most common recording formats: 24b/44.1k, 24b/48k and 24b/96k. PCM conversions of DSD recordings typically use a sample rate of 88.2 kHz.
About the testing methods and arguments
Debunking the claims that support high-resolution audio is incredibly simple. Even high-resolution recordings are not required.
We must first gather all the arguments that support high-resolution audio and select those that come from a misunderstanding of how PCM encoding works. Most of the arguments are nothing but misconceptions (formally these are straw man fallacies). The next step is to examine the limits of 16bit/44.1kHz . Frequency response, noise floor (dynamic range), impulse response (or phase shift, they are the two sides of the same coin). We don't have to bother with nonlinear distortion, as well-designed resampling filters and proper quantization don't cause audible distortion. As a last step, the limits should be compared with the limits of human hearing. That's all.
So we have a real proof, as opposed to the ‘black box’ approach of traditional comparative listening tests. Maybe it’s no surprise that 'listening tests' are called tests and not proof. A real proof can only be based on arguments, and not mere correlations. Conventional listening tests carry the age-old problems of statistics (false cause ➚ (confusion of correlation and causation, correlation does not imply causation); the problem of hidden variables; faulty generalization from flawed tests or small number of tests).
Misconceptions (PCM encoding)
Usually the following misconceptions are used as arguments in favour of high-resolution audio:
- Waveform fallacy: "The output waveform of DA converters is not continuous, therefore increasing the bit-depth and sampling frequency the output wave becomes more and more 'analog', accurate etc. ";
- Sampling fallacy: "The time resolution of the sampled signal is the sampling period." This leads to the false conclusion, that "higher sampling rate offers higher resolution in time.";
- Quantization fallacy: "Digital can't represent analog values below the least significant bit." This leads to the false conclusion, that "dynamic range of 16-bit is 96 decibels".;
- Time smearing fallacy: "At a sampling frequency of 44.1 kHz, digital filters (anti-aliasing, resampling filters) cause audible distortion (pre-ringing in the impulse response is audible)."
All of the these claims are false, since the signal at the output of DA converters is completely analog, the time resolution of a digital system is independent of the sampling frequency, the dynamic range of 16 bits with noise shaping can reach 120 decibels (and 96 decibels also huge). Linear phase filters have linear phase in their passband, so at a sampling frequency of 44.1 kHz the phase shift between 20 Hz and 20 kHz is zero. The frequency of the pre-ringing in the impulse response is 22 kHz (always higher than 21 kHz) so the time smearing is a completely made up story.
The upper limit of human hearing
Usually people can't hear tones above 20 kHz. This is true for almost everyone - and for everyone over the age of 25. An extremely small group of people is able to hear tones above 20 kHz under experimental conditions. These people are under 25 years old. But as far as audio reproduction and sampling frequency are concerned, hearing tones above 20 kHz doesn't matter.
The upper limit of hearing varies not only from individual to individual and by age, but also depends on the amplitude (sound pressure) of the test signal: the highest audible frequency for a test tone at 100 decibels is higher than for a test tone at 80 decibels. In high frequency hearing tests the sound pressure of the test signal around the hearing threshold is well above the sound pressure values found in live music and movies. The test signal can reach 110 decibels, while the maximum sound pressure level in music at 20 kHz is approx. 85 decibels (crash cymbal). The normal hi-hat level in records is about 60 dBSPL and instruments (brass instruments, violins) hardly produce more than 60 dBSPL in this range. In summary, even if someone can hear test tones up to 26 kHz, he or she will not hear harmonics above 20-22 kHz.
Properties of 16bit/44.1 kHz
Without noise shaping the dynamic range is about 103 decibels, with noise shaping it is about 120 decibels. Dynamic range of 16 bits is huge and completely covers the range required for any type of audio reproduction. In a hi-fi system, the peak SPL and thus the maximum dynamics is approx. 110 decibels.
Completely flat frequency response up to at least 20 kHz. If we don't mind a few dB attenuation, the upper limit is 21 kHz.
Transient response (impulse response, phase response)
Linear phase shift, zero group delay distortion up to at least 21 kHz. This means that the impulse response is perfect in the audible range. There is a small inaudible ringing (pre-ringing) near 22 kHz.
In the human hearing range, perfect frequency response and transient response can be achieved with a sampling frequency of 44.1 kHz. Quantization noise of 16b/44k becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 120 dBSPL. All in all, 16b/44k is perfect for sound reproduction and distributing studio quality recordings.
About the recording formats
16b/44k is perfect for distribution, but what about recording formats? It seems that 24b/96k became the standard in music production, but does it mean any improvement over 16b/44k?
Effects and editing operations produce two kinds of distortion: arithmetic distortion (quantization distortion) and aliasing distortion. Arithmetic distortion can be reduced by increasing the bit-depth, aliasing distortion can be reduced by increasing the sample rate. All effects generate arithmetic distortion, but very few effects generate aliasing distortion. So sample rates above 48 kHz usually are not useful in recording formats.
The largest group of effects only uses linear operations (linear effects). These are: equalizers (filters), reverb, volume change, fade-in, fade-out. Linear effects and linear editing operations only generate arithmetic distortion. This means that the quality of linear effects and linear editing operations is not affected by the sampling frequency. For an EQ or a reverb the 44.1 kHz as good as the 96 kHz. However, linear effects require bit-depth higher than 16 bits to keep the distortion products below 120 decibels. Keeping the files in 24-bit format can prevent the accumulation of noise due to continuous requantization in the DAW (audio editor).
Nonlinear effects (e.g. guitar distortion) are sensitive to the sampling frequency, however, they all use internal oversampling, so the use of a higher sampling frequency has no benefit. Modulation effects (chorus, flanger) may sound a bit cleaner at higher sample rates, because smooth time stretching requires correct intersample calculation.
In summary, there is no point in claiming that 24b/96k is better recording format than 16b/44k or 24b/44k. For live recordings 16b/44k is sufficient, while 24b/44k may provide lower final noise for an effects-packed multi-track recording.
Although recordings in the studio are stored in 24-bit files, the real bit-depth of recordings often does not reach 16 bits and the remaining bits only contain noise. There are no 24-bit recordings, neither 20-bit nor 19-bit. Even 18 and 17-bit recordings are very rare and with noise shaping they can be transferred to 16 bits (19-bit dynamics can be transferred to 16 bits with noise shaping). So, modern 24-bit recordings are still 16-bit recordings. The dynamic range of analog recordings is even lower, as analog tape recorders can only capture 13-bit dynamic range. It means that analog recordings could be distributed in 13 bits without compromising playback fidelity. The main role of 24-bit in recording formats is to ensure the accumulation of conversion errors is small enough to be completely irrelevant.
Format discrimination studies
There are several perceptual studies (listening tests) about format discrimination. Some studies show that there is no perceptual difference between high resolution and standard formats, while others show that there is a small, but (in)significant difference (investigation of bit-depth and sample rate have to be separated). The problem lies in the statistical nature of these test which makes difficult to interpret small differences. I don't think anyone could be 'convinced' by a test in which the participants can discriminate between different sample rates, but the answers are slightly better than chance.
I don't know if the statistical approach is the source of all problems, but when format discrimination studies focus on statistical calculations rather than analog and digital signal behaviour, than we can expect more false positive results in the future. Sometimes the authors have rudimentary knowledge about the PCM encoding and they justify their results with a typical misconception (e.g. "higher sample rates have better time resolution"). In such situations the authors don't question their data and methods, because they think they found something and the mistake remain unrevealed.
Side effects of ultrasounds can easily falsify the results. Distortion in the audio chain (amplifiers, speakers) and clipping (ADC, DAC) may lead to false detection of higher sample rates. Thermal compression in tweeters can be higher with ultrasonic content. A 2 % rise in DC resistance is 0.17 dB loss in SPL, which can be detected with sustained or slowly changing tones (e.g. violin). In this respect, tweeters with a small voice coil and neodymium magnet are worse than dome tweeters with a 1" voice coil and a large ferrite magnet.
It's worth mentioning that format discrimination studies never use null tests. Null test is more accurate and more convenient than traditional listening tests. In a null test we're listening to the difference between two versions of a file (so we are not comparing A to B, we are simply listening to [A - B]). As a null test doesn't use comparison, it doesn't rely on auditory memory and the use of blind test (ABX) is not required. The test procedure is a simple sound detection in silence - and this is a huge benefit. The null test is like examining things under a magnifying glass, it magnifies small differences a thousandfold, however it is also a disadvantage, as a null test is much more susceptible to errors, especially errors that may lead to false positive results (for example, low level intermodulation distortion that would be masked in normal tests can falsify the results).
Music with extra amount of noise
High-resolution downloads don't make sense, but there is a different reason why distributing audio in high-res formats is really stupid. A recording in 24-bit / 88.2 kHz WAV is three times larger than in 16-bit / 44.1 kHz WAV. The ratio is slightly worse with FLAC files. And what’s the plus in a high resolution recording? Noise. Huge amount of inaudible, 'functionless' noise.
In 16-bit files the amount of recording noise is negligible compared to the file size. In a 24/96 (or24/88.2) recording, the size of the inaudible and completely unnecessary recording noise is roughly equal to the size of a 16 bit / 44.1 kHz WAV file of the same length. That's a pretty big waste. Another problem is that noise cannot be compressed with lossless tools, which is also evident in the huge size of high-resolution FLAC files. The average bitrate of a 24/96 recording in FLAC is approx. 2.5 Mbps (Megabits per second), of which 1.4 Mbps is unnecessary recording noise. That is, more than half of a 24/96 FLAC file consists of inaudible noise...
High-resolution music streaming is a noise pollution. It takes away bandwidth and energy and in return gives nothing, especially when we can stream in studio quality at 130kbps with AAC and Opus.
References (hearing, high-frequency auditory thresholds):
K. R. Henry and G. A. Fast, ‘‘Ultrahigh-frequency auditory thresholds in young adults: Reliable responses up to 24 kHz with a quasi-free-field technique‘‘, 1984
A. Rodríguez Valiente et al., ‘‘Extended high-frequency (9 – 20 kHz) audiometry reference thresholds in 645 healthy subjects‘‘, April 2014, Int J Audiol.
Kaoru Ashihara et al, ‘‘Threshold of hearing in free field for high-frequency tones from 1 to 20 kHz‘‘, 2003
Kaoru Ashihara et al, ‘‘Hearing threshold for pure tones above 20 kHz‘‘, 2005
Kaoru Ashihara, ‘‘Hearing threshold for pure tones above 16 kHz‘‘, 2007