High-resolution audio vs. 16 bit / 44.1kHz

Investigation of the optimal sampling frequency and bit depth of distribution audio formats.

Nov. 2, 2021

There is a huge amount of misinformation on the net about the optimal sampling frequency and resolution (bit depth) of audio files for distribution. The CD (Compact Disc) was released forty years ago, but digital audio is still a chaotic subject, mainly dominated by misconceptions.

More and more digital music store offer recordings in 24bit/96kHz or other ‘hi-res’ formats (MQA, DSD). The hype around high-resolution formats in the media also keeps growing. However, promises remain promises, as bit-depths higher than 16 bits or sampling frequencies higher than 44.1 kHz will not improve the playback fidelity.

Often high-resolution music and hi-res players are advertised with misleading descriptions. False representations of pulse-code modulation, the use of old digital stereotypes, fundamental misconceptions for advertising purposes are very common. The marketing term "CD quality" is also misleading, mainly because it suggests that quality is proportional to file size or bitrate. Since 16bit/44.1kHz provides the same playback fidelity as 24bit/96kHz, studio quality recordings can be distributed in 16/44.1 formats. 16/44.1 is studio quality.

General misconceptions

Perhaps the biggest problem with digital audio is that it’s easy to create theories that seem true and are extremely convincing, which make their promotion simple, yet they have nothing to do with reality. There is no error in the logic of reasoning, however the basic assumptions (premises) are false, and if the initial statements are false, the whole theory collapses (these false theories are very similar to the straw man fallacy). This is true for vinyl hype, DSD, high resolution audio and MQA encoding. Each of its starting point is wrong (digital signals are square waves, effect of filters is audible, etc.).

False analogies are another large group of fallacies (e.g., misleading illustrations of PCM encoding in high-resolution ads). The real appeal of false theories and analogies is their simplicity, as they provide simpler explanations than real answers. However, the real problem is a long-term confirmation and self deception, which can only be avoided if we recognize mistakes in time.

It is a common mistake in hi-fi to shift the boundary between subjective and objective, that is, to claim that an objective phenomenon or characteristics is subjective. It is also a common trick to present a simple acoustic or electrical phenomenon as if there were no rational explanation for it. In amplifiers and in digital audio problems are quite straightforward and we can’t find subjective elements here. All we need to do is to look at how and how much the signal changes after passing through a component and compare the change to known thresholds. Audio measurements provide a method that is straightforward and perfect.

Sampling frequency, resolution (bit depth)

The sampling rate shows how many samples are taken per second during digital conversion. The sampling frequency determines the frequency range that can be recorded. The highest frequency is half the sampling frequency.

If the sampling frequency is at least twice the frequency of the highest component of the original signal, then the original signal can be perfectly reconstructed from the sampled values. This means that a sampling frequency of 40 kHz is required to 'sample' the 0-20 kHz range. Prior to sampling and during sample rate conversions, we have to get rid of components above half the sampling frequency. Low-pass filters (resampling, anti-aliasing) are used for this task. The DA conversion also includes smoothing low-pass filters. Since these filters have some bandwidth requirement, the actual sampling frequency is a bit higher than the theoretical value.

Bottom line: at a sampling frequency of 44.1 kHz, the original signal can be reconstructed from the sample points up to 20 kHz with correct amplitude and phase. This applies to all kinds of signals.

Bit-depth determines the noise floor and the dynamic range. Thanks to the plethora of conversion methods and how the human ear perceives low level noise, traditional formulas (DR = n * 6.02 decibels; n = number of bits) are incorrect and unusable. The real dynamic range of a digital system or an audio file is usually higher than n * 6.02 decibels and higher than the measured SNR (Signal-To-Noise Ratio).

Monty Montgomery (Xiph.org) wrote a detailed article in 2012 on why 24bit/96kHz and 24bit/192kHz file downloads don't make sense (24/192 Music Downloads ... and why they make no sense) . The article was removed from Xiph.org a few years ago, but it is still available in the web archive (link). Monty also created a short video about PCM encoding. In the video different analog signals are converted to digital and then converted back to analog. The reconstructed signal is completely analog, there are no stair steps in the waveform. In order words the output of a properly dithered and filtered digital system is indistinguishable from a band-limited pure analog system with the same noise floor (video on YouTube, text of the video).

Resolution is a misnomer
Resolution is a misnomer in digital audio. The number of bits defines a well defined noise floor and nothing else. If 'resolution' means 'bit-depth', then the use of the word is acceptable, however audio signals don't have 'resolution' or 'accuracy'. Does it follow that 'high-resolution audio' is a misnomer also? (well-known misnomers: jellyfish, cryptocurrency...).

Origin of high-resolution audio

The use of bit-depths higher than 16 bits and sampling frequencies higher than 44.1 kHz is not new. The oldest audio file formats (WAV, AIFF) support non-standard bit-depths and sample rates since their release. However, the spread of high-resolution recordings has been held back by a number of factors: the small size of hard drives, the lack of drivers and hardware and the hype surrounding DSD around 2000.

Bit-depths greater than 16 bits and sampling frequencies higher than 44.1 kHz were originally used for recording, editing and mixing. The most common recording formats: 24bit/44.1kHz, 24bit/48kHz and 24bit/96kHz. PCM conversions of DSD recordings typically use a sample rate of 88.2 kHz.

About the testing methods and arguments

Debunking the claims that support high-resolution audio is not rocket science. We must first gather all the arguments that support high-resolution audio and select those that come from a misunderstanding of how PCM encoding works. Most of the arguments are nothing but misconceptions (formally these are straw man fallacies). The next step is to examine the limits of 16bit/44.1kHz . Frequency response, noise floor (dynamic range), impulse response (or phase shift, they are the two sides of the same coin). We don't have to bother with nonlinear distortion, as well-designed resampling filters and proper quantization don't cause audible distortion. As a last step, the limits should be compared with the limits of human hearing. That's all.

So we have a real proof, as opposed to the ‘black box’ approach of traditional comparative listening tests. Maybe it’s no surprise that 'listening tests' are called tests and not proof. A real proof can only be based on arguments, and not mere correlations. Conventional listening tests carry the age-old problems of statistics (false cause (confusion of correlation and causation, correlation does not imply causation); the problem of hidden variables; faulty generalization from flawed tests or small number of tests).

Misconceptions (PCM encoding)

Usually the following misconceptions are used as arguments in favour of high-resolution audio:

All of the these claims are false, since the signal at the output of DA converters is analog replica of the original, the time resolution of a digital system is independent of the sampling frequency, the dynamic range of 16 bits with noise shaping can reach 120 decibels (and 96 decibels also huge). Linear phase filters have linear phase in their passband, so at a sampling frequency of 44.1 kHz the phase shift between 20 Hz and 20 kHz is zero. The frequency of the pre-ringing in the impulse response is 22 kHz (always higher than 21 kHz) so the time smearing is a completely made up story.

Sample rate and filter ringing
Is there any situation where we can hear the pre-ringing in the impulse response or transient signals (e.g. with castanets)? Yes, if the sample rate is lower than 32 kHz AND the impulse response of the resampling filter is 'too' long. However, if the impulse response is short (shorter than approx. 6 msec) then pre-ringing is masked by the impulse itself (this is called temporal pre-masking, temporal pre-masking is in the order of 2-3 ms for pulse like sounds). This means that filter pre-ringing is not an issue even at sample rates lower than 32 kHz. Apart from noise floor and noise shaping possibilities, the only difference between 22.05 kHz and 44.1 kHz sample rates is that 44.1 kHz has twice the bandwidth.

The upper limit of human hearing

Usually people can't hear tones above 20 kHz. This is true for almost everyone - and for everyone over the age of 25. An extremely small group of people under the age of 25 is able to hear tones above 20 kHz under experimental conditions. But as far as audio reproduction and sampling frequency are concerned, hearing tones above 20 kHz doesn't matter.

The upper limit of hearing varies not only from individual to individual and by age, but also depends on the amplitude (sound pressure) of the test signal: the highest audible frequency for a test tone at 100 decibels is higher than for a test tone at 80 decibels. In high frequency hearing tests the sound pressure of the test signal around the hearing threshold is well above the sound pressure values found in live music and movies. The test signal can reach 110 decibels, while the maximum sound pressure level in music at 20 kHz is approx. 85 decibels (crash cymbal). The normal hi-hat level in records is about 60 dBSPL and instruments (brass instruments, violins) hardly produce more than 60 dBSPL in this range. In summary, even if someone can hear test tones up to 26 kHz, he or she will not hear harmonics above 20-22 kHz.

Properties of 16bit/44.1kHz

Dynamic range
Dynamic range of 16bit/44.1kHz is huge and completely covers the range required for any type of audio reproduction. In a hi-fi system, the peak SPL and thus the maximum dynamics is approx. 110 decibels. The quantization noise of 16-bit dither becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 105 dBSPL. With noise shaping the gain (volume) can be set about 18 decibels higher.

Frequency response
44.1 kHz sampling frequency provides flat frequency response up to at least 20 kHz. If we don't mind a few dB attenuation, the upper limit is 21 kHz.

Transient response (impulse response, phase response)
Linear phase shift, zero group delay distortion up to at least 21 kHz. This means that the impulse response is perfect in the audible range. There is a small inaudible ringing (pre-ringing) near 22 kHz.

In the human hearing range, perfect frequency response and transient response can be achieved with a sampling frequency of 44.1 kHz. Quantization noise of 16b/44k becomes audible when the gain set so that the sound pressure level of a full-scale sinusoid exceed 120 dBSPL. All in all, 16b/44k is perfect for sound reproduction and distributing studio quality recordings.

In fact, 16bit/44.1kHz is a bit overkill

If we look more closely, 16bit/44.1kHz not only provides transparency, but it is an overkill for most people, and overkill for everyone most of the time, considering typical musical events and common playback levels. High quality music could be distributed even in 14bit/32kHz and the difference would be minor at most.

The dynamic range of 16-bit is huge
Compressed pop and rock music could be distributed in 8-10 bits without compromising playback fidelity, because threshold of hearing is not constant, but changes with the loudness of music. In terms of dynamic range, analog recording technology is equivalent to a 12 or 13-bit digital system, depending on quality. 13-bit provides high fidelity with any kind of music: for example 10-bit version of an arpeggio is a bit noisy, but the 13-bit version sounds the same as the 16-bit or the 24-bit version. The only issue with 13-bit is that the noise is audible during fade-out and extremely quiet parts. There is a significant difference between 10 and 13 bit with non-compressed music, but between 13 and 16 bit? Hard to hear.

A wide variety of music could be released at 32 kHz without compromising playback fidelity
Only a few instruments can produce relatively high SPLs above 16 kHz. Cymbals, hi-hats, castanets, brass instruments, steel-string acoustic guitar, in rare case violin benefit from 44.1 kHz sample rate. Piano and cello works, chamber music and many classical tunes could be released at 32 kHz without without compromising playback fidelity.

Hearing and age
As we get older, our ability to hear high pitched sounds decreases. People over 40 can't hear tones above 16 kHz, which means that 32 kHz sampling rate could deliver the same fidelity as 44.1 kHz for them.

Returning to dynamic range and bits, when we are listening to a digital transfer of an analog recording (CD, FLAC, MP3 version... or a YouTube video), then in terms of dynamic range, this is equivalent to listening to a 12-bit or a 13-bit digital system.

About the recording formats

16bit/44.1kHz is perfect for distribution, but what about recording formats? It seems that 24b/96k became the standard in music production, but does it mean any improvement over 16/44.1?

Effects and editing operations produce two kinds of distortion: arithmetic distortion (quantization distortion) and aliasing distortion. Arithmetic distortion can be reduced by increasing the bit-depth, aliasing distortion can be reduced by increasing the sample rate. All effects generate arithmetic distortion, but very few effects generate aliasing distortion. So sample rates above 48 kHz usually are not useful in recording formats.

The largest group of effects only uses linear operations (linear effects). These are: equalizers (filters), reverb, volume change, fade-in, fade-out. Linear effects and linear editing operations only generate arithmetic distortion. This means that the quality of linear effects and linear editing operations is not affected by the sampling frequency. For an EQ or a reverb the 44.1 kHz as good as the 96 kHz. However, linear effects require bit-depth higher than 16 bits to keep the distortion products below 120 decibels. Keeping the files in 24-bit format can prevent the accumulation of noise due to continuous requantization in the DAW (audio editor).

Nonlinear effects (e.g. guitar distortion) are sensitive to the sampling frequency, however, they all use internal oversampling, so the use of a higher sampling frequency has no benefit. Modulation effects (chorus, flanger) may sound a bit cleaner at higher sample rates, because smooth time stretching requires correct intersample calculation.

In summary, there is no point in claiming that 24b/96k is better recording format than 16b/44k or 24b/44k. For live recordings 16b/44k is sufficient, while 24b/44k may provide lower final noise for an effects-packed multi-track recording.

Although recordings in the studio are stored in 24-bit files, the real bit-depth of recordings often does not reach 16 bits and the remaining bits only contain noise. There are no 24-bit recordings, neither 20-bit nor 19-bit. Even 18 and 17-bit recordings are very rare and with noise shaping they can be transferred to 16 bits (19-bit dynamics can be transferred to 16 bits with noise shaping). So, modern 24-bit recordings are still 16-bit recordings. The dynamic range of analog recordings is even lower, as analog tape recorders can only capture 13-bit dynamic range. It means that analog recordings could be distributed in 13 bits without compromising playback fidelity. The main role of 24-bit in recording formats is to ensure the accumulation of conversion errors is small enough to be completely irrelevant.

Noise floor of 24-bit audiophile recordings (from 2L, BIS and AIX Records)
The blue line is the noise floor of a 16bit/44.1kHz system (of file) with standard dither
The ear is most sensitive to noise at 4 kHz, and its sensitivity drops rapidly above 13 kHz.

Format discrimination studies

There are several perceptual studies (listening tests) about format discrimination. Some studies show that there is no perceptual difference between high resolution and standard formats, while others show that there is a small, but (in)significant difference (investigation of bit-depth and sample rate have to be separated). The problem lies in the statistical nature of these test which makes difficult to interpret small differences. I don't think anyone could be 'convinced' by a test in which the participants can discriminate between different sample rates, but the answers are slightly better than chance.

I don't know if the statistical approach is the source of all problems, but when format discrimination studies focus on statistical calculations rather than analog and digital signal behaviour, than we can expect more false positive results in the future. Sometimes the authors have rudimentary knowledge about the PCM encoding and they justify their results with a typical misconception (e.g. "higher sample rates have better time resolution"). In such situations the authors don't question their data and methods, because they think they found something and the mistake remain unrevealed.

Side effects of ultrasounds can easily falsify the results. Distortion in the audio chain (amplifiers, speakers) and clipping (ADC, DAC) may lead to false detection of higher sample rates. Thermal compression in tweeters can be higher with ultrasonic content. A 2 % rise in DC resistance is 0.17 dB loss in SPL, which can be detected with sustained or slowly changing tones (e.g. violin). In this respect, tweeters with a small voice coil and neodymium magnet are worse than dome tweeters with a 1" voice coil and a large ferrite magnet.

It's worth mentioning that format discrimination studies never use null tests. Null test is more accurate and more convenient than traditional listening tests. In a null test we're listening to the difference between two versions of a file (so we are not comparing A to B, we are simply listening to [A - B]). As a null test doesn't use comparison, it doesn't rely on auditory memory and the use of blind test (ABX) is not required. The test procedure is a simple sound detection in silence - and this is a huge benefit. The null test is like examining things under a magnifying glass, it magnifies small differences a thousandfold, however it is also a disadvantage, as a null test is much more susceptible to errors, especially errors that may lead to false positive results (for example, low level intermodulation distortion that would be masked in normal tests can falsify the results).

Another concept which is often misunderstood is the effect of training. Training is a learning process, similar to learning the letters of an alphabet or learning the road signs. Learning improves 'pattern recognition', but the absolute limits of our perception don't change. The difference between a trained listener and a non-trained listener is that a trained listener knows what to look for, which part of the music to pay attention to.

Music with extra amount of noise

High-resolution downloads don't make sense, but there is a different reason why distributing audio in high-res formats is really stupid. A recording in 24-bit / 88.2 kHz WAV is three times larger than in 16-bit / 44.1 kHz WAV. The ratio is slightly worse with FLAC files. And what’s the plus in a high resolution recording? Noise. Huge amount of inaudible, 'functionless' noise.

Recording noise in FLAC files - 24-bit is a waste of space for nothing
(orange is noise, blue is the 'signal' above the noise floor!)

In 16-bit files the amount of recording noise is negligible compared to the file size. In a 24/96 (or24/88.2) recording, the size of the inaudible and completely unnecessary recording noise is roughly equal to the size of a 16 bit / 44.1 kHz WAV file of the same length. That's a pretty big waste. Another problem is that noise cannot be compressed with lossless tools, which is also evident in the huge size of high-resolution FLAC files. The average bitrate of a 24/96 recording in FLAC is approx. 2.5 Mbps (Megabits per second), of which 1.4 Mbps is unnecessary recording noise. That is, more than half of a 24/96 FLAC file consists of inaudible noise...

High-resolution music streaming is a noise pollution. It takes away bandwidth and energy and in return gives nothing, especially when we can stream in studio quality at 130kbps with AAC and Opus.

Csaba Horváth

Revision history:
  2021-01-28 - the 1 kHz slowly fading pure tone was changed to a 500 Hz slowly fading pure tone. 500Hz pure tone is more pleasant and natural.

References (hearing, high-frequency auditory thresholds):

K. R. Henry and G. A. Fast, ‘‘Ultrahigh-frequency auditory thresholds in young adults: Reliable responses up to 24 kHz with a quasi-free-field technique‘‘, 1984
A. Rodríguez Valiente et al., ‘‘Extended high-frequency (9 – 20 kHz) audiometry reference thresholds in 645 healthy subjects‘‘, April 2014, Int J Audiol.
Kaoru Ashihara et al, ‘‘Threshold of hearing in free field for high-frequency tones from 1 to 20 kHz‘‘, 2003
Kaoru Ashihara et al, ‘‘Hearing threshold for pure tones above 20 kHz‘‘, 2005
Kaoru Ashihara, ‘‘Hearing threshold for pure tones above 16 kHz‘‘, 2007

More articles