Perceptual Coding

A way of shrinking audio files by throwing away the sounds your ears would never have noticed anyway.

Sounds under the blue masking curve are inaudible, so the coder deletes them and the file shrinks roughly 11x.

What it is

Shrinking audio by deleting sounds your ears physically cannot hear, using a model of human hearing.

Key facts

Human hearing range: roughly 20 Hz to 20,000 Hz (20 kHz); upper limit drops with age, often ~15-16 kHz by 40.
CD audio (PCM) = 1411 kbps stereo (44,100 Hz x 16-bit x 2 channels). MP3 at 320 kbps is ~4.4x smaller; 128 kbps is ~11x smaller.
MP3 (MPEG-1 Layer III) released 1993; AAC released 1997; both use perceptual coding. AAC at 128 kbps beats MP3 at 128 kbps.
Lossy compression is irreversible: discarded data is gone forever. Lossless (FLAC, ALAC) only saves ~40-60% and keeps every bit.
Absolute Threshold of Hearing: ~0 dB SPL at 1-4 kHz (most sensitive), but you need ~+20 to +60 dB SPL to hear 20 Hz or 18 kHz.
Frequency masking: a loud tone hides quieter tones near it in pitch; the masking curve spreads wider upward in frequency than downward.
Temporal masking: a loud sound hides quieter sounds ~5-20 ms BEFORE it (pre-masking) and ~50-200 ms AFTER it (post-masking).
Hearing is non-linear: the ear groups frequencies into ~24 Critical Bands (Bark scale); masking is calculated band by band.
The coder keeps the Signal-to-Mask Ratio above 0 and dumps quantisation noise UNDER the masking threshold so you never hear it.
0 dB SPL reference = 20 micropascals; threshold of pain ~120-130 dB SPL. Every +10 dB = ~2x perceived loudness.

How it works

Split the audio into short time frames (MP3 frame = 1152 samples, ~26 ms at 44.1 kHz).
Transform each frame into frequency bands (filterbank + MDCT) so you see what pitches are present.
Run the psychoacoustic model: calculate the masking threshold for every band right now.
Assign more bits to audible parts, fewer or zero bits to masked/inaudible parts.
Quantise so the error noise stays buried under the masking threshold, then Huffman-code the result.
Pack into frames with a header; decoder reverses it back to playable audio.

Real examples

A 3-minute song: ~32 MB as WAV, ~3 MB as a 128 kbps MP3, ~7 MB at 320 kbps.
Spotify streams in Ogg Vorbis (~96-320 kbps); YouTube uses AAC/Opus, not raw audio.
A loud kick drum masks the hiss of a quiet shaker on the same beat, so the coder bins the shaker detail.
Bluetooth uses perceptual codecs (SBC, AAC, aptX) because the radio link can't carry full PCM.
Phone calls use heavy perceptual/parametric coding (AMR, Opus) to fit voice into tiny bandwidth.

How it helps in live sound

Use 320 kbps MP3 or WAV/AAC for backing tracks; avoid 128 kbps on a big PA where artefacts get exposed.
Never re-encode an MP3 from an MP3; each lossy pass stacks more permanent damage. Start from WAV.
Watch for 'swirly' high-frequency artefacts and pre-echo on transients (cymbals, claps) on cheap low-bitrate files.
For show playback (QLab, Ableton) import WAV/AIFF; reserve MP3/AAC only for size-limited delivery.
Phone/laptop Bluetooth into the desk adds a second perceptual codec, get a cabled or USB feed instead.

Everyday analogy

Like packing a suitcase and leaving out clothes you know you'll never wear, so it's far lighter but you still have everything you'll actually reach for.

Watch out

Myth: 'higher bitrate always sounds better.' Truth: above ~256 kbps AAC most people can't pick it from the original, but no bitrate undoes data already thrown away.

Fun fact

The MP3's tuning was perfected using Suzanne Vega's a cappella 'Tom's Diner', earning her the nickname 'the mother of the MP3'.

Key takeaways

It deletes sound your ear can't detect, not random data.
Masking is the core trick: loud sounds hide nearby quiet ones, in pitch and in time.
Lossy = permanent loss; lossless only halves the size but keeps everything.
Bitrate sets quality vs size: 128 kbps small/rough, 320 kbps near-transparent.
For live PA, feed full-quality files; save perceptual codecs for delivery only.