Understand Audio Data with Computer Vision Background

Side by side comparison of audio and visual data for quick understanding

Chiawei Lim
Towards Data Science

--

Image by the author

Vision is a powerful sense of human beings. From images, we can grasp the content and sentiment instantly without additional interpretation needed.

On the contrary, one-dimensional sequences data (including audio data) requires the understanding of x and y-axis labels to interpret the waveform. Without the labels, the signals lack the context to maps to the peak and low values in the sequence.

Image by the author

With my background in computer vision and signal processing, this article intends to provide an introduction to audio data by relating to vision data. Mapping of the relevancy between one to another turned out to put the concepts into perspective effectively.

Digitalization

Image is formed from the capturing of light in an instant, while the video is formed from the stitching of images in the time axis.

In a similar manner, audio signals are created from the capturing of air pressure over an amount of time.

Image by the author

As computers can’t consume data of continuous wavelengths, fundamentally data in the smallest form is stored as bits of 0’s and 1's. Hence, the data capturing process includes the conversion of data from continuous to discrete form. This process is referred to as digitization.

In general, digitization is divided into two parts, one retrieves sampling points while the other retrieve intensity values. The former is referred to as sampling while the latter as quantization.

Sampling

Vision

Sampling on an image occurs spatially at a regular interval. This corresponds to a 2-dimensional grid of the digitized image in width and height. The width and heights are referred to as resolution, while the smallest block of an element is a pixel. The higher the image resolution, the higher quality is the image.

Image by the author

Audio

Sampling in a one-dimensional sequence retrieves data points at regular intervals in the time axis.

The sampling rate determines how many data points are captured per second. High sampling rates result in more samples being collected at a frequent interval, while the opposite comes with fewer samples at a sparser interval.

Photo by the author

Audio sampling rates are measured in hertz (Hz). The standard rate of modern days audio files is 44.1 kHz, which means the audio is sampled 44,100 times per second. Human hearings are in the range of 20 Hz and 20 kHz. Values lower than the range will impair audio quality while higher levels have no appreciable effect on the enhancement.

Image by the author

Quantization

For both visual and audio data, quantization refers to the sampling of amplitude values. The intensity values are retrieved to represent the strength of the input signal.

Vision

The sampling of intensity values at every spatial coordinate comes in the form of grey(1-channel) or RGB(3-channels) levels. The intensity values are also referred to as pixel values, with the standard range from 0–255.

Image by the author

Audio

For audio data, quantization captures the quantities in air pressure on every sampling point. The signal values vary over time, forming a waveform showing variation in the strength of air pressure.

Image by the author

Here’s summarizes the content.

The fundamental understanding of audio data allows the subsequent processing of it for various operations. I hope that this article brings insights to you in kick starting the journey with audio processing.

Thanks for reading!

Reading Materials

--

--