Understand Audio Data with Computer Vision Background

Side by side comparison of audio and visual data for quick understanding

Published in

Towards Data Science

4 min readFeb 14, 2022

Vision is a powerful sense of human beings. From images, we can grasp the content and sentiment instantly without additional interpretation needed.

On the contrary, one-dimensional sequences data (including audio data) requires the understanding of x and y-axis labels to interpret the waveform. Without the labels, the signals lack the context to maps to the peak and low values in the sequence.

With my background in computer vision and signal processing, this article intends to provide an introduction to audio data by relating to vision data. Mapping of the relevancy between one to another turned out to put the concepts into perspective effectively.

Digitalization

Image is formed from the capturing of light in an instant, while the video is formed from the stitching of images in the time axis.

In a similar manner, audio signals are created from the capturing of air pressure over an amount of time.

As computers can’t consume data of continuous wavelengths, fundamentally data in the smallest form is stored as bits of 0’s and 1's. Hence, the data capturing process includes the conversion of data from continuous to discrete form. This process is referred to as digitization.

In general, digitization is divided into two parts, one retrieves sampling points while the other retrieve intensity values. The former is referred to as sampling while the latter as quantization.

Sampling

Vision

Sampling on an image occurs spatially at a regular interval. This corresponds to a 2-dimensional grid of the digitized image in width and height. The width and heights are referred to as resolution, while the smallest block of an element is a pixel. The higher the image resolution, the higher quality is the image.

Audio

Sampling in a one-dimensional sequence retrieves data points at regular intervals in the time axis.

The sampling rate determines how many data points are captured per second. High sampling rates result in more samples being collected at a frequent interval, while the opposite comes with fewer samples at a sparser interval.

Audio sampling rates are measured in hertz (Hz). The standard rate of modern days audio files is 44.1 kHz, which means the audio is sampled 44,100 times per second. Human hearings are in the range of 20 Hz and 20 kHz. Values lower than the range will impair audio quality while higher levels have no appreciable effect on the enhancement.

Quantization

For both visual and audio data, quantization refers to the sampling of amplitude values. The intensity values are retrieved to represent the strength of the input signal.

Vision

The sampling of intensity values at every spatial coordinate comes in the form of grey(1-channel) or RGB(3-channels) levels. The intensity values are also referred to as pixel values, with the standard range from 0–255.

Audio

For audio data, quantization captures the quantities in air pressure on every sampling point. The signal values vary over time, forming a waveform showing variation in the strength of air pressure.

Here’s summarizes the content.

The fundamental understanding of audio data allows the subsequent processing of it for various operations. I hope that this article brings insights to you in kick starting the journey with audio processing.

Thanks for reading!

Reading Materials

https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html

Understand Audio Data with Computer Vision Background

Side by side comparison of audio and visual data for quick understanding

Digitalization

Sampling

Vision

Audio

Quantization

Vision

Audio

Reading Materials

Written by Chiawei Lim