Towards smart acoustic cameras for simultaneous sound localization and recognition

Research output: ThesisPhD Thesis

250 Downloads (Pure)

Abstract

Acoustic cameras are devices that visualize sound by utilizing an array of microphones.
The signal from each microphone is combined using a beamforming algorithm to
generate an acoustic heatmap or acoustic image. These beamforming algorithms tend
to have a high computational cost, which increases with the number of microphones.
The combination of a high number of Input/Output (I/O) requirements for the
microphones combined with the high amount of parallel computations makes Field
Programmable Gate Arrays (FPGAs) very suitable for processing the signals from
these microphone arrays. FPGAs have a low power consumption, which makes them
especially viable when targeting battery-powered devices such as handheld acoustic
cameras or nodes in a sensor network.
Despite the high computational power per watt of FPGAs, satisfying real-time
scenarios still presents a challenge, especially when targeting acoustic images with a
higher resolution. To overcome this challenge, a multi-mode acoustic camera has been
developed. The camera supports multiple modes depending on the task at hand. To
satisfy the real-time requirement for each mode, the resolution of the acoustic heatmap
can be adapted.
A second limitation of the existing acoustic cameras is the identification of the type
of sound, which commonly requires human expertise to recognize and profile the sound.
In recent years, deep learning, a form of Artificial intelligence (AI), has shown promising
results towards the task of sound recognition by using Convolutional Neural Networks
(CNNs). However, most of the research focuses on improving the accuracy of such
models without considering the limitations one encounters when deploying such a model
on resource-constrained devices such as FPGAs.
FPGAs are used nowadays for embedding deep learning inference, mainly using two
architectures. One type of architecture uses a general-purpose soft-core inside the
Programmable Logic (PL) of the FPGA. On the other hand, there are also
dataflow-based architectures that translate each layer in a CNN to a functional block in
the PL. Embedding these CNNs for inference on FPGAs is not a trivial task and comes
with trade-offs in terms of resource consumption, accuracy, supported layers,... These
two architectures are compared against other embedded solutions such as Google’s edge
Tensor Processing Unit (TPU) and a Raspberry Pi (RPi) to find the best fit for
acoustic cameras. Acoustic cameras are targeted in this instance because they can
identify the location of a sound source, which is not possible when using one
microphone. Furthermore, existing beamforming techniques such as delay-and-sum
reconstruct audio signals which can be used for audio classification tasks.
Original languageEnglish
Awarding Institution
  • Vrije Universiteit Brussel
Supervisors/Advisors
  • Touhafi, Abdellah, Supervisor
  • da Silva Gomes, Bruno Tiago, Co-Supervisor
Award date29 Mar 2024
Publication statusPublished - 2024

Fingerprint

Dive into the research topics of 'Towards smart acoustic cameras for simultaneous sound localization and recognition'. Together they form a unique fingerprint.

Cite this