Lingyu Zhu: Audio-visual learning in visually guided sound source separation and 3D environment perception

Woman in yellow shirt looking at camera.

Kuva: Nannan Zou

Perceiving events and scenes is an experience that intrinsically involves co-occurrence of multiple senses, particularly seeing and hearing. Mainly by modelling different sensations individually researchers have made significant advancements in perception learning. In her doctoral dissertation Lingyu Zhu explored that a scene’s visual or auditory signal can serve as a useful cue for enhancing and complementing another modality through learning audio-visual models.

In her doctoral research Lingyu Zhu has, with a large set of multi-modal observations, proposed computational models that leverage the appearance and motions from video and temporal information within audio to identify audio components for visually guided sound source separation.

Moreover, she has extracted and applied structural information from vision (e.g., RGB or depth image) and audio for 3D environment perception.

According to Zhu, despite the impressive progress of audio-visual learning in recent years, there is still much to be addressed: i) Semantics and motions are two key visual cues within realistic videos and can be utilized to facilitate audio-visual problems; ii) Natural sounds are rich of frequencies. The impact of different audio frequencies on audio- visual learning remains underdetermined; iii) Binaural echoes recorded in the physical world preserve holistic 3D information which complements the RGB image, especially when going beyond the limited visual Field of View (FoV). Their potential for 3D perception is underexplored.

“To address the aforementioned challenges, I have built audio-visual perception algorithms to leverage semantic, motions, temporal, and geometrical cues from videos and audio streams. These algorithms are particularly useful for applications of visually guided sound source separation, depth estimation, and embodied agent navigation”, says Zhu.

The research for the doctoral dissertation was conducted in the Computer Vision Group at Tampere University and supported by the Research Council of Finland (projects 327910 & 324346).

Public defence on Friday 15 November

M.Sc. Lingyu Zhu's dissertation in the field of multi-modal learning titled Scene Perception through Audio-Visual Learning will be publicly examined at the Faculty of Information Technology and Communication Sciences at Tampere University at 13 o’clock on Friday 15 November 2024. The venue is the Auditorium D11 in the main building at City Centre Campus (Kalevantie 4, Tampere). The Opponent will be Senior Lecturer Jorma Laaksonen from Aalto University, Finland. The Custos will be Professor Esa Rahtu from Tampere University.

The doctoral dissertation is available online.

The public defence can be followed via remote connection.