Hyppää pääsisältöön

Vladimir Iashin: Novel Multimodal Transformer Architectures Enhance Video Content Understanding

Tampereen yliopisto
SijaintiKorkeakoulunkatu 1, 33720 Tampere
Hervannan kampus, Tietotalo, sali TB109
Ajankohta12.5.2023 13.00–16.00
PääsymaksuMaksuton tapahtuma
Henkilökuva Vladimir Iashin
In his doctoral dissertation, Vladimir Iashin explores innovative solutions to improve multimodal video understanding. The study introduces novel techniques for dense video captioning, visually-guided audio generation, and synchronization of audio and video streams within a video. The proposed multimodal transformer architecture offers a flexible and efficient model for integrating multiple modalities and capturing complex temporal dynamics in video data, with potential applications in various fields such as entertainment, education, and healthcare.

A recent doctoral dissertation by Vladimir Iashin presents significant progress in the field of multimodal video understanding. The research aims to improve video content understanding, particularly in areas such as dense video captioning, visually-guided audio generation, and synchronization of audio and video streams within a video.

The study provides an overview of the latest research in the field and introduces novel methods for addressing key challenges in multimodal video understanding. The proposed techniques are based on a multimodal transformer architecture that models the temporal dimension of the data and fuses multiple modalities within a single model.

Broad Implications of Multimodal Transformer Architectures for Multimedia Technologies

The potential applications of this research are numerous and relevant to society. Foley designers, who create and edit sound effects for movies, will benefit from the visually-guided audio generation techniques proposed in the study. By synchronizing the audio and video streams, the new generation of content creators can produce more compelling and professional-looking videos.

Moreover, the study's dense video captioning method could be particularly beneficial to visually impaired individuals, who often struggle to follow video content without audio descriptions. This technology will allow them to access video content more easily and independently.

The study has far-reaching implications for the field of video content understanding. The proposed multimodal transformer architectures represent novel methods for addressing the challenges of modeling long sequences in the context of audio-visual data. It offers a variety of flexible and efficient models for integrating multiple modalities and capturing complex temporal dynamics in video data.

By advancing the state-of-the-art in multimodal processing, this research contributes to the development of more sophisticated and effective multimedia technologies with broad applications in various fields, such as entertainment, education, and healthcare.

In summary, this research offers innovative solutions to key challenges in multimodal video understanding, with potential applications that can benefit society and industry. It contributes to the ongoing discussion and development of multimedia technologies and demonstrates the potential of multimodal transformer architectures for addressing challenges in various fields of research.

Public defence on Friday 12 May

The doctoral dissertation of MSc Vladimir Iashin in the field of Deep Learning titled Multi-modal Video Content Understanding will be publicly examined at the Faculty of Faculty of Information Technology and Communication Sciences at Tampere University at 1:00 pm on Friday 12th of May 2023 at the Tietotalo building, lecture hall TB109 (Korkeakoulunkatu 1, Tampere). The Opponent will be Assistant Professor Yuki Asano from the University of Amsterdam, Netherlands.  The Custos will be Professor Esa Rahtu from the Faculty of Information Technology and Communication Sciences at Tampere University.

The doctoral dissertation is available online.

The public defence can be followed via remote connection.

Photograph: Hanna Ranta (Photo Stella Oy)