Multimodal Speech-to-Text Transcription
In this work, the creation of a speech transcript is to be expanded to include additional sources of information, primarily video. In this way, non-verbal elements such as gestures, lip-reading, visible speaker activity etc. are to be included for the optimisation of a speech transcript. Non-verbal expressions such as nodding or shaking the head should be recognised and transferred to a transcript.Students can actively contribute to the definition of the work.
Further information
- Semester or Master’s thesis for 1-2 people
- 40% theory, 60% realisation
- Prerequisites: Signal processing, Python
German is commonly spoken within the company. Basic proficiency is helpful and appreciated.

Have we sparked your interest?
I am interested in the study Multimodal Speech-to-Text Transcription and would like to find out more.