Multimodal Speech-to-Text Transcription

In this work, the creation of a speech transcript is to be expanded to include additional sources of information, primarily video. In this way, non-verbal elements such as gestures, lip-reading, visible speaker activity etc. are to be included for the optimisation of a speech transcript. Non-verbal expressions such as nodding or shaking the head should be recognised and transferred to a transcript.Students can actively contribute to the definition of the work.

  • Further information

    • Semester or Master’s thesis for 1-2 people
    • 40% theory, 60% realisation
    • Prerequisites: Signal processing, Python

    German is commonly spoken within the company. Basic proficiency is helpful and appreciated.

Have we sparked your interest?

I am interested in the study Multimodal Speech-to-Text Transcription and would like to find out more.