Das Transkript im Zeitalter der Künstlichen Intelligenz : Automatische Spracherkennung in der Oral History

Tobias Kilgus; Peter Kompiel

doi:10.3224/bios.v38i1-2.08

PDF (Deutsch) (EUR 10) Leseprobe (Deutsch)

Tobias Kilgus, Peter Kompiel

Abstract

The contribution examines the potential and limitations of automatic speech recognition (ASR) for oral history interviews. Manual transcription of audiovisual recordings is time-consuming and costly; this “transcription bottleneck” hinders the indexing of extensive collections. The “ASR4Memory” project developed a data protection-compliant open-source transcription pipeline based on WhisperX, which is operated on local servers. A key challenge is the smoothing of spoken language caused by ASR models, which results in the loss of dialects, filler words, and nonverbal expressions – elements that are often essential for scientific analysis. Its use also raises ethical and legal questions, particularly with regard to data protection and potential distortions caused by training data. The ASR application generates various time-aligned transcript formats with speaker recognition and supports multimodal work and full-text search. Despite considerable time savings, manual post-processing is often still necessary. Domain-specific fine-tuning with 300 hours of anonymized interviews significantly improved both the word error rate and the recognition of historical terms. The study advocates for transparent, critically reflected ASR use that takes into account subject-specific requirements and ethical standards in order to sustainably develop audiovisual research data in accordance with FAIR principles.

Bibliography: Kilgus, Tobias/Kompiel, Peter: Das Transkript im Zeitalter der Künstlichen Intelligenz. Automatische Spracherkennung in der Oral History, BIOS – Zeitschrift für Biographieforschung, Oral History und Lebensverlaufsanalysen, 1+2-2025, pp. 68-82.

Published: March 2026

Issue: Jg. 38, Nr. 1+2-2025: Erinnerungen und Algorithmen. Oral History im digitalen Wandel

DOI: https://doi.org/10.3224/bios.v38i1-2.08

Open Access from: 2028-03-02

Open Access License: CC BY 4.0

Literature

Albrecht, Steffen (2023): ChatGPT und andere Computermodelle zur Sprachverarbeitung – Grundlagen, Anwendungspotenziale und mögliche Auswirkungen. Büro für Technikfolgen-Abschätzung beim Deutschen Bundestag (TAB), 21.4.2023, Hintergrundpapier Nr. 26. Online als PDF: https://www.bundestag.de/re-source/blob/944148/30b0896f6e49908155fcd01d77f57922/20-18-109-Hintergrundpapier-data.pdf.

Althage, Melanie, Aline Deicke, Mark Hall, Dennis Möbus, Cindarella Petz und Melanie Seltmann (2024): Algorithmenkritik. In: Living Handbook „Digitale Quellenkritik“. Version 1.0. 2024. 10.5281/zenodo.12648832

Apel, Linde, Almut Leh und Cord Pagenstecher (2022): Oral History im digitalen Wandel. Interviews als Forschungsdaten, in: Linde Apel (Hg.): Erinnern, erzählen, Geschichte schreiben. Oral History im 21. Jahrhundert, Forum Zeitgeschichte, Bd. 29, Berlin: Metropol, 193-222. https://doi.org/10.5771/9783748959342

Bain, Max, Jaesung Huh, Tengda Han und Andrew Zisserman (2023): WhisperX: Time-Accurate Speech Transcription of Long-Form Audio, 11.7.2023. https://doi.org/10.21437/Inter-speech.2023-78

Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed und Michael Auli (2020): wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In: arxiv, 22.10.2020. https://doi.org/10.48550/arXiv.2006.11477

Boeddeker, Christoph Tobias Cord-Landwehr und Reinhold Haeb-Umbach (2024): Once more Diarization: Improving meeting transcription systems through segment-level speaker re-assignment. In: arxiv, 5.6.2024. https://doi.org/10.48550/arXiv.2406.0315

Bredin, Hervé, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz und Marie-Philippe Gill (2019): Pyannote.audio: neural building blocks for speaker diarization. In: arxiv, 4.11.2019. https://doi.org/10.48550/arXiv.1911.01255

Brinckmann, Caren (2009): Transcription Bottleneck of Speech Corpus Exploitation. In: Verena Lyding (Hg.): Proceedings of the Second Colloquium on Lesser Used Languages and Computer Linguistics (LULCL II). Combining efforts to foster computational support of minority languages, EURAC book, Bd. 54, Bozen: Europäische Akademie, 165-179. On-line als PDF: https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/6832/file/Brinckmann_Transcription_Bottleneck_of_Speech_Exploitation_2009.pdf.

Bullock, Latané, Hervé Bredin und Leibny Paola Garcia-Perera (2019): Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection. In: arxiv, 25.10.2019. https://doi.org/10.48550/arXiv.1910.11646

Chan, William, Daniel Park, Chris Lee, Yu Zhang, Quoc Le und Mohammad Norouzi (2021): SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. In: arxiv, 27.4.2021. https://doi.org/10.48550/arXiv.2104.02133

Chollet, François (2021): Deep Learning with Python. Shelter Island: Manning.

Czerwiakowski, Ewa (2013): Die Übersetzung eines Interviewtranskripts als Repräsentation der Originalquelle. In: Nicolas Apostolopoulos und Cord Pagenstecher (Hg.): Erinnern an Zwangsarbeit. Zeitzeugen-Interviews in der digitalen Welt, Berlin: Metropol, 121-126.

Evers, Jeanine C. (2010). From the Past into the Future. How Technological Developments Change Our Ways of Data Collection, Transcription and Analysis, in: Forum Qualitative Sozialforschung/Forum: Qualitative Social Research, 12, Heft 1: The KWALON Experiment: Discussions on Qualitative Data Analysis Software by Developers and Users, Herausgegeben von Jeanine Evers, Katja Mruck, Christina Silver und Bart Peeters in Kooperation mit Silvana di Gregorio und Clare Tagg. https://doi.org/10.17169/fqs-12.1.1636

Fuß, Susanne und Ute Karbach (2019): Grundlagen der Transkription. Eine praktische Einführung. UTB, Bd. 4185, Stuttgart: UTB. https://doi.org/10.36198/9783838550749

Franken, Lina und Dennis Möbus (2024): Mensch und Maschine als Team. Exploratives Topic Modeling und manuelle Annotation in der qualitativen Sozialforschung, in: Zeitschrift für digitale Geisteswissenschaften, 9, Heft 9. https://doi.org/10.17175/2024_003

Hübner, Andreas (2024): Wem „gehören“ Forschungsdaten? In: Zenodo, 27.4.2024. https://doi.org/10.5281/zenodo.11077412

Imeri, Sabine und Michaela Rizzolli (2022): CARE Principles for Indigenous Data Governance. Eine Leitlinie für ethische Fragen im Umgang mit Forschungsdaten?, in: o-Bib. Das offene Bibliotheksjournal, 9, Heft 2. https://doi.org/10.5282/o-bib/5815

Juang, B.H. und Laurence Rabiner (2005): Automatic Speech Recognition – A Brief History of the Technology Development. Online als PDF: https://web.ece.ucsb.edu/Faculty/Rabi-ner/ece259/Reprints/354_LALI-ASRHistory-final-10-8.pdf.

Jurafsky, Daniel und James H. Martin (2024): Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Third Edition Draft. Online als PDF: https://web.stanford.edu/~jurafsky/slp3/ed3book-feb3_2024.pdf.

Kreutzer, Till und Henning Lahmann (2021): Rechtsfragen bei Open Science: Ein Leitfaden, Hamburg: Hamburg University Press. https://doi.org/10.15460/HUP.211

Kurz, Susanne und Øyvind Eide (2024): Vertrauen in die Wirklichkeit – AI, Trust und Reliability in den Digital Humanities, Konferenzbeitrag, DHd 2024 Quo Vadis DH (DHd2024), Passau, Deutschland. https://doi.org/10.5281/zenodo.10698376.

Lehečka, Jan, Jan Švec, Josef V. Psutka und Pavel Ircing (2023): Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech. INTER-SPEECH 2023, 20.-24. August 2023, Dublin, Ireland. Online als PDF: https://www.isca-archive.org/interspeech_2023/lehecka23_interspeech.pdf.

Oppel, Fabian (2024): Pressemitteilung: ZEW: So soll risikoreiche generative KI geprüft werden. 24.4.2024, https://idw-online.de/de/news832471 (25.9.2025).

O’Shaughnessy, Douglas (2024): Trends and developments in automatic speech recognition research, in: Computer Speech & Language, 83. https://doi.org/10.1016/j.csl.2023.101538

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey und Ilya Sutskever (2022): Robust Speech Recognition via Large-Scale Weak Supervision. In: ar-xiv, 6.12.2022. https://doi.org/10.48550/arXiv.2212.04356

Schmidt, Thomas (2018): Gesprächskorpora. Aktuelle Herausforderungen für einen besonderen Korpustyp. In: Marc Kupietz und Thomas Schmidt (Hg.): Korpuslinguistik. Berlin, Bos-ton: De Gruyter, 209-230. https://doi.org/10.1515/9783110538649

Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams und Nando de Freitas (2016): Taking the Human Out of the Loop: A Review of Bayesian Optimization. In: Proceedings of the IEEE, 104, Heft 1, 148-175. https://doi.org/10.1109/JPROC.2015.2494218

Smyth, Hannah K., Julianne Nyhane und Andrew Flinn (2023): Exploring the possibilities of Thomson’s fourth paradigm transformation – The case for a multimodal approach to digital oral history? In: Digital Scholarship in the Humanities, 38, Heft 2, 720-736. https://doi.org/10.1093/llc/fqac094

Švec, Jan, Martin Bulín, Pavel Ircing, Adam Frém und und Filip Polák (2023): State-of-the-Art Speech Recognition for Understanding Oral Histories, https://www.clarin.eu/impact-stories/state-art-speech-recognition-understanding-oral-histories (25.9.2025).

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser und Illia Polosukhin (2017): Attention Is All You Need. In: arxiv, 12.6.2017. https://doi.org/10.48550/arXiv.1706.03762

Wollin-Giering, Susanne, Markus Hoffmann, Jonas Höfting und Carla Ventzke (2024): Automatic Transcription of English and German Qualitative Interviews. In: Forum Qualitative Sozialforschung/Forum: Qualitative Social Research, 25, Heft 1. https://doi.org/10.17169/fqs-25.1.4129

Woggon, Helga (2012): Transkription und Übersetzung. Video-Interviews als Lesetexte, In: Sigrid Abenhausen, Nicolas Apostolopoulos, Bernd Körte-Braun und Verena Lucia Nägel (Hg.): Zeugen der Shoah. Die didaktische und wissenschaftliche Arbeit mit Video-Interviews des USC

Shoah Foundation Institute, Berlin: Freie Universität Berlin, Center für Digitale Systeme (CeDiS), 24-28. Online als PDF: https://www.zeugendershoah.de/unter-richtsmaterialien/vha_broschuere/index.html.

Zhang, Yu, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang und Yonghui Wu (2021): BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. In: arxiv, 27.9.2021. https://doi.org/10.48550/arXiv.2109.13226

Article Sidebar

Main Article Content

Abstract

Article Details

Literature