CLARIN/CLARIAH Collaboration on Automatic Transcription Chain for Digital Humanities

In the CLARIAH project, we are developing the Media Suite, an application that supports scholarly research using audiovisual media collections. In 2017 we will also be integrating tools that support Oral History research in the Media Suite. From 10 to 12 May 2017,  scholars and technology experts discussed the development of an automatic transcription chain for spoken word collections in the context of CLARIN, the European counterpart of CLARIAH, at a CLARIN-PLUS workshop in Arezzo. We observed that CLARIAH and CLARIN use a different but interesting complementary approach to the development of such a transcription chain that encourages further collaboration.

The automatic transcription chain CLARIN is focussing on,  should eventually be integrated into the CLARIN research infrastructure for language resources and technology. The chain consists of a cascade of components, including automatic speech recognition, that help scholars to generate speech transcripts for spoken word collections such as Oral History interviews. CLARIAH was invited to the workshop to share technical and infrastructural experience on the development of such a transcription chain in the CLARIAH Media Suite, that currently is in progress based on experience in the projects Verteld Verleden and Oral History Today. Also, CLARIAH took part in the discussion on scholarly requirements of transcription, given available techniques that were presented at the workshop by invited experts in the field.

In CLARIAH the approach is to embed transcription technology in the Media Suite and develop advanced features bottom-up gradually with a strong focus on generic workflows and sustainability. In CLARIN, the focus is on a stand-alone transcription service containing various advanced features such as alignment (that labels already existing manual transcripts with time-codes, see e.g., WebMAUS), manual transcription and correction tools, and transcription viewers. Moreover, in CLARIAH the assumption is that collections are already available within the infrastructure. In the CLARIN approach, collections can also exist locally with individual scholars that want to use transcription tools.  It is useful to have these different approaches, as each one will provide complementary insights towards the development of transcription tools that fit the both practice and infrastructure context of scholars.

The envisaged CLARIN transcription service is a central ‘portal’ for scholars all over Europe that should be able to connect scholars to different transcription technologies ranging from speech recognition for different languages (e.g., webASR for English) to manual correction tools. The speech recognition engine used in CLARIAH (operated by Beeld en Geluid and University of Twente) will also be used in the CLARIN service (operated by Radboud University).   

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s