Unlocking Archives for Scholarly Research

The CLARIAH Media Suite was at the CLARIN Annual Conference in Pisa, Italy,  from 8-10 October. The message was first of all that we have made good progress in establishing a scholarly research infrastructure for doing mixed-media analysis with multimedia data that are abundantly available at national archives, libraries and knowledge institutions. Multimedia collections that for a long time have been ‘locked’ due to the lack of a proper interface to these data, both from a technical and legal (IPR/Privacy) perspective.

CLARIN Conference: proceedings with the paper “Media Suite: Unlocking Archives for Mixed Media Scholarly Research”,  slides.

Our approach is based on (i) scholarly requirements with respect to access and analysis of data, (ii) requirements with respect to the sustainability of the technical infrastructure we are developing, and (iii) the principle that everything that we build for (i) should be usable for scholars in (ii): not in an X amount of years when the infrastructure is “ready”, but immediately (or at least as soon as possible). So that we can build the infrastructure in co-development, keep track of its benefits (research output) and shortcomings, and continuously update its roadmap.

In this context, we defined some architecture principles for the infrastructure. The institutions that own or hold the data such as archives are responsible for the quality of the data (metadata) and for facilitating access to the data. To authorize access to the data, we use a federated authentication mechanism. Data from the various institutions, and tools for searching, analysis, annotation and visualisation, are available through a “workspace” or Virtual Research Environment (VRE).  Data created by scholars in the workspace can –if IPR permits– be exported in various formats for analysis in external tools that are already available. And finally, an application such as the Media Suite provides an interface to the underlying infrastructure, geared towards the specific requirements of a scholarly user group, in our case media scholars.

Speech Recognition

An example of the data analysis tools available in the CLARIAH workspace is automatic speech recognition (ASR). At the CLARIN conference, we presented an overview paper (see proceedings) on ASR for scholarly research (see below). It explains that ASR is helpful for: (i) supporting the transcription of the spoken word (e.g., in interview collections), turning it from a fully manual and time-consuming process into a (semi) automatic one, and (ii) increasing the efficiency of discovery in large audiovisual collections.  The question is how we make ASR available for scholars given these intentions.


Activating ASR

First of all, we have to make a distinction between the processing of individual files or small personal collections on the one hand, and large institutional collections. For the first scenario, the CLARIAH infrastructure incorporates a speech recognition service, that can be easily deployed by individual scholars.  Within the closed environment of the workspace, scholars can upload files and select enrichment services such as speech recognition from a drop-down menu. Speech recognition jobs are scheduled in the background and after finishing, the transcripts become available in the (personal) workspace for viewing and searching as part of a personal collection index.  On request of scholars, the processing of large collections is scheduled manually by technology specialists. For example, “news and actualities” programs are a rich source for scholarly research. To process all available content of this type of programming in an archive, all program identifiers for this type need to be collected first, using a combination of metadata fields (e.g., genre, title). With the identifiers, the digital source files can be extracted from the archive and send to a computer cluster for recognition. This computer cluster can be a local cluster or a high performance computing cluster, depending on the quantity.

browse_transcripts (1)

Browsing transcripts