To foster scholarly research using large data collections in the art and humanities, the CLARIAH project is developing a research infrastructure that aims to streamline access to large audiovisual collections and related context collections, available at different locations in The Netherlands. Also, it provides scholars with robust and sustainable tools to work with these collections. Gateway to the data and tools in the infrastructure is the Media Suite, a portal that helps scholars to explore, select, analyze and annotate data collections. Many practical issues arise in the process of making data collections from various institutions available within the infrastructure in a way that effectively supports scholarly use. The identification of such issues and developing strategies to address these is pivotal to the success of a research infrastructure.
To test the emerging infrastructure, ‘Research Pilots’ were awarded by CLARIAH, six of them focussing on the audiovisual domain. Scholars defined a research question and suggested data collections and tools that they need to address the research question in the Media Suite. Recently, we organized a workshop with scholars, content-owners, and CLARIAH developers, to discuss the details of the data requirements of scholars and to investigate the alignment of these with the status of the CLARIAH infrastructure. The workshop improved our mutual understanding of large, institutional data collections in a research infrastructure but also made clear that there are barriers to overcome to serve the needs of scholars with respect to collection access. We identified three caveats with respect to effectively using these collections in practice.
The first one is that scholars make assumptions about the data collections that may not always be valid. As explained by NISV’s expert in media history Bas Agterberg, the process of audiovisual archiving through the years has been influenced by many practical issues, ranging from the take-up of collections assembled for other purposes than archiving, mergers with other institutes, to institutional data selection policies that changed over time for various reasons. So, when a scholar would be interested in a specific type of programming in a specific time-period, it is important to understand that there may be gaps in the archive that could for instance influence representativeness off the data for research. From a research infrastructure perspective, the lesson learned is that we should put an effort in documenting data collections, for example by providing pointers to the existing documentation available with collection owners.
The second issue with collections is that it is often far from obvious how to trace specific programs or genres in the metadata. For scholars, a question like “give me all autobiographical documentaries between 1965 and 1975” makes perfect sense. However, it may require some ‘metadata archaeology’ to discover which metadata fields to query and how to query them, to be able to select the desired items from a collection. As is the case with the collections themselves, also the metadata have a history with respect to its origin, metadata models and protocols for filling the fields. The Media Suite provides a “Collection inspector” that could be helpful in providing statistics on the completion of individual metadata fields in a collection and distribution over the years. However, the ‘raw’ field names may not always make sense for scholars without background knowledge on the metadata model of a specific collection. To improve its usefulness for scholars, the metadata fields in the Collection Inspector may need to be mapped to a comprehensible format. A minimum requirement is that for each of the collections in the infrastructure we can provide documentation on its metadata model so that the rationale behind the naming of fields can be tracked down.
The third issue with respect to the usability of data collections in the infrastructure is the availability of transcripts such as subtitles or manually or automatically generated speech transcripts, that can be used for searching relevant clips in large amounts of data. However, such transcripts are typically sparse. For instance, for the broadcast data in the NISV collections, synchronized subtitles are only available from 2006 onwards. To improve search granularity for collections without subtitles, CLARIAH is setting up an automatic speech recognition service that is embedded in the infrastructure, capable of processing very large data collections. One of the models for use is that when scholars require speech transcripts for specific collections or date ranges, this service can be called upon on request.
The Media Suite development team is working on (strategies for) the integration of multimedia data collections from DANS (oral history), EYE (film), KB (newspapers for comparative search) and Beeld en Geluid (program guides), in close collaboration with the content owners. The goal is to enable scholars to analyze these data collections in the Media Suite, access the source data (e.g., view content) via available platforms from content owners (e.g., Delpher), and when necessary, address issues on data archaeology and granularity as discussed above.