On developing benchmark evaluations

The Multimedia COMMONS 2016 workshop (October 16 2016) –that will run as part of the ACM Multimedia conference in Amsterdam– will provide a forum for the community of current and potential users of the Multimedia Commons. This is a multi-institution collaboration initiative, that was launched last year to compute features, generate annotations, and develop analysis tools, principally focusing on the Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M), which contains around 99.2 million images and nearly 800,000 videos from Flickr. The workshop aims to share novel research using the YFCC100M dataset, emphasizing approaches that were not possible with smaller or more restricted multimedia collections; ask new questions about the scalability, generalizability, and reproducibility of algorithms and methods; re-examine how we use data challenges and benchmarking tasks to catalyze research advances; and discuss priorities, methods, and plans for continuously expanding annotation efforts.

At the MMCommons workshop I will discuss the development of benchmark evaluations in the context of  a series of tasks focusing on audiovisual search emphasizing its ‘multimodal’ aspects, starting in 2006 with the workshop on ‘Searching Spontaneous Conversational Speech’ that led to tasks in CLEF and MediaEval (“Search and Hyperlinking”), and recently also TRECVid (“Video Hyperlinking”). The value and importance of Benchmark Evaluations is widely acknowledged. Benchmarks play a key role in many research projects. It takes time, a well-balanced team of domain specialists preferably with links to the user community and industry, and a strong involvement of the research community itself to establish a sound evaluation framework that includes (annotated) data sets, well-defined tasks that reflect the needs in the ‘real world’, a proper evaluation methodology, ground-truth, including a strategy for repetitive assessments, and last but not least, funding. Although the benefits of an evaluation framework are typically reviewed from a perspective of ‘research output’ –e.g., a scientific publication demonstrating an advance of a certain methodology– it is important to be aware of the value of the process of creating a benchmark itself: it increases significantly the understanding of the problem we want to address and as a consequence also the impact of the evaluation outcomes.

The focus of my talk will be on the process rather than on the results of these evaluations themselves, and will address cross-benchmark connections, and new benchmark paradigms, specifically the integration of benchmarking in industrial ‘living labs’ or Evaluation-as-a-Service (EaaS) initiatives that are becoming popular in some domains.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: