Choosing the right computational tool to perform data enrichment for your specific case has become a task in itself. This is made much more difficult by the tendency of natural language processing (NLP) engineers to publish only raw numbers on specific benchmark sets to communicate the performance of their tools. These performance statistics are not only difficult for researchers from outside of the field to understand, but also become less relevant when a researcher wishes to use the tools on a data set which differs largely from the kind of data included in classical NLP benchmarking sets, as is often the case for researchers working on computational social science and humanities (SSH) projects.
At the 2024 CLARIN International Conference Angel Daza and Antske Fokkens presented their work titled Choosing the Right Tool for You: Informed Evaluation of Text Analysis Tools, an evaluation methodology for comparing model output (paper forthcoming) which takes into account both instance-level and aggregate comparisons in order to provide a more well rounded and understandable way of comparing the performance of multiple annotation tools for span annotation tasks. The method allows the researcher to use a subset of their own data to evaluate performance for their specific use case and can be performed at the corpus, document, and sentence levels, with or without gold data.
As a part of our task in SSHOC-NL we are now working towards intergrating this approach into the GaLAHaD platform to enable users to easily compare model outputs across multiple levels and find outliers in their data through visualisations in an accessible environment.