GaLAHaD: Generating Linguistic Annotations for Historical Dutch

Historical texts are invaluable for linguistic and digital humanities research. Enriching these texts with linguistic annotations, such as part-of-speech tags and modern Dutch lemmata, enhances their accessibility by simplifying analysis and minimizing the impact of historical spelling variation.

GaLAHaD streamlines data enrichment and tool evaluation by providing an open, user-friendly platform that does not require technical expertise. Researchers can upload, annotate, and assess their data, leveraging features like annotation comparison and error analysis. Developers can easily contribute tools via its docker-based architecture, with options to use hosted services or self-host via open-source resources. GaLAHaD is openly accessible on INT’s infrastructure via CLARIN credentials.

For developers, it can be interesting to know that GaLAHaD is developed with open science in mind. The code for the platform, training taggers and the taggers themselves, as well as the corpus data is openly available.

https://github.com/INL/galahad: the GaLAHaD platform itself
https://github.com/INL/galahad-taggers-dockerized: a Docker interface for the taggers that provide linguistic annotation in GaLAHaD
https://github.com/INL/galahad-corpus-data: gold standard data for tagging and lemmatization that can be used to train and evaluate taggers
https://github.com/INL/galahad-train-battery: Python code to train additional taggers on the corpus data