In this post we provide an extensive overview of how part-of-speech tagging (assigning word classes to tokens) is typically evaluated. We will focus on common metrics such as precision and recall, and describe step-by-step what the numbers mean by means of intuitive examples. This post was initially written as documentation for the GaLAHaD platform, which allows users to upload historical Dutch text, automatically tag it and evaluate it. So while you can read most of this post as an extensive introduction to part-of-speech-tagging evaluation, some parts may be more specific to the GaLAHaD platform.

Metrics

In annotation, False Negatives (FN), False Positives (FP), True Negatives (TN), and True Positives (TP) are concepts that indicate whether a label of interest was present (positive) or absent (negative), and whether the prediction was correct (true) or incorrect (false) about the presence or absence. These metrics are discussed in detail below but in short they can be summarised as follows, where a prediction (e.g. the predicted part-of-speech tag) is different from the true part-of-speech tag.

These metrics are always considered “from the viewpoint” of a specific label, as the examples below will make clear. You’ll notice that False Negatives, False Positives and True Positives can all be explained from the perspective of either the predicted or the correct tag. True Negatives are a bit different to think of (example 3) because the viewpoint of a different tag is taken (neither the correct nor the predicted label but any other tag). This makes True Negatives a bit more abstract and often less emphasised in tagging tasks (it is not included in GaLAHaD) but it is included here for the sake of completeness.

Example 1: word liep (English: ran), correct label “verb” ↔ predicted label “noun”

From the viewpoint of “verb”: False Negative: the prediction is “not a verb” (negative) and that prediction is incorrect (false)
From the viewpoint of “noun”: False Positive: the prediction is “noun” (positive) but that prediction is incorrect (false)

Example 2: word kasteel (English: castle), correct label “noun” = predicted label “noun”

From the viewpoint of “noun”: True Positive: the prediction is “noun” (positive) and that prediction is correct (true)

Example 3: word snel (English: quickly), correct label “adverb” = predicted label “adverb”

From the viewpoint of “noun”: True Positive: the prediction is “adverb” (positive) and that prediction is correct (true)
From the viewpoint of “verb”: False Negative: the prediction is “not a verb” (negative) and that prediction is correct (true)

So in your analyses you should always consider which perspective (or which label) is important to you.

As will be discussed, these calculations can then be used to calculate metrics such as Precision, Recall, F1 score and accuracy.

The metrics are explained in detail below but as a summary (where X can be e.g. “verb” or “noun”):

Precision: out of all predictions for X, how many are indeed actual X
Recall: out of all reference labels X, how many were correctly predicted
F1: a balance between precision and recall
Accuracy: out of all tokens, how many were correctly predicted as not-X or X

False Negative (FN)

A False Negative (FN) is an annotation that should have been annotated with a label (e.g. X) but is not. The predicted label is not (“negative”) X, and that prediction is incorrect (“false”).

In part-of-speech tagging, this means that a word should have received a certain grammatical label (let’s call it label X, such as “verb”), but it wasn’t marked as such by the tagger. In the reference layer (the correct answer key), the word has label X, but in the hypothesis layer (a tagger’s prediction), it is wrongly labelled as something else.

Example: The Dutch word liep (past tense of lopen, meaning ’to walk’) is not correctly tagged as a verb but as a noun. So the “verb” annotation for liep is a false negative because the prediction was not a verb but a noun and that is incorrect.

False Positive (FP)

A False Positive (FP) occurs when a label X is incorrectly assigned to a word. The predicted label is X (“positive”), and that prediction is incorrect (“false”).

So a word was tagged with a particular grammatical label (e.g. “verb”), but it shouldn’t have been. In the hypothesis layer, the word is incorrectly given label X, while the reference layer shows it should have a different label.

Example: The tagger tags the word het (meaning ’the’ in English) as a verb, which is incorrect because het is actually a determiner. So the “verb” annotation for het is a false positive because the prediction was “verb”.

True Positive (TP)

A True Positive (TP) signifies a correct annotation, where a word is accurately labelled with a category X. The predicted label is X (“positive”), and that prediction is correct (“true”).

True Positives occur when the tagger successfully predicts the correct grammatical label (such as “verb”) to a word. Both the hypothesis layer (a tagger’s prediction) and the reference layer (the correct answer key) agree that the word is a verb.

Example: The Dutch word liep was correctly identified as a verb, matching the reference layer’s labelling.

True Negative (TN)

A True Negative (TN) refers to the correct identification that a word should not be labelled X. The predicted label is not X (“negative”), and that prediction is correct (“true”).

A True Negative indicates that a word was correctly tagged with a label different from the label X being tagged with an incorrect grammatical label (let’s consider label X) when it shouldn’t be. Both the hypothesis layer (a tagger’s prediction) and the reference layer (the correct answer key) correctly show that the word does not have label X.

Example: The tool correctly tags koek (meaning ‘cookie’) as a noun, so from the perspective of the label “verb” this prediction is a true negative: the label was correctly not classified as a verb.

Precision

To give an example in the same atmosphere as our GaLAHaD tool for historical Dutch: imagine you are an archer shooting arrows at a target. Precision (P) means how many of your arrows, out of all the ones you shot, actually hit the bullseye. In terms of tagging, precision tells us how many of the annotations in the hypothesis layer are actually correct. The focus lies on the prediction (precision – prediction).

Mathematically, precision is calculated using the formula:

P = TP / (TP+FP)

So precision is the ratio of the correct prediction (TP), e.g. “verb”, and all predictions “verb”, including those “verb” predictions that are incorrect (TP+FP). In other words, Precision wants the false positives (incorrect prediction of “verb”) to be low.

Example: Suppose our tagger identified 100 tokens with the “verb” label (TP+FP). However, only 87 of these predictions are correct (TP). So the precision is 87% (87+100). This suggests that 87 out of 100 times, our tagger’s “arrows” hit the target, accurately labelling verbs.

On the corpus level (Precision across all instances), Precision can be calculated with micro-averaging or macro-averaging, where different approaches are taken on combining the precision of all possible labels into one final score. This is described in the section on “Macro-averaging versus micro-averaging”.

Recall

Rather than an archer, now imagine you’re an adept spy tasked with spotting enemy soldiers in the night. The goal is to count all of the soldiers. Recall (R) is analogous to how many soldiers you successfully identify compared to all the ones that are actually hiding in the woods. Just like the spy aiming to spot every enemy, recall aims to determine how well our tagger identifies all the words that truly belong to a particular grammatical category. It’s a measure of thoroughness in detection of all the right answers. So the focus lies on the references (recall – reference).

Recall can be formulated as follows:

R = TP / (TP+FN)

So recall is the ratio of the correct prediction (TP), e.g. “verb”, and all the occurrences of “verb” in the references, including the False Negatives that were not predicted by the tagger (TP+FN). Put differently, Recall emphasises that false negatives should be low (“verb” tokens that were incorrectly labelled as something else).

Example: in our reference layer, we see that the label “verb” occurs 100 times (TP+FN). However, the tagger only tagged 64 as “verb” (TP) and did not spot the other 36 (FN). Recall is therefore 64%.

To aggregate Recall on the corpus level to achieve one total Recall score for the whole corpus, the Recall of all possible labels can be averaged. Typically two approaches are distinguished, macro-averaging and micro-averaging, described in the section on “Macro-averaging versus micro-averaging”.

Difference precision and recall

Let’s consider two example scenarios that exemplify the difference between precision and recall. In all cases our dataset consists of 100 tokens and our label of interest is “verb”.

High precision, low recall

TP (correctly labelled as verb): 10
FP (incorrectly labelled as verb): 1
FN (incorrectly labelled as not a verb): 15
TN (correctly labelled as not a verb): 74

We can notice the difference between the number of FP and FN: the tagger only once made a mistake in predicting “verb” where it should not, but it made 15 mistakes tagging a token as a non-verb where it should have been a verb. We calculate precision and recall as illustrated in the previous sections:

P = TP / (TP+FP) = 10 / (10+1) = 90.9%
R = TP / (TP+FN) = 10 / (10+15) = 40%

Despite a high precision, the tagger’s recall for the tag “verb” is low. When it predicted the label “verb” it was most often correct (high precision) but out of all cases where “verb” was the correct reference, it missed a lot of verbs (low recall). This occurs when a tagger is not making a lot of “verb” predictions (so it misses quite a bit of the verbs) but the “verb” prediction that it does make are often correct.

Low precision, high recall

TP (correctly labelled as verb): 20
FP (incorrectly labelled as verb): 30
FN (incorrectly labelled as not a verb): 5
TN (correctly labelled as not a verb): 45

In contrast to the previous example, we now see that the model predicted the label “verb” when it should not be a verb 30 times (FP) and predicted a verb when it should have been a verb (FN) only 5 times. This will lead to a difference in precision and recall:

P = TP / (TP+FP) = 20 / (20+30) = 40%
R = TP / (TP+FN) = 20 / (20+5) = 80%

In contrast to the previous example, the tagger now identifies most verbs (high recall) but it “overshoots”; it incorrectly labelled many tokens as “verb” that should have been something else (low precision). This happens when the tagger assigns a lot of “verb” tags and by doing so indeed finds the verbs in the text but also misclassifies many other words as verb.

F1

While the archer and the spy mentioned above both have their skills and uses, a master strategist is able to master both of their strengths. Similarly, the F1 score is a combination of Precision and Recall to provide a balanced view of a tagger’s performance, accounting for how accurate its predictions are as well as for how many of the references it actually got right.

Mathematically, the F1 score is calculated as the harmonic mean of Precision and Recall, formulated as:

F1 = 2 * (P * R) / (P + R)

Or, using true/false and positive/negative terminology:

F1 = 2TP / (2TP + FP + FN)

Combining Precision and Recall in this way strikes a balance between predicting accurately (Precision) and predicting thoroughly (Recall). A tagger must score well on both facets to achieve a high F1 score. To illustrate this, we can calculate the F1 score for the two scenarios above (high precision and low recall vs. low precision and high recall).

High precision, low recall:

P = TP / (TP+FP) = 10 / (10+1) = 90.9%
R = TP / (TP+FN) = 10 / (10+15) = 40%
F1 = 2 * (P * R) / (P + R) = 2 * (0.909 * 0.40) / (0.909 + 0.40) = 0.553…

Low precision, high recall

P = TP / (TP+FP) = 20 / (20+30) = 40%
R = TP / (TP+FN) = 20 / (20+5) = 80%
F1 = 2 * (P *R) / (P + R) = 2 * (0.40 * 0.80) / (0.40 + 0.80) = 0.533…

From these examples it should be clear that very different precision and recall values can lead to similar F1 scores. Depending on your interest, it can therefore still be useful to look at precision and/or recall separately!

Macro-averaging versus micro-averaging

When we aggregate the precision, recall and F1 to retrieve a single corpus level score rather than individual scores for each class (“verb”, “noun”, etc.), we can average all the label scores. There are different ways to average these label scores with both their own uses. Macro-averaging focuses on the highest level, to assign equal weight to all classes. Micro-averaging instead looks at the samples under a microscope, where each individual instance is assigned equal weight.

Macro-averaging

Once more relying on an example in a historical setting, we can consider macro-averaging as the case of captains in the army. Within the army, the opinion of each captain is equally important regardless of how many soldiers they lead. Similarly, in macro-averaging a high frequency class like “verb” that has many occurrences is weighed the same as a lower frequency class like an interjection – the tagger’s performance on verbs is considered equally important to its performance on interjections.

Mathematically, the macro-average for Precision, Recall, and F1 are – intuitively – the averages of the P, R and F1 for each class (e.g. precision for “verb”, precision for “noun”, precision for “interjection”, etc.):

P_macro = sum of all class Ps / number of classes
R_macro = sum of all class Rs / number of classes
F1_ macro = sum of all class F1s / number of classes

By giving equal weight to all classes, regardless of their frequency, the aggregated scores give an intuition of how a tagger performs “on average” across all classes, although this average may be highly skewed when the model performs very differently on some of the classes.

Micro-averaging

Unlike macro-averaging, where a captain’s voice (class) is given equal importance regardless of its number of soldiers, micro-averaging does not consider the captains themselves but instead focuses on the individual soldiers. Each individual soldier gets its own voice, and each voice is measured equally. In tagging, that means that all instances are accounted for, which means that more frequent labels like “verb” will impact the score: classes that occur more frequently will have a stronger impact on the final score.

The micro-average can be calculated by means of the analysis of the FN, FP, TP, TN discussed above:

P_micro = sum of all TP / sum of all (TP+FP)
R_micro = sum of all TP / sum of all (TP+FN)
F1_ micro= 2 * (P_micro * R_micro) / (P_micro + R_micro)

Emphasising the individual importance of each sample implies that majority classes will be more impactful on the final score. This approach can be useful when you are more interested in the real distribution of classes, where you indeed want to prioritise high frequency classes to weigh more thoroughly and you are less interested in minority classes.

Choosing micro-averaging or macro-averaging

Macro-averaging is useful when you care about the model’s ability to perform well across each class (e.g. PoS tag) independently. It is helpful if you’re interested in an even performance in every class, regardless of the class frequency.

Micro-averaging is better suited when you want to evaluate the overall performance across all instances (tokens), especially if there is class imbalance where some tags are more frequent than others. For example, if verbs and nouns are overwhelmingly more frequent than other tags, micro-averaging will provide a performance measure that is highly impacted by these high-frequency classes.

Accuracy

Accuracy provides a simple, intuitive way of interpreting a tagger’s performance. It comes down to “how many of the predictions were correct”, or in GaLAHaD terminology: how many of the predictions in the prediction layer are identical to the ones in the reference layer.

Label-specific accuracy (binary accuracy)

Accuracy can be calculated for an individual label like “verb”, in which case we consider a binary case: out of all tokens, which ones were correctly identified as a “verb” and correctly identified as “not a verb”. Mathematically, that is the ratio of true positive and true negatives, and the number of tokens.

Acc_binary = (TP+TN) / (Total=TP+FP+TN+FN)

Example: Precision, Recall and Accuracy for “verb”

token	Actual POS	Predicted POS
lopen (to run)	verb	noun
snel (fast)	adjective	adjective
hond (dog)	noun	noun
mooie (beautiful)	adjective	adjective
spring (jump)	verb	verb
katten (cats)	noun	noun
lui (lazy)	adjective	noun
vos (fox)	noun	noun
speelde (played)	verb	verb
blij (happy)	adjective	adjective

Consider the table above, where the tagger has made some predictions. To calculate binary accuracy on “verb”, we first calculate the positives and negatives:

TP (correctly predicted verbs): spring, speelde = 2
FP (incorrectly predicted as verbs): 0
FN (incorrectly not predicted as verbs): lopen = 1
TN (correctly predicted as not a verb): remaining words = 7

These numbers allow us to calculate Precision and Recall as before, and now also include accuracy.

P = TP / (TP + FP) = 2 / (2 + 0) = 1.0
R = TP / (TP + FN) = 2 / (2 + 1) = 0.666…
Acc = (TP + TN) / Total = (2 + 7) / 10 = 0.9

So out of the two predictions “verb” that were made, both of those two were indeed correct (Precision). Out of the three references that should have been “verb”, only two were found (Recall). And in all predictions the tagger correctly identified whether a tag is a verb or “not a verb” except for lopen which it incorrectly classified as a noun (binary accuracy).

Corpus-level accuracy

Using binary accuracy to see whether a tag was correctly predicted is useful, but accuracy is used more often on the corpus level to get a more intuitive idea of the tagger’s performance. Accuracy is considered less nuanced but more intuitive than Precision, Recall and F1 score because it only considers whether a given tag matches the reference or not. It is easy to calculate and easy to understand: how many of the predictions were actually correct?

Acc = correct predictions / number of tokens

This formulation of accuracy is similar to micro-averaging because the classes are not given equal weight but instead all predictions are counted as equal, like each individual soldier in the army in the examples above.

Example: Precision, Recall and Accuracy for the whole corpus

token	Actual POS	Predicted POS
lopen (to run)	verb	noun
snel (fast)	adjective	adjective
hond (dog)	noun	noun
mooie (beautiful)	adjective	adjective
spring (jump)	verb	verb
katten (cats)	noun	noun
lui (lazy)	adjective	noun
vos (fox)	noun	noun
speelde (played)	verb	verb
blij (happy)	adjective	adjective

While calculating corpus-level accuracy is easy (simply the proportion where the labels in the “predicted” column are the same as the “actual” column), Precision and Recall (and therefore F1) do require us to first calculate the positives and negatives for each label. In this dummy corpus we only have three labels: verb, noun and adjective.

Verb:

TP (correctly predicted verbs): spring, speelde = 2
FP (incorrectly predicted as verbs): 0
FN (incorrectly not predicted as verbs): lopen = 1
TN (correctly predicted as not a verb): remaining words = 7
P = TP / (TP + FP) = 2 / (2 + 0) = 1.0
R = TP / (TP + FN) = 2 / (2 + 1) = 0.666…
Acc = (TP + TN) / Total = (2 + 7) / 10 = 0.9

Noun:

TP (correctly predicted nouns): hond, katten, vos = 3
FP (incorrectly predicted as nouns): lopen, lui = 2
FN (incorrectly not predicted as nouns): 0
TN (correctly predicted as not a nouns): remaining words = 5
P = TP / (TP + FP) = 3 / (3 + 2) = 0.6
R = TP / (TP + FN) = 3 / (3 + 0) = 1.0
Acc = (TP + TN) / Total = (3 + 5) / 10 = 0.8

Adjective:

TP (correctly predicted adjectives): snel, mooie, blij = 3
FP (incorrectly predicted as adjectives): 0
FN (incorrectly not predicted as adjectives): lui = 1
TN (correctly predicted as not a adjectives): remaining words = 6
P = TP / (TP + FP) = 3 / (3 + 0) = 1.0
R = TP / (TP + FN) = 3 / (3 + 1) = 0.75
Acc = (TP + TN) / Total = (3 + 6) / 10 = 0.9

All in all our tagger does its job quite well! We can look at different metrics depending on what our focus is:

Precision: out of all predictions for X, a number of them were indeed actual X
Recall: out of all reference labels X, a number of them were correctly predicted
Accuracy: out of all tokens, a number of them were correctly predicted as not-X or X

Now that we calculated the class-specific scores, we can aggregate them into single scores for the whole corpus. For the sake of ease we will use macro-averaging for Precision and Recall here.

P = sum of all class Ps / number of classes = (1.0 + 0.6 + 1.0) / 3 = 0.866…
R = sum of all class Rs / number of classes = (0.66… + 1.0 + 0.75) / 3 = 0.8055…
Acc = number of correct predictions / number of tokens = 8 / 10 = 0.8

Note here that macro-averaging has smoothed out the lower P=0.6 for nouns! The high precision of verbs and adjectives is now masking the lower precision of nouns.

GaLAHaD-specific

No match

No match counts the number of instances for which there was an issue with the alignment of hypothesis and reference layer, so a predicted label could not be successfully mapped to a reference (so no score could be calculated).

Multiple part-of-speech

For tokens that in fact consist of more than one word, a multiple analysis is given. This means that one token is not only assigned more than one lemma but also more than one part of speech.

Example: token “int” analysed as IN (ADP) + HET (PD) (English: in it)

The evaluation results for part-of-speech tagging are taking the assignment of multiple part-of-speech tags into account.

Evaluation of part-of-speech tagging