3.1. Evaluation for named entity recognition (NER) and entity linking (EL)#
##Summary
This recipe describes how to perform the evaluation of named entity recognition (NER) and entity linking (EL) output in different scenarios:
The user does have groundtruth data, where groundtruth data is the manually verified entity tags of entities found in a given text. In this case quantitative evaluation is possible.
The user does not have groundtruth data, but they are willing to manually inspect the NER output in order to spot and flag errors, inconsistencies, hallucinations, etc. In this case, qualitative evaluation is necessary. As this process is time consuming, it can be supported by in-notebook visualizations for quick data inspection.
Note that the recipe only showcases a subset of the possible approaches, cf. Variations and alternatives.
##Rationale
These methods help the user assess the quality of named entity recognition (NER) and entity linking (EL) outputs. This is essential for any application, but especially when communicating with lay people, who often have reservations about new technologies.
The cookbook also allows for evaluation both in a situation in which data comes with ground truth labels (quantitative evaluation) and in a situation where data is not labeled (qualitative evaluation, a.k.a. eye-balling).
To run the quantitative evaluation with use the HIPE-scorer, a set of Python scripts developed as part of the HIPE shared task, focused on named entity processing of historical documents. As such, these scripts have certain requirements, for example when it comes to file naming or data format.
Output data format can be fed to application recipes for visualizing and analyzing errors, making the estimation of the performance easier also for lay people.
3.1.1. Process overview#
The evaluation module takes as input a tsv file where the first column is the token and the others are used to classify the token.
If the file includes gold labels, the user can perform the quantitative evaluation of the annotated test data. The process uses the following steps:
Installing the HIPE scorer
Downloading the evaluation data and ground truths
Reshape data to the format required by the scorer
Running the scorer and saving the results
If the file does not include gold labels, the cookbook returns a visualization of the annotation and gives the possibility to the user to give a free-text feedback about the annotation.
3.1.2. Preparation#
The notebook cells in this section contain the defintion of functions that are used further down in the notebook. These cells must be run but you don’t need to inspect them closely unless you want to modify the behaviour of this notebook.
3.1.2.1. Preparation for HIPE scorer#
! git clone https://github.com/enriching-digital-heritage/HIPE-scorer.git
Cloning into 'HIPE-scorer'...
remote: Enumerating objects: 1004, done.
remote: Counting objects: 100% (107/107), done.
remote: Compressing objects: 100% (83/83), done.
remote: Total 1004 (delta 47), reused 49 (delta 20), pack-reused 897 (from 1)
Receiving objects: 100% (1004/1004), 311.16 KiB | 2.93 MiB/s, done.
Resolving deltas: 100% (638/638), done.
cd HIPE-scorer/
/content/HIPE-scorer
pip install -r requirements.txt
Collecting docopt (from -r requirements.txt (line 1))
Downloading docopt-0.6.2.tar.gz (25 kB)
Preparing metadata (setup.py) ... ?25l?25hdone
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 2)) (2.0.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 3)) (2.2.2)
Requirement already satisfied: smart_open in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 4)) (7.3.0.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2025.2)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart_open->-r requirements.txt (line 4)) (1.17.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 3)) (1.17.0)
Building wheels for collected packages: docopt
Building wheel for docopt (setup.py) ... ?25l?25hdone
Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=b365b69191028b0ef62d19ff3fed819dfc0464286576e8ec8243595cbffcd328
Stored in directory: /root/.cache/pip/wheels/1a/bf/a1/4cee4f7678c68c5875ca89eaccf460593539805c3906722228
Successfully built docopt
Installing collected packages: docopt
Successfully installed docopt-0.6.2
! pip install .
Processing /content/HIPE-scorer
Preparing metadata (setup.py) ... ?25l?25hdone
Building wheels for collected packages: HIPE-scorer
Building wheel for HIPE-scorer (setup.py) ... ?25l?25hdone
Created wheel for HIPE-scorer: filename=HIPE_scorer-2.0-py3-none-any.whl size=15478 sha256=bd836e2c8559efdf027f53830cd3a4a61581474b9eee31aa70539aad693c848c
Stored in directory: /root/.cache/pip/wheels/6c/70/25/36232846b9cd45c513678a5037cd77f079a0d86c8f80b7a6e7
Successfully built HIPE-scorer
Installing collected packages: HIPE-scorer
Successfully installed HIPE-scorer-2.0
3.1.2.2. Preparation for manual assessment#
import pandas as pd
from IPython.display import HTML, display
# Function for visualising the entities with link
def highlight_entities(
data,
iob_column = "NE-COARSE-LIT",
base_url="https://www.wikidata.org/wiki/"
):
# 1) Rebuild the text with spacing rules
text_parts = []
for idx, row in data.iterrows():
tok = row["TOKEN"]
if "NoSpaceAfter" in row["MISC"]:
text_parts.append(tok)
else:
text_parts.append(tok + " ")
text = "".join(text_parts)
# 2) Merge contiguous IOB entities of the same type
entities = []
current = None # {"start_char": int, "end_char": int, "label": str}
for idx, row in data.iterrows():
tag = row[iob_column]
if tag == "_" or tag == "O":
# close any open entity
if current is not None:
entities.append(current)
current = None
continue
# Extract type and prefix
if tag.startswith("B-"):
etype = tag[2:]
eid = row["NEL-LIT"]
# close previous if open
if current is not None:
entities.append(current)
# start new
current = {
"start_char": int(row["start_char"]),
"end_char": int(row["end_char"]),
"label": etype,
"id": eid
}
elif tag.startswith("I-"):
etype = tag[2:]
eid = row["NEL-LIT"]
if current is not None and current["label"] == etype:
# extend current run
current["end_char"] = int(row["end_char"])
else:
# stray I- (no open run or different type) → treat as B-
if current is not None:
entities.append(current)
current = {
"start_char": int(row["start_char"]),
"end_char": int(row["end_char"]),
"label": etype,
"id": eid
}
else:
# Unknown tag → close any open entity
if current is not None:
entities.append(current)
current = None
# flush any remaining entity
if current is not None:
entities.append(current)
# 3) Render with spans (note: end_char is inclusive → slice to en+1)
entities.sort(key=lambda e: e["start_char"])
result = ""
last_idx = 0
for e in entities:
s, en = int(e["start_char"]), int(e["end_char"])
etext = text[s:en + 1] # inclusive end
etype = e.get("label", "Other")
eid = e.get("id", "")
color = label_to_color.get(etype, "#dddddd")
# decide whether to show eid as link or not
if eid !="_" and eid !="NIL":
eid_html = f'<a href="{base_url}{eid}">{eid}</a>'
else:
eid_html = "" # if entity linking was not successful no link is shown
result += text[last_idx:s]
result += (
f'<span style="background-color:{color}; color:black; padding:3px 6px; '
f'border-radius:16px; margin:0 2px; display:inline-block; '
f'box-shadow: 1px 1px 3px rgba(0,0,0,0.1);">'
f'{etext}'
f'<span style="font-size:0.75em; font-weight:normal; margin-left:6px; '
f'background-color:rgba(0,0,0,0.05); padding:1px 6px; border-radius:12px;">'
f'{etype} {eid_html}</span></span>'
)
last_idx = en + 1 # continue after inclusive end
result += text[last_idx:]
return result,text,entities
3.1.2.3. Data preparation#
# we need a function to convert the British Museum groundtruth data into the format expected by the HIPE scorer.
# The data is found here: http://145.38.185.232/enriching/bm.txt
def get_demo_data():
import pandas as pd
# Loading the data
data = pd.read_csv(
'https://raw.githubusercontent.com/wjbmattingly/llm-lod-recipes/refs/heads/main/output/sample.tsv',
sep='\t'
)
# Adding the start and end characters per token to the dataframe
data['start_char'] = 0
data['end_char'] = 0
current_char = 0
for index, row in data.iterrows():
data.loc[index, 'start_char'] = current_char
token = row['TOKEN']
# Check if the next token should not have a space before it
if index + 1 < len(data) and 'NoSpaceAfter' in data.loc[index, 'MISC']:
current_char += len(token)
else:
current_char += len(token) + 1 # Add 1 for the space after the token
data.loc[index, 'end_char'] = current_char - 1 # Subtract 1 because end_char is inclusive
# Just for testing purposes: adds a Wikidata ID to one entity
data.loc[0, 'NEL-LIT'] = 'Q1744'
return data
df = get_demo_data()
# preparing data for groundtruth evaluation
import regex as re
import csv
def clean_format(input_file,output_file):
file = open(input_file)
output = open(output_file,mode='w')
reader = file.readlines() #(file,delimiter="\t")
#writer = csv.writer(output,delimiter="\t")
output.write('\t'.join(["TOKEN","NE-COARSE-LIT","NE-COARSE-METO","NE-FINE-LIT","NE-FINE-METO","NE-FINE-COMP","NE-NESTED","NEL-LIT","NEL-METO","MISC\n"]))
i = 1
for line in reader:
line = line.strip()
mod_line = re.sub('_','-',line)
if re.search('-DOCSTART- -DOCSTART- -DOCSTART-',mod_line):
mod_line = re.sub('-DOCSTART- -DOCSTART- -DOCSTART-',f'# document_{i}',mod_line)
i+=1
mod_line = mod_line.split(' ')
try:
if len(mod_line)==2:
output.write("\n"+" ".join(mod_line)+"\n")
else:
output.write('\t'.join([mod_line[0],mod_line[1],'-','-','-','-','-',mod_line[2],'-','-\n',]))
except: continue
clean_format('./data/bm_labels.txt','./data/gold/sample.tsv')
clean_format('./data/bm-2-ner-format.txt','./data/predictions/sample.tsv')
#
3.1.3. NER evaluation with groundtruth data by using the HIPE scorer#
! python clef_evaluation.py --help
Evaluate the systems for the HIPE Shared Task
Usage:
clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_coarse [options]
clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_fine [options]
clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nel [--n_best=<n>] [options]
clef_evaluation.py -h | --help
Options:
-h --help Show this screen.
-t --task=<type> Type of evaluation task (nerc_coarse, nerc_fine, nel).
-e --hipe_edition=<str> Specify the HIPE edition (triggers different set of columns to be considered during eval). Possible values: hipe-2020, hipe-2022 [default: hipe-2020]
-r --ref=<fpath> Path to gold standard file in CONLL-U-style format.
-p --pred=<fpath> Path to system prediction file in CONLL-U-style format.
-o --outdir=<dir> Path to output directory [default: .].
-l --log=<fpath> Path to log file.
-g --original_nel It splits the NEL boundaries using original CLEF algorithm.
-n, --n_best=<n> Evaluate NEL at particular cutoff value(s) when provided with a ranked list of entity links. Example: 1,3,5 [default: 1].
--noise-level=<str> Evaluate NEL or NERC also on particular noise levels (normalized Levenshtein distance of their manual OCR transcript). Example: 0.0-0.1,0.1-1.0,
--time-period=<str> Evaluate NEL or NERC also on particular time periods. Example: 1900-1950,1950-2000.
--glue=<str> Provide two columns separated by a plus (+) whose label are glued together for the evaluation (e.g. COL1_LABEL.COL2_LABEL). When glueing more than one pair, separate by comma.
--skip-check Skip check that ensures that the files name is in line with submission requirements.
--tagset=<fpath> Path to file containing the valid tagset of CLEF-HIPE.
--suffix=<str> Suffix that is appended to output file names and evaluation keys.
3.1.3.1. Running the scorer#
‼️ In the cell below it is important to note the parameter --task. This parameter value needs to be adjusted depending on the task one wants to evaluate (i.e. NER or EL). When evaluating NER we use --task nerc_coarse, while for evaluating EL we use --task nel.
import glob
import regex as re
for doc in glob.glob('./data/predictions/*'):
gold = re.sub('predictions','gold',doc)
! python clef_evaluation.py --ref "{gold}" --pred "{doc}" --task nerc_coarse --outdir ./data/evaluations/ --hipe_edition hipe-2020 --log=./data/evaluations/scorer.log
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'I-LOC -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'I-LOC -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
true: 100 pred: 100
data_format_true [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [31], [90], [41], [28], [8], [41], [23], [30], [22], [13], [53], [66], [9], [10], [69], [8], [39], [48], [81], [52], [38], [19], [59], [30], [34], [27], [44], [32], [47], [45], [4], [58], [43], [17], [23], [56], [9]]
data_format_pred [[35], [50], [38], [76], [56], [39], [73], [32], [20], [115], [49], [41], [17], [85], [8], [47], [31], [13], [36], [21], [12], [35], [47], [32], [24], [23], [58], [48], [56], [70], [45], [47], [40], [92], [16], [8], [68], [24], [53], [8], [35], [72], [28], [31], [15], [36], [46], [29], [90], [46], [49], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [32], [92], [42], [28], [8], [41], [23], [30], [22], [14], [54], [66], [10], [10], [69], [8], [39], [48], [82], [52], [38], [19], [59], [30], [34], [28], [43], [32], [48], [44], [4], [58], [43], [18], [23], [55], [9]]
ERROR:root:Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.
2025-09-11 14:28:44,413 - ERROR - ./data/predictions/sample.tsv - Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.
Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.
Let’s now look at the various bits of data produced by the scorer. They are in the folder specified in the --outdir parameter of the scorer.
ls -la ./evaluation_data/evaluations/
total 32
drwxr-xr-x 2 root root 4096 Sep 11 12:34 ./
drwxr-xr-x 5 root root 4096 Sep 11 12:34 ../
-rw-r--r-- 1 root root 14963 Sep 11 12:34 01_sample_nerc_coarse.json
-rw-r--r-- 1 root root 1240 Sep 11 12:34 01_sample_nerc_coarse.tsv
-rw-r--r-- 1 root root 200 Sep 11 12:34 scorer.log
Here is an overview of the files created by the scorer:
scorer.log– the log produced by the scorer01_sample_nerc_coarse.tsv– a TSV file contaning the evaluation results for document01_sampleand tasknerc_coarse, at different levels of aggregation etc.01_sample_nerc_coarse.json– a JSON file with a more granular breakdown of the evaluation, which can be useful for error analysis and to better understand systems’ performance.
! cat ./evaluation_data/evaluations/scorer.log
2025-09-11 12:34:38,546 - WARNING - ./evaluation_data/predictions/01_sample.tsv - No tags in the column '['NE-COARSE-METO']' of the system response file: './evaluation_data/predictions/01_sample.tsv'
import pandas as pd
eval_df = pd.read_csv('./evaluation_data/evaluations/01_sample_nerc_coarse.tsv', sep='\t')
eval_df.drop(columns=['System'], inplace=True)
eval_df.set_index('Evaluation', inplace=True)
Let’s print the micro-averaged precision, recall and F-score in a strict evaluation regime:
eval_df.loc['NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]
| P | R | F1 | |
|---|---|---|---|
| Evaluation | |||
| NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
| NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
| NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
Let’s print the micro-averaged precision, recall and F-score in a fuzzy evaluation regime:
# Let's print the micro-averaged precision, recall and F-score in a *fuzzy* evaluation regime
eval_df.loc['NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]
| P | R | F1 | |
|---|---|---|---|
| Evaluation | |||
| NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
| NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
| NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL | 1.0 | 1.0 | 1.0 |
There is more data in the TSV file, as it can be seen when printing the whole content:
! cat ./evaluation_data/evaluations/01_sample_nerc_coarse.tsv
System Evaluation Label P R F1 F1_std P_std R_std TP FP FN
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL ALL 1.0 1.0 1.0 4 0 0
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL DATE 1.0 1.0 1.0 1 0 0
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL PERSON 1.0 1.0 1.0 3 0 0
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL ALL 1.0 1.0 1.0 4 0 0
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL DATE 1.0 1.0 1.0 1 0 0
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL PERSON 1.0 1.0 1.0 3 0 0
NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL ALL 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL DATE 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL PERSON 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL ALL 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL DATE 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL PERSON 1.0 1.0 1.0 0.0 0.0 0.0
NE-COARSE-METO-micro-fuzzy-TIME-ALL-LED-ALL ALL 0 0 0 0 0 0
NE-COARSE-METO-micro-strict-TIME-ALL-LED-ALL ALL 0 0 0 0 0 0
NE-COARSE-METO-macro_doc-fuzzy-TIME-ALL-LED-ALL ALL
NE-COARSE-METO-macro_doc-strict-TIME-ALL-LED-ALL ALL
3.1.4. Manual assessment of the output#
3.1.4.1. Displaying the entities#
To allow for manual assessment of output quality, the following cells display the identified entities in form of color-coded annotations via HTML. These kinds of insights into the results can both complement quantitative statistics and work as another way to estimate output quality. The latter is especially important for the frequent cases where no gold (or silver or bronze) standard is available that the NER output can be evaluated on. (Also, note that there are also other tools or modules out there that provide similar visualisations, e.g. displaCy when using spaCy for NER.)
data = get_demo_data()
data.head()
| TOKEN | NE-COARSE-LIT | NE-COARSE-METO | NE-FINE-LIT | NE-FINE-METO | NE-FINE-COMP | NE-NESTED | NEL-LIT | NEL-METO | MISC | start_char | end_char | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Madonna | B-PERSON | _ | _ | _ | _ | _ | Q1744 | _ | _ | 0 | 7 |
| 1 | and | _ | _ | _ | _ | _ | _ | _ | _ | _ | 8 | 11 |
| 2 | child | _ | _ | _ | _ | _ | _ | _ | _ | NoSpaceAfter | 12 | 16 |
| 3 | ; | _ | _ | _ | _ | _ | _ | _ | _ | _ | 17 | 18 |
| 4 | the | _ | _ | _ | _ | _ | _ | _ | _ | _ | 19 | 22 |
# Highlighting the identified entities in form of color-coded annotations with links to authority files where available
# Color palette - add more colors if more labels are used or change them here
colors = ['#F0D9EF', '#FCDCE1', '#FFE6BB', '#E9ECCE', '#CDE9DC', '#C4DFE5', '#D9E5F0', '#F0E6D9', '#E0D9F0', '#E6FFF0', '#9CC5DF']
# Name Labels that should be shown in color, not mentioned labels will be shown in grey (this makes it easier to focus on certain categories if needed)
labels = ["PERSON", "DATE"]
# Mapping each label from the label set to a color from the palette
label_to_color = {label: colors[i % len(colors)] for i, label in enumerate(labels)}
# Generating the HTML - two changes can be made here:
# 1) by default, the column "NE-COARSE-LIT" is used for the entities, this can be changed via the argument "iob_column"
# 2) the entity identifiers are taken from the column "NEL-LIT"; by default, these are expected to be Wikidata identifiers (e.g. Q1744) and are combined with the Wikidate base URL; for another authority file, the base URL can be changed via the argument "base_url"
res,text,entities = highlight_entities(data)
# displaying the annotations
display(HTML(res))
3.1.4.2. Giving feedback on the entities#
The following cell gives a very simple example for how manual assessment of entities could be integrated into the data. Here, the user is asked for feedback on each identified entity which then shows up in designated column.
# Create a new column for manual assessment
data['manual_assessment'] = ''
# Display results again for better overview (no scrolling back and front)
display(HTML(res))
# Iterate through the identified entities
for e in entities:
s, en = int(e["start_char"]), int(e["end_char"])
etext = text[s:en + 1]
etype = e.get("label", "Other")
# Ask for feedback on the entity
feedback = input(f'Is the entity "{etext.strip()}" with label "{etype}" correct? (y/n/feedback): ')
# Store the feedback for all tokens within the entity span
for index, row in data.iterrows():
token_start = int(row["start_char"])
token_end = int(row["end_char"])
if max(s, token_start) <= min(en, token_end):
data.loc[index, 'manual_assessment'] = feedback
# Show the altered data (now with user assessments)
data
3.1.5. EL evaluation with groundtruth data#
TODO.
3.1.6. Variations and alternatives#
Another approach if the user does not have groundtruth data could be to use an LLM-judge approach to evaluate the NER output in absence of labelled golden data. The task of the LLM is then to “review” the NER output and to assess its quality.