3.1. Evaluation for named entity recognition (NER) and entity linking (EL)#


##Summary

This recipe describes how to perform the evaluation of named entity recognition (NER) and entity linking (EL) output in different scenarios:

  1. The user does have groundtruth data, where groundtruth data is the manually verified entity tags of entities found in a given text. In this case quantitative evaluation is possible.

  2. The user does not have groundtruth data, but they are willing to manually inspect the NER output in order to spot and flag errors, inconsistencies, hallucinations, etc. In this case, qualitative evaluation is necessary. As this process is time consuming, it can be supported by in-notebook visualizations for quick data inspection.

Note that the recipe only showcases a subset of the possible approaches, cf. Variations and alternatives.

##Rationale

These methods help the user assess the quality of named entity recognition (NER) and entity linking (EL) outputs. This is essential for any application, but especially when communicating with lay people, who often have reservations about new technologies.

The cookbook also allows for evaluation both in a situation in which data comes with ground truth labels (quantitative evaluation) and in a situation where data is not labeled (qualitative evaluation, a.k.a. eye-balling).

To run the quantitative evaluation with use the HIPE-scorer, a set of Python scripts developed as part of the HIPE shared task, focused on named entity processing of historical documents. As such, these scripts have certain requirements, for example when it comes to file naming or data format.

Output data format can be fed to application recipes for visualizing and analyzing errors, making the estimation of the performance easier also for lay people.

3.1.1. Process overview#

The evaluation module takes as input a tsv file where the first column is the token and the others are used to classify the token.

If the file includes gold labels, the user can perform the quantitative evaluation of the annotated test data. The process uses the following steps:

  1. Installing the HIPE scorer

  2. Downloading the evaluation data and ground truths

  3. Reshape data to the format required by the scorer

  4. Running the scorer and saving the results

If the file does not include gold labels, the cookbook returns a visualization of the annotation and gives the possibility to the user to give a free-text feedback about the annotation.

3.1.2. Preparation#

The notebook cells in this section contain the defintion of functions that are used further down in the notebook. These cells must be run but you don’t need to inspect them closely unless you want to modify the behaviour of this notebook.

3.1.2.1. Preparation for HIPE scorer#

! git clone https://github.com/enriching-digital-heritage/HIPE-scorer.git
Cloning into 'HIPE-scorer'...
remote: Enumerating objects: 1004, done.
remote: Counting objects: 100% (107/107), done.
remote: Compressing objects: 100% (83/83), done.
remote: Total 1004 (delta 47), reused 49 (delta 20), pack-reused 897 (from 1)
Receiving objects: 100% (1004/1004), 311.16 KiB | 2.93 MiB/s, done.
Resolving deltas: 100% (638/638), done.
cd HIPE-scorer/
/content/HIPE-scorer
pip install -r requirements.txt
Collecting docopt (from -r requirements.txt (line 1))
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... ?25l?25hdone
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 2)) (2.0.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 3)) (2.2.2)
Requirement already satisfied: smart_open in /usr/local/lib/python3.12/dist-packages (from -r requirements.txt (line 4)) (7.3.0.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas->-r requirements.txt (line 3)) (2025.2)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart_open->-r requirements.txt (line 4)) (1.17.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 3)) (1.17.0)
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... ?25l?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=b365b69191028b0ef62d19ff3fed819dfc0464286576e8ec8243595cbffcd328
  Stored in directory: /root/.cache/pip/wheels/1a/bf/a1/4cee4f7678c68c5875ca89eaccf460593539805c3906722228
Successfully built docopt
Installing collected packages: docopt
Successfully installed docopt-0.6.2
! pip install .
Processing /content/HIPE-scorer
  Preparing metadata (setup.py) ... ?25l?25hdone
Building wheels for collected packages: HIPE-scorer
  Building wheel for HIPE-scorer (setup.py) ... ?25l?25hdone
  Created wheel for HIPE-scorer: filename=HIPE_scorer-2.0-py3-none-any.whl size=15478 sha256=bd836e2c8559efdf027f53830cd3a4a61581474b9eee31aa70539aad693c848c
  Stored in directory: /root/.cache/pip/wheels/6c/70/25/36232846b9cd45c513678a5037cd77f079a0d86c8f80b7a6e7
Successfully built HIPE-scorer
Installing collected packages: HIPE-scorer
Successfully installed HIPE-scorer-2.0

3.1.2.2. Preparation for manual assessment#

import pandas as pd
from IPython.display import HTML, display

# Function for visualising the entities with link

def highlight_entities(
    data,
    iob_column = "NE-COARSE-LIT",
    base_url="https://www.wikidata.org/wiki/"
    ):
    # 1) Rebuild the text with spacing rules
    text_parts = []
    for idx, row in data.iterrows():
        tok = row["TOKEN"]
        if "NoSpaceAfter" in row["MISC"]:
            text_parts.append(tok)
        else:
            text_parts.append(tok + " ")
    text = "".join(text_parts)

    # 2) Merge contiguous IOB entities of the same type
    entities = []
    current = None  # {"start_char": int, "end_char": int, "label": str}

    for idx, row in data.iterrows():
        tag = row[iob_column]
        if tag == "_" or tag == "O":
            # close any open entity
            if current is not None:
                entities.append(current)
                current = None
            continue

        # Extract type and prefix
        if tag.startswith("B-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            # close previous if open
            if current is not None:
                entities.append(current)
            # start new
            current = {
                "start_char": int(row["start_char"]),
                "end_char": int(row["end_char"]),
                "label": etype,
                "id": eid
            }

        elif tag.startswith("I-"):
            etype = tag[2:]
            eid = row["NEL-LIT"]
            if current is not None and current["label"] == etype:
                # extend current run
                current["end_char"] = int(row["end_char"])
            else:
                # stray I- (no open run or different type) → treat as B-
                if current is not None:
                    entities.append(current)
                current = {
                    "start_char": int(row["start_char"]),
                    "end_char": int(row["end_char"]),
                    "label": etype,
                    "id": eid
                }

        else:
            # Unknown tag → close any open entity
            if current is not None:
                entities.append(current)
                current = None

    # flush any remaining entity
    if current is not None:
        entities.append(current)

    # 3) Render with spans (note: end_char is inclusive → slice to en+1)
    entities.sort(key=lambda e: e["start_char"])
    result = ""
    last_idx = 0

    for e in entities:
        s, en = int(e["start_char"]), int(e["end_char"])
        etext = text[s:en + 1]  # inclusive end
        etype = e.get("label", "Other")
        eid = e.get("id", "")
        color = label_to_color.get(etype, "#dddddd")

        # decide whether to show eid as link or not
        if eid !="_" and eid !="NIL":
          eid_html = f'<a href="{base_url}{eid}">{eid}</a>'
        else:
          eid_html = ""  # if entity linking was not successful no link is shown

        result += text[last_idx:s]
        result += (
            f'<span style="background-color:{color}; color:black; padding:3px 6px; '
            f'border-radius:16px; margin:0 2px; display:inline-block; '
            f'box-shadow: 1px 1px 3px rgba(0,0,0,0.1);">'
            f'{etext}'
            f'<span style="font-size:0.75em; font-weight:normal; margin-left:6px; '
            f'background-color:rgba(0,0,0,0.05); padding:1px 6px; border-radius:12px;">'
            f'{etype} {eid_html}</span></span>'
        )
        last_idx = en + 1  # continue after inclusive end

    result += text[last_idx:]
    return result,text,entities

3.1.2.3. Data preparation#

# we need a function to convert the British Museum groundtruth data into the format expected by the HIPE scorer.
# The data is found here: http://145.38.185.232/enriching/bm.txt
def get_demo_data():
  import pandas as pd
  # Loading the data
  data = pd.read_csv(
      'https://raw.githubusercontent.com/wjbmattingly/llm-lod-recipes/refs/heads/main/output/sample.tsv',
      sep='\t'
  )

  # Adding the start and end characters per token to the dataframe
  data['start_char'] = 0
  data['end_char'] = 0
  current_char = 0

  for index, row in data.iterrows():
      data.loc[index, 'start_char'] = current_char
      token = row['TOKEN']
      # Check if the next token should not have a space before it
      if index + 1 < len(data) and 'NoSpaceAfter' in data.loc[index, 'MISC']:
          current_char += len(token)
      else:
          current_char += len(token) + 1  # Add 1 for the space after the token

      data.loc[index, 'end_char'] = current_char - 1 # Subtract 1 because end_char is inclusive

  # Just for testing purposes: adds a Wikidata ID to one entity
  data.loc[0, 'NEL-LIT'] = 'Q1744'
  return data
df = get_demo_data()
# preparing data for groundtruth evaluation
import regex as re
import csv

def clean_format(input_file,output_file):
  file = open(input_file)
  output = open(output_file,mode='w')
  reader = file.readlines() #(file,delimiter="\t")
  #writer = csv.writer(output,delimiter="\t")
  output.write('\t'.join(["TOKEN","NE-COARSE-LIT","NE-COARSE-METO","NE-FINE-LIT","NE-FINE-METO","NE-FINE-COMP","NE-NESTED","NEL-LIT","NEL-METO","MISC\n"]))
  i = 1
  for line in reader:
    line = line.strip()
    mod_line = re.sub('_','-',line)
    if re.search('-DOCSTART- -DOCSTART- -DOCSTART-',mod_line):
      mod_line = re.sub('-DOCSTART- -DOCSTART- -DOCSTART-',f'# document_{i}',mod_line)
      i+=1
    mod_line = mod_line.split(' ')
    try:
      if len(mod_line)==2:
        output.write("\n"+" ".join(mod_line)+"\n")
      else:
        output.write('\t'.join([mod_line[0],mod_line[1],'-','-','-','-','-',mod_line[2],'-','-\n',]))
    except: continue
clean_format('./data/bm_labels.txt','./data/gold/sample.tsv')
clean_format('./data/bm-2-ner-format.txt','./data/predictions/sample.tsv')
#

3.1.3. NER evaluation with groundtruth data by using the HIPE scorer#

! python clef_evaluation.py --help
Evaluate the systems for the HIPE Shared Task

Usage:
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_coarse [options]
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nerc_fine [options]
  clef_evaluation.py --pred=<fpath> --ref=<fpath> --task=nel [--n_best=<n>] [options]
  clef_evaluation.py -h | --help


Options:
    -h --help               Show this screen.
    -t --task=<type>        Type of evaluation task (nerc_coarse, nerc_fine, nel).
    -e --hipe_edition=<str> Specify the HIPE edition (triggers different set of columns to be considered during eval). Possible values: hipe-2020, hipe-2022 [default: hipe-2020]
    -r --ref=<fpath>        Path to gold standard file in CONLL-U-style format.
    -p --pred=<fpath>       Path to system prediction file in CONLL-U-style format.
    -o --outdir=<dir>       Path to output directory [default: .].
    -l --log=<fpath>        Path to log file.
    -g --original_nel       It splits the NEL boundaries using original CLEF algorithm.
    -n, --n_best=<n>        Evaluate NEL at particular cutoff value(s) when provided with a ranked list of entity links. Example: 1,3,5 [default: 1].
    --noise-level=<str>     Evaluate NEL or NERC also on particular noise levels (normalized Levenshtein distance of their manual OCR transcript). Example: 0.0-0.1,0.1-1.0,
    --time-period=<str>     Evaluate NEL or NERC also on particular time periods. Example: 1900-1950,1950-2000.
    --glue=<str>            Provide two columns separated by a plus (+) whose label are glued together for the evaluation (e.g. COL1_LABEL.COL2_LABEL). When glueing more than one pair, separate by comma.
    --skip-check            Skip check that ensures that the files name is in line with submission requirements.
    --tagset=<fpath>        Path to file containing the valid tagset of CLEF-HIPE.
    --suffix=<str>          Suffix that is appended to output file names and evaluation keys.

3.1.3.1. Running the scorer#

‼️ In the cell below it is important to note the parameter --task. This parameter value needs to be adjusted depending on the task one wants to evaluate (i.e. NER or EL). When evaluating NER we use --task nerc_coarse, while for evaluating EL we use --task nel.

import glob
import regex as re

for doc in glob.glob('./data/predictions/*'):
  gold = re.sub('predictions','gold',doc)
  ! python clef_evaluation.py --ref "{gold}" --pred "{doc}" --task nerc_coarse --outdir ./data/evaluations/ --hipe_edition hipe-2020 --log=./data/evaluations/scorer.log
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'I-LOC -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'I-LOC -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
{'TOKEN': 'O -', 'NE-COARSE-LIT': None, 'NE-COARSE-METO': None, 'NE-FINE-LIT': None, 'NE-FINE-METO': None, 'NE-FINE-COMP': None, 'NE-NESTED': None, 'NEL-LIT': None, 'NEL-METO': None, 'MISC': None}
true: 100 pred: 100
data_format_true [[36], [49], [37], [76], [56], [38], [73], [33], [20], [113], [49], [41], [17], [85], [8], [47], [30], [14], [36], [21], [10], [35], [47], [32], [24], [22], [58], [48], [56], [69], [45], [47], [40], [92], [16], [8], [67], [24], [53], [8], [35], [71], [28], [31], [15], [36], [46], [29], [89], [47], [48], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [31], [90], [41], [28], [8], [41], [23], [30], [22], [13], [53], [66], [9], [10], [69], [8], [39], [48], [81], [52], [38], [19], [59], [30], [34], [27], [44], [32], [47], [45], [4], [58], [43], [17], [23], [56], [9]]
data_format_pred [[35], [50], [38], [76], [56], [39], [73], [32], [20], [115], [49], [41], [17], [85], [8], [47], [31], [13], [36], [21], [12], [35], [47], [32], [24], [23], [58], [48], [56], [70], [45], [47], [40], [92], [16], [8], [68], [24], [53], [8], [35], [72], [28], [31], [15], [36], [46], [29], [90], [46], [49], [65], [8], [7], [38], [25], [40], [39], [30], [51], [26], [39], [68], [32], [92], [42], [28], [8], [41], [23], [30], [22], [14], [54], [66], [10], [10], [69], [8], [39], [48], [82], [52], [38], [19], [59], [30], [34], [28], [43], [32], [48], [44], [4], [58], [43], [18], [23], [55], [9]]
ERROR:root:Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.
2025-09-11 14:28:44,413 - ERROR - ./data/predictions/sample.tsv - Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.
Data mismatch between system response './data/predictions/sample.tsv' and gold standard due to wrong segmentation or missing lines.

Let’s now look at the various bits of data produced by the scorer. They are in the folder specified in the --outdir parameter of the scorer.

ls -la ./evaluation_data/evaluations/
total 32
drwxr-xr-x 2 root root  4096 Sep 11 12:34 ./
drwxr-xr-x 5 root root  4096 Sep 11 12:34 ../
-rw-r--r-- 1 root root 14963 Sep 11 12:34 01_sample_nerc_coarse.json
-rw-r--r-- 1 root root  1240 Sep 11 12:34 01_sample_nerc_coarse.tsv
-rw-r--r-- 1 root root   200 Sep 11 12:34 scorer.log

Here is an overview of the files created by the scorer:

  • scorer.log – the log produced by the scorer

  • 01_sample_nerc_coarse.tsv – a TSV file contaning the evaluation results for document 01_sample and task nerc_coarse, at different levels of aggregation etc.

  • 01_sample_nerc_coarse.json – a JSON file with a more granular breakdown of the evaluation, which can be useful for error analysis and to better understand systems’ performance.

! cat ./evaluation_data/evaluations/scorer.log
2025-09-11 12:34:38,546 - WARNING - ./evaluation_data/predictions/01_sample.tsv - No tags in the column '['NE-COARSE-METO']' of the system response file: './evaluation_data/predictions/01_sample.tsv'
import pandas as pd

eval_df = pd.read_csv('./evaluation_data/evaluations/01_sample_nerc_coarse.tsv', sep='\t')
eval_df.drop(columns=['System'], inplace=True)
eval_df.set_index('Evaluation', inplace=True)

Let’s print the micro-averaged precision, recall and F-score in a strict evaluation regime:

eval_df.loc['NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]
P R F1
Evaluation
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL 1.0 1.0 1.0
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL 1.0 1.0 1.0
NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL 1.0 1.0 1.0

Let’s print the micro-averaged precision, recall and F-score in a fuzzy evaluation regime:

# Let's print the micro-averaged precision, recall and F-score in a *fuzzy* evaluation regime
eval_df.loc['NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL'][['P', 'R', 'F1']]
P R F1
Evaluation
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL 1.0 1.0 1.0
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL 1.0 1.0 1.0
NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL 1.0 1.0 1.0

There is more data in the TSV file, as it can be seen when printing the whole content:

! cat ./evaluation_data/evaluations/01_sample_nerc_coarse.tsv
System	Evaluation	Label	P	R	F1	F1_std	P_std	R_std	TP	FP	FN
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	ALL	1.0	1.0	1.0				4	0	0
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	DATE	1.0	1.0	1.0				1	0	0
	NE-COARSE-LIT-micro-fuzzy-TIME-ALL-LED-ALL	PERSON	1.0	1.0	1.0				3	0	0
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	ALL	1.0	1.0	1.0				4	0	0
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	DATE	1.0	1.0	1.0				1	0	0
	NE-COARSE-LIT-micro-strict-TIME-ALL-LED-ALL	PERSON	1.0	1.0	1.0				3	0	0
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	ALL	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	DATE	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-LIT-macro_doc-fuzzy-TIME-ALL-LED-ALL	PERSON	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	ALL	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	DATE	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-LIT-macro_doc-strict-TIME-ALL-LED-ALL	PERSON	1.0	1.0	1.0	0.0	0.0	0.0			
	NE-COARSE-METO-micro-fuzzy-TIME-ALL-LED-ALL	ALL	0	0	0				0	0	0
	NE-COARSE-METO-micro-strict-TIME-ALL-LED-ALL	ALL	0	0	0				0	0	0
	NE-COARSE-METO-macro_doc-fuzzy-TIME-ALL-LED-ALL	ALL									
	NE-COARSE-METO-macro_doc-strict-TIME-ALL-LED-ALL	ALL									

3.1.4. Manual assessment of the output#

3.1.4.1. Displaying the entities#

To allow for manual assessment of output quality, the following cells display the identified entities in form of color-coded annotations via HTML. These kinds of insights into the results can both complement quantitative statistics and work as another way to estimate output quality. The latter is especially important for the frequent cases where no gold (or silver or bronze) standard is available that the NER output can be evaluated on. (Also, note that there are also other tools or modules out there that provide similar visualisations, e.g. displaCy when using spaCy for NER.)

data = get_demo_data()
data.head()
TOKEN NE-COARSE-LIT NE-COARSE-METO NE-FINE-LIT NE-FINE-METO NE-FINE-COMP NE-NESTED NEL-LIT NEL-METO MISC start_char end_char
0 Madonna B-PERSON _ _ _ _ _ Q1744 _ _ 0 7
1 and _ _ _ _ _ _ _ _ _ 8 11
2 child _ _ _ _ _ _ _ _ NoSpaceAfter 12 16
3 ; _ _ _ _ _ _ _ _ _ 17 18
4 the _ _ _ _ _ _ _ _ _ 19 22
# Highlighting the identified entities in form of color-coded annotations with links to authority files where available

# Color palette - add more colors if more labels are used or change them here
colors = ['#F0D9EF', '#FCDCE1', '#FFE6BB', '#E9ECCE', '#CDE9DC', '#C4DFE5', '#D9E5F0', '#F0E6D9', '#E0D9F0', '#E6FFF0', '#9CC5DF']

# Name Labels that should be shown in color, not mentioned labels will be shown in grey (this makes it easier to focus on certain categories if needed)
labels = ["PERSON", "DATE"]

# Mapping each label from the label set to a color from the palette
label_to_color = {label: colors[i % len(colors)] for i, label in enumerate(labels)}

# Generating the HTML - two changes can be made here:
# 1) by default, the column "NE-COARSE-LIT" is used for the entities, this can be changed via the argument "iob_column"
# 2) the entity identifiers are taken from the column "NEL-LIT"; by default, these are expected to be Wikidata identifiers (e.g. Q1744) and are combined with the Wikidate base URL; for another authority file, the base URL can be changed via the argument "base_url"
res,text,entities = highlight_entities(data)

# displaying the annotations
display(HTML(res))
Madonna PERSON Q1744and child; the Virgin PERSON seated turned to left and seen three-quarter DATE length, holding the infant Jesus PERSON seated on her knee and suckling him, a round composition. c.1641 Etching

3.1.4.2. Giving feedback on the entities#

The following cell gives a very simple example for how manual assessment of entities could be integrated into the data. Here, the user is asked for feedback on each identified entity which then shows up in designated column.

# Create a new column for manual assessment
data['manual_assessment'] = ''

# Display results again for better overview (no scrolling back and front)
display(HTML(res))

# Iterate through the identified entities
for e in entities:
    s, en = int(e["start_char"]), int(e["end_char"])
    etext = text[s:en + 1]
    etype = e.get("label", "Other")

    # Ask for feedback on the entity
    feedback = input(f'Is the entity "{etext.strip()}" with label "{etype}" correct? (y/n/feedback): ')

    # Store the feedback for all tokens within the entity span
    for index, row in data.iterrows():
        token_start = int(row["start_char"])
        token_end = int(row["end_char"])
        if max(s, token_start) <= min(en, token_end):
            data.loc[index, 'manual_assessment'] = feedback

# Show the altered data (now with user assessments)
data
Madonna PERSON Q1744and child; the Virgin PERSON seated turned to left and seen three-quarter DATE length, holding the infant Jesus PERSON seated on her knee and suckling him, a round composition. c.1641 Etching

3.1.5. EL evaluation with groundtruth data#

TODO.

3.1.6. Variations and alternatives#

Another approach if the user does not have groundtruth data could be to use an LLM-judge approach to evaluate the NER output in absence of labelled golden data. The task of the LLM is then to “review” the NER output and to assess its quality.