Named Entity Recognition

2.2. Named Entity Recognition#

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that involves identifying and classifying named entities (like people, places, organizations) within text. For example, in the sentence “Shakespeare wrote Romeo and Juliet in London”, a NER system would identify “Shakespeare” as a person, “Romeo and Juliet” as a work of art, and “London” as a location. NER is crucial for extracting structured information from unstructured text, making it valuable for tasks like information retrieval, question answering, and metadata enrichment. In this notebook, we’ll explore how to perform NER using both traditional NLP approaches and modern Large Language Models.

2.3. Rationale#

This notebook demonstrates how to use OpenAI’s GPT models to perform Named Entity Recognition (NER) by converting input text into annotated markdown format. Rather than using traditional NLP libraries, we leverage a Large Language Model’s natural language understanding capabilities to identify and classify named entities. The notebook takes plain text as input and outputs markdown where entities are annotated in the format Entity, such as London. This approach showcases how LLMs can be used for structured information extraction tasks in cultural heritage metadata enrichment.

2.4. Process Overview#

The process consists of the following steps:

Text Input: We start with plain text that needs entity recognition
LLM Processing: The text is sent to GPT with a prompt that instructs it to identify entities
Entity Annotation: The LLM marks entities in markdown format: Entity
Visualization: The annotated text is displayed with color-coded entity highlighting

This approach leverages the LLM’s natural language understanding while producing structured, machine-readable output.

2.5. Install Packages#

!pip install spacy pandas openai python-dotenv

Requirement already satisfied: spacy in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (3.8.7)
Requirement already satisfied: pandas in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (2.3.2)
Requirement already satisfied: openai in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (1.107.0)
Requirement already satisfied: python-dotenv in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (1.1.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (1.0.13)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.0.11)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (3.0.10)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (8.3.6)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.5.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (0.16.1)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (4.67.1)
Requirement already satisfied: numpy>=1.19.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.2.6)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.32.5)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (2.11.7)
Requirement already satisfied: jinja2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (3.1.6)
Requirement already satisfied: setuptools in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (78.1.1)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (25.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from spacy) (3.5.0)
Requirement already satisfied: language-data>=1.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from langcodes<4.0.0,>=3.2.0->spacy) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.4.1)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.20)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2025.8.3)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (1.3.0)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from thinc<8.4.0,>=8.3.4->spacy) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from typer<1.0.0,>=0.3.0->spacy) (8.2.1)
Requirement already satisfied: shellingham>=1.3.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from typer<1.0.0,>=0.3.0->spacy) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from typer<1.0.0,>=0.3.0->spacy) (14.1.0)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from weasel<0.5.0,>=0.1.0->spacy) (0.22.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from weasel<0.5.0,>=0.1.0->spacy) (7.3.0.post1)
Requirement already satisfied: wrapt in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy) (1.17.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/wjm55/.local/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from pandas) (2025.2)
Requirement already satisfied: anyio<5,>=3.5.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from openai) (4.10.0)
Requirement already satisfied: distro<2,>=1.7.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from openai) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from openai) (0.28.1)
Requirement already satisfied: jiter<1,>=0.4.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from openai) (0.10.0)
Requirement already satisfied: sniffio in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from openai) (1.3.1)
Requirement already satisfied: exceptiongroup>=1.0.2 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (1.3.0)
Requirement already satisfied: httpcore==1.* in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)
Requirement already satisfied: h11>=0.16 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)
Requirement already satisfied: marisa-trie>=1.1.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy) (1.3.1)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/wjm55/.local/lib/python3.10/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/wjm55/.local/lib/python3.10/site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /Users/wjm55/.local/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (0.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/gliner/lib/python3.10/site-packages (from jinja2->spacy) (3.0.2)

2.6. Necessary Functions for Visualizations#

import re
import spacy
from spacy.tokens import Doc, Span


def annotated_text_to_spacy_doc(text, nlp=None):
    """
    Converts annotated text in format [Entity](LABEL) to a spaCy Doc with entity spans.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
    
    Returns:
        spacy.tokens.Doc: spaCy document with entity spans set
        
    Example:
        >>> text = "[Tom](PERSON) worked for [Microsoft](ORGANIZATION) in 2020 before he lived in [Rome](LOCATION)."
        >>> doc = annotated_text_to_spacy_doc(text)
        >>> spacy.displacy.render(doc, style="ent")
    """
    if nlp is None:
        nlp = spacy.blank("en")
    
    # Pattern to match [text](LABEL) format
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    
    # Parse the text to extract tokens and entity information
    tokens = []
    entity_spans = []  # List of (start_token_idx, end_token_idx, label)
    custom_labels = set()
    
    # Split text by the pattern and process each part
    last_end = 0
    token_idx = 0
    
    for match in re.finditer(pattern, text):
        # Add tokens before the entity
        before_entity = text[last_end:match.start()]
        if before_entity.strip():
            # Tokenize the text before the entity
            before_tokens = before_entity.split()
            tokens.extend(before_tokens)
            token_idx += len(before_tokens)
        
        # Add the entity tokens
        entity_text = match.group(1)
        entity_label = match.group(2)
        custom_labels.add(entity_label)
        
        # Tokenize the entity text
        entity_tokens = entity_text.split()
        start_token_idx = token_idx
        tokens.extend(entity_tokens)
        token_idx += len(entity_tokens)
        end_token_idx = token_idx
        
        # Store entity span information
        entity_spans.append((start_token_idx, end_token_idx, entity_label))
        
        last_end = match.end()
    
    # Add any remaining tokens after the last entity
    remaining = text[last_end:]
    if remaining.strip():
        remaining_tokens = remaining.split()
        tokens.extend(remaining_tokens)
    
    # Add custom labels to the NLP model if they don't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner")
    else:
        ner = nlp.get_pipe("ner")
    
    for label in custom_labels:
        ner.add_label(label)
    
    # Create spaces array (True for tokens that should have a space after them)
    # Simple heuristic: all tokens except the last one get a space
    spaces = [True] * len(tokens)
    if tokens:
        spaces[-1] = False
    
    # Create the Doc from tokens
    doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
    
    # Create entity spans
    entities = []
    for start_idx, end_idx, label in entity_spans:
        if start_idx < len(doc) and end_idx <= len(doc):
            span = Span(doc, start_idx, end_idx, label=label)
            entities.append(span)
    
    # Set entities on the document
    doc.ents = entities
    
    return doc


def visualize_annotated_text(text, nlp=None, style="ent", jupyter=True):
    """
    Convenience function to convert annotated text and visualize it with displaCy.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
        style (str): displaCy style ("ent" or "dep")
        jupyter (bool): Whether to render for Jupyter notebook
    
    Returns:
        Rendered visualization (HTML string if not in Jupyter)
    """
    doc = annotated_text_to_spacy_doc(text, nlp)
    
    try:
        import spacy
        return spacy.displacy.render(doc, style=style, jupyter=jupyter)
    except ImportError:
        print("spaCy not installed. Please install with: pip install spacy")
        return None

2.7. Importing the Required Libraries#

from dotenv import load_dotenv
import os
from openai import OpenAI

2.8. Loading our Environment Variables#

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

2.9. Connecting to OpenAI Client#

client = OpenAI(api_key=OPENAI_API_KEY)

2.10. Main Variables for the Notebook#

Show code cell source

Hide code cell source

INPUT_DATA = [{'text_original': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'text_clean': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
  'language': {'language': 'en', 'score': -868.9007034301758},
  'sentences': [{'id': 0,
    'start': 0,
    'end': 130,
    'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."},
   {'id': 1,
    'start': 131,
    'end': 324,
    'text': 'Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.'}],
  'tokens': [{'id': 0,
    'text': 'This',
    'start': 0,
    'end': 4,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 1,
    'text': 'painting',
    'start': 5,
    'end': 13,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 2,
    'text': 'depicts',
    'start': 14,
    'end': 21,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 3,
    'text': 'Monet',
    'start': 22,
    'end': 27,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 4,
    'text': "'s",
    'start': 27,
    'end': 29,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 5,
    'text': 'first',
    'start': 30,
    'end': 35,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 6,
    'text': 'wife',
    'start': 36,
    'end': 40,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 7,
    'text': ',',
    'start': 40,
    'end': 41,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 8,
    'text': 'Camille',
    'start': 42,
    'end': 49,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 9,
    'text': ',',
    'start': 49,
    'end': 50,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 10,
    'text': 'outside',
    'start': 51,
    'end': 58,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 11,
    'text': 'on',
    'start': 59,
    'end': 61,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 12,
    'text': 'a',
    'start': 62,
    'end': 63,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 13,
    'text': 'snowy',
    'start': 64,
    'end': 69,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 14,
    'text': 'day',
    'start': 70,
    'end': 73,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 15,
    'text': 'passing',
    'start': 74,
    'end': 81,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 16,
    'text': 'by',
    'start': 82,
    'end': 84,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 17,
    'text': 'the',
    'start': 85,
    'end': 88,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 18,
    'text': 'French',
    'start': 89,
    'end': 95,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 19,
    'text': 'doors',
    'start': 96,
    'end': 101,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 20,
    'text': 'of',
    'start': 102,
    'end': 104,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 21,
    'text': 'their',
    'start': 105,
    'end': 110,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 22,
    'text': 'home',
    'start': 111,
    'end': 115,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 23,
    'text': 'at',
    'start': 116,
    'end': 118,
    'ws': True,
    'is_punct': False,
    'sent_id': 0},
   {'id': 24,
    'text': 'Argenteuil',
    'start': 119,
    'end': 129,
    'ws': False,
    'is_punct': False,
    'sent_id': 0},
   {'id': 25,
    'text': '.',
    'start': 129,
    'end': 130,
    'ws': True,
    'is_punct': True,
    'sent_id': 0},
   {'id': 26,
    'text': 'Her',
    'start': 131,
    'end': 134,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 27,
    'text': 'face',
    'start': 135,
    'end': 139,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 28,
    'text': 'is',
    'start': 140,
    'end': 142,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 29,
    'text': 'rendered',
    'start': 143,
    'end': 151,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 30,
    'text': 'in',
    'start': 152,
    'end': 154,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 31,
    'text': 'a',
    'start': 155,
    'end': 156,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 32,
    'text': 'radically',
    'start': 157,
    'end': 166,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 33,
    'text': 'bold',
    'start': 167,
    'end': 171,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 34,
    'text': 'Impressionist',
    'start': 172,
    'end': 185,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 35,
    'text': 'technique',
    'start': 186,
    'end': 195,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 36,
    'text': 'of',
    'start': 196,
    'end': 198,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 37,
    'text': 'mere',
    'start': 199,
    'end': 203,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 38,
    'text': 'daubs',
    'start': 204,
    'end': 209,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 39,
    'text': 'of',
    'start': 210,
    'end': 212,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 40,
    'text': 'paint',
    'start': 213,
    'end': 218,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 41,
    'text': 'quickly',
    'start': 219,
    'end': 226,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 42,
    'text': 'applied',
    'start': 227,
    'end': 234,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 43,
    'text': ',',
    'start': 234,
    'end': 235,
    'ws': True,
    'is_punct': True,
    'sent_id': 1},
   {'id': 44,
    'text': 'just',
    'start': 236,
    'end': 240,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 45,
    'text': 'as',
    'start': 241,
    'end': 243,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 46,
    'text': 'the',
    'start': 244,
    'end': 247,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 47,
    'text': 'snow',
    'start': 248,
    'end': 252,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 48,
    'text': 'and',
    'start': 253,
    'end': 256,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 49,
    'text': 'trees',
    'start': 257,
    'end': 262,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 50,
    'text': 'are',
    'start': 263,
    'end': 266,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 51,
    'text': 'defined',
    'start': 267,
    'end': 274,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 52,
    'text': 'by',
    'start': 275,
    'end': 277,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 53,
    'text': 'broad',
    'start': 278,
    'end': 283,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 54,
    'text': ',',
    'start': 283,
    'end': 284,
    'ws': True,
    'is_punct': True,
    'sent_id': 1},
   {'id': 55,
    'text': 'broken',
    'start': 285,
    'end': 291,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 56,
    'text': 'strokes',
    'start': 292,
    'end': 299,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 57,
    'text': 'of',
    'start': 300,
    'end': 302,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 58,
    'text': 'pure',
    'start': 303,
    'end': 307,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 59,
    'text': 'white',
    'start': 308,
    'end': 313,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 60,
    'text': 'and',
    'start': 314,
    'end': 317,
    'ws': True,
    'is_punct': False,
    'sent_id': 1},
   {'id': 61,
    'text': 'green',
    'start': 318,
    'end': 323,
    'ws': False,
    'is_punct': False,
    'sent_id': 1},
   {'id': 62,
    'text': '.',
    'start': 323,
    'end': 324,
    'ws': False,
    'is_punct': True,
    'sent_id': 1}],
  'meta': {'source': 'CMA',
   'id': 135382,
   'char_count': 324,
   'token_count': 63,
   'sentence_count': 2}}]

MODEL = "gpt-4o-mini"
LABELS = ["PERSON", "LOCATION", "ORGANIZATION"]
TEXT = INPUT_DATA[0]["text_clean"]
print(TEXT)

This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.

2.11. Creating the Prompt#

prompt = f"""
Convert the following text into a structured markdown format, where you annotate the entities in the text in the following format: [Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
{LABELS}

Do this for the following text:
{TEXT}

Only return the markdown output, nothing else.
"""

print(prompt)

Convert the following text into a structured markdown format, where you annotate the entities in the text in the following format: [Tom](PERSON) went to [New York](PLACE).

Look for the following entities types:
['PERSON', 'LOCATION', 'ORGANIZATION']

Do this for the following text:
This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.

Only return the markdown output, nothing else.

2.12. Calling OpenAI#

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}]
)

markdown_output = response.choices[0].message.content

print(markdown_output)

This painting depicts [Monet](PERSON)'s first wife, [Camille](PERSON), outside on a snowy day passing by the French doors of their home at [Argenteuil](LOCATION). Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.

2.13. Visualizing the Results#

visualize_annotated_text(markdown_output)

This painting depicts Monet PERSON 's first wife, Camille PERSON , outside on a snowy day passing by the French doors of their home at Argenteuil LOCATION . Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.

doc = annotated_text_to_spacy_doc(markdown_output)
print(doc.ents)

(Monet, Camille, Argenteuil)

entities = []
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)
    entities.append({
        "text": ent.text,
        "label": ent.label_,
        "start_char": ent.start_char,
        "end_char": ent.end_char
    })
    

Monet PERSON 22 27
Camille PERSON 43 50
Argenteuil LOCATION 121 131

INPUT_DATA[0]["entities"] = entities

print(INPUT_DATA[0]["entities"])

[{'text': 'Monet', 'label': 'PERSON', 'start_char': 22, 'end_char': 27}, {'text': 'Camille', 'label': 'PERSON', 'start_char': 43, 'end_char': 50}, {'text': 'Argenteuil', 'label': 'LOCATION', 'start_char': 121, 'end_char': 131}]