2.16. Disambiguation with Tool Calling#

In this notebook, we explore entity disambiguation using GPT-4o’s tool calling capabilities. Rather than providing a list of candidates, we leverage GPT-4’s ability to search the web and directly find the correct Wikidata IDs for entities. This approach combines the power of large language models with real-time web access to accurately link entity mentions to their corresponding Wikidata entries. We’ll demonstrate how to use this method to disambiguate geographic locations and other named entities in text, providing a more automated and scalable approach to entity linking.

2.16.1. Installing Packages#

!pip install pydantic openai spacy pandas

2.16.2. Getting our ENV Varaibles.#

First, we need to set up our environment variables to access the OpenAI API. We’ll use the python-dotenv package to load environment variables from a .env file, which should contain our OPENAI_API_KEY. This keeps our API key secure by not hardcoding it directly in our code.

import sys
sys.path.append("..")
from dotenv import load_dotenv
import os
from openai import OpenAI
from pydantic import BaseModel
import json
import pandas as pd
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

2.16.3. Visualization Functions#

This bit of code is just for making it easier to visualize our data at the end of the notebook.

Hide code cell source

import re
import spacy
from spacy.tokens import Doc, Span


def annotated_text_to_spacy_doc(text, nlp=None):
    """
    Converts annotated text in format [Entity](LABEL) to a spaCy Doc with entity spans.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
    
    Returns:
        spacy.tokens.Doc: spaCy document with entity spans set
        
    Example:
        >>> text = "[Tom](PERSON) worked for [Microsoft](ORGANIZATION) in 2020 before he lived in [Rome](LOCATION)."
        >>> doc = annotated_text_to_spacy_doc(text)
        >>> spacy.displacy.render(doc, style="ent")
    """
    if nlp is None:
        nlp = spacy.blank("en")
    
    # Pattern to match [text](LABEL) format
    pattern = r'\[([^\]]+)\]\(([^)]+)\)'
    
    # Parse the text to extract tokens and entity information
    tokens = []
    entity_spans = []  # List of (start_token_idx, end_token_idx, label)
    custom_labels = set()
    
    # Split text by the pattern and process each part
    last_end = 0
    token_idx = 0
    
    for match in re.finditer(pattern, text):
        # Add tokens before the entity
        before_entity = text[last_end:match.start()]
        if before_entity.strip():
            # Tokenize the text before the entity
            before_tokens = before_entity.split()
            tokens.extend(before_tokens)
            token_idx += len(before_tokens)
        
        # Add the entity tokens
        entity_text = match.group(1)
        entity_label = match.group(2)
        custom_labels.add(entity_label)
        
        # Tokenize the entity text
        entity_tokens = entity_text.split()
        start_token_idx = token_idx
        tokens.extend(entity_tokens)
        token_idx += len(entity_tokens)
        end_token_idx = token_idx
        
        # Store entity span information
        entity_spans.append((start_token_idx, end_token_idx, entity_label))
        
        last_end = match.end()
    
    # Add any remaining tokens after the last entity
    remaining = text[last_end:]
    if remaining.strip():
        remaining_tokens = remaining.split()
        tokens.extend(remaining_tokens)
    
    # Add custom labels to the NLP model if they don't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner")
    else:
        ner = nlp.get_pipe("ner")
    
    for label in custom_labels:
        ner.add_label(label)
    
    # Create spaces array (True for tokens that should have a space after them)
    # Simple heuristic: all tokens except the last one get a space
    spaces = [True] * len(tokens)
    if tokens:
        spaces[-1] = False
    
    # Create the Doc from tokens
    doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
    
    # Create entity spans
    entities = []
    for start_idx, end_idx, label in entity_spans:
        if start_idx < len(doc) and end_idx <= len(doc):
            span = Span(doc, start_idx, end_idx, label=label)
            entities.append(span)
    
    # Set entities on the document
    doc.ents = entities
    
    return doc


def visualize_annotated_text(text, nlp=None, style="ent", jupyter=True):
    """
    Convenience function to convert annotated text and visualize it with displaCy.
    
    Args:
        text (str): Text with annotations like "[Tom](PERSON) worked for [Microsoft](ORGANIZATION)"
        nlp (spacy.Language, optional): spaCy language model. If None, uses blank English model.
        style (str): displaCy style ("ent" or "dep")
        jupyter (bool): Whether to render for Jupyter notebook
    
    Returns:
        Rendered visualization (HTML string if not in Jupyter)
    """
    doc = annotated_text_to_spacy_doc(text, nlp)
    
    try:
        import spacy
        return spacy.displacy.render(doc, style=style, jupyter=jupyter)
    except ImportError:
        print("spaCy not installed. Please install with: pip install spacy")
        return None

2.16.4. Connect to OpenAI Client#

client = OpenAI(api_key=OPENAI_API_KEY)

2.16.5. Define Notebook Varaibles.#

TEXT = "They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION)."
ENTITY_TO_IDENTIFY = "Memphis"

2.16.6. Crafting the Prompt#

prompt = """
Query the web to identify this entity in Wikidata.

{entity}

It is within the context of the following text:

{text}

Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    entity_text: str
    label: str
    wikidata_id: str
    sources: list[str]
"""
formatted_prompt = prompt.format(entity=ENTITY_TO_IDENTIFY, text=TEXT)
print(formatted_prompt)
Query the web to identify this entity in Wikidata.

Memphis

It is within the context of the following text:

They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION).

Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    entity_text: str
    label: str
    wikidata_id: str
    sources: list[str]

2.16.7. Calling OpenAI#

response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "web_search",
}],
    input=formatted_prompt,
)

output_text = response.output_text
print(output_text)
```json
{
  "entity_text": "Memphis",
  "label": "Memphis",
  "wikidata_id": "Q5715",
  "sources": [
    "turn0search0",
    "turn0search2"
  ]
}
```

2.16.8. Parsing the Output#

def parse_json_with_sources(text):
    json_data = text.split("```json")[1]
    json_data, sources = json_data.split("```")
    json_data = json.loads(json_data)
    return json_data, sources

json_output, sources = parse_json_with_sources(output_text)
print(json_output)
{'entity_text': 'Memphis', 'label': 'Memphis', 'wikidata_id': 'Q5715', 'sources': ['turn0search0', 'turn0search2']}

2.16.9. Visualizing the Output#

from spacy import displacy
import spacy
TEXT
'They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION).'
nlp = spacy.load("en_core_web_sm")
doc = nlp(TEXT)
W0912 11:24:19.624000 89480 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
doc = annotated_text_to_spacy_doc(TEXT)
displacy.render(doc, style="ent")
They marched from Alexandria LOCATION through Memphis LOCATION via the Nile LOCATION to Thebes LOCATION .
output_ents = []
pandas_output = []
for ent in doc.ents:
    if ent.text == ENTITY_TO_IDENTIFY:
        output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": f'{ent.label_} <a href="https://www.wikidata.org/wiki/{json_output["wikidata_id"]}">{json_output["wikidata_id"]}</a>'})
        pandas_output.append({"start": ent.start_char, "end": ent.end_char, "label": f'{ent.label_}', 'wikidata_id': json_output["wikidata_id"]})
    else:
        output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": ent.label_, 'wikidata_id': None})
        pandas_output.append({"start": ent.start_char, "end": ent.end_char, "label": ent.label_, 'wikidata_id': None})
dic_ents = {
    "text": doc.text,
    "ents": output_ents,
    "title": None
}

displacy.render(dic_ents, manual=True, style="ent")
They marched from Alexandria LOCATION through Memphis LOCATION Q5715 via the Nile LOCATION to Thebes LOCATION .
df = pd.DataFrame(pandas_output)
df
start end label wikidata_id
0 18 28 LOCATION None
1 37 44 LOCATION Q5715
2 53 57 LOCATION None
3 61 67 LOCATION None
df.to_csv("../../output/entities-disambiguation-web.csv", index=False)