2.16. Disambiguation with Tool Calling#
In this notebook, we explore entity disambiguation using GPT-4o’s tool calling capabilities. Rather than providing a list of candidates, we leverage GPT-4’s ability to search the web and directly find the correct Wikidata IDs for entities. This approach combines the power of large language models with real-time web access to accurately link entity mentions to their corresponding Wikidata entries. We’ll demonstrate how to use this method to disambiguate geographic locations and other named entities in text, providing a more automated and scalable approach to entity linking.
2.16.1. Installing Packages#
!pip install pydantic openai spacy pandas
2.16.2. Getting our ENV Varaibles.#
First, we need to set up our environment variables to access the OpenAI API. We’ll use the python-dotenv package to load environment variables from a .env file, which should contain our OPENAI_API_KEY. This keeps our API key secure by not hardcoding it directly in our code.
import sys
sys.path.append("..")
from dotenv import load_dotenv
import os
from openai import OpenAI
from pydantic import BaseModel
import json
import pandas as pd
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
2.16.3. Visualization Functions#
This bit of code is just for making it easier to visualize our data at the end of the notebook.
2.16.4. Connect to OpenAI Client#
client = OpenAI(api_key=OPENAI_API_KEY)
2.16.5. Define Notebook Varaibles.#
TEXT = "They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION)."
ENTITY_TO_IDENTIFY = "Memphis"
2.16.6. Crafting the Prompt#
prompt = """
Query the web to identify this entity in Wikidata.
{entity}
It is within the context of the following text:
{text}
Only return the JSON output, nothing else. Do so with the following schema:
class Entity(BaseModel):
entity_text: str
label: str
wikidata_id: str
sources: list[str]
"""
formatted_prompt = prompt.format(entity=ENTITY_TO_IDENTIFY, text=TEXT)
print(formatted_prompt)
Query the web to identify this entity in Wikidata.
Memphis
It is within the context of the following text:
They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION).
Only return the JSON output, nothing else. Do so with the following schema:
class Entity(BaseModel):
entity_text: str
label: str
wikidata_id: str
sources: list[str]
2.16.7. Calling OpenAI#
response = client.responses.create(
model="gpt-4o",
tools=[{"type": "web_search",
}],
input=formatted_prompt,
)
output_text = response.output_text
print(output_text)
```json
{
"entity_text": "Memphis",
"label": "Memphis",
"wikidata_id": "Q5715",
"sources": [
"turn0search0",
"turn0search2"
]
}
```
2.16.8. Parsing the Output#
def parse_json_with_sources(text):
json_data = text.split("```json")[1]
json_data, sources = json_data.split("```")
json_data = json.loads(json_data)
return json_data, sources
json_output, sources = parse_json_with_sources(output_text)
print(json_output)
{'entity_text': 'Memphis', 'label': 'Memphis', 'wikidata_id': 'Q5715', 'sources': ['turn0search0', 'turn0search2']}
2.16.9. Visualizing the Output#
from spacy import displacy
import spacy
TEXT
'They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION).'
nlp = spacy.load("en_core_web_sm")
doc = nlp(TEXT)
W0912 11:24:19.624000 89480 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
doc = annotated_text_to_spacy_doc(TEXT)
displacy.render(doc, style="ent")
output_ents = []
pandas_output = []
for ent in doc.ents:
if ent.text == ENTITY_TO_IDENTIFY:
output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": f'{ent.label_} <a href="https://www.wikidata.org/wiki/{json_output["wikidata_id"]}">{json_output["wikidata_id"]}</a>'})
pandas_output.append({"start": ent.start_char, "end": ent.end_char, "label": f'{ent.label_}', 'wikidata_id': json_output["wikidata_id"]})
else:
output_ents.append({"start": ent.start_char, "end": ent.end_char, "label": ent.label_, 'wikidata_id': None})
pandas_output.append({"start": ent.start_char, "end": ent.end_char, "label": ent.label_, 'wikidata_id': None})
dic_ents = {
"text": doc.text,
"ents": output_ents,
"title": None
}
displacy.render(dic_ents, manual=True, style="ent")
df = pd.DataFrame(pandas_output)
df
| start | end | label | wikidata_id | |
|---|---|---|---|---|
| 0 | 18 | 28 | LOCATION | None |
| 1 | 37 | 44 | LOCATION | Q5715 |
| 2 | 53 | 57 | LOCATION | None |
| 3 | 61 | 67 | LOCATION | None |
df.to_csv("../../output/entities-disambiguation-web.csv", index=False)