Preparing a Cultural Heritage Dataset for Named Entity Recognition

1.1. Preparing a Cultural Heritage Dataset for Named Entity Recognition#

1.1.1. Rationale#

This recipe takes some source content and preprocesses it for use across the rest of the cookbook.

You can either:

run queries to get sample content from the Cleveland Museum of Art API from a keyword (e.g. “Manet”) or record ID search, or
upload your own sample text (e.g. transcribed text from a digitised item).

It shows the necessary steps to process the results and format them into a JSON file that can then be fed into a NER process.

1.2. Overview of the process#

Most parts of the recipe simply need to be run in a Notebook environment like Colab, Binder, Jupyter Notebooks.

Initialise the notebook

Run the first ‘cells’ to import the required libraries

Fetch input
This is the part where you can provide some specific input.

You can:

Get artworks from the Cleveland Museum of Art API (by ID or keyword). You can run the code as-is, or play with changing options to search for specific keywords or record IDs. Or,
Upload your own text.

Process text
- Clean text - normalize Unicode, remove control characters, collapse spaces, mask URLs/emails.
- Detect language - we use langid.py to assign a language code and confidence score.
- Tokenize & split sentences - use spaCy’s lightweight tokenizer and sentencizer to create tokens and sentence spans.
Assemble JSON
- Combine results into a structured record:
  - text_original and text_clean
  - language (code + score)
  - sentences (spans + text)
  - tokens (spans + features)
  - meta (counts + IDs)

1.3. Step 1: Install the necessary packages and libraries#

%pip install spacy langid

Requirement already satisfied: spacy in /usr/local/lib/python3.12/dist-packages (3.8.7)
Requirement already satisfied: langid in /usr/local/lib/python3.12/dist-packages (1.1.6)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.0.13)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.11)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.0.10)
Requirement already satisfied: thinc<8.4.0,>=8.3.4 in /usr/local/lib/python3.12/dist-packages (from spacy) (8.3.6)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.12/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.5.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.10)
Requirement already satisfied: weasel<0.5.0,>=0.1.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: typer<1.0.0,>=0.3.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (0.17.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (4.67.1)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.32.4)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.12/dist-packages (from spacy) (2.11.7)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.1.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from spacy) (75.2.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (25.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.12/dist-packages (from spacy) (3.5.0)
Requirement already satisfied: language-data>=1.2 in /usr/local/lib/python3.12/dist-packages (from langcodes<4.0.0,>=3.2.0->spacy) (1.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.12/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy) (0.4.1)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2025.8.3)
Requirement already satisfied: blis<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.4->spacy) (1.3.0)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.12/dist-packages (from thinc<8.4.0,>=8.3.4->spacy) (0.1.5)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (8.2.1)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.12/dist-packages (from typer<1.0.0,>=0.3.0->spacy) (13.9.4)
Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in /usr/local/lib/python3.12/dist-packages (from weasel<0.5.0,>=0.1.0->spacy) (0.22.0)
Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in /usr/local/lib/python3.12/dist-packages (from weasel<0.5.0,>=0.1.0->spacy) (7.3.0.post1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->spacy) (3.0.2)
Requirement already satisfied: marisa-trie>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy) (1.3.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (2.19.2)
Requirement already satisfied: wrapt in /usr/local/lib/python3.12/dist-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy) (1.17.3)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy) (0.1.2)

# pip install requests spacy langid

import re, unicodedata, json
from typing import List, Dict, Any, Tuple, Optional
import hashlib
import io

import requests
import spacy
import langid

try:
  from google.colab import files
  IN_COLAB = True
except:
  IN_COLAB = False

1.4. Step 2: get content - run API queries or upload your own text#

1.4.1. Step 2a: Fetch Input from a Museum API#

Get artworks from the Cleveland Museum of Art API (by ID or keyword).

# ---------------------------
# 1) CMA API (single function)
# ---------------------------
def fetch_cma(source: str, *, mode: str = "search", limit: int = 100) -> List[Dict[str, Any]]: # the function outputs the first 100 results as a basic parameter
    """
    mode='id'     -> source is an artwork id
    mode='search' -> source is a keyword query
    Returns: list of {'id': <id>, 'text': <description or ''>}
    """
    base = "https://openaccess-api.clevelandart.org/api/artworks"

    if mode == "id":
        url = f"{base}/{source}"
        r = requests.get(url, timeout=20)
        if r.status_code == 404:
            # Graceful return instead of raising
            return []
        r.raise_for_status()
        data = r.json().get("data", {})
        return [{"id": data.get("id"), "text": (data.get("description") or "").strip()}]

    # search mode
    params = {"q": source, "skip": 0, "limit": limit}
    r = requests.get(base, params=params, timeout=30)
    r.raise_for_status()
    out = []
    for a in r.json().get("data", []):
        out.append({"id": a.get("id"), "text": (a.get("description") or "").strip()})
    return out

The code in the cells below will display a preview of the results for each function.

# test the fetch_cma function and print an overview of the results, using the "search" mode and a keyword, e.g. "monet"
records = fetch_cma("monet", mode="search")
if not records:
    print("CMA by id: no record (likely 404).")

print(records[:10])

[{'id': 135382, 'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."}, {'id': 136510, 'text': 'A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France. The resultant canvases are notable for their varied motifs, formats, and sizes. Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface. By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format that fills the viewer’s field of vision.'}, {'id': 130391, 'text': "This early work reveal's Monet's fascination with capturing the transitory effects that became the primary focus of his later innovations. Painted with almost scientific accuracy, this still life has a freshness and immediacy derived partly from its composition. Isolated against a dark background, the fully mature peonies, potted hydrangeas, and basketed lilacs spill downward and outward from the geraniums at the rear. At the same time, Monet's energetic brushwork conveys the sparkling play of light on leaves and petals."}, {'id': 95272, 'text': 'In 1888, Claude Monet spent four months in Antibes, a city in southeastern France on the Mediterranean coast, to derive inspiration for painting. Although his visit was occasionally challenged by strong winds that threatened to knock over his easel, the artist was able to complete nearly 40 works. This especially vibrant canvas is a depiction of a gardener’s house set against the sea and the distant Alps. Monet portrays intense midday light through thickly applied paint in bright colors that evoke the region’s sun-drenched climate. Small daubs of green on the slender trees framing the house suggest the onset of spring.'}, {'id': 125234, 'text': 'This is one of several views Claude Monet painted of Pourville, a fishing village on the north coast of France in Normandy. Pourville featured a boardwalk, a broad beach, and stunning cliffs. According to the artist\'s correspondence, the village was nothing special, but he was inspired by the landscape: "One could not be any closer to the sea than I am, on the pebbled beach itself and the waves beat at the foot of the house." During the spring of 1882 in Pourville, Monet stayed in a small hotel similar to the dwellings seen at the foot of the cliff to the right in this painting.'}, {'id': 150069, 'text': "Paul Paulin was a renowned Parisian dentist who had a passion for sculpting and was encouraged by Edgar Degas to pursue an artistic career. Paulin is best known for his portrait busts of notable Impressionists. There are related versions of this portrait bust of the celebrated painter Claude Monet at three museums in Paris: the Petit Palais (plaster), the Musée d'Orsay (bronze), and the Musée Rodin (bronze)."}, {'id': 518788, 'text': 'When photographing that garden in 1989, Cleveland artist Herbert Ascherman chose an elongated, vertical panoramic format. It evokes the Japanese scroll paintings that influenced the Impressionist painters, including Monet. To view other depictions of the garden at Giverny in the museum’s collection see Monet’s painting <em>Water Lilies (Agapanthus)</em>, c. 1915–26 (1960.81) and photographs by Sally Gall (1993.216) and Lynn Geesaman (2000.90).'}, {'id': 444535, 'text': 'Frédéric Bazille painted this charming, poignant portrait of his close friend and fellow Impressionist Pierre-Auguste Renoir at a time when he was sharing his studio with Renoir and Claude Monet. Although Bazille played an important role in the early development of Impressionism, he is not as well-known as his colleagues due to his early death at age 28 while serving in the French army during the Franco-Prussian War of 1870.'}, {'id': 157104, 'text': 'Although born and trained in Holland, Jongkind spent much of his life painting outdoors in France. In this depiction of Bas-Meudon near Paris, the artist applied paint in small patches of bright color to suggest the intensity of outdoor light. Although typcially finished in the studio from open-air sketches, Jongkind\'s oil paintings achieve a convincing immediacy that greatly impressed the young Claude Monet. The two met in the early 1860s and spent part of a summer painting together along the coast of Normandy. "From that time he was my real master," Monet later acknowledged, "it was to him that I owe the final education of my eye."'}, {'id': 128363, 'text': 'The town of Villerville on the Normandy coast appears just to the right of center in this expansive landscape by Daubigny, a pioneer of outdoor painting and a major influence on Claude Monet and the Impressionists. Daubigny introduced a new kind of natural landscape based on outdoor studies of light, water, and atmospheric conditions. Here, streaks of bright light along the horizon set off the dark masses of the rocky shore in the foreground.'}]

# test the fetch_cma function and print an overview of the results, using the "id" mode and a sample record id.
records = fetch_cma("135382", mode="id")
if not records:
    print("CMA by id: no record (likely 404).")

print(records[:10])

[{'id': 135382, 'text': "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."}]

If you

1.5. 2b: Using your own data#

TODO: Add explanatory text here

def fetch_local(source: Optional[str] = None):
  if IN_COLAB:
    # Upload a text file to use your own data
    uploaded = files.upload()
    records = [{"id": filename, "text": contents.decode("utf-8")} for filename, contents in uploaded.items()]
  else:
    # If running on your local machine, pass the location of the text file
    with open(source, "r") as f:
      records  = [{"id": source, "text": f.read().strip()}]

  return records

records = fetch_local()
records

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving 7E59v2Ax.txt to 7E59v2Ax.txt
Saving 7E51v1x.txt to 7E51v1x (9).txt

[{'id': '7E59v2Ax.txt',
  'text': '\ufeff-----------------------------------------------------------------------------\r\nMAG X870E TOMAHAWK WIFI (MS-7E59) V2.A83 BIOS Release\r\n-----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n-  Improve memory compatibility.\r\n-  Optimized Secure Erase mechanism.\r\n\r\n3. 2025/08/01\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  改善記憶體相容性問題。\r\n-  優化安全性清除機制。\r\n\r\nC. 更新日期: 公元2025年8月1號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  改善内存兼容性问题。\r\n-  优化安全性清除机制。\r\n\r\nC. 更新日期: 公元2025年8月1号'},
 {'id': '7E51v1x (9).txt',
  'text': '\ufeff----------------------------------------------------------------------------\r\nMAG X870 TOMAHAWK WIFI (MS-7E51) V1.A65 BIOS Release\r\n----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n-  Support AGESA PI 1.2.0.3f.\r\n-  Security issue mitigation.\r\n\r\n3. 2025/07/18\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  支援 AGESA PI 1.2.0.3f。\r\n-  修補安全性問題。\r\n\r\nC. 更新日期: 公元2025年7月18號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  支援 AGESA PI 1.2.0.3f。\r\n-  修补安全性问题。\r\n\r\nC. 更新日期: 公元2025年7月18号'}]

1.6. Process text for NER#

1.6.1. Step 3a: Clean text#

Normalize Unicode, remove control characters, collapse spaces, mask URLs/emails.

# ---------------------------
# 2) Pre-processing helpers
# ---------------------------
def clean_text(text: str) -> str:
    t = unicodedata.normalize("NFC", text)
    t = re.sub(r"[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F]", "", t)  # strip control chars
    t = re.sub(r"[ \t\u00A0]+", " ", t)                                   # collapse spaces
    t = re.sub(r"https?://\S+", "<URL>", t)                               # mask URLs
    t = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<EMAIL>", t)               # mask emails
    return t

# call the function clean text on the records output from the previous function, and print the first results.
clean_records = [clean_text(r["text"]) for r in records]
print(clean_records[:10])

['\ufeff-----------------------------------------------------------------------------\r\nMAG X870E TOMAHAWK WIFI (MS-7E59) V2.A83 BIOS Release\r\n-----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n- Improve memory compatibility.\r\n- Optimized Secure Erase mechanism.\r\n\r\n3. 2025/08/01\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n- 改善記憶體相容性問題。\r\n- 優化安全性清除機制。\r\n\r\nC. 更新日期: 公元2025年8月1號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n- 改善内存兼容性问题。\r\n- 优化安全性清除机制。\r\n\r\nC. 更新日期: 公元2025年8月1号', '\ufeff----------------------------------------------------------------------------\r\nMAG X870 TOMAHAWK WIFI (MS-7E51) V1.A65 BIOS Release\r\n----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n- Support AGESA PI 1.2.0.3f.\r\n- Security issue mitigation.\r\n\r\n3. 2025/07/18\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n- 支援 AGESA PI 1.2.0.3f。\r\n- 修補安全性問題。\r\n\r\nC. 更新日期: 公元2025年7月18號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n- 支援 AGESA PI 1.2.0.3f。\r\n- 修补安全性问题。\r\n\r\nC. 更新日期: 公元2025年7月18号']

Then, detect the language of the dataset using the SpaCy langid module. This will help define the appropriate sentencizer for the following step.

# detect language using langid

def detect_language(text: str) -> Dict[str, Any]:
    if not text:
        return {"language": "xx", "score": None}
    code, score = langid.classify(text)  # score = log-likelihood
    return {"language": code, "score": float(score)}

# call the function detect_language and print the language id for the records output
lang_id = [detect_language(r) for r in clean_records]
print(lang_id[:10])

[{'language': 'zh', 'score': -1687.8875234127045}, {'language': 'zh', 'score': -1590.8130090236664}]

1.6.2. Step 3b - Preprocess the texts#

Using the Spacy sentencizer, based on the language detected by langid, split the output into sentences.

Then, assemble the structured record as a JSON output.

Output fields:
- text_original and text_clean
- language (code + score)
- sentences (spans + text)
- tokens (spans + features)
- meta (counts + IDs)

# ---------------------------
# 3) Preprocess (fixed generator usage)
# ---------------------------
def preprocess_texts(
    texts_with_meta: List[Tuple[str, Dict[str, Any]]],
    do_clean: bool = True,
    do_langid: bool = True,
    do_sentences: bool = True,
    do_tokens: bool = True,
) -> List[Dict[str, Any]]:
    nlp = spacy.blank("xx")
    if do_sentences:
        nlp.add_pipe("sentencizer")

    results: List[Dict[str, Any]] = []

    if do_tokens or do_sentences:
        # Iterate with zip to consume the generator safely
        for (text, meta), doc in zip(texts_with_meta, nlp.pipe([t for t, _ in texts_with_meta])):
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}

            # Sentences
            sents = []
            if do_sentences:
                for sid, s in enumerate(doc.sents):
                    sents.append({"id": sid, "start": s.start_char, "end": s.end_char, "text": s.text})

            # Tokens
            tokens = []
            if do_tokens:
                tok2sent = {}
                if do_sentences:
                    for sid, s in enumerate(doc.sents):
                        for tok in s:
                            tok2sent[tok.i] = sid
                for tok in doc:
                    tokens.append({
                        "id": tok.i,
                        "text": tok.text,
                        "start": tok.idx,
                        "end": tok.idx + len(tok.text),
                        "ws": tok.whitespace_ != "",
                        "is_punct": tok.is_punct,
                        "sent_id": tok2sent.get(tok.i) if do_sentences else None
                    })

            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": sents,
                "tokens": tokens,
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": len(tokens),
                    "sentence_count": len(sents),
                }
            })
    else:
        # No tokenization/sentences requested
        for text, meta in texts_with_meta:
            text_clean = clean_text(text) if do_clean else text
            lang = detect_language(text) if do_langid else {"language": "xx", "score": None}
            results.append({
                "text_original": text,
                "text_clean": text_clean,
                "language": lang,
                "sentences": [],
                "tokens": [],
                "meta": {
                    **meta,
                    "char_count": len(text),
                    "token_count": 0,
                    "sentence_count": 0,
                }
            })

    return results

# call the preprocessing function and print the first results
pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records]
out_api_id = preprocess_texts(pairs)
print(out_api_id[:10])

[{'text_original': '\ufeff-----------------------------------------------------------------------------\r\nMAG X870E TOMAHAWK WIFI (MS-7E59) V2.A83 BIOS Release\r\n-----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n-  Improve memory compatibility.\r\n-  Optimized Secure Erase mechanism.\r\n\r\n3. 2025/08/01\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  改善記憶體相容性問題。\r\n-  優化安全性清除機制。\r\n\r\nC. 更新日期: 公元2025年8月1號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  改善内存兼容性问题。\r\n-  优化安全性清除机制。\r\n\r\nC. 更新日期: 公元2025年8月1号', 'text_clean': '\ufeff-----------------------------------------------------------------------------\r\nMAG X870E TOMAHAWK WIFI (MS-7E59) V2.A83 BIOS Release\r\n-----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n- Improve memory compatibility.\r\n- Optimized Secure Erase mechanism.\r\n\r\n3. 2025/08/01\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n- 改善記憶體相容性問題。\r\n- 優化安全性清除機制。\r\n\r\nC. 更新日期: 公元2025年8月1號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n- 改善内存兼容性问题。\r\n- 优化安全性清除机制。\r\n\r\nC. 更新日期: 公元2025年8月1号', 'language': {'language': 'zh', 'score': -1687.8875234127045}, 'sentences': [{'id': 0, 'start': 0, 'end': 218, 'text': '\ufeff-----------------------------------------------------------------------------\r\nMAG X870E TOMAHAWK WIFI (MS-7E59) V2.A83 BIOS Release\r\n-----------------------------------------------------------------------------\r\n\r\n1.'}, {'id': 1, 'start': 219, 'end': 249, 'text': 'This is AMI BIOS release\r\n\r\n2.'}, {'id': 2, 'start': 250, 'end': 346, 'text': 'This BIOS fixes the following problem of the previous version:\r\n-  Improve memory compatibility.'}, {'id': 3, 'start': 346, 'end': 384, 'text': '\r\n-  Optimized Secure Erase mechanism.'}, {'id': 4, 'start': 384, 'end': 390, 'text': '\r\n\r\n3.'}, {'id': 5, 'start': 391, 'end': 511, 'text': '2025/08/01\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  改善記憶體相容性問題。'}, {'id': 6, 'start': 511, 'end': 526, 'text': '\r\n-  優化安全性清除機制。'}, {'id': 7, 'start': 526, 'end': 658, 'text': '\r\n\r\nC. 更新日期: 公元2025年8月1號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  改善内存兼容性问题。'}, {'id': 8, 'start': 658, 'end': 673, 'text': '\r\n-  优化安全性清除机制。'}, {'id': 9, 'start': 673, 'end': 697, 'text': '\r\n\r\nC. 更新日期: 公元2025年8月1号'}], 'tokens': [{'id': 0, 'text': '\ufeff-----------------------------------------------------------------------------', 'start': 0, 'end': 78, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 1, 'text': '\r\n', 'start': 78, 'end': 80, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 2, 'text': 'MAG', 'start': 80, 'end': 83, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 3, 'text': 'X870E', 'start': 84, 'end': 89, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 4, 'text': 'TOMAHAWK', 'start': 90, 'end': 98, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 5, 'text': 'WIFI', 'start': 99, 'end': 103, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 6, 'text': '(', 'start': 104, 'end': 105, 'ws': False, 'is_punct': True, 'sent_id': 0}, {'id': 7, 'text': 'MS-7E59', 'start': 105, 'end': 112, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 8, 'text': ')', 'start': 112, 'end': 113, 'ws': True, 'is_punct': True, 'sent_id': 0}, {'id': 9, 'text': 'V2.A83', 'start': 114, 'end': 120, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 10, 'text': 'BIOS', 'start': 121, 'end': 125, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 11, 'text': 'Release', 'start': 126, 'end': 133, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 12, 'text': '\r\n', 'start': 133, 'end': 135, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 13, 'text': '-----------------------------------------------------------------------------', 'start': 135, 'end': 212, 'ws': False, 'is_punct': True, 'sent_id': 0}, {'id': 14, 'text': '\r\n\r\n', 'start': 212, 'end': 216, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 15, 'text': '1', 'start': 216, 'end': 217, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 16, 'text': '.', 'start': 217, 'end': 218, 'ws': True, 'is_punct': True, 'sent_id': 0}, {'id': 17, 'text': 'This', 'start': 219, 'end': 223, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 18, 'text': 'is', 'start': 224, 'end': 226, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 19, 'text': 'AMI', 'start': 227, 'end': 230, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 20, 'text': 'BIOS', 'start': 231, 'end': 235, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 21, 'text': 'release', 'start': 236, 'end': 243, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 22, 'text': '\r\n\r\n', 'start': 243, 'end': 247, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 23, 'text': '2', 'start': 247, 'end': 248, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 24, 'text': '.', 'start': 248, 'end': 249, 'ws': True, 'is_punct': True, 'sent_id': 1}, {'id': 25, 'text': 'This', 'start': 250, 'end': 254, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 26, 'text': 'BIOS', 'start': 255, 'end': 259, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 27, 'text': 'fixes', 'start': 260, 'end': 265, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 28, 'text': 'the', 'start': 266, 'end': 269, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 29, 'text': 'following', 'start': 270, 'end': 279, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 30, 'text': 'problem', 'start': 280, 'end': 287, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 31, 'text': 'of', 'start': 288, 'end': 290, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 32, 'text': 'the', 'start': 291, 'end': 294, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 33, 'text': 'previous', 'start': 295, 'end': 303, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 34, 'text': 'version', 'start': 304, 'end': 311, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 35, 'text': ':', 'start': 311, 'end': 312, 'ws': False, 'is_punct': True, 'sent_id': 2}, {'id': 36, 'text': '\r\n', 'start': 312, 'end': 314, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 37, 'text': '-', 'start': 314, 'end': 315, 'ws': True, 'is_punct': True, 'sent_id': 2}, {'id': 38, 'text': ' ', 'start': 316, 'end': 317, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 39, 'text': 'Improve', 'start': 317, 'end': 324, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 40, 'text': 'memory', 'start': 325, 'end': 331, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 41, 'text': 'compatibility', 'start': 332, 'end': 345, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 42, 'text': '.', 'start': 345, 'end': 346, 'ws': False, 'is_punct': True, 'sent_id': 2}, {'id': 43, 'text': '\r\n', 'start': 346, 'end': 348, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 44, 'text': '-', 'start': 348, 'end': 349, 'ws': True, 'is_punct': True, 'sent_id': 3}, {'id': 45, 'text': ' ', 'start': 350, 'end': 351, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 46, 'text': 'Optimized', 'start': 351, 'end': 360, 'ws': True, 'is_punct': False, 'sent_id': 3}, {'id': 47, 'text': 'Secure', 'start': 361, 'end': 367, 'ws': True, 'is_punct': False, 'sent_id': 3}, {'id': 48, 'text': 'Erase', 'start': 368, 'end': 373, 'ws': True, 'is_punct': False, 'sent_id': 3}, {'id': 49, 'text': 'mechanism', 'start': 374, 'end': 383, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 50, 'text': '.', 'start': 383, 'end': 384, 'ws': False, 'is_punct': True, 'sent_id': 3}, {'id': 51, 'text': '\r\n\r\n', 'start': 384, 'end': 388, 'ws': False, 'is_punct': False, 'sent_id': 4}, {'id': 52, 'text': '3', 'start': 388, 'end': 389, 'ws': False, 'is_punct': False, 'sent_id': 4}, {'id': 53, 'text': '.', 'start': 389, 'end': 390, 'ws': True, 'is_punct': True, 'sent_id': 4}, {'id': 54, 'text': '2025/08/01', 'start': 391, 'end': 401, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 55, 'text': '\r\n\r\n\r\n', 'start': 401, 'end': 407, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 56, 'text': '[', 'start': 407, 'end': 408, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 57, 'text': 'Below', 'start': 408, 'end': 413, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 58, 'text': 'information', 'start': 414, 'end': 425, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 59, 'text': 'is', 'start': 426, 'end': 428, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 60, 'text': 'Traditional', 'start': 429, 'end': 440, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 61, 'text': 'Chinese', 'start': 441, 'end': 448, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 62, 'text': 'language', 'start': 449, 'end': 457, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 63, 'text': ']', 'start': 457, 'end': 458, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 64, 'text': '\r\n\r\n', 'start': 458, 'end': 462, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 65, 'text': 'A.', 'start': 462, 'end': 464, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 66, 'text': 'AMI', 'start': 465, 'end': 468, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 67, 'text': 'BIOS', 'start': 469, 'end': 473, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 68, 'text': '正式發行', 'start': 474, 'end': 478, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 69, 'text': '\r\n\r\n', 'start': 478, 'end': 482, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 70, 'text': 'B.', 'start': 482, 'end': 484, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 71, 'text': '此版本修正下列問題', 'start': 485, 'end': 494, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 72, 'text': ':', 'start': 494, 'end': 495, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 73, 'text': '\r\n', 'start': 495, 'end': 497, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 74, 'text': '-', 'start': 497, 'end': 498, 'ws': True, 'is_punct': True, 'sent_id': 5}, {'id': 75, 'text': ' ', 'start': 499, 'end': 500, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 76, 'text': '改善記憶體相容性問題', 'start': 500, 'end': 510, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 77, 'text': '。', 'start': 510, 'end': 511, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 78, 'text': '\r\n', 'start': 511, 'end': 513, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 79, 'text': '-', 'start': 513, 'end': 514, 'ws': True, 'is_punct': True, 'sent_id': 6}, {'id': 80, 'text': ' ', 'start': 515, 'end': 516, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 81, 'text': '優化安全性清除機制', 'start': 516, 'end': 525, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 82, 'text': '。', 'start': 525, 'end': 526, 'ws': False, 'is_punct': True, 'sent_id': 6}, {'id': 83, 'text': '\r\n\r\n', 'start': 526, 'end': 530, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 84, 'text': 'C.', 'start': 530, 'end': 532, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 85, 'text': '更新日期', 'start': 533, 'end': 537, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 86, 'text': ':', 'start': 537, 'end': 538, 'ws': True, 'is_punct': True, 'sent_id': 7}, {'id': 87, 'text': '公元2025年8月1號', 'start': 539, 'end': 550, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 88, 'text': '\r\n\r\n\r\n', 'start': 550, 'end': 556, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 89, 'text': '[', 'start': 556, 'end': 557, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 90, 'text': 'Below', 'start': 557, 'end': 562, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 91, 'text': 'information', 'start': 563, 'end': 574, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 92, 'text': 'is', 'start': 575, 'end': 577, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 93, 'text': 'Simplified', 'start': 578, 'end': 588, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 94, 'text': 'Chinese', 'start': 589, 'end': 596, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 95, 'text': 'language', 'start': 597, 'end': 605, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 96, 'text': ']', 'start': 605, 'end': 606, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 97, 'text': '\r\n\r\n', 'start': 606, 'end': 610, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 98, 'text': 'A.', 'start': 610, 'end': 612, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 99, 'text': 'AMI', 'start': 613, 'end': 616, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 100, 'text': 'BIOS', 'start': 617, 'end': 621, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 101, 'text': '正式发行', 'start': 622, 'end': 626, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 102, 'text': '\r\n\r\n', 'start': 626, 'end': 630, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 103, 'text': 'B.', 'start': 630, 'end': 632, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 104, 'text': '此版本修正下列问题', 'start': 633, 'end': 642, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 105, 'text': ':', 'start': 642, 'end': 643, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 106, 'text': '\r\n', 'start': 643, 'end': 645, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 107, 'text': '-', 'start': 645, 'end': 646, 'ws': True, 'is_punct': True, 'sent_id': 7}, {'id': 108, 'text': ' ', 'start': 647, 'end': 648, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 109, 'text': '改善内存兼容性问题', 'start': 648, 'end': 657, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 110, 'text': '。', 'start': 657, 'end': 658, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 111, 'text': '\r\n', 'start': 658, 'end': 660, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 112, 'text': '-', 'start': 660, 'end': 661, 'ws': True, 'is_punct': True, 'sent_id': 8}, {'id': 113, 'text': ' ', 'start': 662, 'end': 663, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 114, 'text': '优化安全性清除机制', 'start': 663, 'end': 672, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 115, 'text': '。', 'start': 672, 'end': 673, 'ws': False, 'is_punct': True, 'sent_id': 8}, {'id': 116, 'text': '\r\n\r\n', 'start': 673, 'end': 677, 'ws': False, 'is_punct': False, 'sent_id': 9}, {'id': 117, 'text': 'C.', 'start': 677, 'end': 679, 'ws': True, 'is_punct': False, 'sent_id': 9}, {'id': 118, 'text': '更新日期', 'start': 680, 'end': 684, 'ws': False, 'is_punct': False, 'sent_id': 9}, {'id': 119, 'text': ':', 'start': 684, 'end': 685, 'ws': True, 'is_punct': True, 'sent_id': 9}, {'id': 120, 'text': '公元2025年8月1号', 'start': 686, 'end': 697, 'ws': False, 'is_punct': False, 'sent_id': 9}], 'meta': {'source': 'CMA', 'id': '7E59v2Ax.txt', 'char_count': 697, 'token_count': 121, 'sentence_count': 10}}, {'text_original': '\ufeff----------------------------------------------------------------------------\r\nMAG X870 TOMAHAWK WIFI (MS-7E51) V1.A65 BIOS Release\r\n----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n-  Support AGESA PI 1.2.0.3f.\r\n-  Security issue mitigation.\r\n\r\n3. 2025/07/18\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  支援 AGESA PI 1.2.0.3f。\r\n-  修補安全性問題。\r\n\r\nC. 更新日期: 公元2025年7月18號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  支援 AGESA PI 1.2.0.3f。\r\n-  修补安全性问题。\r\n\r\nC. 更新日期: 公元2025年7月18号', 'text_clean': '\ufeff----------------------------------------------------------------------------\r\nMAG X870 TOMAHAWK WIFI (MS-7E51) V1.A65 BIOS Release\r\n----------------------------------------------------------------------------\r\n\r\n1. This is AMI BIOS release\r\n\r\n2. This BIOS fixes the following problem of the previous version:\r\n- Support AGESA PI 1.2.0.3f.\r\n- Security issue mitigation.\r\n\r\n3. 2025/07/18\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n- 支援 AGESA PI 1.2.0.3f。\r\n- 修補安全性問題。\r\n\r\nC. 更新日期: 公元2025年7月18號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n- 支援 AGESA PI 1.2.0.3f。\r\n- 修补安全性问题。\r\n\r\nC. 更新日期: 公元2025年7月18号', 'language': {'language': 'zh', 'score': -1590.8130090236664}, 'sentences': [{'id': 0, 'start': 0, 'end': 215, 'text': '\ufeff----------------------------------------------------------------------------\r\nMAG X870 TOMAHAWK WIFI (MS-7E51) V1.A65 BIOS Release\r\n----------------------------------------------------------------------------\r\n\r\n1.'}, {'id': 1, 'start': 216, 'end': 246, 'text': 'This is AMI BIOS release\r\n\r\n2.'}, {'id': 2, 'start': 247, 'end': 340, 'text': 'This BIOS fixes the following problem of the previous version:\r\n-  Support AGESA PI 1.2.0.3f.'}, {'id': 3, 'start': 340, 'end': 371, 'text': '\r\n-  Security issue mitigation.'}, {'id': 4, 'start': 371, 'end': 377, 'text': '\r\n\r\n3.'}, {'id': 5, 'start': 378, 'end': 508, 'text': '2025/07/18\r\n\r\n\r\n[Below information is Traditional Chinese language]\r\n\r\nA. AMI BIOS 正式發行\r\n\r\nB. 此版本修正下列問題:\r\n-  支援 AGESA PI 1.2.0.3f。'}, {'id': 6, 'start': 508, 'end': 521, 'text': '\r\n-  修補安全性問題。'}, {'id': 7, 'start': 521, 'end': 665, 'text': '\r\n\r\nC. 更新日期: 公元2025年7月18號\r\n\r\n\r\n[Below information is Simplified Chinese language]\r\n\r\nA. AMI BIOS 正式发行\r\n\r\nB. 此版本修正下列问题:\r\n-  支援 AGESA PI 1.2.0.3f。'}, {'id': 8, 'start': 665, 'end': 678, 'text': '\r\n-  修补安全性问题。'}, {'id': 9, 'start': 678, 'end': 703, 'text': '\r\n\r\nC. 更新日期: 公元2025年7月18号'}], 'tokens': [{'id': 0, 'text': '\ufeff----------------------------------------------------------------------------', 'start': 0, 'end': 77, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 1, 'text': '\r\n', 'start': 77, 'end': 79, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 2, 'text': 'MAG', 'start': 79, 'end': 82, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 3, 'text': 'X870', 'start': 83, 'end': 87, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 4, 'text': 'TOMAHAWK', 'start': 88, 'end': 96, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 5, 'text': 'WIFI', 'start': 97, 'end': 101, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 6, 'text': '(', 'start': 102, 'end': 103, 'ws': False, 'is_punct': True, 'sent_id': 0}, {'id': 7, 'text': 'MS-7E51', 'start': 103, 'end': 110, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 8, 'text': ')', 'start': 110, 'end': 111, 'ws': True, 'is_punct': True, 'sent_id': 0}, {'id': 9, 'text': 'V1.A65', 'start': 112, 'end': 118, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 10, 'text': 'BIOS', 'start': 119, 'end': 123, 'ws': True, 'is_punct': False, 'sent_id': 0}, {'id': 11, 'text': 'Release', 'start': 124, 'end': 131, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 12, 'text': '\r\n', 'start': 131, 'end': 133, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 13, 'text': '----------------------------------------------------------------------------', 'start': 133, 'end': 209, 'ws': False, 'is_punct': True, 'sent_id': 0}, {'id': 14, 'text': '\r\n\r\n', 'start': 209, 'end': 213, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 15, 'text': '1', 'start': 213, 'end': 214, 'ws': False, 'is_punct': False, 'sent_id': 0}, {'id': 16, 'text': '.', 'start': 214, 'end': 215, 'ws': True, 'is_punct': True, 'sent_id': 0}, {'id': 17, 'text': 'This', 'start': 216, 'end': 220, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 18, 'text': 'is', 'start': 221, 'end': 223, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 19, 'text': 'AMI', 'start': 224, 'end': 227, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 20, 'text': 'BIOS', 'start': 228, 'end': 232, 'ws': True, 'is_punct': False, 'sent_id': 1}, {'id': 21, 'text': 'release', 'start': 233, 'end': 240, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 22, 'text': '\r\n\r\n', 'start': 240, 'end': 244, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 23, 'text': '2', 'start': 244, 'end': 245, 'ws': False, 'is_punct': False, 'sent_id': 1}, {'id': 24, 'text': '.', 'start': 245, 'end': 246, 'ws': True, 'is_punct': True, 'sent_id': 1}, {'id': 25, 'text': 'This', 'start': 247, 'end': 251, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 26, 'text': 'BIOS', 'start': 252, 'end': 256, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 27, 'text': 'fixes', 'start': 257, 'end': 262, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 28, 'text': 'the', 'start': 263, 'end': 266, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 29, 'text': 'following', 'start': 267, 'end': 276, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 30, 'text': 'problem', 'start': 277, 'end': 284, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 31, 'text': 'of', 'start': 285, 'end': 287, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 32, 'text': 'the', 'start': 288, 'end': 291, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 33, 'text': 'previous', 'start': 292, 'end': 300, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 34, 'text': 'version', 'start': 301, 'end': 308, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 35, 'text': ':', 'start': 308, 'end': 309, 'ws': False, 'is_punct': True, 'sent_id': 2}, {'id': 36, 'text': '\r\n', 'start': 309, 'end': 311, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 37, 'text': '-', 'start': 311, 'end': 312, 'ws': True, 'is_punct': True, 'sent_id': 2}, {'id': 38, 'text': ' ', 'start': 313, 'end': 314, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 39, 'text': 'Support', 'start': 314, 'end': 321, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 40, 'text': 'AGESA', 'start': 322, 'end': 327, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 41, 'text': 'PI', 'start': 328, 'end': 330, 'ws': True, 'is_punct': False, 'sent_id': 2}, {'id': 42, 'text': '1.2.0.3f', 'start': 331, 'end': 339, 'ws': False, 'is_punct': False, 'sent_id': 2}, {'id': 43, 'text': '.', 'start': 339, 'end': 340, 'ws': False, 'is_punct': True, 'sent_id': 2}, {'id': 44, 'text': '\r\n', 'start': 340, 'end': 342, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 45, 'text': '-', 'start': 342, 'end': 343, 'ws': True, 'is_punct': True, 'sent_id': 3}, {'id': 46, 'text': ' ', 'start': 344, 'end': 345, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 47, 'text': 'Security', 'start': 345, 'end': 353, 'ws': True, 'is_punct': False, 'sent_id': 3}, {'id': 48, 'text': 'issue', 'start': 354, 'end': 359, 'ws': True, 'is_punct': False, 'sent_id': 3}, {'id': 49, 'text': 'mitigation', 'start': 360, 'end': 370, 'ws': False, 'is_punct': False, 'sent_id': 3}, {'id': 50, 'text': '.', 'start': 370, 'end': 371, 'ws': False, 'is_punct': True, 'sent_id': 3}, {'id': 51, 'text': '\r\n\r\n', 'start': 371, 'end': 375, 'ws': False, 'is_punct': False, 'sent_id': 4}, {'id': 52, 'text': '3', 'start': 375, 'end': 376, 'ws': False, 'is_punct': False, 'sent_id': 4}, {'id': 53, 'text': '.', 'start': 376, 'end': 377, 'ws': True, 'is_punct': True, 'sent_id': 4}, {'id': 54, 'text': '2025/07/18', 'start': 378, 'end': 388, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 55, 'text': '\r\n\r\n\r\n', 'start': 388, 'end': 394, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 56, 'text': '[', 'start': 394, 'end': 395, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 57, 'text': 'Below', 'start': 395, 'end': 400, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 58, 'text': 'information', 'start': 401, 'end': 412, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 59, 'text': 'is', 'start': 413, 'end': 415, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 60, 'text': 'Traditional', 'start': 416, 'end': 427, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 61, 'text': 'Chinese', 'start': 428, 'end': 435, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 62, 'text': 'language', 'start': 436, 'end': 444, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 63, 'text': ']', 'start': 444, 'end': 445, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 64, 'text': '\r\n\r\n', 'start': 445, 'end': 449, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 65, 'text': 'A.', 'start': 449, 'end': 451, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 66, 'text': 'AMI', 'start': 452, 'end': 455, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 67, 'text': 'BIOS', 'start': 456, 'end': 460, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 68, 'text': '正式發行', 'start': 461, 'end': 465, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 69, 'text': '\r\n\r\n', 'start': 465, 'end': 469, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 70, 'text': 'B.', 'start': 469, 'end': 471, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 71, 'text': '此版本修正下列問題', 'start': 472, 'end': 481, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 72, 'text': ':', 'start': 481, 'end': 482, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 73, 'text': '\r\n', 'start': 482, 'end': 484, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 74, 'text': '-', 'start': 484, 'end': 485, 'ws': True, 'is_punct': True, 'sent_id': 5}, {'id': 75, 'text': ' ', 'start': 486, 'end': 487, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 76, 'text': '支援', 'start': 487, 'end': 489, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 77, 'text': 'AGESA', 'start': 490, 'end': 495, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 78, 'text': 'PI', 'start': 496, 'end': 498, 'ws': True, 'is_punct': False, 'sent_id': 5}, {'id': 79, 'text': '1.2.0.3f', 'start': 499, 'end': 507, 'ws': False, 'is_punct': False, 'sent_id': 5}, {'id': 80, 'text': '。', 'start': 507, 'end': 508, 'ws': False, 'is_punct': True, 'sent_id': 5}, {'id': 81, 'text': '\r\n', 'start': 508, 'end': 510, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 82, 'text': '-', 'start': 510, 'end': 511, 'ws': True, 'is_punct': True, 'sent_id': 6}, {'id': 83, 'text': ' ', 'start': 512, 'end': 513, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 84, 'text': '修補安全性問題', 'start': 513, 'end': 520, 'ws': False, 'is_punct': False, 'sent_id': 6}, {'id': 85, 'text': '。', 'start': 520, 'end': 521, 'ws': False, 'is_punct': True, 'sent_id': 6}, {'id': 86, 'text': '\r\n\r\n', 'start': 521, 'end': 525, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 87, 'text': 'C.', 'start': 525, 'end': 527, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 88, 'text': '更新日期', 'start': 528, 'end': 532, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 89, 'text': ':', 'start': 532, 'end': 533, 'ws': True, 'is_punct': True, 'sent_id': 7}, {'id': 90, 'text': '公元2025年7月18號', 'start': 534, 'end': 546, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 91, 'text': '\r\n\r\n\r\n', 'start': 546, 'end': 552, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 92, 'text': '[', 'start': 552, 'end': 553, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 93, 'text': 'Below', 'start': 553, 'end': 558, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 94, 'text': 'information', 'start': 559, 'end': 570, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 95, 'text': 'is', 'start': 571, 'end': 573, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 96, 'text': 'Simplified', 'start': 574, 'end': 584, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 97, 'text': 'Chinese', 'start': 585, 'end': 592, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 98, 'text': 'language', 'start': 593, 'end': 601, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 99, 'text': ']', 'start': 601, 'end': 602, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 100, 'text': '\r\n\r\n', 'start': 602, 'end': 606, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 101, 'text': 'A.', 'start': 606, 'end': 608, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 102, 'text': 'AMI', 'start': 609, 'end': 612, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 103, 'text': 'BIOS', 'start': 613, 'end': 617, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 104, 'text': '正式发行', 'start': 618, 'end': 622, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 105, 'text': '\r\n\r\n', 'start': 622, 'end': 626, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 106, 'text': 'B.', 'start': 626, 'end': 628, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 107, 'text': '此版本修正下列问题', 'start': 629, 'end': 638, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 108, 'text': ':', 'start': 638, 'end': 639, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 109, 'text': '\r\n', 'start': 639, 'end': 641, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 110, 'text': '-', 'start': 641, 'end': 642, 'ws': True, 'is_punct': True, 'sent_id': 7}, {'id': 111, 'text': ' ', 'start': 643, 'end': 644, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 112, 'text': '支援', 'start': 644, 'end': 646, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 113, 'text': 'AGESA', 'start': 647, 'end': 652, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 114, 'text': 'PI', 'start': 653, 'end': 655, 'ws': True, 'is_punct': False, 'sent_id': 7}, {'id': 115, 'text': '1.2.0.3f', 'start': 656, 'end': 664, 'ws': False, 'is_punct': False, 'sent_id': 7}, {'id': 116, 'text': '。', 'start': 664, 'end': 665, 'ws': False, 'is_punct': True, 'sent_id': 7}, {'id': 117, 'text': '\r\n', 'start': 665, 'end': 667, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 118, 'text': '-', 'start': 667, 'end': 668, 'ws': True, 'is_punct': True, 'sent_id': 8}, {'id': 119, 'text': ' ', 'start': 669, 'end': 670, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 120, 'text': '修补安全性问题', 'start': 670, 'end': 677, 'ws': False, 'is_punct': False, 'sent_id': 8}, {'id': 121, 'text': '。', 'start': 677, 'end': 678, 'ws': False, 'is_punct': True, 'sent_id': 8}, {'id': 122, 'text': '\r\n\r\n', 'start': 678, 'end': 682, 'ws': False, 'is_punct': False, 'sent_id': 9}, {'id': 123, 'text': 'C.', 'start': 682, 'end': 684, 'ws': True, 'is_punct': False, 'sent_id': 9}, {'id': 124, 'text': '更新日期', 'start': 685, 'end': 689, 'ws': False, 'is_punct': False, 'sent_id': 9}, {'id': 125, 'text': ':', 'start': 689, 'end': 690, 'ws': True, 'is_punct': True, 'sent_id': 9}, {'id': 126, 'text': '公元2025年7月18号', 'start': 691, 'end': 703, 'ws': False, 'is_punct': False, 'sent_id': 9}], 'meta': {'source': 'CMA', 'id': '7E51v1x (9).txt', 'char_count': 703, 'token_count': 127, 'sentence_count': 10}}]

1.7. Step 5: Put it all together#

Call the functions. The code below provides three different options, depending on the types of calls, and it also includes an option that processes local examples.

Each option can be run independently and saves the output in a file within the Colab notebook.

# ---------------------------
# 4) MAIN (robust demos)
# ---------------------------
def save_records(records: List[Dict[str, str]], source: str = "CMA") -> None:
  if not records:
    raise ValueError("No records found (likely 404 if using the CMA API).")

  pairs = [(r["text"], {"source": source, "id": r["id"]}) for r in records]
  processed_texts = preprocess_texts(pairs)
  json_string = json.dumps(processed_texts, ensure_ascii=False, indent=2)
  # Use hash to prevent overwriting previous work
  hash = hashlib.sha1(json_string.encode("utf-8")).hexdigest()
  print(f"CMA by id: {json_string}")

  output_file = f"/content/output_{hash}.json"
  with open(output_file, "w", encoding="utf-8") as f:
    json.dump(out_api_id, f)
    print(f"Saved preprocessed files to {output_file}")

  if IN_COLAB:
    files.download(output_file)

# Save to
save_records(records)

CMA by id: [
  {
    "text_original": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "text_clean": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "language": {
      "language": "en",
      "score": -868.9007034301758
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 130,
        "text": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."
      },
      {
        "id": 1,
        "start": 131,
        "end": 324,
        "text": "Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "This",
        "start": 0,
        "end": 4,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "painting",
        "start": 5,
        "end": 13,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "depicts",
        "start": 14,
        "end": 21,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "Monet",
        "start": 22,
        "end": 27,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "'s",
        "start": 27,
        "end": 29,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": "first",
        "start": 30,
        "end": 35,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 6,
        "text": "wife",
        "start": 36,
        "end": 40,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 7,
        "text": ",",
        "start": 40,
        "end": 41,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 8,
        "text": "Camille",
        "start": 42,
        "end": 49,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 9,
        "text": ",",
        "start": 49,
        "end": 50,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 10,
        "text": "outside",
        "start": 51,
        "end": 58,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 11,
        "text": "on",
        "start": 59,
        "end": 61,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 12,
        "text": "a",
        "start": 62,
        "end": 63,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 13,
        "text": "snowy",
        "start": 64,
        "end": 69,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 14,
        "text": "day",
        "start": 70,
        "end": 73,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 15,
        "text": "passing",
        "start": 74,
        "end": 81,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 16,
        "text": "by",
        "start": 82,
        "end": 84,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 17,
        "text": "the",
        "start": 85,
        "end": 88,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 18,
        "text": "French",
        "start": 89,
        "end": 95,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 19,
        "text": "doors",
        "start": 96,
        "end": 101,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 20,
        "text": "of",
        "start": 102,
        "end": 104,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 21,
        "text": "their",
        "start": 105,
        "end": 110,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 22,
        "text": "home",
        "start": 111,
        "end": 115,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 23,
        "text": "at",
        "start": 116,
        "end": 118,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 24,
        "text": "Argenteuil",
        "start": 119,
        "end": 129,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 25,
        "text": ".",
        "start": 129,
        "end": 130,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 26,
        "text": "Her",
        "start": 131,
        "end": 134,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 27,
        "text": "face",
        "start": 135,
        "end": 139,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 28,
        "text": "is",
        "start": 140,
        "end": 142,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 29,
        "text": "rendered",
        "start": 143,
        "end": 151,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 30,
        "text": "in",
        "start": 152,
        "end": 154,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 31,
        "text": "a",
        "start": 155,
        "end": 156,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 32,
        "text": "radically",
        "start": 157,
        "end": 166,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 33,
        "text": "bold",
        "start": 167,
        "end": 171,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 34,
        "text": "Impressionist",
        "start": 172,
        "end": 185,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 35,
        "text": "technique",
        "start": 186,
        "end": 195,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 36,
        "text": "of",
        "start": 196,
        "end": 198,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 37,
        "text": "mere",
        "start": 199,
        "end": 203,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 38,
        "text": "daubs",
        "start": 204,
        "end": 209,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 39,
        "text": "of",
        "start": 210,
        "end": 212,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 40,
        "text": "paint",
        "start": 213,
        "end": 218,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 41,
        "text": "quickly",
        "start": 219,
        "end": 226,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 42,
        "text": "applied",
        "start": 227,
        "end": 234,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 43,
        "text": ",",
        "start": 234,
        "end": 235,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 44,
        "text": "just",
        "start": 236,
        "end": 240,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 45,
        "text": "as",
        "start": 241,
        "end": 243,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 46,
        "text": "the",
        "start": 244,
        "end": 247,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 47,
        "text": "snow",
        "start": 248,
        "end": 252,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 48,
        "text": "and",
        "start": 253,
        "end": 256,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 49,
        "text": "trees",
        "start": 257,
        "end": 262,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 50,
        "text": "are",
        "start": 263,
        "end": 266,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 51,
        "text": "defined",
        "start": 267,
        "end": 274,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 52,
        "text": "by",
        "start": 275,
        "end": 277,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 53,
        "text": "broad",
        "start": 278,
        "end": 283,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 54,
        "text": ",",
        "start": 283,
        "end": 284,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 55,
        "text": "broken",
        "start": 285,
        "end": 291,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 56,
        "text": "strokes",
        "start": 292,
        "end": 299,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 57,
        "text": "of",
        "start": 300,
        "end": 302,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 58,
        "text": "pure",
        "start": 303,
        "end": 307,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 59,
        "text": "white",
        "start": 308,
        "end": 313,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 60,
        "text": "and",
        "start": 314,
        "end": 317,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 61,
        "text": "green",
        "start": 318,
        "end": 323,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 62,
        "text": ".",
        "start": 323,
        "end": 324,
        "ws": false,
        "is_punct": true,
        "sent_id": 1
      }
    ],
    "meta": {
      "source": "CMA",
      "id": 135382,
      "char_count": 324,
      "token_count": 63,
      "sentence_count": 2
    }
  }
]
Saved preprocessed files to /content/output_fa8d9c82c0b1dd7e071c3fff472cf4c2a42ec53f.json

    # OPTION B: CMA search (take a couple of hits, skip empty descriptions)
if __name__ == "__main__":
  try:
        records = fetch_cma("monet", mode="search", limit=5)
        pairs = [(r["text"], {"source": "CMA", "id": r["id"]}) for r in records if r["text"]]
        pairs = pairs[:2]  # first two with non-empty text
        if pairs:
            out_api_search = preprocess_texts(pairs)
            print("\nCMA search:")
            print(json.dumps(out_api_search, ensure_ascii=False, indent=2))
            # save the results in a local file
            output_file = "/content/output_api_search.json"
            with open(output_file, "w", encoding="utf-8") as f:
                json.dump(out_api_search, f)
                print(f"Saved preprocessed files to {output_file}")
        else:
            print("\nCMA search: no descriptions returned for this query.")
  except Exception as e:
        print("API search failed:", e)

CMA search:
[
  {
    "text_original": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "text_clean": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil. Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green.",
    "language": {
      "language": "en",
      "score": -868.9007034301758
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 130,
        "text": "This painting depicts Monet's first wife, Camille, outside on a snowy day passing by the French doors of their home at Argenteuil."
      },
      {
        "id": 1,
        "start": 131,
        "end": 324,
        "text": "Her face is rendered in a radically bold Impressionist technique of mere daubs of paint quickly applied, just as the snow and trees are defined by broad, broken strokes of pure white and green."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "This",
        "start": 0,
        "end": 4,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "painting",
        "start": 5,
        "end": 13,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "depicts",
        "start": 14,
        "end": 21,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "Monet",
        "start": 22,
        "end": 27,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "'s",
        "start": 27,
        "end": 29,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": "first",
        "start": 30,
        "end": 35,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 6,
        "text": "wife",
        "start": 36,
        "end": 40,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 7,
        "text": ",",
        "start": 40,
        "end": 41,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 8,
        "text": "Camille",
        "start": 42,
        "end": 49,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 9,
        "text": ",",
        "start": 49,
        "end": 50,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 10,
        "text": "outside",
        "start": 51,
        "end": 58,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 11,
        "text": "on",
        "start": 59,
        "end": 61,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 12,
        "text": "a",
        "start": 62,
        "end": 63,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 13,
        "text": "snowy",
        "start": 64,
        "end": 69,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 14,
        "text": "day",
        "start": 70,
        "end": 73,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 15,
        "text": "passing",
        "start": 74,
        "end": 81,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 16,
        "text": "by",
        "start": 82,
        "end": 84,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 17,
        "text": "the",
        "start": 85,
        "end": 88,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 18,
        "text": "French",
        "start": 89,
        "end": 95,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 19,
        "text": "doors",
        "start": 96,
        "end": 101,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 20,
        "text": "of",
        "start": 102,
        "end": 104,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 21,
        "text": "their",
        "start": 105,
        "end": 110,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 22,
        "text": "home",
        "start": 111,
        "end": 115,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 23,
        "text": "at",
        "start": 116,
        "end": 118,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 24,
        "text": "Argenteuil",
        "start": 119,
        "end": 129,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 25,
        "text": ".",
        "start": 129,
        "end": 130,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 26,
        "text": "Her",
        "start": 131,
        "end": 134,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 27,
        "text": "face",
        "start": 135,
        "end": 139,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 28,
        "text": "is",
        "start": 140,
        "end": 142,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 29,
        "text": "rendered",
        "start": 143,
        "end": 151,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 30,
        "text": "in",
        "start": 152,
        "end": 154,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 31,
        "text": "a",
        "start": 155,
        "end": 156,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 32,
        "text": "radically",
        "start": 157,
        "end": 166,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 33,
        "text": "bold",
        "start": 167,
        "end": 171,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 34,
        "text": "Impressionist",
        "start": 172,
        "end": 185,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 35,
        "text": "technique",
        "start": 186,
        "end": 195,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 36,
        "text": "of",
        "start": 196,
        "end": 198,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 37,
        "text": "mere",
        "start": 199,
        "end": 203,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 38,
        "text": "daubs",
        "start": 204,
        "end": 209,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 39,
        "text": "of",
        "start": 210,
        "end": 212,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 40,
        "text": "paint",
        "start": 213,
        "end": 218,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 41,
        "text": "quickly",
        "start": 219,
        "end": 226,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 42,
        "text": "applied",
        "start": 227,
        "end": 234,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 43,
        "text": ",",
        "start": 234,
        "end": 235,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 44,
        "text": "just",
        "start": 236,
        "end": 240,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 45,
        "text": "as",
        "start": 241,
        "end": 243,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 46,
        "text": "the",
        "start": 244,
        "end": 247,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 47,
        "text": "snow",
        "start": 248,
        "end": 252,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 48,
        "text": "and",
        "start": 253,
        "end": 256,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 49,
        "text": "trees",
        "start": 257,
        "end": 262,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 50,
        "text": "are",
        "start": 263,
        "end": 266,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 51,
        "text": "defined",
        "start": 267,
        "end": 274,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 52,
        "text": "by",
        "start": 275,
        "end": 277,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 53,
        "text": "broad",
        "start": 278,
        "end": 283,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 54,
        "text": ",",
        "start": 283,
        "end": 284,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 55,
        "text": "broken",
        "start": 285,
        "end": 291,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 56,
        "text": "strokes",
        "start": 292,
        "end": 299,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 57,
        "text": "of",
        "start": 300,
        "end": 302,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 58,
        "text": "pure",
        "start": 303,
        "end": 307,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 59,
        "text": "white",
        "start": 308,
        "end": 313,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 60,
        "text": "and",
        "start": 314,
        "end": 317,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 61,
        "text": "green",
        "start": 318,
        "end": 323,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 62,
        "text": ".",
        "start": 323,
        "end": 324,
        "ws": false,
        "is_punct": true,
        "sent_id": 1
      }
    ],
    "meta": {
      "source": "CMA",
      "id": 135382,
      "char_count": 324,
      "token_count": 63,
      "sentence_count": 2
    }
  },
  {
    "text_original": "A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France. The resultant canvases are notable for their varied motifs, formats, and sizes. Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface. By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format that fills the viewer’s field of vision.",
    "text_clean": "A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France. The resultant canvases are notable for their varied motifs, formats, and sizes. Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface. By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format that fills the viewer’s field of vision.",
    "language": {
      "language": "en",
      "score": -1428.5701570510864
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 201,
        "text": "A skilled horticulturalist as well as an artist, Claude Monet spent the last 30 years of his life painting the private garden he designed and helped cultivate at his home in Giverny in northern France."
      },
      {
        "id": 1,
        "start": 202,
        "end": 281,
        "text": "The resultant canvases are notable for their varied motifs, formats, and sizes."
      },
      {
        "id": 2,
        "start": 282,
        "end": 448,
        "text": "Monumental in scale, this rendering of his water lily pond focuses on the momentary effects of sunlight as it both penetrates and reflects off its shimmering surface."
      },
      {
        "id": 3,
        "start": 449,
        "end": 661,
        "text": "By zeroing in on the water and omitting its horizon and surrounding banks, Monet infers a limitless expanse—a perception amplified by the painting’s vast horizontal format that fills the viewer’s field of vision."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "A",
        "start": 0,
        "end": 1,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "skilled",
        "start": 2,
        "end": 9,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "horticulturalist",
        "start": 10,
        "end": 26,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "as",
        "start": 27,
        "end": 29,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "well",
        "start": 30,
        "end": 34,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": "as",
        "start": 35,
        "end": 37,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 6,
        "text": "an",
        "start": 38,
        "end": 40,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 7,
        "text": "artist",
        "start": 41,
        "end": 47,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 8,
        "text": ",",
        "start": 47,
        "end": 48,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 9,
        "text": "Claude",
        "start": 49,
        "end": 55,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 10,
        "text": "Monet",
        "start": 56,
        "end": 61,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 11,
        "text": "spent",
        "start": 62,
        "end": 67,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 12,
        "text": "the",
        "start": 68,
        "end": 71,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 13,
        "text": "last",
        "start": 72,
        "end": 76,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 14,
        "text": "30",
        "start": 77,
        "end": 79,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 15,
        "text": "years",
        "start": 80,
        "end": 85,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 16,
        "text": "of",
        "start": 86,
        "end": 88,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 17,
        "text": "his",
        "start": 89,
        "end": 92,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 18,
        "text": "life",
        "start": 93,
        "end": 97,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 19,
        "text": "painting",
        "start": 98,
        "end": 106,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 20,
        "text": "the",
        "start": 107,
        "end": 110,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 21,
        "text": "private",
        "start": 111,
        "end": 118,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 22,
        "text": "garden",
        "start": 119,
        "end": 125,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 23,
        "text": "he",
        "start": 126,
        "end": 128,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 24,
        "text": "designed",
        "start": 129,
        "end": 137,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 25,
        "text": "and",
        "start": 138,
        "end": 141,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 26,
        "text": "helped",
        "start": 142,
        "end": 148,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 27,
        "text": "cultivate",
        "start": 149,
        "end": 158,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 28,
        "text": "at",
        "start": 159,
        "end": 161,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 29,
        "text": "his",
        "start": 162,
        "end": 165,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 30,
        "text": "home",
        "start": 166,
        "end": 170,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 31,
        "text": "in",
        "start": 171,
        "end": 173,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 32,
        "text": "Giverny",
        "start": 174,
        "end": 181,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 33,
        "text": "in",
        "start": 182,
        "end": 184,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 34,
        "text": "northern",
        "start": 185,
        "end": 193,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 35,
        "text": "France",
        "start": 194,
        "end": 200,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 36,
        "text": ".",
        "start": 200,
        "end": 201,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 37,
        "text": "The",
        "start": 202,
        "end": 205,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 38,
        "text": "resultant",
        "start": 206,
        "end": 215,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 39,
        "text": "canvases",
        "start": 216,
        "end": 224,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 40,
        "text": "are",
        "start": 225,
        "end": 228,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 41,
        "text": "notable",
        "start": 229,
        "end": 236,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 42,
        "text": "for",
        "start": 237,
        "end": 240,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 43,
        "text": "their",
        "start": 241,
        "end": 246,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 44,
        "text": "varied",
        "start": 247,
        "end": 253,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 45,
        "text": "motifs",
        "start": 254,
        "end": 260,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 46,
        "text": ",",
        "start": 260,
        "end": 261,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 47,
        "text": "formats",
        "start": 262,
        "end": 269,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 48,
        "text": ",",
        "start": 269,
        "end": 270,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 49,
        "text": "and",
        "start": 271,
        "end": 274,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 50,
        "text": "sizes",
        "start": 275,
        "end": 280,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 51,
        "text": ".",
        "start": 280,
        "end": 281,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 52,
        "text": "Monumental",
        "start": 282,
        "end": 292,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 53,
        "text": "in",
        "start": 293,
        "end": 295,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 54,
        "text": "scale",
        "start": 296,
        "end": 301,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 55,
        "text": ",",
        "start": 301,
        "end": 302,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 56,
        "text": "this",
        "start": 303,
        "end": 307,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 57,
        "text": "rendering",
        "start": 308,
        "end": 317,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 58,
        "text": "of",
        "start": 318,
        "end": 320,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 59,
        "text": "his",
        "start": 321,
        "end": 324,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 60,
        "text": "water",
        "start": 325,
        "end": 330,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 61,
        "text": "lily",
        "start": 331,
        "end": 335,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 62,
        "text": "pond",
        "start": 336,
        "end": 340,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 63,
        "text": "focuses",
        "start": 341,
        "end": 348,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 64,
        "text": "on",
        "start": 349,
        "end": 351,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 65,
        "text": "the",
        "start": 352,
        "end": 355,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 66,
        "text": "momentary",
        "start": 356,
        "end": 365,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 67,
        "text": "effects",
        "start": 366,
        "end": 373,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 68,
        "text": "of",
        "start": 374,
        "end": 376,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 69,
        "text": "sunlight",
        "start": 377,
        "end": 385,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 70,
        "text": "as",
        "start": 386,
        "end": 388,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 71,
        "text": "it",
        "start": 389,
        "end": 391,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 72,
        "text": "both",
        "start": 392,
        "end": 396,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 73,
        "text": "penetrates",
        "start": 397,
        "end": 407,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 74,
        "text": "and",
        "start": 408,
        "end": 411,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 75,
        "text": "reflects",
        "start": 412,
        "end": 420,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 76,
        "text": "off",
        "start": 421,
        "end": 424,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 77,
        "text": "its",
        "start": 425,
        "end": 428,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 78,
        "text": "shimmering",
        "start": 429,
        "end": 439,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 79,
        "text": "surface",
        "start": 440,
        "end": 447,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 80,
        "text": ".",
        "start": 447,
        "end": 448,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 81,
        "text": "By",
        "start": 449,
        "end": 451,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 82,
        "text": "zeroing",
        "start": 452,
        "end": 459,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 83,
        "text": "in",
        "start": 460,
        "end": 462,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 84,
        "text": "on",
        "start": 463,
        "end": 465,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 85,
        "text": "the",
        "start": 466,
        "end": 469,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 86,
        "text": "water",
        "start": 470,
        "end": 475,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 87,
        "text": "and",
        "start": 476,
        "end": 479,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 88,
        "text": "omitting",
        "start": 480,
        "end": 488,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 89,
        "text": "its",
        "start": 489,
        "end": 492,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 90,
        "text": "horizon",
        "start": 493,
        "end": 500,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 91,
        "text": "and",
        "start": 501,
        "end": 504,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 92,
        "text": "surrounding",
        "start": 505,
        "end": 516,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 93,
        "text": "banks",
        "start": 517,
        "end": 522,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 94,
        "text": ",",
        "start": 522,
        "end": 523,
        "ws": true,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 95,
        "text": "Monet",
        "start": 524,
        "end": 529,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 96,
        "text": "infers",
        "start": 530,
        "end": 536,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 97,
        "text": "a",
        "start": 537,
        "end": 538,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 98,
        "text": "limitless",
        "start": 539,
        "end": 548,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 99,
        "text": "expanse",
        "start": 549,
        "end": 556,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 100,
        "text": "—",
        "start": 556,
        "end": 557,
        "ws": false,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 101,
        "text": "a",
        "start": 557,
        "end": 558,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 102,
        "text": "perception",
        "start": 559,
        "end": 569,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 103,
        "text": "amplified",
        "start": 570,
        "end": 579,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 104,
        "text": "by",
        "start": 580,
        "end": 582,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 105,
        "text": "the",
        "start": 583,
        "end": 586,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 106,
        "text": "painting",
        "start": 587,
        "end": 595,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 107,
        "text": "’s",
        "start": 595,
        "end": 597,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 108,
        "text": "vast",
        "start": 598,
        "end": 602,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 109,
        "text": "horizontal",
        "start": 603,
        "end": 613,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 110,
        "text": "format",
        "start": 614,
        "end": 620,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 111,
        "text": "that",
        "start": 621,
        "end": 625,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 112,
        "text": "fills",
        "start": 626,
        "end": 631,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 113,
        "text": "the",
        "start": 632,
        "end": 635,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 114,
        "text": "viewer",
        "start": 636,
        "end": 642,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 115,
        "text": "’s",
        "start": 642,
        "end": 644,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 116,
        "text": "field",
        "start": 645,
        "end": 650,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 117,
        "text": "of",
        "start": 651,
        "end": 653,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 118,
        "text": "vision",
        "start": 654,
        "end": 660,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 119,
        "text": ".",
        "start": 660,
        "end": 661,
        "ws": false,
        "is_punct": true,
        "sent_id": 3
      }
    ],
    "meta": {
      "source": "CMA",
      "id": 136510,
      "char_count": 661,
      "token_count": 120,
      "sentence_count": 4
    }
  }
]
Saved preprocessed files to /content/output_api_search.json

    # OPTION C: Local examples
if __name__ == "__main__":
  examples = [
        ("Rome is the capital of Italy.", {"source": "local"}),
        ("Mark Rutte bezocht gisteren Groningen.", {"source": "local"})
    ]
  out_local = preprocess_texts(examples)
  print("\nLocal examples:")
  print(json.dumps(out_local, ensure_ascii=False, indent=2))

# save the results in a local file
output_file = "/content/output_local.json"
with open(output_file, "w", encoding="utf-8") as f:
  json.dump(out_local, f)
  print(f"Results saved in {output_file}")

Local examples:
[
  {
    "text_original": "Rome is the capital of Italy.",
    "text_clean": "Rome is the capital of Italy.",
    "language": {
      "language": "en",
      "score": -73.82194948196411
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 29,
        "text": "Rome is the capital of Italy."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "Rome",
        "start": 0,
        "end": 4,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "is",
        "start": 5,
        "end": 7,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "the",
        "start": 8,
        "end": 11,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "capital",
        "start": 12,
        "end": 19,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "of",
        "start": 20,
        "end": 22,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": "Italy",
        "start": 23,
        "end": 28,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 6,
        "text": ".",
        "start": 28,
        "end": 29,
        "ws": false,
        "is_punct": true,
        "sent_id": 0
      }
    ],
    "meta": {
      "source": "local",
      "char_count": 29,
      "token_count": 7,
      "sentence_count": 1
    }
  },
  {
    "text_original": "Mark Rutte bezocht gisteren Groningen.",
    "text_clean": "Mark Rutte bezocht gisteren Groningen.",
    "language": {
      "language": "de",
      "score": -57.47963762283325
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 38,
        "text": "Mark Rutte bezocht gisteren Groningen."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "Mark",
        "start": 0,
        "end": 4,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": "Rutte",
        "start": 5,
        "end": 10,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "bezocht",
        "start": 11,
        "end": 18,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "gisteren",
        "start": 19,
        "end": 27,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "Groningen",
        "start": 28,
        "end": 37,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": ".",
        "start": 37,
        "end": 38,
        "ws": false,
        "is_punct": true,
        "sent_id": 0
      }
    ],
    "meta": {
      "source": "local",
      "char_count": 38,
      "token_count": 6,
      "sentence_count": 1
    }
  }
]
Results saved in /content/output_local.json

# OPTION D: Local examples from a txt file (make sure to upload the txt file in the content folder of the Colab Notebook)
if __name__ == "__main__":
  # open a local .txt file as input. remember to change the path and filename to your file.
  with open("/content/input.txt", "r") as f:
    text = f.read()

    # then process the text of the input file
    examples = [(text, {"source": "local"})]
    out_local = preprocess_texts(examples)
    print("\nLocal examples:")
    print(json.dumps(out_local, ensure_ascii=False, indent=2))

# save the results in a local file
output_file = "/content/output_file.json"
with open(output_file, "w", encoding="utf-8") as f:
  json.dump(out_local, f)
  print(f"Results saved in {output_file}")

Local examples:
[
  {
    "text_original": "STRABO, the author of this work, was born at Amasia, or Amasijas, a town situated in the gorge of the mountains through which passes the river Iris, now the Ieschil Irmak, in Pontus, which he has described in the 12th book.[*] He lived during the reign of Augustus, and the earlier part of the reign of Tiberius; for in the 13th book[*] he relates how Sardes and other cities, which had suffered severely from earthquakes, had been repaired by the provident care of Tiberius the present Emperor; but the exact date of his birth, as also of his death, are subjects of conjecture only. Coraÿ and Groskurd conclude, though by a somewhat different argument, that he was born in the year B. C. 66, and the latter that he died A. D. 24. The date of his birth as argued by Groskurd, proceeds on the assumption that Strabo was in his thirty-eighth year when he went from Gyaros to Corinth, at which latter place Octavianus Caesar was then staying on his return to Rome after the battle of Actium, B. C. 31. We may, perhaps, be satisfied with following Clinton, and place it not later than B. C. 54.",
    "text_clean": "STRABO, the author of this work, was born at Amasia, or Amasijas, a town situated in the gorge of the mountains through which passes the river Iris, now the Ieschil Irmak, in Pontus, which he has described in the 12th book.[*] He lived during the reign of Augustus, and the earlier part of the reign of Tiberius; for in the 13th book[*] he relates how Sardes and other cities, which had suffered severely from earthquakes, had been repaired by the provident care of Tiberius the present Emperor; but the exact date of his birth, as also of his death, are subjects of conjecture only. Coraÿ and Groskurd conclude, though by a somewhat different argument, that he was born in the year B. C. 66, and the latter that he died A. D. 24. The date of his birth as argued by Groskurd, proceeds on the assumption that Strabo was in his thirty-eighth year when he went from Gyaros to Corinth, at which latter place Octavianus Caesar was then staying on his return to Rome after the battle of Actium, B. C. 31. We may, perhaps, be satisfied with following Clinton, and place it not later than B. C. 54.",
    "language": {
      "language": "en",
      "score": -3211.323507785797
    },
    "sentences": [
      {
        "id": 0,
        "start": 0,
        "end": 226,
        "text": "STRABO, the author of this work, was born at Amasia, or Amasijas, a town situated in the gorge of the mountains through which passes the river Iris, now the Ieschil Irmak, in Pontus, which he has described in the 12th book.[*]"
      },
      {
        "id": 1,
        "start": 227,
        "end": 583,
        "text": "He lived during the reign of Augustus, and the earlier part of the reign of Tiberius; for in the 13th book[*] he relates how Sardes and other cities, which had suffered severely from earthquakes, had been repaired by the provident care of Tiberius the present Emperor; but the exact date of his birth, as also of his death, are subjects of conjecture only."
      },
      {
        "id": 2,
        "start": 584,
        "end": 730,
        "text": "Coraÿ and Groskurd conclude, though by a somewhat different argument, that he was born in the year B. C. 66, and the latter that he died A. D. 24."
      },
      {
        "id": 3,
        "start": 731,
        "end": 998,
        "text": "The date of his birth as argued by Groskurd, proceeds on the assumption that Strabo was in his thirty-eighth year when he went from Gyaros to Corinth, at which latter place Octavianus Caesar was then staying on his return to Rome after the battle of Actium, B. C. 31."
      },
      {
        "id": 4,
        "start": 999,
        "end": 1090,
        "text": "We may, perhaps, be satisfied with following Clinton, and place it not later than B. C. 54."
      }
    ],
    "tokens": [
      {
        "id": 0,
        "text": "STRABO",
        "start": 0,
        "end": 6,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 1,
        "text": ",",
        "start": 6,
        "end": 7,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 2,
        "text": "the",
        "start": 8,
        "end": 11,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 3,
        "text": "author",
        "start": 12,
        "end": 18,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 4,
        "text": "of",
        "start": 19,
        "end": 21,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 5,
        "text": "this",
        "start": 22,
        "end": 26,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 6,
        "text": "work",
        "start": 27,
        "end": 31,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 7,
        "text": ",",
        "start": 31,
        "end": 32,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 8,
        "text": "was",
        "start": 33,
        "end": 36,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 9,
        "text": "born",
        "start": 37,
        "end": 41,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 10,
        "text": "at",
        "start": 42,
        "end": 44,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 11,
        "text": "Amasia",
        "start": 45,
        "end": 51,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 12,
        "text": ",",
        "start": 51,
        "end": 52,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 13,
        "text": "or",
        "start": 53,
        "end": 55,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 14,
        "text": "Amasijas",
        "start": 56,
        "end": 64,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 15,
        "text": ",",
        "start": 64,
        "end": 65,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 16,
        "text": "a",
        "start": 66,
        "end": 67,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 17,
        "text": "town",
        "start": 68,
        "end": 72,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 18,
        "text": "situated",
        "start": 73,
        "end": 81,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 19,
        "text": "in",
        "start": 82,
        "end": 84,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 20,
        "text": "the",
        "start": 85,
        "end": 88,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 21,
        "text": "gorge",
        "start": 89,
        "end": 94,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 22,
        "text": "of",
        "start": 95,
        "end": 97,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 23,
        "text": "the",
        "start": 98,
        "end": 101,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 24,
        "text": "mountains",
        "start": 102,
        "end": 111,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 25,
        "text": "through",
        "start": 112,
        "end": 119,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 26,
        "text": "which",
        "start": 120,
        "end": 125,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 27,
        "text": "passes",
        "start": 126,
        "end": 132,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 28,
        "text": "the",
        "start": 133,
        "end": 136,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 29,
        "text": "river",
        "start": 137,
        "end": 142,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 30,
        "text": "Iris",
        "start": 143,
        "end": 147,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 31,
        "text": ",",
        "start": 147,
        "end": 148,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 32,
        "text": "now",
        "start": 149,
        "end": 152,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 33,
        "text": "the",
        "start": 153,
        "end": 156,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 34,
        "text": "Ieschil",
        "start": 157,
        "end": 164,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 35,
        "text": "Irmak",
        "start": 165,
        "end": 170,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 36,
        "text": ",",
        "start": 170,
        "end": 171,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 37,
        "text": "in",
        "start": 172,
        "end": 174,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 38,
        "text": "Pontus",
        "start": 175,
        "end": 181,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 39,
        "text": ",",
        "start": 181,
        "end": 182,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 40,
        "text": "which",
        "start": 183,
        "end": 188,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 41,
        "text": "he",
        "start": 189,
        "end": 191,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 42,
        "text": "has",
        "start": 192,
        "end": 195,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 43,
        "text": "described",
        "start": 196,
        "end": 205,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 44,
        "text": "in",
        "start": 206,
        "end": 208,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 45,
        "text": "the",
        "start": 209,
        "end": 212,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 46,
        "text": "12th",
        "start": 213,
        "end": 217,
        "ws": true,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 47,
        "text": "book",
        "start": 218,
        "end": 222,
        "ws": false,
        "is_punct": false,
        "sent_id": 0
      },
      {
        "id": 48,
        "text": ".",
        "start": 222,
        "end": 223,
        "ws": false,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 49,
        "text": "[",
        "start": 223,
        "end": 224,
        "ws": false,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 50,
        "text": "*",
        "start": 224,
        "end": 225,
        "ws": false,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 51,
        "text": "]",
        "start": 225,
        "end": 226,
        "ws": true,
        "is_punct": true,
        "sent_id": 0
      },
      {
        "id": 52,
        "text": "He",
        "start": 227,
        "end": 229,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 53,
        "text": "lived",
        "start": 230,
        "end": 235,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 54,
        "text": "during",
        "start": 236,
        "end": 242,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 55,
        "text": "the",
        "start": 243,
        "end": 246,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 56,
        "text": "reign",
        "start": 247,
        "end": 252,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 57,
        "text": "of",
        "start": 253,
        "end": 255,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 58,
        "text": "Augustus",
        "start": 256,
        "end": 264,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 59,
        "text": ",",
        "start": 264,
        "end": 265,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 60,
        "text": "and",
        "start": 266,
        "end": 269,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 61,
        "text": "the",
        "start": 270,
        "end": 273,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 62,
        "text": "earlier",
        "start": 274,
        "end": 281,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 63,
        "text": "part",
        "start": 282,
        "end": 286,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 64,
        "text": "of",
        "start": 287,
        "end": 289,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 65,
        "text": "the",
        "start": 290,
        "end": 293,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 66,
        "text": "reign",
        "start": 294,
        "end": 299,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 67,
        "text": "of",
        "start": 300,
        "end": 302,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 68,
        "text": "Tiberius",
        "start": 303,
        "end": 311,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 69,
        "text": ";",
        "start": 311,
        "end": 312,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 70,
        "text": "for",
        "start": 313,
        "end": 316,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 71,
        "text": "in",
        "start": 317,
        "end": 319,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 72,
        "text": "the",
        "start": 320,
        "end": 323,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 73,
        "text": "13th",
        "start": 324,
        "end": 328,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 74,
        "text": "book",
        "start": 329,
        "end": 333,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 75,
        "text": "[",
        "start": 333,
        "end": 334,
        "ws": false,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 76,
        "text": "*",
        "start": 334,
        "end": 335,
        "ws": false,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 77,
        "text": "]",
        "start": 335,
        "end": 336,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 78,
        "text": "he",
        "start": 337,
        "end": 339,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 79,
        "text": "relates",
        "start": 340,
        "end": 347,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 80,
        "text": "how",
        "start": 348,
        "end": 351,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 81,
        "text": "Sardes",
        "start": 352,
        "end": 358,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 82,
        "text": "and",
        "start": 359,
        "end": 362,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 83,
        "text": "other",
        "start": 363,
        "end": 368,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 84,
        "text": "cities",
        "start": 369,
        "end": 375,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 85,
        "text": ",",
        "start": 375,
        "end": 376,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 86,
        "text": "which",
        "start": 377,
        "end": 382,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 87,
        "text": "had",
        "start": 383,
        "end": 386,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 88,
        "text": "suffered",
        "start": 387,
        "end": 395,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 89,
        "text": "severely",
        "start": 396,
        "end": 404,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 90,
        "text": "from",
        "start": 405,
        "end": 409,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 91,
        "text": "earthquakes",
        "start": 410,
        "end": 421,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 92,
        "text": ",",
        "start": 421,
        "end": 422,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 93,
        "text": "had",
        "start": 423,
        "end": 426,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 94,
        "text": "been",
        "start": 427,
        "end": 431,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 95,
        "text": "repaired",
        "start": 432,
        "end": 440,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 96,
        "text": "by",
        "start": 441,
        "end": 443,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 97,
        "text": "the",
        "start": 444,
        "end": 447,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 98,
        "text": "provident",
        "start": 448,
        "end": 457,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 99,
        "text": "care",
        "start": 458,
        "end": 462,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 100,
        "text": "of",
        "start": 463,
        "end": 465,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 101,
        "text": "Tiberius",
        "start": 466,
        "end": 474,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 102,
        "text": "the",
        "start": 475,
        "end": 478,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 103,
        "text": "present",
        "start": 479,
        "end": 486,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 104,
        "text": "Emperor",
        "start": 487,
        "end": 494,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 105,
        "text": ";",
        "start": 494,
        "end": 495,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 106,
        "text": "but",
        "start": 496,
        "end": 499,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 107,
        "text": "the",
        "start": 500,
        "end": 503,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 108,
        "text": "exact",
        "start": 504,
        "end": 509,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 109,
        "text": "date",
        "start": 510,
        "end": 514,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 110,
        "text": "of",
        "start": 515,
        "end": 517,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 111,
        "text": "his",
        "start": 518,
        "end": 521,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 112,
        "text": "birth",
        "start": 522,
        "end": 527,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 113,
        "text": ",",
        "start": 527,
        "end": 528,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 114,
        "text": "as",
        "start": 529,
        "end": 531,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 115,
        "text": "also",
        "start": 532,
        "end": 536,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 116,
        "text": "of",
        "start": 537,
        "end": 539,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 117,
        "text": "his",
        "start": 540,
        "end": 543,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 118,
        "text": "death",
        "start": 544,
        "end": 549,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 119,
        "text": ",",
        "start": 549,
        "end": 550,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 120,
        "text": "are",
        "start": 551,
        "end": 554,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 121,
        "text": "subjects",
        "start": 555,
        "end": 563,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 122,
        "text": "of",
        "start": 564,
        "end": 566,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 123,
        "text": "conjecture",
        "start": 567,
        "end": 577,
        "ws": true,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 124,
        "text": "only",
        "start": 578,
        "end": 582,
        "ws": false,
        "is_punct": false,
        "sent_id": 1
      },
      {
        "id": 125,
        "text": ".",
        "start": 582,
        "end": 583,
        "ws": true,
        "is_punct": true,
        "sent_id": 1
      },
      {
        "id": 126,
        "text": "Coraÿ",
        "start": 584,
        "end": 589,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 127,
        "text": "and",
        "start": 590,
        "end": 593,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 128,
        "text": "Groskurd",
        "start": 594,
        "end": 602,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 129,
        "text": "conclude",
        "start": 603,
        "end": 611,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 130,
        "text": ",",
        "start": 611,
        "end": 612,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 131,
        "text": "though",
        "start": 613,
        "end": 619,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 132,
        "text": "by",
        "start": 620,
        "end": 622,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 133,
        "text": "a",
        "start": 623,
        "end": 624,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 134,
        "text": "somewhat",
        "start": 625,
        "end": 633,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 135,
        "text": "different",
        "start": 634,
        "end": 643,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 136,
        "text": "argument",
        "start": 644,
        "end": 652,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 137,
        "text": ",",
        "start": 652,
        "end": 653,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 138,
        "text": "that",
        "start": 654,
        "end": 658,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 139,
        "text": "he",
        "start": 659,
        "end": 661,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 140,
        "text": "was",
        "start": 662,
        "end": 665,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 141,
        "text": "born",
        "start": 666,
        "end": 670,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 142,
        "text": "in",
        "start": 671,
        "end": 673,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 143,
        "text": "the",
        "start": 674,
        "end": 677,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 144,
        "text": "year",
        "start": 678,
        "end": 682,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 145,
        "text": "B.",
        "start": 683,
        "end": 685,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 146,
        "text": "C.",
        "start": 686,
        "end": 688,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 147,
        "text": "66",
        "start": 689,
        "end": 691,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 148,
        "text": ",",
        "start": 691,
        "end": 692,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 149,
        "text": "and",
        "start": 693,
        "end": 696,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 150,
        "text": "the",
        "start": 697,
        "end": 700,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 151,
        "text": "latter",
        "start": 701,
        "end": 707,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 152,
        "text": "that",
        "start": 708,
        "end": 712,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 153,
        "text": "he",
        "start": 713,
        "end": 715,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 154,
        "text": "died",
        "start": 716,
        "end": 720,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 155,
        "text": "A.",
        "start": 721,
        "end": 723,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 156,
        "text": "D.",
        "start": 724,
        "end": 726,
        "ws": true,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 157,
        "text": "24",
        "start": 727,
        "end": 729,
        "ws": false,
        "is_punct": false,
        "sent_id": 2
      },
      {
        "id": 158,
        "text": ".",
        "start": 729,
        "end": 730,
        "ws": true,
        "is_punct": true,
        "sent_id": 2
      },
      {
        "id": 159,
        "text": "The",
        "start": 731,
        "end": 734,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 160,
        "text": "date",
        "start": 735,
        "end": 739,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 161,
        "text": "of",
        "start": 740,
        "end": 742,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 162,
        "text": "his",
        "start": 743,
        "end": 746,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 163,
        "text": "birth",
        "start": 747,
        "end": 752,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 164,
        "text": "as",
        "start": 753,
        "end": 755,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 165,
        "text": "argued",
        "start": 756,
        "end": 762,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 166,
        "text": "by",
        "start": 763,
        "end": 765,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 167,
        "text": "Groskurd",
        "start": 766,
        "end": 774,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 168,
        "text": ",",
        "start": 774,
        "end": 775,
        "ws": true,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 169,
        "text": "proceeds",
        "start": 776,
        "end": 784,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 170,
        "text": "on",
        "start": 785,
        "end": 787,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 171,
        "text": "the",
        "start": 788,
        "end": 791,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 172,
        "text": "assumption",
        "start": 792,
        "end": 802,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 173,
        "text": "that",
        "start": 803,
        "end": 807,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 174,
        "text": "Strabo",
        "start": 808,
        "end": 814,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 175,
        "text": "was",
        "start": 815,
        "end": 818,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 176,
        "text": "in",
        "start": 819,
        "end": 821,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 177,
        "text": "his",
        "start": 822,
        "end": 825,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 178,
        "text": "thirty",
        "start": 826,
        "end": 832,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 179,
        "text": "-",
        "start": 832,
        "end": 833,
        "ws": false,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 180,
        "text": "eighth",
        "start": 833,
        "end": 839,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 181,
        "text": "year",
        "start": 840,
        "end": 844,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 182,
        "text": "when",
        "start": 845,
        "end": 849,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 183,
        "text": "he",
        "start": 850,
        "end": 852,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 184,
        "text": "went",
        "start": 853,
        "end": 857,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 185,
        "text": "from",
        "start": 858,
        "end": 862,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 186,
        "text": "Gyaros",
        "start": 863,
        "end": 869,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 187,
        "text": "to",
        "start": 870,
        "end": 872,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 188,
        "text": "Corinth",
        "start": 873,
        "end": 880,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 189,
        "text": ",",
        "start": 880,
        "end": 881,
        "ws": true,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 190,
        "text": "at",
        "start": 882,
        "end": 884,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 191,
        "text": "which",
        "start": 885,
        "end": 890,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 192,
        "text": "latter",
        "start": 891,
        "end": 897,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 193,
        "text": "place",
        "start": 898,
        "end": 903,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 194,
        "text": "Octavianus",
        "start": 904,
        "end": 914,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 195,
        "text": "Caesar",
        "start": 915,
        "end": 921,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 196,
        "text": "was",
        "start": 922,
        "end": 925,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 197,
        "text": "then",
        "start": 926,
        "end": 930,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 198,
        "text": "staying",
        "start": 931,
        "end": 938,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 199,
        "text": "on",
        "start": 939,
        "end": 941,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 200,
        "text": "his",
        "start": 942,
        "end": 945,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 201,
        "text": "return",
        "start": 946,
        "end": 952,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 202,
        "text": "to",
        "start": 953,
        "end": 955,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 203,
        "text": "Rome",
        "start": 956,
        "end": 960,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 204,
        "text": "after",
        "start": 961,
        "end": 966,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 205,
        "text": "the",
        "start": 967,
        "end": 970,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 206,
        "text": "battle",
        "start": 971,
        "end": 977,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 207,
        "text": "of",
        "start": 978,
        "end": 980,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 208,
        "text": "Actium",
        "start": 981,
        "end": 987,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 209,
        "text": ",",
        "start": 987,
        "end": 988,
        "ws": true,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 210,
        "text": "B.",
        "start": 989,
        "end": 991,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 211,
        "text": "C.",
        "start": 992,
        "end": 994,
        "ws": true,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 212,
        "text": "31",
        "start": 995,
        "end": 997,
        "ws": false,
        "is_punct": false,
        "sent_id": 3
      },
      {
        "id": 213,
        "text": ".",
        "start": 997,
        "end": 998,
        "ws": true,
        "is_punct": true,
        "sent_id": 3
      },
      {
        "id": 214,
        "text": "We",
        "start": 999,
        "end": 1001,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 215,
        "text": "may",
        "start": 1002,
        "end": 1005,
        "ws": false,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 216,
        "text": ",",
        "start": 1005,
        "end": 1006,
        "ws": true,
        "is_punct": true,
        "sent_id": 4
      },
      {
        "id": 217,
        "text": "perhaps",
        "start": 1007,
        "end": 1014,
        "ws": false,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 218,
        "text": ",",
        "start": 1014,
        "end": 1015,
        "ws": true,
        "is_punct": true,
        "sent_id": 4
      },
      {
        "id": 219,
        "text": "be",
        "start": 1016,
        "end": 1018,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 220,
        "text": "satisfied",
        "start": 1019,
        "end": 1028,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 221,
        "text": "with",
        "start": 1029,
        "end": 1033,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 222,
        "text": "following",
        "start": 1034,
        "end": 1043,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 223,
        "text": "Clinton",
        "start": 1044,
        "end": 1051,
        "ws": false,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 224,
        "text": ",",
        "start": 1051,
        "end": 1052,
        "ws": true,
        "is_punct": true,
        "sent_id": 4
      },
      {
        "id": 225,
        "text": "and",
        "start": 1053,
        "end": 1056,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 226,
        "text": "place",
        "start": 1057,
        "end": 1062,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 227,
        "text": "it",
        "start": 1063,
        "end": 1065,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 228,
        "text": "not",
        "start": 1066,
        "end": 1069,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 229,
        "text": "later",
        "start": 1070,
        "end": 1075,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 230,
        "text": "than",
        "start": 1076,
        "end": 1080,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 231,
        "text": "B.",
        "start": 1081,
        "end": 1083,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 232,
        "text": "C.",
        "start": 1084,
        "end": 1086,
        "ws": true,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 233,
        "text": "54",
        "start": 1087,
        "end": 1089,
        "ws": false,
        "is_punct": false,
        "sent_id": 4
      },
      {
        "id": 234,
        "text": ".",
        "start": 1089,
        "end": 1090,
        "ws": false,
        "is_punct": true,
        "sent_id": 4
      }
    ],
    "meta": {
      "source": "local",
      "char_count": 1090,
      "token_count": 235,
      "sentence_count": 5
    }
  }
]
Results saved in /content/output_file.json

1.7.1. Save the results in a local .zip file#

The output files are only saved locally in the Google Colab notebook, and will be deleted after the notebook is closed.

Two options are available:

If you are only interested in one of the files you generated, you can simply download the individual output file.
If you ran all options and want to save all output files, you can download all of them as a zip folder.

# take the output_file generated by your chosen option and download it on your machine.
from google.colab import files
files.download("/content/output_file.json")

# take the output_file saved under content and download it locally as a .zip file
from google.colab import files
!zip -r /content/output_files.zip output_api_id.json output_api_search.json output_local.json output_file.json
files.download("/content/output_files.zip")

  adding: output_local.json (deflated 86%)