Awesome Lorentz list of resources#
A curated list of resources mentioned during the Lorentz workshop. Resources are organised by type (datasets, models, tools) and possibly also by the processing task they help to perform. (If you are not familiar with the format, do check out existing awesome lists on GitHub).
Contents
Resources for evaluation {#resources-for-evaluation}#
This includes datasets (e.g. for benchmarking), annotation guidelines, shared tasks, etc.
…
Models {#models}#
Document processing (OCR, page layout analysis, etc.)#
dots.ocr – Multilingual Document Layout Parsing in a Single Vision-Language Model
Online demo (for quick testing): rednote-hilab/dots.ocr
Model overviews
European Open Source AI Index - index on openness of AI models, rated on various criteria
Tools {#tools}#
Annotation#
INCEpTION – A semantic annotation platform suitable for various types of textual annotations (NER, EL, etc.).
Recogito Studio - An Extensible Platform for Collaborative, Standards-Based Annotation of TEI Text, IIIF Images, and PDFs, including geotagging and reconciliation with different gazetteers (WHG, Pleiades, Wikidata, etc.).
Immarkus - open-source tool for semantic image annotation
Image Positions – Image Annotation platform inside the Wikidata environment
FairCopy - tool for reading, transcribing, and encoding text with custom annotations
CATma - mark-up and analysis tool
Prodi.gy - annotation tool for SpaCy (not open-source)
Liiive - Real-time collaborative viewing & annotation for IIIF image collections
Named Entity Recognition#
GATE geotagger — This service identifies geographical named entities and disambiguates them against GeoNames. The service currently makes use of the Mordecai3 geoparser; more details on Mordecai3 can be found in this paper.
GATE Pleiades NER — This service identifies geographical named entities and disambiguates them against the Pleiades dataset. The approach taken is to use all the names from each entry in Pleiades (that contains a representative point) to build a simple gazetteer. Locations which are ambiguous (i.e. those where multiple lookups overlap) are disambiguated using a geometrical approach. We assume that, in a similar way to word sense disambiguation, a document is likely to be discussing a single area, and so we choose the set of locations which minimise the area covered by the set of selected points; this is currently done by calculating axis aligned bounding boxes for efficiency purposes.
#
Entity Linking & Reconciliation#
Spacyfishing – A spaCy Python wrapper for the entity-fishing tool for entity linking against Wikidata.
OpenRefine – open source tool to manipulate datasets, including semi-automatic entity linking and variant clustering.
TagMe – a tool to identify short phrases or entities and match them against Wikipedia pages.
Ariadne Services for Entity Linking and Disambiguation – …
Applications {#applications}#
britishlibrary/peripleo - a browser-based tool for the mapping of things related to place.
Vistorian - online environment to visualize spatial and networked data.
Formats#
LinkedPasts/linked-places-format - Linked Places format is used to describe attestations of places in a standard way, primarily for linking gazetteer datasets.
LinkedPasts/linked-traces-format - Patterns based on the W3C Web Annotation Model, primarily for use in linking resources describing historical phenomena with the places relevant to them.