Skip to main content

Data Access

The EO Glossary is designed from the ground up to be machine-readable and AI-friendly. Every term carries structured metadata, and the full dataset is available in multiple formats — ready for direct LLM ingestion, RAG pipelines, data analysis, or integration into AI assistants.

This page describes all access patterns in detail.

At a Glance

ResourceURLFormatBest for
AI guidance/llms.txtMarkdownLLM crawlers and agents (llmstxt.org standard)
All definitions (plain text)/llms-full.txtMarkdownDirect LLM ingestion, RAG chunking
All definitions (structured)exports/json/terms.jsonJSONRAG pipelines, programmatic access
All definitions (columnar)exports/parquet/ParquetDuckDB, data science, analytics
All definitions (spreadsheet)exports/xlsx/XLSXExcel, Google Sheets, manual review
Per-term JSON/terms/{slug}.jsonJSONSingle-term lookups
Per-term Markdown/terms/{slug}.mdMarkdownRaw source files
Sitemap/sitemap.xmlXMLWeb crawlers
RSS feed/rss.xmlRSS 2.0Feed readers, change tracking

All exports are regenerated automatically on every push to main.


llms.txt — AI Crawler Guidance

Following the llmstxt.org standard, the glossary serves a /llms.txt file at its root. This lightweight Markdown file tells LLM crawlers and agents what the glossary contains, how it is structured, and where to find the data they need.

If you are building an AI agent or crawler that discovers resources automatically, point it at /llms.txt first.


llms-full.txt — Plain-Text Definitions for LLMs

The /llms-full.txt file contains all term definitions concatenated as plain Markdown — one block per term, separated by horizontal rules. Each block includes:

  • Term name and URL
  • Tags (ontology class + approval status)
  • One-sentence summary
  • Full definition, notes, examples, and sources

This file is designed for direct LLM ingestion: paste it into a system prompt, use it as a context document, or chunk it for a RAG pipeline. It is generated at build time and always reflects the latest state of the glossary.

Example block:

# Accuracy
URL: https://ceos-org.github.io/eo-glossary/terms/accuracy
Tags: core, approved
Summary: Proximity of Measurement results to the accepted Value...

## 1 Definition

Proximity of Measurement results to the accepted Value...

### Notes
- ...

### Sources
- VIM 2.13, modified

JSON Export — Structured Data

The full glossary is exported as a single JSON file:

exports/json/terms.json

Each entry contains:

{
"term": "Accuracy",
"tags": "core, approved",
"synonyms": "",
"definitions": [
{
"definition_no": 1,
"definition": "Proximity of Measurement results to the accepted Value...",
"notes": "...",
"examples": "...",
"sources": "- VIM 2.13, modified"
}
]
}

Terms with multiple definitions (e.g. controversial terms) have multiple entries in the definitions array.

Flattened JSON exports

For simpler consumption, flattened exports are also available where each file contains only a single definition per term:


Parquet Export — Analytics & DuckDB

The same data is available as Apache Parquet files — a columnar format ideal for data science tooling, DuckDB, pandas, Polars, and other analytical frameworks.

Query directly with DuckDB (no download needed)

INSTALL httpfs;
LOAD httpfs;

-- Browse all terms
SELECT * FROM read_parquet(
'https://github.com/ceos-org/eo-glossary/raw/refs/heads/main/exports/parquet/terms_definition_1.parquet'
);

-- Search for a specific term
SELECT term, definition, sources
FROM read_parquet(
'https://github.com/ceos-org/eo-glossary/raw/refs/heads/main/exports/parquet/terms_definition_1.parquet'
)
WHERE term ILIKE '%calibration%';

Run with duckdb -ui for an interactive browser-based UI, or use any DuckDB client.

Query with Python (pandas / Polars)

import pandas as pd

url = "https://github.com/ceos-org/eo-glossary/raw/refs/heads/main/exports/parquet/terms_definition_1.parquet"
df = pd.read_parquet(url)
print(df[df["term"].str.contains("Calibration", case=False)])

Available Parquet files

FileContents
terms.parquetAll terms with nested definitions
terms_definition_1.parquetFirst definition of each term (flat)
terms_definition_2.parquetSecond definition where available
...up to terms_definition_5.parquet

XLSX Export — Spreadsheet Access

For users who prefer working in Excel or Google Sheets, the glossary is also exported as XLSX files:


Per-Term JSON & Markdown

Every term is available individually as both JSON and raw Markdown, served directly from the site:

  • JSON: https://ceos-org.github.io/eo-glossary/terms/{slug}.json
  • Markdown: https://ceos-org.github.io/eo-glossary/terms/{slug}.md

For example:

These are useful for single-term lookups in scripts or AI tool calls.


Schema.org Structured Data (JSON-LD)

Every term page includes a Schema.org DefinedTerm JSON-LD block in its HTML <head>. This enables semantic search engines and knowledge graphs to understand and index term definitions directly.


MCP Server — AI Assistant Integration

The glossary ships with a Model Context Protocol (MCP) server that lets AI assistants like Claude Desktop query the glossary interactively.

Setup

cd mcp && npm install

Add to your claude_desktop_config.json:

{
"mcpServers": {
"eo-glossary": {
"command": "node",
"args": ["/path/to/eo-glossary/mcp/server.js"]
}
}
}

Available tools

ToolDescriptionParameters
list_termsList all terms, optionally filtered by tagtag (optional): "core", "approved", etc.
get_termGet full definition(s) for a specific termterm (required): name or slug
search_termsSearch terms by keyword (name + definition text)query (required): search string

Example interaction

Once configured, you can ask your AI assistant things like:

  • "What does the EO Glossary define as 'Calibration'?"
  • "List all controversial terms in the EO Glossary"
  • "Search the EO Glossary for terms related to uncertainty"

The MCP server loads data from exports/json/terms.json locally, or falls back to fetching from GitHub if the local file is not available.


RSS Feed

The /rss.xml feed tracks term updates. Subscribe in any feed reader to stay informed when definitions are added or modified.


Building Your Own Integration

The glossary's open data makes it straightforward to build custom integrations:

  • RAG pipeline: Ingest llms-full.txt or terms.json, chunk by term, embed with your model of choice
  • Search index: Parse terms.json and index into Elasticsearch, Meilisearch, Typesense, etc.
  • Knowledge graph: Use the per-term JSON (which includes source attributions) to build ontology-aware graphs
  • CI/CD checks: Fetch terms.json in your pipeline to validate that documentation uses consistent EO terminology
  • Chatbot / agent: Connect the MCP server, or fetch terms.json at startup for in-memory lookups

All data is licensed under CC BY 4.0 — free to use, share, and adapt with attribution.