Skip to content

Parsing Narratives

When you POST text to /v1/notes/import, TNGS runs a five-stage rule-based pipeline entirely in memory before writing a single node to the graph. The pipeline requires no external NLP libraries or model downloads; it ships and runs in CI out of the box.


The five-stage pipeline

flowchart LR
    A[Raw text] --> B[Stage 1\nText prep]
    B --> C[Stage 2\nSegmentation]
    C --> D[Stage 3\nEntity extraction]
    C --> E[Stage 4\nEvent detection]
    D & E --> F[Stage 5\nAnnotation]
    F --> G[Pattern detection]
    G --> H[(Neo4j graph)]

Stage 1 — Text preparation and segmentation

The format field in the ingest request controls how the input is split into scenes. Two segmentation strategies are available:

Plain-text segmentation (format: "text", default)

Double newlines (\n\n or more) mark scene boundaries. Each paragraph becomes one Scene node with an empty summary. YAML front matter is stripped first if present.

Sentences → Atoms

Within each paragraph a regex splits on sentence-ending punctuation (., !, ?) followed by optional closing punctuation and a space before an upper-case letter, or at end-of-string:

(?<=[.!?])[\"']?\s+(?=[A-Z])  |  (?<=[.!?])[\"']?$

Each sentence becomes one candidate Atom.

Plain-text segmentation

Input:

Alice offered the book. She smiled.

Bob accepted it gratefully. He nodded once.
Output: 2 scenes, 4 atoms. Both scenes have summary = "".

Scene Sequence summary Atoms
Scene 1 1 (empty) "Alice offered the book." · "She smiled."
Scene 2 2 (empty) "Bob accepted it gratefully." · "He nodded once."

Markdown segmentation (format: "markdown")

ATX headings (# through ######) are scene boundaries. The heading text (without the # sigil) becomes scene.summary. Prose paragraphs under a heading are merged into that scene — the heading is the scene boundary, not the blank line. YAML front matter is stripped first.

(?<heading>)  ^#{1,6}\s+(.+)$   →  new SceneSection(summary=capture)
(?<prose>)    non-empty line      →  accumulated into current section sentences

Markdown segmentation

Input:

## The House

It is very seldom that mere ordinary people secure ancestral halls.
I would say a haunted house—but that would be asking too much of fate!

## The Room

I do not like our room a bit. The color is revolting.
Output: 2 scenes, 4 atoms. Heading text is not an atom.

Scene Sequence summary Atoms
Scene 1 1 The House "It is very seldom…" · "I would say a haunted house…"
Scene 2 2 The Room "I do not like our room a bit." · "The color is revolting."

Multiple paragraphs within a section

All prose paragraphs under one heading are merged into a single scene. A chapter with five paragraphs and one ## heading produces one scene with atoms from all five paragraphs — not five scenes.

Headings with no prose are skipped

A heading line immediately followed by another heading (no prose between them) produces no scene. A scene is only created when there is at least one prose sentence under the heading.


Stage 3 — Entity extraction

Each scene's sentences are scanned for consecutive capitalised words (1–4 tokens matching [A-Z][a-z]+) that do not appear at the very start of a sentence. A stop-list of pronouns, conjunctions, and common sentence-initial words filters obvious false positives.

Confidence is assigned per entity:

confidence = min(0.95, 0.75 + 0.05 × (mention_count − 1))

A single mention → 0.75 confidence. Each additional mention raises it by 0.05, capped at 0.95.

Entity extraction example

Sentences: "Alice ran. Alice stopped. Bob arrived."

Name Mentions Confidence
Alice 2 0.80
Bob 1 0.75

Stage 4 — Event detection

Each sentence is scanned for a verb phrase using a regex that covers:

  • Modal + main verb (will run, should leave)
  • Be + -ing progressive (was walking, is running)
  • Have + past participle (had arrived, has seen)
  • Simple inflected verb (walked, stops, running)

The first match per sentence becomes a DetectedEvent. Tense is inferred from surface markers:

Marker pattern Tense
will, shall, going to future
was, were, had, -ed endings past
Everything else present

All rule-based events receive a base confidence of 0.75.


Stage 5 — Confidence annotation and atom classification

The annotator wraps extracted entities and events in domain model objects with UUIDs and applies the confidence_threshold (default 0.6):

  • Items at or above the threshold proceed normally.
  • Items below the threshold are flagged needs_review=True but are not discarded. Ambiguity is first-class data; a human or downstream process resolves it.

Atom confidence penalties:

Condition Penalty
Sentence shorter than 10 characters −0.15
No terminal punctuation (.!?) −0.05

Atom kind classification (heuristic, in priority order):

Kind Trigger
dialogic Sentence starts with " / ' / " / ", or contains said/asked/replied/whispered/shouted
reflexive Contains thought/felt/wondered/realised/realized/knew
transitional Starts with then/later/meanwhile/afterwards/next/finally
expository Contains was a/ were a/ is a/ are a/ had been
descriptive Default

Character–event linking: A character is added as a PARTICIPATES_IN participant if their name (case-insensitive) appears anywhere in the source sentence of the event.


Pattern detection (parallel to annotation)

After atoms and events are assembled for a scene, PatternService runs keyword matching against all registered Pattern templates. Matches above the confidence threshold become PatternInstance nodes linked to the scene.


From pipeline to graph

All in-memory results are written to Neo4j in a single pass using idempotent MERGE statements. Re-ingesting the same text leaves the graph unchanged (no duplicate nodes).

sequenceDiagram
    actor User
    participant API
    participant IS as IngestService
    participant PS as PatternService
    participant GR as GraphRepository

    User->>API: POST /v1/notes/import
    API->>IS: ingest(payload)
    IS->>IS: _segment() → SceneSection list
    Note right of IS: markdown: headings → scene boundaries<br/>text: paragraphs → scene boundaries
    loop per scene
        IS->>IS: extract_entities()
        IS->>IS: detect_events()
        IS->>IS: annotate_atoms() / annotate_events()
        IS->>PS: detect_patterns(atoms, events)
        PS-->>IS: pattern_instances[]
        IS->>GR: save_scene(scene)
    end
    GR-->>IS: done
    IS-->>API: IngestResult
    API-->>User: 201 + JSON summary

Ingest result

A successful ingest returns:

{
  "narrative_id": "550e8400-e29b-41d4-a716-446655440000",
  "scene_count": 2,
  "atom_count": 4,
  "event_count": 3,
  "character_count": 2,
  "pattern_count": 1,
  "flagged_count": 0
}

flagged_count is the total number of atoms and events with needs_review=True. Review flags are queryable directly in Neo4j:

MATCH (n:Narrative {id: $id})-[:HAS_SCENE]->(s)-[:CONTAINS]->(a:Atom)
WHERE a.needs_review = true
RETURN s.sequence AS scene, a.text, a.confidence
ORDER BY a.confidence ASC

Replacing pipeline stages

Each stage is a Strategy. To swap in a heavier NLP backend (spaCy, stanza, or an LLM classifier), write a function that accepts the same inputs and returns the same output type (ExtractedEntity, DetectedEvent, Atom), then pass it to IngestService during construction. No other code changes.