Parsing Narratives¶
When you POST text to /v1/notes/import, TNGS runs a five-stage rule-based
pipeline entirely in memory before writing a single node to the graph. The
pipeline requires no external NLP libraries or model downloads; it ships and
runs in CI out of the box.
The five-stage pipeline¶
flowchart LR
A[Raw text] --> B[Stage 1\nText prep]
B --> C[Stage 2\nSegmentation]
C --> D[Stage 3\nEntity extraction]
C --> E[Stage 4\nEvent detection]
D & E --> F[Stage 5\nAnnotation]
F --> G[Pattern detection]
G --> H[(Neo4j graph)]
Stage 1 — Text preparation and segmentation¶
The format field in the ingest request controls how the input is split into
scenes. Two segmentation strategies are available:
Plain-text segmentation (format: "text", default)¶
Double newlines (\n\n or more) mark scene boundaries. Each paragraph becomes
one Scene node with an empty summary. YAML front matter is stripped first if
present.
Sentences → Atoms
Within each paragraph a regex splits on sentence-ending punctuation (., !,
?) followed by optional closing punctuation and a space before an upper-case
letter, or at end-of-string:
Each sentence becomes one candidate Atom.
Plain-text segmentation
Input:
Output: 2 scenes, 4 atoms. Both scenes havesummary = "".
| Scene | Sequence | summary | Atoms |
|---|---|---|---|
| Scene 1 | 1 | (empty) | "Alice offered the book." · "She smiled." |
| Scene 2 | 2 | (empty) | "Bob accepted it gratefully." · "He nodded once." |
Markdown segmentation (format: "markdown")¶
ATX headings (# through ######) are scene boundaries. The heading text
(without the # sigil) becomes scene.summary. Prose paragraphs under a
heading are merged into that scene — the heading is the scene boundary, not
the blank line. YAML front matter is stripped first.
(?<heading>) ^#{1,6}\s+(.+)$ → new SceneSection(summary=capture)
(?<prose>) non-empty line → accumulated into current section sentences
Markdown segmentation
Input:
## The House
It is very seldom that mere ordinary people secure ancestral halls.
I would say a haunted house—but that would be asking too much of fate!
## The Room
I do not like our room a bit. The color is revolting.
| Scene | Sequence | summary | Atoms |
|---|---|---|---|
| Scene 1 | 1 | The House | "It is very seldom…" · "I would say a haunted house…" |
| Scene 2 | 2 | The Room | "I do not like our room a bit." · "The color is revolting." |
Multiple paragraphs within a section
All prose paragraphs under one heading are merged into a single scene.
A chapter with five paragraphs and one ## heading produces one scene
with atoms from all five paragraphs — not five scenes.
Headings with no prose are skipped
A heading line immediately followed by another heading (no prose between them) produces no scene. A scene is only created when there is at least one prose sentence under the heading.
Stage 3 — Entity extraction¶
Each scene's sentences are scanned for consecutive capitalised words (1–4
tokens matching [A-Z][a-z]+) that do not appear at the very start of a
sentence. A stop-list of pronouns, conjunctions, and common sentence-initial
words filters obvious false positives.
Confidence is assigned per entity:
A single mention → 0.75 confidence. Each additional mention raises it by 0.05, capped at 0.95.
Entity extraction example
Sentences: "Alice ran. Alice stopped. Bob arrived."
| Name | Mentions | Confidence |
|---|---|---|
| Alice | 2 | 0.80 |
| Bob | 1 | 0.75 |
Stage 4 — Event detection¶
Each sentence is scanned for a verb phrase using a regex that covers:
- Modal + main verb (
will run,should leave) - Be + -ing progressive (
was walking,is running) - Have + past participle (
had arrived,has seen) - Simple inflected verb (
walked,stops,running)
The first match per sentence becomes a DetectedEvent. Tense is
inferred from surface markers:
| Marker pattern | Tense |
|---|---|
will, shall, going to |
future |
was, were, had, -ed endings |
past |
| Everything else | present |
All rule-based events receive a base confidence of 0.75.
Stage 5 — Confidence annotation and atom classification¶
The annotator wraps extracted entities and events in domain model objects with
UUIDs and applies the confidence_threshold (default 0.6):
- Items at or above the threshold proceed normally.
- Items below the threshold are flagged
needs_review=Truebut are not discarded. Ambiguity is first-class data; a human or downstream process resolves it.
Atom confidence penalties:
| Condition | Penalty |
|---|---|
| Sentence shorter than 10 characters | −0.15 |
No terminal punctuation (.!?) |
−0.05 |
Atom kind classification (heuristic, in priority order):
| Kind | Trigger |
|---|---|
dialogic |
Sentence starts with " / ' / " / ", or contains said/asked/replied/whispered/shouted |
reflexive |
Contains thought/felt/wondered/realised/realized/knew |
transitional |
Starts with then/later/meanwhile/afterwards/next/finally |
expository |
Contains was a/ were a/ is a/ are a/ had been |
descriptive |
Default |
Character–event linking: A character is added as a PARTICIPATES_IN
participant if their name (case-insensitive) appears anywhere in the source
sentence of the event.
Pattern detection (parallel to annotation)¶
After atoms and events are assembled for a scene, PatternService runs
keyword matching against all registered Pattern templates. Matches above
the confidence threshold become PatternInstance nodes linked to the scene.
From pipeline to graph¶
All in-memory results are written to Neo4j in a single pass using
idempotent MERGE statements. Re-ingesting the same text leaves the graph
unchanged (no duplicate nodes).
sequenceDiagram
actor User
participant API
participant IS as IngestService
participant PS as PatternService
participant GR as GraphRepository
User->>API: POST /v1/notes/import
API->>IS: ingest(payload)
IS->>IS: _segment() → SceneSection list
Note right of IS: markdown: headings → scene boundaries<br/>text: paragraphs → scene boundaries
loop per scene
IS->>IS: extract_entities()
IS->>IS: detect_events()
IS->>IS: annotate_atoms() / annotate_events()
IS->>PS: detect_patterns(atoms, events)
PS-->>IS: pattern_instances[]
IS->>GR: save_scene(scene)
end
GR-->>IS: done
IS-->>API: IngestResult
API-->>User: 201 + JSON summary
Ingest result¶
A successful ingest returns:
{
"narrative_id": "550e8400-e29b-41d4-a716-446655440000",
"scene_count": 2,
"atom_count": 4,
"event_count": 3,
"character_count": 2,
"pattern_count": 1,
"flagged_count": 0
}
flagged_count is the total number of atoms and events with
needs_review=True. Review flags are queryable directly in Neo4j:
MATCH (n:Narrative {id: $id})-[:HAS_SCENE]->(s)-[:CONTAINS]->(a:Atom)
WHERE a.needs_review = true
RETURN s.sequence AS scene, a.text, a.confidence
ORDER BY a.confidence ASC
Replacing pipeline stages¶
Each stage is a Strategy. To swap in a heavier NLP backend (spaCy, stanza,
or an LLM classifier), write a function that accepts the same inputs and
returns the same output type (ExtractedEntity, DetectedEvent, Atom), then
pass it to IngestService during construction. No other code changes.