Field Manual № 01 Temporal × Ollama × Chroma Vol. 1 · Ed. 2026
◆ A Tutorial · Annotated Source · Live Walkthrough

Building a
retrieval-augmented
memory.

A line-by-line field guide to activities.py — the Temporal worker that turns scattered documents into something an LLM can actually remember.

Activities
04
extract · chunk · embed · query
Services
03
S3 · Ollama · Chroma
Endpoints hit
02
/api/embed · /api/generate
Algorithm
HNSW
cosine, k=8
Contents

Eleven
chapters.

01
Chapter I
01

A one-paragraph briefing before the schematics arrive.

The big picture.

This module powers a “chat with your documents” system. Files arrive, get torn down into bite-size chunks, and each chunk is converted into a numerical fingerprint — a vector — that lives in a specialized database. When a user asks a question, we find the chunks whose fingerprints are nearest to the question's own fingerprint, and hand those chunks to a language model along with the question. The model writes the answer using only what we gave it.

Pipeline A
3 activities

Upload

extract → chunk → embed & store

One-time work per document. The expensive bit. After this, the document is searchable.

Pipeline B
1 activity

Query

embed → retrieve → generate

Runs on every question. Fast: one embed call, one DB lookup, one LLM call.

02
Chapter II
02

Every component on one page.
Click any node for the detailed brief.

Architecture, mapped.

Five parts cooperate: a Temporal worker hosting the four activities, two storage backends, and one LLM runtime that doubles as embedder. Tap any component below for its full brief.

Fig. 2.1 — System topology
▸ click any element
USER browser/CLI TEMPORAL WORKER activities.py extract_text_activity PDF/DOCX/XLSX → text chunk_text_activity sliding window 500w / 100w overlap embed_and_store_activity N chunks → N vectors → Chroma execute_query_activity RAG: embed→retrieve→generate heartbeats · retries · DI seams S3 / MinIO raw/{doc}.pdf text/{doc}.txt answer/{q}.json OLLAMA /api/embed /api/generate CHROMADB collection: local_context HNSW · cosine upload/query read/write HTTP add/query ⚬ pulsing dots indicate clickable nodes — solid lines: synchronous · - - - dashed: data writes

“Temporal remembers what ran.
S3 remembers what was made.
Chroma remembers what it means.”

— design principle, this module
03
Chapter III
03

Four algorithms doing the heavy lifting. None of them are new. All of them earn their place.

Algorithms in service.

Algorithm № 1

Sliding-window chunking

O(n)
words → overlap = 100 words → context preserved at boundaries

Split text into 500-word windows that overlap by 100 words. The overlap means a sentence sitting on a boundary still has its surrounding context preserved in the next chunk — retrieval doesn't miss it.

Algorithm № 2

Cosine similarity

O(d)
θ query chunk cos(θ) = a·b ÷ (‖a‖·‖b‖)

Two vectors are similar if they point in the same direction, regardless of length. Magnitude-blind, so a long passage and a short passage are compared on meaning alone. ChromaDB returns distance = 1 − cosine — smaller is better.

Algorithm № 3

HNSW index

O(log n)
layer 2 (sparse) layer 1 layer 0 (dense, all nodes) search ↓

Hierarchical Navigable Small Worlds. A multi-layer graph where the top layers are sparse highways and the bottom is dense local streets. Search starts at the top, descending greedily. Find your nearest neighbor among millions of vectors in logarithmic time.

Algorithm № 4

Transformer embeddings

768-d
"warranty" nomic-embed transformer [ 0.012, −0.034, 0.567, … ] (768 floats)

A neural network reads the text and produces a fixed-length vector. Texts with similar meanings — even using different words — produce vectors that point in similar directions. “warranty period” and “coverage duration” end up neighbors in 768-dimensional space.

04
Chapter IV
04

Three activities, run in sequence, that turn a PDF into a searchable memory.

The upload pipeline.

A user drops a file into the system. Three activities fire — each one a discrete, retryable step. Click any node in the flow below.

Fig. 4.1 — Upload sequence
▸ click any step
01 · EXTRACT in: s3_raw_key out: s3_text_key pdfplumber, python-docx, openpyxl, plain 02 · CHUNK in: s3_text_key out: List[str] size = 500 words overlap = 100 words 03 · EMBED & STORE for chunk in chunks: vec = ollama.embed() heartbeat() collection.add(...) ← batch OLLAMA /api/embed CHROMA add() ⚬ each box is a Temporal activity — independently retried, independently observed ⚬ between activities, intermediate state lives in S3 — workflow stays compact
05
Chapter V
05

A short interlude on what numbers can know.

What an embedding is.

An embedding is a list of numbers — typically 768 of them — that captures what a piece of text means. Two passages that say the same thing using different words land near each other; two passages that disagree point in different directions. That's the entire trick.

dim 1 dim 2 "warranty period is 24 months" "coverage lasts two years" "valid for 730 days" "shipping takes 3-5 days" "delivered within a week" ? "how long is the warranty"

Picture vectors as points scattered through 768-dimensional space. We can't visualize 768 dimensions, so the diagram squashes it to 2. The principle survives the squash: related ideas cluster.

When the user asks “how long is the warranty?”, that question becomes its own vector (the hollow ring). It lands inside the warranty cluster. ChromaDB returns the nearest neighbors. The shipping cluster — far away in vector space — is correctly ignored.

06
Chapter VI
06

The system's memory.
How vectors live, how they're found.

ChromaDB internals.

Three ideas worth knowing: the parallel-array add() shape, the metadata-driven filter, and the HNSW index sitting underneath.

Fig. 6.1 — How a chunk is stored
collection.add(...) ids "abc-123_chunk_0" embeddings [0.012, -0.034, ...] metadatas {doc_id, filename, chunk_index, total_chunks, uploaded_at} documents "The warranty period is 24 months from date of purchase, per page 4..." all four arrays must be the same length — index i across them describes one chunk
WRITE PATH

add() — bulk index

ChromaDB indexes the embeddings into the HNSW graph as a batch. One round-trip, one rebuild of the graph layers. Calling add per chunk would be ~10× slower for large documents.

READ PATH

query() — k nearest

Pass a query vector and n_results=8. ChromaDB walks the HNSW graph, returns the eight nearest chunks plus their metadata and cosine distances. The activity then filters by enabled doc_id if specified.

07
Chapter VII
07

One activity. Three external calls. Many subtle decisions.

The query pipeline.

A user types a question. The system embeds it, retrieves nearby chunks from Chroma, constructs a prompt, and asks Ollama for an answer. Click any step.

Fig. 7.1 — RAG sequence
▸ click any step
01 · LOAD read query JSON from S3 → query_text 02 · EMBED ollama.embed(q) heartbeat() → 768-d vector 03 · RETRIEVE collection.query() n_results = 8 filter by doc_id 04 · BUILD join with "\n\n---" cap at 12 000 chars + system + question 05 · GENERATE ollama.generate() temp 0.3 · ctx 4096 strip <think> · cap 50k S3: answer/{q}.json + webhook to backend ⚬ entire activity heartbeats twice — once before embed, once before generate ⚬ if Chroma is empty: short-circuit with "I do not know." ⚬ if no doc passes filter: same
08
Chapter VIII
08

One service, two jobs. Different endpoints, different payload shapes, different failure modes.

Ollama, two ways.

Ollama runs LLMs locally. This module asks it for two things: turn a string into a vector, and turn a prompt into prose. Same server. Two endpoints. Two payload shapes.

EMBED
POST /api/embed

String → vector

Input: any text up to ~6 000 chars after pre-truncation. Output: a list of 768 floats. Used once per chunk at upload time and once per question at query time.

{
  "model": "nomic-embed-text",
  "input": "warranty period 24 months",
  "truncate": true
}
{ "embeddings": [[0.012, -0.034, …]] }
GENERATE
POST /api/generate

Prompt → answer

Input: a system prompt that constrains behavior plus a user prompt carrying the question and retrieved context. Output: natural-language text.

{
  "model": "llama3.1:8b",
  "system": "Answer ONLY from context...",
  "prompt": "CONTEXT: ... QUESTION: ...",
  "stream": false,
  "options": {
    "temperature": 0.3,
    "num_ctx": 4096
  }
}
{ "response": "The warranty is 24 months..." }
09
Chapter IX
09

The single most important paragraph of this whole tour.

How Ollama gets context.

Ollama, the language model, knows nothing about your documents. It only knows what you put in its prompt. The art of RAG is composing that prompt so the model has exactly what it needs — and a clear instruction not to make things up.

Anatomy of a RAG prompt
① system prompt
You are Local Context Query. Answer ONLY from provided context.
If context lacks info say 'I do not know.'
Cite document names. Be concise.

— constrains the model's persona and forbids hallucination.

② user prompt — context block
CONTEXT:
[chunk 1: "The warranty period for product Z-9 is 24 months..."]
---
[chunk 2: "Coverage begins on the date of purchase as shown..."]
---
[chunk 3: "Extended warranty options include 36, 48, and 60..."]

— the eight (or fewer) chunks Chroma returned, joined by \n\n---\n\n, capped at 12 000 chars.

③ user prompt — the question
QUESTION:
How long is the warranty?

Answer from context only:

— the literal user question, plus a final framing line that re-anchors the constraint.

resulting answer
The warranty is 24 months from the date of purchase, per the product documentation. Extended options of 36, 48, and 60 months are also available.

“The model knows nothing.
The retrieval knows everything.
The system prompt is the bridge.”

10
Chapter X
10

What makes this thing not break in production.

Reliability patterns.

Two-tier errors

ApplicationError(non_retryable=True) for permanent failures — empty file, model not pulled, schema violation. Plain Exception for transient — Temporal retries by policy.

Heartbeats

Inside the embed loop: activity.heartbeat(f"chunk {i}/{n}"). Without it, Temporal would assume the worker died on a long activity and reschedule. Visible as live progress in the Temporal UI.

Idempotent IDs

Chunk IDs {doc_id}_chunk_{i} are deterministic. A retried activity overwrites cleanly — no duplicates, no orphans.

Compact workflow history

Bulky data (raw bytes, extracted text, generated answers) lives in S3. Workflows pass S3 keys, not payloads. Temporal history stays small.

Dependency injection

S3, Chroma, Ollama wrapped in thin classes. Activities accept them via constructor. Tests pass mocks; production passes real clients. No monkeypatching.

Fire-and-forget webhooks

Backend notification wrapped in try/except: pass. Answer is already saved; the webhook is best-effort. The activity must not fail on a flaky downstream.

11
Chapter XI
11

Where to read deeper. Primary sources only.

References & further reading.

End of manual

Happy retrieving.

↑ Back to top