Field Manual № 01 — Temporal RAG Activities

Chapter III

Four algorithms doing the heavy lifting. None of them are new. All of them earn their place.

Algorithms in service.

Algorithm № 1

Sliding-window chunking

O(n)

Split text into 500-word windows that overlap by 100 words. The overlap means a sentence sitting on a boundary still has its surrounding context preserved in the next chunk — retrieval doesn't miss it.

nomic-embed-text has a 2048-token context. At ~1.3 tokens per word, 500 words ≈ 650 tokens — comfortable margin. The previous default of 800 words occasionally crashed the embedder on dense technical text.

Stride math: if size=500 and overlap=100, the stride is 500−100 = 400 words. A 5,000-word document yields ⌈5000/400⌉ = 13 chunks.

Algorithm № 2

Cosine similarity

O(d)

Two vectors are similar if they point in the same direction, regardless of length. Magnitude-blind, so a long passage and a short passage are compared on meaning alone. ChromaDB returns distance = 1 − cosine — smaller is better.

Embedding models normalize semantic intensity into vector direction, not length. A two-sentence summary and a paragraph saying the same thing produce vectors of similar direction but very different magnitudes. Euclidean distance would call them far apart; cosine calls them close — which is what we want.

Configured in code by: metadata={"hnsw:space": "cosine"} on the collection.

Algorithm № 3

HNSW index

O(log n)

Hierarchical Navigable Small Worlds. A multi-layer graph where the top layers are sparse highways and the bottom is dense local streets. Search starts at the top, descending greedily. Find your nearest neighbor among millions of vectors in logarithmic time.

Imagine an interstate map (top layer): five cities, two highways. You pick the city nearest your destination. Then the state-road map (middle layer): twenty towns, you zoom in. Finally the local street map (bottom layer): every house. You arrive in three hops, not by checking every house.

This is approximate — HNSW occasionally misses the true nearest neighbor — but for retrieval the approximation is invisible. Brute-force would be O(n) per query; HNSW is O(log n).

Algorithm № 4

Transformer embeddings

768-d

A neural network reads the text and produces a fixed-length vector. Texts with similar meanings — even using different words — produce vectors that point in similar directions. “warranty period” and “coverage duration” end up neighbors in 768-dimensional space.

nomic-embed-text is a transformer trained with contrastive learning: pairs of related texts (paraphrases, query/document) were pushed together in vector space, while unrelated pairs were pushed apart. Repeat for billions of pairs and the geometry of meaning emerges.

Why 768? A balance — high enough to capture semantic nuance, low enough that nearest-neighbor search stays fast and storage stays cheap. (~3 KB per chunk vector.)

Chapter VI

The system's memory.
How vectors live, how they're found.

ChromaDB internals.

Three ideas worth knowing: the parallel-array add() shape, the metadata-driven filter, and the HNSW index sitting underneath.

Fig. 6.1 — How a chunk is stored

WRITE PATH

add() — bulk index

ChromaDB indexes the embeddings into the HNSW graph as a batch. One round-trip, one rebuild of the graph layers. Calling add per chunk would be ~10× slower for large documents.

Chunk IDs are deterministic: {doc_id}_chunk_{i}. If Temporal retries the activity after a partial success, the second pass overwrites instead of duplicating. This is what lets us treat the activity as retryable without fear.

READ PATH

query() — k nearest

Pass a query vector and n_results=8. ChromaDB walks the HNSW graph, returns the eight nearest chunks plus their metadata and cosine distances. The activity then filters by enabled doc_id if specified.

{
  "ids":        [["abc_chunk_3", ...]],   ← outer list = batch
  "documents":  [["chunk text", ...]],
  "metadatas":  [[{...}, ...]],
  "distances":  [[0.1234, ...]]
}

The outer list exists because query() can take many query vectors at once. We pass one, so we always index [0].

Chapter VIII

One service, two jobs. Different endpoints, different payload shapes, different failure modes.

Ollama, two ways.

Ollama runs LLMs locally. This module asks it for two things: turn a string into a vector, and turn a prompt into prose. Same server. Two endpoints. Two payload shapes.

EMBED

POST /api/embed

String → vector

Input: any text up to ~6 000 chars after pre-truncation. Output: a list of 768 floats. Used once per chunk at upload time and once per question at query time.

{
  "model": "nomic-embed-text",
  "input": "warranty period 24 months",
  "truncate": true
}

↓

{ "embeddings": [[0.012, -0.034, …]] }

Older Ollama versions don't speak /api/embed. The client tries it first; on a non-404 failure it retries against the legacy /api/embeddings (uses "prompt" instead of "input", returns "embedding" singular).

Three layers of token-limit defense:

Word-based chunking caps theoretical token count.
Pre-flight character truncation at 6 000 chars (cuts at last word boundary).
"truncate": true tells Ollama to handle whatever still slips through.

All three exist because real-world text contains base64, URLs, and hex dumps that confound naive word-counting.

GENERATE

POST /api/generate

Prompt → answer

Input: a system prompt that constrains behavior plus a user prompt carrying the question and retrieved context. Output: natural-language text.

{
  "model": "llama3.1:8b",
  "system": "Answer ONLY from context...",
  "prompt": "CONTEXT: ... QUESTION: ...",
  "stream": false,
  "options": {
    "temperature": 0.3,
    "num_ctx": 4096
  }
}

↓

{ "response": "The warranty is 24 months..." }

temperature: 0.3 — for RAG you want grounded answers, not creative ones. Lower temperature concentrates probability on the most likely tokens, suppressing novelty and rambling. Set to 0 you'd get repetitive output; 0.3 is the sweet spot for grounded prose.

num_ctx: 4096 — the context window the model is allowed to attend to. Big enough for ~12 000 chars of context + the question + the answer. Bigger windows cost more memory; this is the tightest comfortable size.

The <think> strip — reasoning models like Qwen and DeepSeek prefix their answers with chain-of-thought wrapped in <think>...</think>. The activity strips those before returning, so users see only the answer.

Chapter IX

The single most important paragraph of this whole tour.

How Ollama gets context.

Ollama, the language model, knows nothing about your documents. It only knows what you put in its prompt. The art of RAG is composing that prompt so the model has exactly what it needs — and a clear instruction not to make things up.

Anatomy of a RAG prompt

① system prompt

You are Local Context Query. Answer ONLY from provided context.
If context lacks info say 'I do not know.'
Cite document names. Be concise.

— constrains the model's persona and forbids hallucination.

② user prompt — context block

CONTEXT:
[chunk 1: "The warranty period for product Z-9 is 24 months..."]
---
[chunk 2: "Coverage begins on the date of purchase as shown..."]
---
[chunk 3: "Extended warranty options include 36, 48, and 60..."]

— the eight (or fewer) chunks Chroma returned, joined by \n\n---\n\n, capped at 12 000 chars.

③ user prompt — the question

QUESTION:
How long is the warranty?

Answer from context only:

— the literal user question, plus a final framing line that re-anchors the constraint.

resulting answer

The warranty is 24 months from the date of purchase, per the product documentation. Extended options of 36, 48, and 60 months are also available.

“The model knows nothing.
The retrieval knows everything.
The system prompt is the bridge.”

Order matters. Putting context before the question makes the model treat the chunks as the source of truth. Putting the question first would invite it to answer from training data and "verify" against the chunks afterward — which is how hallucinations sneak in.

Repeating the constraint. The system prompt says "answer from context only." The final "Answer from context only:" line in the user prompt repeats it — the closer to the answer's generation point, the more weight the instruction carries.

"I do not know" as an explicit option. Without permission to refuse, language models will invent an answer. Giving them a clean exit ("If context lacks info say 'I do not know.'") dramatically reduces hallucination rate.

The 12 000-char cap. Beyond a certain point, models start ignoring the middle of long contexts (the "lost in the middle" effect). Capping context keeps the most-relevant chunks salient.

Chapter X

What makes this thing not break in production.

Reliability patterns.

①

Two-tier errors

ApplicationError(non_retryable=True) for permanent failures — empty file, model not pulled, schema violation. Plain Exception for transient — Temporal retries by policy.

②

Heartbeats

Inside the embed loop: activity.heartbeat(f"chunk {i}/{n}"). Without it, Temporal would assume the worker died on a long activity and reschedule. Visible as live progress in the Temporal UI.

③

Idempotent IDs

Chunk IDs {doc_id}_chunk_{i} are deterministic. A retried activity overwrites cleanly — no duplicates, no orphans.

④

Compact workflow history

Bulky data (raw bytes, extracted text, generated answers) lives in S3. Workflows pass S3 keys, not payloads. Temporal history stays small.

⑤

Dependency injection

S3, Chroma, Ollama wrapped in thin classes. Activities accept them via constructor. Tests pass mocks; production passes real clients. No monkeypatching.

⑥

Fire-and-forget webhooks

Backend notification wrapped in try/except: pass. Answer is already saved; the webhook is best-effort. The activity must not fail on a flaky downstream.

# Permanent — DO NOT retry, hint at human action
if r.status_code == 404:
    raise ApplicationError(
        f"Embedding model '{self._embed_model}' not found. "
        f"Pull it with: ollama pull {self._embed_model}",
        non_retryable=True,
    )

# Transient — let Temporal retry by policy
if r.status_code != 200:
    raise ValueError(f"Ollama generate error {r.status_code}: {r.text[:200]}")

# Best-effort — never fail the activity
try:
    await client.post(f"{self._backend_url}/api/internal/query-complete", ...)
except Exception:
    activity.logger.debug("Backend notification failed (non-critical)")

Component

User · Frontend

Not part of activities.py, but shapes its API. The frontend uploads files and writes query JSON directly to S3, then starts a Temporal workflow that references the S3 key.

Why this indirection? It keeps the Temporal workflow's history compact — workflows remember the key, not the bytes. A 50 MB PDF doesn't bloat the orchestration database.

Upload: PUT s3://bucket/raw/{doc_id}.pdf
Query: PUT s3://bucket/query/{query_id}.json
Then: client.start_workflow(...)

Component

Temporal Worker

A long-running Python process that polls Temporal Server for activity tasks. When tasks arrive, it runs the appropriate Python function and reports the result back.

Three properties make it special:

Durable. Worker crashes are recoverable. Temporal replays the workflow from history, skipping completed activities.
Observable. Every activity start, retry, heartbeat, and result is logged in the Temporal UI.
Composable. The four activities can be reused in other workflows — say, a re-indexing workflow that calls embed_and_store without the upload steps.

activities.py registers four activities by name

Component

S3 / MinIO

Object storage. Holds three categories of blob:

raw/{doc_id}.{ext} — original uploaded files

text/{doc_id}.txt — extracted plain text

query/{query_id}.json — incoming questions

answer/{query_id}.json — generated answers

S3Client wraps boto3 with two methods — get_bytes and put_bytes. The endpoint URL is configurable, so the same code talks to AWS, MinIO, or LocalStack.

Component

Ollama

A local LLM runtime. Runs models on the user's machine — no API keys, no rate limits, no data leaving the host. This module talks to it via two HTTP endpoints:

/api/embed

Turns text into a 768-dimensional vector. Called once per chunk at upload, once per question at query.

/api/generate

Turns a (system + user) prompt into a natural-language answer. Called once per query.

The class also handles the legacy /api/embeddings endpoint as a fallback for older Ollama versions.

→ Ollama API docs

Component

ChromaDB

A vector database — purpose-built for storing high-dimensional embeddings and finding nearest neighbors quickly.

One collection: local_context. Configured for cosine similarity:

metadata = {"hnsw:space": "cosine"}

Underneath, Chroma uses HNSW (Hierarchical Navigable Small Worlds) to make nearest-neighbor lookups O(log n) instead of O(n). With 100 000 chunks, the difference is "instantaneous" vs. "noticeable lag."

→ Chroma docs

Activity 01

extract_text_activity

Reads a file from S3 and extracts its plain-text content. Dispatches to a parser based on file extension:

.pdf → pdfplumber

.docx → python-docx

.xlsx → openpyxl

other → utf-8 decode

Output is written back to S3 at text/{doc_id}.txt. The activity returns just the key — keeping the workflow history small.

Empty extraction → ApplicationError(non_retryable=True). No point retrying a broken PDF.

in: UploadInput out: ExtractResult

Activity 02

chunk_text_activity

Reads the extracted text from S3 and splits it into overlapping windows.

# sliding window
chunks, i = [], 0
while i < len(words):
chunks.append(" ".join(words[i : i + 500]))
i += 500 - 100 # stride = 400

The 100-word overlap is critical: a sentence at a chunk boundary still has surrounding context preserved in the next chunk, so retrieval doesn't miss it.

Math: a 5000-word document with size=500/overlap=100 → ⌈5000/400⌉ = 13 chunks.

in: s3_text_key out: ChunkResult

Activity 03

embed_and_store_activity

The expensive one. For each chunk:

Heartbeat with progress (f"chunk {i+1}/{n}")
Call ollama.embed(chunk) → 768 floats
Build parallel arrays of ids, embeddings, metadata, documents

After the loop, one batched collection.add(...) call writes everything to Chroma. Per-chunk inserts would be ~10× slower at scale.

Chunk IDs are deterministic ({doc_id}_chunk_{i}), so retrying after a partial failure overwrites instead of duplicating.

heartbeats every iteration

Query Step 01

Load query

Read query/{query_id}.json from S3. The frontend wrote this JSON before starting the workflow. Why not pass it as a workflow argument? Two reasons:

Compact history. Workflows store every input forever. Keeping questions in S3 keeps the orchestration database lean.
Larger payloads. A query might include attached context, system overrides, or per-user settings — easier to evolve without bumping a workflow version.

Query Step 02

Embed the question

The same OllamaClient.embed() used at upload time is now called on the user's question. Critical: the question and the chunks must use the same embedding model. Different models produce vectors in incompatible spaces.

A heartbeat is fired before the call: activity.heartbeat("Embedding query"). The Temporal UI shows this as live progress.

Query Step 03

Retrieve neighbors

Ask Chroma for the 8 chunks whose vectors are nearest to the query vector (n_results = min(8, count)). Chroma walks its HNSW index and returns matches in milliseconds.

Then comes the per-document filter: if the user enabled only documents A and C, drop any returned chunk whose doc_id is something else.

Edge cases handled: empty collection → "I do not know — no documents uploaded." All-filtered-out → "I do not know — no relevant info."

Query Step 04

Build the prompt

Three pieces are assembled:

System prompt — persona & constraints ("answer ONLY from context, say 'I do not know' otherwise").
Context block — the surviving chunks joined with \n\n---\n\n, capped at 12 000 chars.
The question — verbatim, plus a final "Answer from context only:" framing line.

The 12 000-char cap exists because of the "lost in the middle" phenomenon — LLMs pay less attention to information buried mid-context. Keeping context tight keeps recall high.

Query Step 05

Generate the answer

ollama.generate(model, prompt, system). Critical settings:

temperature: 0.3 — low enough to suppress flowery rambling, high enough to handle paraphrasing.
num_ctx: 4096 — the model's attention window. Big enough for context + question + answer.
stream: false — we want the full string, not chunks.

Post-processing: strip <think>...</think> blocks (chain-of-thought from reasoning models), cap output at 50 000 chars.

Final step: write {query_id, answer, sources} to s3://answer/{query_id}.json and POST a fire-and-forget webhook to the backend.

Building a retrieval-augmented memory.

Elevenchapters.

The big picture.

Upload

Query

The split that matters

Architecture, mapped.

Algorithms in service.

Sliding-window chunking

Cosine similarity

HNSW index

Transformer embeddings

The upload pipeline.

What an embedding is.

ChromaDB internals.

add() — bulk index

query() — k nearest

The query pipeline.

Ollama, two ways.

String → vector

Prompt → answer

How Ollama gets context.

Reliability patterns.

Two-tier errors

Heartbeats

Idempotent IDs

Compact workflow history

Dependency injection

Fire-and-forget webhooks

References & further reading.

User · Frontend

Temporal Worker

S3 / MinIO

Ollama

ChromaDB

extract_text_activity

chunk_text_activity

embed_and_store_activity

Load query

Embed the question

Retrieve neighbors

Build the prompt

Generate the answer

Building a
retrieval-augmented
memory.

Eleven
chapters.