A line-by-line field guide to activities.py — the Temporal worker
that turns scattered documents into something an LLM can actually remember.
A one-paragraph briefing before the schematics arrive.
This module powers a “chat with your documents” system. Files arrive, get torn down into bite-size chunks, and each chunk is converted into a numerical fingerprint — a vector — that lives in a specialized database. When a user asks a question, we find the chunks whose fingerprints are nearest to the question's own fingerprint, and hand those chunks to a language model along with the question. The model writes the answer using only what we gave it.
extract → chunk → embed & store
One-time work per document. The expensive bit. After this, the document is searchable.
embed → retrieve → generate
Runs on every question. Fast: one embed call, one DB lookup, one LLM call.
Every component on one page.
Click any node for the detailed brief.
Five parts cooperate: a Temporal worker hosting the four activities, two storage backends, and one LLM runtime that doubles as embedder. Tap any component below for its full brief.
“Temporal remembers what ran.
S3 remembers what was made.
Chroma remembers what it means.”
Four algorithms doing the heavy lifting. None of them are new. All of them earn their place.
Split text into 500-word windows that overlap by 100 words. The overlap means a sentence sitting on a boundary still has its surrounding context preserved in the next chunk — retrieval doesn't miss it.
Two vectors are similar if they point in the same direction, regardless of length.
Magnitude-blind, so a long passage and a short passage are compared on meaning alone.
ChromaDB returns distance = 1 − cosine — smaller is better.
Hierarchical Navigable Small Worlds. A multi-layer graph where the top layers are sparse highways and the bottom is dense local streets. Search starts at the top, descending greedily. Find your nearest neighbor among millions of vectors in logarithmic time.
A neural network reads the text and produces a fixed-length vector. Texts with similar meanings — even using different words — produce vectors that point in similar directions. “warranty period” and “coverage duration” end up neighbors in 768-dimensional space.
Three activities, run in sequence, that turn a PDF into a searchable memory.
A user drops a file into the system. Three activities fire — each one a discrete, retryable step. Click any node in the flow below.
A short interlude on what numbers can know.
An embedding is a list of numbers — typically 768 of them — that captures what a piece of text means. Two passages that say the same thing using different words land near each other; two passages that disagree point in different directions. That's the entire trick.
Picture vectors as points scattered through 768-dimensional space. We can't visualize 768 dimensions, so the diagram squashes it to 2. The principle survives the squash: related ideas cluster.
When the user asks “how long is the warranty?”, that question becomes its own vector (the hollow ring). It lands inside the warranty cluster. ChromaDB returns the nearest neighbors. The shipping cluster — far away in vector space — is correctly ignored.
The system's memory.
How vectors live, how they're found.
Three ideas worth knowing: the parallel-array add() shape,
the metadata-driven filter, and the HNSW index sitting underneath.
ChromaDB indexes the embeddings into the HNSW graph as a batch. One round-trip,
one rebuild of the graph layers. Calling add per chunk
would be ~10× slower for large documents.
Pass a query vector and n_results=8. ChromaDB walks the
HNSW graph, returns the eight nearest chunks plus their metadata and cosine distances.
The activity then filters by enabled doc_id if specified.
One activity. Three external calls. Many subtle decisions.
A user types a question. The system embeds it, retrieves nearby chunks from Chroma, constructs a prompt, and asks Ollama for an answer. Click any step.
One service, two jobs. Different endpoints, different payload shapes, different failure modes.
Ollama runs LLMs locally. This module asks it for two things: turn a string into a vector, and turn a prompt into prose. Same server. Two endpoints. Two payload shapes.
Input: any text up to ~6 000 chars after pre-truncation.
Output: a list of 768 floats. Used once per chunk at upload time and
once per question at query time.
{
"model": "nomic-embed-text",
"input": "warranty period 24 months",
"truncate": true
}
{ "embeddings": [[0.012, -0.034, …]] }
Input: a system prompt that constrains behavior plus a user prompt carrying the question and retrieved context. Output: natural-language text.
{
"model": "llama3.1:8b",
"system": "Answer ONLY from context...",
"prompt": "CONTEXT: ... QUESTION: ...",
"stream": false,
"options": {
"temperature": 0.3,
"num_ctx": 4096
}
}
{ "response": "The warranty is 24 months..." }
The single most important paragraph of this whole tour.
Ollama, the language model, knows nothing about your documents. It only knows what you put in its prompt. The art of RAG is composing that prompt so the model has exactly what it needs — and a clear instruction not to make things up.
— constrains the model's persona and forbids hallucination.
— the eight (or fewer) chunks Chroma returned, joined by \n\n---\n\n, capped at 12 000 chars.
— the literal user question, plus a final framing line that re-anchors the constraint.
“The model knows nothing.
The retrieval knows everything.
The system prompt is the bridge.”
What makes this thing not break in production.
ApplicationError(non_retryable=True) for permanent failures
— empty file, model not pulled, schema violation. Plain
Exception for transient — Temporal retries by policy.
Inside the embed loop: activity.heartbeat(f"chunk {i}/{n}").
Without it, Temporal would assume the worker died on a long activity and reschedule.
Visible as live progress in the Temporal UI.
Chunk IDs {doc_id}_chunk_{i} are deterministic.
A retried activity overwrites cleanly — no duplicates, no orphans.
Bulky data (raw bytes, extracted text, generated answers) lives in S3. Workflows pass S3 keys, not payloads. Temporal history stays small.
S3, Chroma, Ollama wrapped in thin classes. Activities accept them via constructor. Tests pass mocks; production passes real clients. No monkeypatching.
Backend notification wrapped in try/except: pass.
Answer is already saved; the webhook is best-effort. The activity must not fail
on a flaky downstream.
Where to read deeper. Primary sources only.
Happy retrieving.
Not part of activities.py, but shapes its API. The frontend
uploads files and writes query JSON directly to S3, then starts a Temporal workflow
that references the S3 key.
Why this indirection? It keeps the Temporal workflow's history compact — workflows remember the key, not the bytes. A 50 MB PDF doesn't bloat the orchestration database.
A long-running Python process that polls Temporal Server for activity tasks. When tasks arrive, it runs the appropriate Python function and reports the result back.
Three properties make it special:
embed_and_store
without the upload steps.Object storage. Holds three categories of blob:
S3Client wraps boto3 with two methods —
get_bytes and put_bytes. The
endpoint URL is configurable, so the same code talks to AWS, MinIO, or LocalStack.
A local LLM runtime. Runs models on the user's machine — no API keys, no rate limits, no data leaving the host. This module talks to it via two HTTP endpoints:
Turns text into a 768-dimensional vector. Called once per chunk at upload, once per question at query.
Turns a (system + user) prompt into a natural-language answer. Called once per query.
The class also handles the legacy /api/embeddings endpoint
as a fallback for older Ollama versions.
A vector database — purpose-built for storing high-dimensional embeddings and finding nearest neighbors quickly.
One collection: local_context. Configured for cosine similarity:
Underneath, Chroma uses HNSW (Hierarchical Navigable Small Worlds) to make nearest-neighbor lookups O(log n) instead of O(n). With 100 000 chunks, the difference is "instantaneous" vs. "noticeable lag."
→ Chroma docsReads a file from S3 and extracts its plain-text content. Dispatches to a parser based on file extension:
Output is written back to S3 at text/{doc_id}.txt. The
activity returns just the key — keeping the workflow history small.
Empty extraction → ApplicationError(non_retryable=True).
No point retrying a broken PDF.
Reads the extracted text from S3 and splits it into overlapping windows.
The 100-word overlap is critical: a sentence at a chunk boundary still has surrounding context preserved in the next chunk, so retrieval doesn't miss it.
Math: a 5000-word document with size=500/overlap=100 → ⌈5000/400⌉ = 13 chunks.
The expensive one. For each chunk:
f"chunk {i+1}/{n}")ollama.embed(chunk) → 768 floats
After the loop, one batched collection.add(...) call
writes everything to Chroma. Per-chunk inserts would be ~10× slower at scale.
Chunk IDs are deterministic ({doc_id}_chunk_{i}), so retrying
after a partial failure overwrites instead of duplicating.
Read query/{query_id}.json from S3. The frontend wrote this
JSON before starting the workflow. Why not pass it as a workflow argument? Two reasons:
The same OllamaClient.embed() used at upload time is now
called on the user's question. Critical: the question and the chunks must use
the same embedding model. Different models produce vectors in incompatible
spaces.
A heartbeat is fired before the call: activity.heartbeat("Embedding query").
The Temporal UI shows this as live progress.
Ask Chroma for the 8 chunks whose vectors are nearest to the query vector
(n_results = min(8, count)). Chroma walks its HNSW index and
returns matches in milliseconds.
Then comes the per-document filter: if the user enabled only documents A and C,
drop any returned chunk whose doc_id is something else.
Edge cases handled: empty collection → "I do not know — no documents uploaded." All-filtered-out → "I do not know — no relevant info."
Three pieces are assembled:
\n\n---\n\n, capped at 12 000 chars.The 12 000-char cap exists because of the "lost in the middle" phenomenon — LLMs pay less attention to information buried mid-context. Keeping context tight keeps recall high.
ollama.generate(model, prompt, system). Critical settings:
temperature: 0.3 — low enough to suppress flowery
rambling, high enough to handle paraphrasing.num_ctx: 4096 — the model's attention window. Big enough
for context + question + answer.stream: false — we want the full string, not chunks.
Post-processing: strip <think>...</think>
blocks (chain-of-thought from reasoning models), cap output at 50 000 chars.
Final step: write {query_id, answer, sources} to
s3://answer/{query_id}.json and POST a fire-and-forget
webhook to the backend.