Product Block 06 · Sellable today
Document AI Search
Upload contracts, manuals, policies. Employees ask in plain language and get the answer with the citation.
Best fit: Law firms (precedent search), accountants (tax memos), HOAs (bylaws), large dental/medical groups (protocols), property managers (lease libraries).
Document AI Search — build it without writing code
Drop the prompt below into Claude Code or Codex. The agent picks a knowledge-worker firm with a doc-search pain, ingests 20-50 of their PUBLIC docs into Vectorize, builds the search + citation interface, runs 5-10 prospect-specific test queries, and produces a 90-second Loom + cold pitch.
You provide
You provide: (1) prospect firm name + URL, (2) link to their published bylaws / engagement letter / FAQ docs (or the human can email them and ask), (3) decision-maker email.
You get back
You get: a hosted password-gated search interface with their public docs ingested, 5-10 working query examples + citations, a 90-second Loom of search→answer→source-doc, and a draft cold pitch.
Runtime & cost
Roughly 90 minutes wall-clock (depends on doc count). ~$2-3 in embedding + Claude tokens.
You are building a Document AI Search (Block 06 in the Cafecito AI new-hire playbook). Full reference at https://cafecito-ai.com/new-hire/blocks/06-document-ai-search. Read it. Use plan mode. Stop at every [GATE].
INPUTS YOU NEED FROM THE HUMAN (ask before doing anything else):
1. Prospect firm name + URL (mid-size law / accounting / HOA management / dental group)
2. Link to or upload of 20-50 public docs (bylaws / engagement letters / FAQs / policies). If they don't share publicly, the human emails them with: "Want a free demo of doc search on YOUR docs? Send 5 sample PDFs."
3. Decision-maker email (managing partner / office manager / GM)
ENVIRONMENT (verify):
- Working dir: /home/eratner/cafecito-ai
- Cloudflare account: f7a9b24f679e1d3952921ee5e72e677e
- Reference scaffold: /home/eratner/pkm-wiki/
SECRETS TO CONFIRM:
- ANTHROPIC_API_KEY (Claude for synthesis + citation)
- (Vectorize uses Workers AI binding — no separate key)
THE PLAN:
STEP 1 — RESEARCH + DOC GATHERING (15 min, mostly waiting on human)
- Pull prospect site. Identify firm size, practice areas, public-facing doc URLs.
- Wait for human to provide 20-50 docs. PDFs preferred; DOCX + MD also OK.
- Pre-qualify: at least 20 docs OR at least 200 pages of text content. If less, ask the human to gather more before proceeding.
[GATE 1 — show doc count + page-count estimate, ask "proceed?"]
STEP 2 — SCAFFOLD + INGEST (20 min)
- Create /home/eratner/cafecito-ai/docsearch-<prospect-slug>/.
- Use Prompt #1 from the block page ("Ingestion pipeline scaffold") to generate worker.js + wrangler.jsonc + D1 migration + Vectorize index config.
- Bindings: R2 DOC_R2 (`wrangler r2 bucket create docsearch-<slug>-r2`), Vectorize CORPUS_INDEX (`wrangler vectorize create docsearch-<slug>-corpus --dimensions=1024 --metric=cosine`), D1 DOC_LEDGER (`wrangler d1 create docsearch-<slug>-db`), Workers AI binding.
- Deploy the scaffold (no docs yet).
- Run ingestion: POST /admin/upload with each doc. Verify chunk counts in D1 DOC_LEDGER.
[GATE 2 — show docs ingested + chunk count, ask "proceed to search?"]
STEP 3 — SEARCH + CITATION INTERFACE (15 min)
- Use Prompt #2 ("Search + citation prompt") for the synthesis logic.
- Build the public-facing search UI as a route on the worker (single page, search bar, top-5 chunks, "ask the corpus" button).
- Build the admin upload UI with Prompt #3 ("Admin upload UI").
- Password-gate both via Workers Basic auth (secret ADMIN_TOKEN).
[GATE 3 — show the live URL + password, ask "ready to test?"]
STEP 4 — FIVE-QUERY TEST PASS (15 min)
- Generate 5 prospect-specific test queries (questions their employees would actually ask).
- Run each. Verify: top chunk relevance, citation accuracy, refusal-on-low-confidence behavior, language detection if applicable.
- Edit the prompt or re-chunk if any query returns hallucination.
[GATE 4 — confirm 5/5 queries pass, ask "ready to record demo?"]
STEP 5 — RECORD THE LOOM (5 min, human action)
- Tell human: "Open the search URL. Type one of the test queries. Show the answer + the citation. Click the citation to open the source doc and highlight the page. Save the URL — that's the Loom."
STEP 6 — DRAFT THE COLD PITCH (5 min)
- Email subject: "How long does it take an associate at [FIRM] to find your standard force majeure clause?"
- Email body: 4 sentences, lead with the cost-of-search math ($500/hr partners answering "where is X?" questions), demo URL, $5k setup / $750/mo, one yes/no close.
[GATE 5 — show the draft, ask the human to approve / send manually]
STEP 7 — SHIP THE SUMMARY
- Single-line: "[FIRM] doc search live at [URL] · [N] docs ingested · demo: [LOOM] · pitch sent to [EMAIL]."
DONE.
GUARDRAILS: NEVER ingest non-public docs without a signed letter. Refuse-on-low-confidence (cosine < 0.7) MUST be working before you record the demo. Cost ceiling: $5 in tokens.
01Stack▾
- CF Worker
- R2 (docs)
- Vectorize (embeddings)
- Claude (synthesis + citation)
02Live references▾
Working production examples. Read the code. Steal the pattern.
- Karpathy LLM Wiki Personal-knowledge pattern — same architecture, different corpus
03Day-1 plan▾
A real prospect. A real demo. A real outbound message — all before 5pm.
-
09:00–09:30
Pick a mid-size firm with a known doc-search pain.
Mid-size law firm (10-30 attorneys), accounting practice, large HOA management. Find one where you can credibly ask "where do you store your standard force majeure clause?" and they wince.
-
09:30–10:30
Get a sample of their public documents.
Their published bylaws, public client engagement letter, FAQ docs. 20-50 docs is plenty for a demo. Don't ask for confidential material on Day 1 — public is enough.
-
10:30–11:30
Stand up the Vectorize pipeline.
Clone /home/eratner/pkm-wiki/. Replace corpus with theirs. Chunk size: 800 tokens with 100 overlap (sweet spot for legal/policy). Embed via @cf/baai/bge-large. Index goes into Vectorize.
-
11:30–12:30
Build the search + citation interface.
Single-page app: search bar + result cards (top 5 chunks with scores) + "ask the corpus" button → Claude synthesizes answer WITH citations to source doc + page. Always cite. Never answer without chunks.
-
12:30–13:30
Build the admin upload flow.
Password-gated /admin route. Drag-drop PDFs/DOCX. Worker extracts text, chunks, embeds, writes to Vectorize. Show progress. Version every doc upload. Build this on Day 1 — not optional.
-
13:30–14:30
Build the answer-rejection guard.
If top chunk score below threshold (cosine sim < 0.7), refuse to synthesize. Return "no clear match in your corpus — closest matches were [list]." This single rule prevents 90% of hallucinations.
-
14:30–15:30
Write 5-10 prospect-specific test queries.
Use questions their employees actually ask: "What's our standard NDA mutuality clause?" "How do we handle a refund request after 90 days?" Run each. Verify citations. Note hallucinations.
-
15:30–16:30
Record the Loom demo on real prospect data.
90 seconds: open search, type a real query, see answer + citation, click citation to open source doc. Highlight the page. The "click → source" is the entire trust play.
-
16:30–17:00
Send the cold pitch + demo URL.
Email the managing partner / office manager. Loom + hosted demo URL with their docs (password-gated). Subject: "How long does it take an associate at [FIRM] to find your standard force majeure clause?"
04Best practices & gotchas▾
-
Chunk size 800 tokens with 100 token overlap. Don't experiment.
Why: Legal, policy, and HR docs are dense and reference-heavy. 500-token chunks split mid-clause; 1500-token chunks lose specificity. 800/100 is the empirically good middle.
-
Always cite source doc + page. Never answer without it.
Why: A search system that returns prose without citations is indistinguishable from a chatbot that hallucinates. Lawyers/accountants will not trust an uncited answer.
-
Refuse to answer when top chunk similarity is below 0.7 cosine.
Why: The single biggest source of hallucination is the model trying to be helpful when there's no relevant context. A hard threshold + "no clear match" prevents 90% of bad answers.
-
Version every doc upload. Never overwrite.
Why: Compliance, legal hold, and "what did the bylaws say in 2023" are real questions. Vectorize entries should carry doc_id + version_id + uploaded_at. Storage is free; defensibility is priceless.
-
Detect query language. Search in the corpus language; respond in the query language.
Why: A bilingual workforce queries in Spanish but the corpus is in English. Embedding-search works fine cross-lingually with multilingual models. Synthesizing the answer in the query language closes the loop.
-
Employees can't upload. Admin only.
Why: A document index is only valuable if it's curated. Open uploads = duplicates, bad scans, irrelevant material, governance nightmares. Admin-controlled upload behind auth from day 1.
-
Per-user audit log: who searched what, when, and got which result.
Why: Compliance + discovery in legal/medical contexts ("did anyone search for X before incident Y?"); product improvement (which queries return "no match" → what to add to the corpus).
05Prompts (copy-paste)▾
Drop these into Claude Code. Replace the [BRACKETED] fields with the prospect's details.
Generates the full doc-ingestion CF Worker — upload, text extraction, chunking, embedding, Vectorize indexing.
Build me a CF Worker pipeline for ingesting documents into a Vectorize-backed search index for [BUSINESS NAME].
Pipeline:
- POST /admin/upload (auth: Bearer ADMIN_TOKEN)
- Accepts multipart upload of PDF/DOCX/TXT/MD.
- Stores raw file in R2 with key: docs/<doc_id>/v<version>/<filename>.
- Extracts text using @cf/svg2pdf (PDF) or unzip-based DOCX or plain read.
- Chunks at 800 tokens with 100 overlap.
- Embeds via @cf/baai/bge-m3 (multilingual).
- Writes to Vectorize index "<business-slug>-corpus" with metadata: doc_id, version_id, page_number, chunk_index, source_filename, uploaded_at.
- POST /admin/replace/:doc_id - marks all existing chunks as superseded=true, calls upload with version+1.
- GET /admin/docs - returns list of indexed docs + versions + chunk counts.
- POST /admin/delete/:doc_id - soft-delete with retention.
Bindings: R2 DOC_R2, Vectorize CORPUS_INDEX, D1 DOC_LEDGER (id, doc_id, version, filename, uploaded_at, uploaded_by, chunk_count, status), Workers AI binding AI, secret ADMIN_TOKEN.
Output: complete worker.js (Hono), wrangler.jsonc additions, D1 migration, Vectorize index config, and the admin upload HTML page (drag-drop, progress bar, version-aware).
The Claude prompt that takes top-K chunks + the user query and returns a synthesized answer with hard citations.
You are the search assistant for [BUSINESS NAME]. You answer employee questions using ONLY the document chunks provided. You always cite.
User query: [QUERY]
User's preferred response language: [en|es]
Top 5 retrieved chunks:
[
{ "chunk_index": "...", "source_filename": "...", "page": "...", "version": "...", "similarity": "...", "text": "..." },
... 4 more
]
Rules:
1. If highest similarity score below 0.70, respond ONLY with: "I couldn't find a clear match in your corpus. Closest matches: [list source_filename + page for top 3]." Do not synthesize.
2. If similarity scores above 0.70, synthesize using ONLY the provided chunk text. Do not introduce information from outside.
3. Every factual claim must end with a citation: [source_filename · page X · version Y]. Cite the SPECIFIC chunk(s) used.
4. If chunks contradict each other, surface the contradiction and cite both. Do not pick one.
5. If the question is ambiguous, surface the ambiguity and ask a clarifying question.
6. Respond in the user's preferred language. Chunks may be in any language.
7. Never use phrases like "Based on the documents provided" — just answer.
Output JSON:
{
"answer": "...",
"citations": [{"source_filename": "...", "page": ..., "version": ..., "quoted_text": "..."}],
"confidence": "high|medium|low",
"ambiguity_detected": false|true,
"clarifying_question": "...",
"no_match": false|true,
"no_match_alternatives": [...]
}
Generates the password-gated admin page where the firm's ops person uploads docs and manages versions.
Build me the admin upload page for the document-search system at [BUSINESS NAME].
URL: <domain>/admin (Workers Basic auth, secret ADMIN_TOKEN).
Sections:
1. **Upload zone** (drag-drop): accepts PDF/DOCX/TXT/MD, multi-file. Per-file progress bar. Success: "Indexed [N] chunks in [X]s." Failure shows error inline.
2. **Document library**: all indexed docs with filename | version | uploaded_at | uploaded_by | chunk_count | actions (Replace / View / Soft-delete). Sortable. Filter by status.
3. **Search test bar**: lets the admin run a sample query against the index. Shows top 5 chunks + scores + synthesized answer in real time.
4. **Audit log**: collapsible. Last 50 admin actions.
Style: editorial palette (cream/ink + business accent). Mobile-friendly. Vanilla HTML/CSS/JS. Inline everything.
Auth: Workers Basic auth via Authorization header. On 401, clean password prompt.
Output: complete index.html.
When a doc is replaced, notify the admin + every employee who searched related queries in the last 30 days.
Build me a notification system for the document search at [BUSINESS NAME].
Trigger: when a doc is replaced (POST /admin/replace/:doc_id).
Logic:
1. Pull chunk metadata for the OLD version of the doc.
2. Query the per-user audit log for any user who searched for queries that retrieved chunks from this doc in the last 30 days.
3. For each such user, generate a personalized email:
- Subject: "[FIRM] Knowledge Update: [filename] was updated [date]"
- Body: "Hi [user]. The doc you searched on [past_date] for \"[their query]\" was just updated. Previous version said: [excerpt]. New version says: [excerpt]. [link to new doc] [link to re-run query]"
4. Send via Resend. Cap at one email per user per day.
5. Send the admin a single summary email: "[doc] was replaced. [N] employees notified."
Output: the CF Worker handler function + the email template (HTML + text fallback).
06Selling script▾
Discovery question (ask this first)
"When a new associate at your firm needs to find your standard [force majeure / refund / liability] clause, where do they go? How long does it take?"
The frame
Knowledge worker firms have decades of high-value documents indexed in nothing but their senior partners' memory. Every time a senior partner answers a question that's already in the corpus, the firm pays them $500-1000 to retrieve information instead of using it. A searchable corpus turns the firm's collective IP into a 24/7 resource.
The demo play
Get them to give you ONE actual question they can't easily answer. On the discovery call, you've already ingested their sample docs. Type their question into the search bar live. The synthesized answer + citation lands in 4 seconds.
Objections
-
"What if it hallucinates?"
"Hard threshold: if top retrieval score is under 70%, the system refuses to answer and shows you closest near-misses. We measured: hallucination rate under 2% with the threshold + citation enforcement."
-
"Security and confidentiality."
"Documents stay in your CF R2 bucket — same security envelope as your existing CF infrastructure. No data leaves CF for embedding. Claude calls only see retrieved chunks for the active query, not the corpus. Per-user audit log."
-
"Our employees won't use it."
"Firms that wire it into the workflow (one Slack slash command + daily start-of-day digest of 'queries that returned no match') see 60-80% adoption in 60 days. The $750/mo includes the monthly adoption review."
-
"$5k setup is a lot."
"It's ingestion of 200-500 docs, the search interface, admin tools, citation system, change-notification system, and 2 hours of employee training. Closest enterprise tool quote is $40k setup + $3k/mo."
The close
"$5k now to ingest your first 200 docs and stand up search by next Friday. $750/mo from the day a single employee runs their first query. If your firm doesn't collectively run 100+ searches in the first 30 days, we extend the trial — no charge."
07Pricing notes▾
Anchor on senior-partner time. A firm where partners bill $500/hr that fields 10 "where is X?" questions/week is leaking $250k/yr. Per-query cost: ~$0.015. At 1,000 queries/mo (typical for 20-person firm) marginal cost is $15 against $750 fee. Setup includes: ingestion of up to 500 documents, search interface, admin upload tools, citation enforcement, change-notification, 2hr training. Does NOT include: SSO/SAML ($1.5k), multi-tenant ($2k), connector to existing DMS (NetDocuments, iManage — $3k each). Upsell paths: combine with Block 02 (intake) + Block 04 (review) → "Knowledge Stack" at $1.5k/mo.