Solution Architecture Document — The Broken Handle
Project: The Broken Handle — automated research pipeline & AI shorts generator
Authors: Jarrod E. Brown, William Lowdermilk
Platform: Cloudflare developer platform (Workers, D1, R2, Vectorize, KV, Workers AI, Email Routing, Secrets Store)
Status: Research pipeline live (v0.6); shorts generator planned
Repository: github.com/thebrokenhandle/the-broken-handle (private)
A formal SAD also exists in-repo at
docs/architecture/TBH-Solution-Architecture-v1.2.docx. This document supersedes it.
Web & Social
- Website: thebrokenhandle.com
- Newsletter: brokenhandlepod.substack.com
- YouTube: @brokenhandlepod
- Instagram: @brokenhandlepod
- Threads: @brokenhandlepod
- X / Twitter: @brokenhandlepod
- BlueSky: @brokenhandlepod.bsky.social
- Facebook: @brokenhandlepod
Subscribe — scan to join the newsletter on Substack:
1. Overview
The Broken Handle is a weekly business podcast on the modern job market — layoffs, AI displacement, and career pivots. This repository contains the engine behind the show: an automated research pipeline that ingests forward-looking labor and business indicators from 75+ registered sources across five ingestion methods, and a planned AI shorts generator that produces short-form “Pivot Briefs” as marketing for the weekly long-form episode. Everything runs on Cloudflare’s developer platform.
The pipeline implements a Research Synthesizer archetype: the data already exists across government agencies, executive surveys, hiring-intent trackers, and sector-specific publications, but the value is in aggregation, classification, and synthesis — turning a week’s worth of scattered signals into a ranked, citeable Monday planning brief that makes the show’s commentary defensible rather than anecdotal.
2. Problem & Context
Producing a credible weekly podcast on the labor market requires tracking dozens of forward-looking indicators that publish on different cadences (real-time layoff trackers, monthly government releases, quarterly executive surveys) across incompatible formats (RSS feeds, HTML landing pages, email newsletters, PDF reports, API endpoints). Manually checking 75+ sources each week is unsustainable for two part-time hosts. Worse, the resulting commentary risks confirmation bias — hosts naturally gravitate toward familiar sources and miss signals from sectors or geographies outside their experience.
No existing tool combines labor-market data aggregation with editorial synthesis. News aggregators (Feedly, Google Alerts) collect headlines but don’t extract structured indicators. Research platforms (Bloomberg Terminal, Statista) are prohibitively expensive for an independent podcast. The Broken Handle pipeline fills this gap: a serverless, zero-maintenance system that ingests, classifies, and synthesizes research on a cron schedule, delivering a ready-to-use planning brief every Monday morning.
3. Goals & Requirements
Functional
- Ingest forward-looking labor/business indicators from 75+ sources across five methods (API, RSS, HTML scrape, email ingestion, external scrape via GitHub Actions).
- Store raw content in R2, structured indicators and documents in D1, and semantic embeddings in Vectorize.
- Classify documents across seven balance axes (sector, geography, career stage, role level, org size, persona, signal type) using Gemini Flash.
- Run a weekly synthesis that produces a ranked, cited Monday planning brief and emails it to both hosts.
- (Planned) Generate 60–90s AI-narrated short-form videos (“Pivot Briefs”) from the curated data layer.
Non-functional
- Serverless and low-maintenance: cron-driven ingestion, no manual intervention for steady-state operation.
- Source configuration managed as data (D1 rows + JSON scrape configs), not code — adding most new sources requires zero TypeScript.
- Reproducible, versioned deployments via GitHub Actions CI/CD with TypeScript type checking.
- Balance guardrails prevent editorial blind spots by measuring and nudging topical coverage.
4. Decision Rationale
Why Cloudflare Workers over a traditional backend? The pipeline runs on cron schedules (hourly ingestion, weekly synthesis) with no persistent connections or long-running processes. Workers’ per-request billing means the system costs near-zero between crons. The integrated D1/R2/Vectorize/KV stack eliminates cross-service networking latency and authentication overhead. The same Cloudflare account hosts the podcast website (thebrokenhandle.com) and the companion TBH Editor, so infrastructure is consolidated under one billing and DNS umbrella.
Why five ingestion methods instead of standardizing on one? Source publishers don’t standardize their output. Government agencies (FRED, BLS) expose APIs with structured JSON. Think tanks and trade press (McKinsey, Construction Dive) publish RSS feeds. Research firms (Deloitte, Vistage) post reports on HTML landing pages with no feed. Executive newsletters (Apollo Academy, CB Insights, JPMorgan) deliver via email with no web archive. Anti-bot sites (TrueUp Layoffs) block Worker fetch requests entirely. Each ingestion method addresses a real constraint — the alternative would be losing coverage.
Why Gemini 2.5 Pro for synthesis? The weekly brief requires ranking, citation, and editorial voice — tasks that benefit from a large context window (the payload can include 100+ documents) and strong instruction-following for the WISER framework’s anti-fabrication rules. Gemini 2.5 Pro’s 1M-token context window handles the full payload without chunking, and its structured output reliability is high enough that the brief lands directly in the hosts’ inbox without human review. Gemini Flash handles the lighter classification workload (tagging documents across balance axes) at lower cost.
Why WISER prompting? The synthesis prompt defines the brief’s editorial voice, ranking criteria, anti-fabrication rules, and output structure. WISER (Who, Instructions, Sub-tasks, Examples, Review) provides a repeatable framework that makes each section auditable and independently tunable. The locked system prompt (~1,200 words) has been iterated through production runs; changes are version-controlled alongside the TypeScript source.
Why balance guardrails? Without measurement, the brief would over-index on whatever published most that week (typically tech layoffs and macro indicators) and under-represent sectors, geographies, and career stages that publish less frequently. The guardrail system measures Natural coverage (what the source pool ingested) versus Published coverage (what the brief cited), computes lift, and injects a directive into the synthesis prompt when floors are binding. This makes editorial balance a measured, tunable property rather than a subjective judgment.
Why email ingestion over scraping? Several high-value sources (Apollo Academy, CB Insights, JPMorgan Eye on the Market, Big Four CEO outlook newsletters) distribute exclusively via email with no public web archive. Rather than lose coverage, the pipeline uses Cloudflare Email Routing to accept newsletters at dedicated addresses (e.g., apollo@thebrokenhandle.com), parse them with PostalMime, and feed them through the same document/embedding pipeline as other sources. Sender-domain allowlists prevent spoofing.
5. Architecture Overview
See also: Data Flow Diagram for a visual trace of the full pipeline, and Deployment Diagram for Cloudflare resource layout.
The system is organized as a three-stage pipeline — Ingest → Store & Embed → Synthesize & Deliver — with a balance measurement layer that spans stages two and three.
Stage 1: Ingest. An hourly cron triggers the tbh-research Worker, which queries D1 for sources that are “due” based on their configured cadence. Each source dispatches to one of three registered handler types (API, RSS, HTML scrape). Email sources arrive asynchronously via Cloudflare Email Routing. External scrape sources are rendered by Playwright in a GitHub Actions workflow and POSTed to the Worker’s /ingest endpoint.
Stage 2: Store & Embed. Each handler writes raw payloads to R2 (tbh-scrape-raw), structured documents and indicators to D1 (tbh_structured_data), and generates 768-dimensional text embeddings via Workers AI (BGE-base-en-v1.5) stored in Vectorize (tbh-research-vectors).
Stage 3: Synthesize & Deliver. Every Monday at 13:00 UTC, the synthesis cron pulls the past week’s documents and indicators from D1, classifies them across seven balance axes using Gemini Flash, computes Natural coverage shares, injects balance directives where floors are binding, calls Gemini 2.5 Pro with the full WISER-framework system prompt to generate the brief, resolves [source_id] citations to APA-style parentheticals, writes the result to D1 as a draft content_item with provenance links, computes Published coverage, persists the week’s balance history, and emails the HTML-rendered brief to both hosts.
6. Components
| # | Component | Responsibility |
|---|---|---|
| 1 | Worker tbh-research (src/index.ts) |
Main entry point: hourly ingestion cron, Monday synthesis cron, admin HTTP API (/sources, /run-synthesis, /ingest, /status), email handler. |
| 2 | src/sources/fred.ts |
FRED API handler — fetches economic time series via the FRED STLFED API, writes indicators (value, unit, period, direction, change). |
| 3 | src/sources/rss.ts |
Generic RSS handler — parses Atom/RSS feeds using D1-driven source config (sources.url + url_fallbacks), writes documents with content hashing for dedup. |
| 4 | src/sources/generic_html.ts |
Generic HTML scraper — three modes (list, snapshot, two_step) driven by JSON scrape_config in D1. Uses HTMLRewriter for server-rendered pages. Also exposes ingestExternalHtml for the GHA external-scrape path. |
| 5 | src/email.ts |
Email ingestion handler — Cloudflare Email Routing dispatcher. Maps recipient addresses to source IDs, validates sender domains, parses RFC822 via PostalMime, stores raw .eml in R2, persists attachments to tbh-media, writes documents, embeds into Vectorize. |
| 6 | src/synthesis.ts |
Weekly synthesis orchestrator — pulls recent docs/indicators, calls Gemini 2.5 Pro with the locked WISER system prompt, resolves citations, writes content items with provenance, emails the brief. Includes idempotency guard, failure alerting, and test mode. |
| 7 | src/tagging.ts |
Balance classification — batches documents (12/batch) through Gemini Flash to tag across seven axes using a controlled vocabulary from the tags table. Best-effort; failures are logged and swallowed. |
| 8 | src/balance.ts |
Balance measurement — loads guardrail rules from guardrail_config, computes Natural/Published coverage shares per bucket, generates synthesis directives for binding floors, persists weekly trends to balance_history. |
| 9 | src/embeddings.ts |
Vectorize embedding — truncates text to ~512 tokens, runs Workers AI BGE-base-en-v1.5, upserts 768-dim vectors with source/document metadata. |
| 10 | src/db.ts |
D1 query layer — source scheduling (getDueSources with cadence-aware due logic), document upsert with content-hash idempotency, indicator upsert, fetch-run lifecycle. |
| 11 | src/r2.ts |
R2 storage helpers — raw payload persistence keyed by source_id/date/filename. |
| 12 | src/feed.ts |
RSS/Atom feed parsing utilities. |
| 13 | src/scrape.ts |
HTML scrape framework — shared logic for HTMLRewriter-based content extraction. |
| 14 | .github/workflows/hard-scrape.yml |
External scrape via GitHub Actions — runs Playwright (headless Chromium) every 6 hours against JS-hydrated sites that block Worker fetch, POSTs rendered HTML to /ingest. |
| 15 | Worker tbh-shorts (planned) |
AI short-form content generator; KV TBH_DATA_CACHE reserved. |
7. Data Flow
- Source scheduling. The hourly cron calls
getDueSources(), which queries D1 for active sources whoselast_fetched_atexceeds their cadence window (15 min for real-time, 7 days for weekly, 30/90/365 for monthly/quarterly/annual). Sources are processed in tier order, with never-fetched sources prioritized. - Ingestion dispatch. Each due source is dispatched to its registered handler in the
FETCHERSregistry. Afetch_runrecord is created in D1 to track the attempt. The handler fetches content, and on success: (a) stores the raw payload in R2, (b) writes structured documents and/or indicators to D1 with content-hash deduplication, (c) generates a text embedding via Workers AI and upserts it into Vectorize with source metadata. Thefetch_runis completed with status, byte count, and document/indicator counts. - Email arrival. Inbound emails hit Cloudflare Email Routing, which dispatches to
handleIncomingEmail. The handler maps the recipient address to a source ID, validates the sender domain against an allowlist, parses the RFC822 message via PostalMime, stores the raw.emlin R2, persists attachments totbh-media, and writes a document through the same upsert-and-embed path as other handlers. - External scrape (GHA). Every 6 hours, the
hard-scrape.ymlGitHub Actions workflow launches Playwright to render JS-hydrated sites (e.g., TrueUp Layoffs), extracts content, and POSTs it to the Worker’s/ingest/:source_idendpoint. The Worker’singestExternalHtmlhandler stores and indexes the content identically to the built-in scraper. - Weekly tagging. At synthesis time,
tagWeekDocumentsbatches the week’s untagged documents (up to 60 per run) through Gemini Flash for classification across seven balance axes. Tags are written todocument_tagsvia the controlled vocabulary in thetagstable. - Balance measurement.
prepareBalanceloads guardrail rules fromguardrail_config, computes Natural coverage shares (fraction of the week’s pool touching each bucket), and generates a directive paragraph injected into the Gemini synthesis prompt when any floor is binding — nudging the model to include under-represented sectors or geographies. - Synthesis.
runSynthesispulls all documents and indicators from the past 7 days, assembles the user payload (documents capped at 2,000 chars each, indicators as structured rows), prepends the balance directive, and calls Gemini 2.5 Pro with the locked WISER system prompt (temperature 0.4, max 8,192 output tokens). The returned markdown is citation-resolved ([source_id]→ APA parentheticals), written to D1 as a draftcontent_itemwith provenance links viacontent_sources, and emailed to both hosts as rendered HTML. - Balance finalization. After synthesis,
finalizeBalanceextracts the brief’s citations to compute Published coverage, calculates lift (Published − Natural) per bucket, persists the week’s balance snapshot tobalance_history, and appends a coverage footer to the brief.
8. Data Model
The D1 database (tbh_structured_data) contains the following core tables:
sources — the source registry. Each row defines a data source with its ID, name, URL, tier (priority ordering), category, ingestion method (api/rss/scrape/external_scrape/email/manual_pdf), cadence, outlook window, paywall/auth flags, status, and optional JSON columns for url_fallbacks (tried in order if the primary URL fails) and scrape_config (mode + selectors for the generic HTML scraper).
fetch_runs — audit log of every ingestion attempt. Tracks source ID, trigger (cron/manual/email), start/complete timestamps, HTTP status, R2 key, byte count, and error messages. Used for source-health monitoring and debugging.
documents — ingested content. Each row links to a source and fetch run, with title, URL, published date, full text content, summary, content hash (SHA-256 for dedup), and an optional vector_id linking to the Vectorize index.
indicators — structured numeric/text data points. Keyed by (source_id, name, period_end) for upsert idempotency. Stores value (numeric or text), unit, period label, direction (up/down/unchanged/mixed), change versus prior, and extraction confidence.
tags — controlled vocabulary for balance classification. Each tag has a type (axis: sector, geography, career_stage, role_level, org_size, persona, signal_type) and a name (bucket within that axis).
document_tags — many-to-many join linking documents to tags. Written by the Gemini Flash classifier during the weekly tagging pass.
guardrail_config — balance rules. Each row defines a floor, ceiling, min_count, or group_floor for a specific axis/bucket combination. Rules are data, not code — tuning a target is a one-row UPDATE.
balance_history — weekly snapshots of Natural/Published coverage and lift per bucket. Powers trend analysis to distinguish under-sourced buckets (add sources) from genuinely balanced ones.
content_items — generated content (weekly briefs, planned shorts). Stores type, title, slug, status (draft/published), and the full markdown body.
content_sources — provenance links from content items to the documents they cited.
content_tags / content_people — future tagging and attribution for generated content.
9. Source Taxonomy
The pipeline ingests from 75+ registered sources across five ingestion methods:
| Method | Handler | Count | Examples |
|---|---|---|---|
| API | fred.ts |
1 | FRED (Federal Reserve Economic Data) — economic time series |
| RSS | rss.ts |
41 | Indeed Hiring Lab, McKinsey Global Institute, Challenger Gray, BCG podcasts, Construction Dive, Healthcare Dive, USDA NASS (4 feeds), USDA News/Blogs, EPI, NELP, Industry Dive family (8 verticals), Daily Yonder, Beef Magazine, Hechinger Report, Stateline |
| HTML Scrape | generic_html.ts |
33 | Layoffs.fyi (snapshot), Fed Beige Book (two-step), Brookings Metro, NFIB Optimism, Deloitte CFO Signals, ADP Research, Burning Glass, Gusto, Homebase, BofA Institute, USBR Lake Mead, US Drought Monitor |
email.ts |
15 | Apollo Academy, CB Insights, JPMorgan Eye on the Market, Axios Pro Rata, KPMG CEO Outlook, EY CEO Outlook, USDA ERS (Charts of Note + Amber Waves), Chicago Fed (catch-all for 15+ gov newsletters), Drovers/Farm Journal family | |
| External Scrape | GHA + Playwright | 1 | TrueUp Layoffs (anti-bot blocks Worker fetch; rendered in headless Chromium) |
Sources are tiered (1–4) for priority ordering during ingestion. Categories span government agencies, executive surveys, hiring-intent trackers, sector-specific trade press, rural/agricultural, and edge-case executive research. Several sources have dual paths — for example, Deloitte CFO Signals and AGC Outlook have both a scrape pipeline and an email backup, writing to the same source ID for dedup.
The HTML scraper operates in three modes, configured per-source via JSON scrape_config in D1:
list— parses a listing page for article links matching CSS selector and regex filters, writes each as a document (e.g., Brookings Metro, BCG Henderson Institute).snapshot— captures a single page’s content as one document plus a count indicator (e.g., Layoffs.fyi visible entries count).two_step— navigates a landing page to find the latest report link via regex, then fetches and extracts the report body (e.g., Fed Beige Book).
10. External Interfaces
| Interface | Endpoint / Address | Purpose | Auth | Cadence |
|---|---|---|---|---|
| FRED API | api.stlouisfed.org/fred/series/observations |
Economic time series (unemployment, GDP, CPI) | API key (Secrets Store) | Weekly |
| Gemini 2.5 Pro | generativelanguage.googleapis.com |
Weekly synthesis brief generation | API key (Secrets Store) | Weekly (Monday) |
| Gemini 2.5 Flash | generativelanguage.googleapis.com |
Document balance classification (7 axes) | API key (Secrets Store) | Weekly (Monday) |
| Workers AI | @cf/baai/bge-base-en-v1.5 |
768-dim text embeddings for Vectorize | Workers AI binding | Per document |
| Cloudflare Email Routing | *@thebrokenhandle.com → Worker |
Inbound newsletter ingestion (15 addresses) | Sender-domain allowlist | Async (on arrival) |
| Cloudflare Email Sending | pipeline@thebrokenhandle.com |
Outbound brief delivery to hosts | SEND_EMAIL binding | Weekly + failure alerts |
| RSS/Atom feeds | Various publisher URLs | Content ingestion (41 sources) | Public | Hourly (cadence-gated) |
| HTML pages | Various publisher URLs | Content scraping (33 sources) | Public | Hourly (cadence-gated) |
| GitHub Actions | hard-scrape.yml |
Playwright rendering for anti-bot sites | TRIGGER_KEY secret | Every 6 hours |
API keys — four keys are stored in Cloudflare Secrets Store (default_secrets_store): FRED_STLFED, BLS_GOV, CENSUS_GOV, GEMINI_GOOGLE. The Worker’s admin endpoints are protected by a TRIGGER_KEY secret passed as a query parameter.
11. Error Handling & Resilience
Source fetch failures. Each ingestion attempt is wrapped in a try/catch that writes a failed status to the fetch_runs table with the error message. Failed sources are retried on the next cron cycle — the cadence-based scheduling naturally handles transient failures without explicit retry logic. Sources with persistent failures accumulate failed fetch runs visible in the /status endpoint.
RSS fallback URLs. Sources with unreliable primary URLs can specify a url_fallbacks JSON array in D1. The RSS handler tries the primary URL first, then each fallback in order until one succeeds.
Content deduplication. Documents are upserted using a content hash (SHA-256). If a document with the same (source_id, url) or (source_id, content_hash) already exists, the fetch succeeds without writing a duplicate. This handles sources that republish or update content without changing the URL.
External scrape resilience. The GitHub Actions hard-scrape workflow runs on a separate schedule (every 6 hours) from the main Worker cron. If Playwright fails to render a site, the GHA run fails with a logged error; the Worker continues operating on its remaining sources. The /ingest endpoint validates the TRIGGER_KEY before accepting POSTed content.
Email sender validation. Each email source has an allowlist of sender domains (SOURCE_ALLOWED_SENDERS). Emails from unrecognized domains are rejected with a bounce, preventing spoofed content from entering the pipeline. One exception: the Drovers/Farm Journal address has no allowlist because the brand family sends from many domains; the trade-off is accepted for an unpublished, low-spoof-risk address.
Synthesis idempotency. The synthesis pipeline checks for an existing content_item with the current week’s slug before calling Gemini. If a brief already exists, it returns skipped — preventing duplicate API spend on cron retries or manual re-triggers. Override with triggeredBy='force'.
Synthesis failure alerting. If the Gemini call or any downstream step fails, the pipeline catches the error, emails a failure alert to both hosts with the error message and retry instructions, and returns a failed result. This ensures missed briefs are visible without relying on Worker log monitoring.
Tagging best-effort. The balance tagging pass (tagWeekDocuments) is wrapped in error handling that logs and swallows failures per batch. A tagging failure never blocks the weekly brief — the synthesis proceeds with whatever tags were successfully applied, and untagged documents are picked up on the next run.
Email size limits. Inbound emails larger than 5 MB are rejected. Per-attachment size is capped at 25 MB. These limits prevent oversized messages from consuming R2 storage or timing out the Worker.
12. Non-Functional Requirements (Measured)
| NFR | Target | Basis |
|---|---|---|
| Ingestion coverage | 75+ sources across 5 methods | Registered fetchers + email mappings + GHA external scrape |
| Source cadence enforcement | Cadence-gated (15 min to annual) | getDueSources query with per-cadence datetime thresholds |
| Document deduplication | 0 duplicate documents | Content-hash (SHA-256) upsert idempotency in D1 |
| Synthesis delivery | Every Monday by 09:00 ET | Cron at 13:00 UTC (06:00 Phoenix / 09:00 ET) |
| Anti-fabrication | 0 invented figures in brief | WISER system prompt hard rules + source_id-only citation format |
| Balance coverage | Measured across 7 axes | Natural/Published share tracking with configurable floor/ceiling guardrails |
| Embedding dimensionality | 768-dim cosine | BGE-base-en-v1.5 via Workers AI, matching Vectorize index config |
| Brief word count | 800–1,200 words | Enforced in WISER Review checklist; Gemini maxOutputTokens=8192 |
| Email sender validation | Domain allowlist per source | Bounce on mismatch; prevents spoofed ingestion |
| CI/CD | TypeScript typecheck on every push/PR | GitHub Actions ci.yml with npm run typecheck |
13. Tech Stack
| Layer | Technology | Role |
|---|---|---|
| Compute | Cloudflare Workers (TypeScript) | Hourly ingestion cron, weekly synthesis cron, admin API, email handler |
| Database | Cloudflare D1 (SQLite) | Sources, documents, indicators, tags, content items, balance history, guardrail config |
| Object storage | Cloudflare R2 | Raw payloads (tbh-scrape-raw), media/attachments (tbh-media) |
| Vector search | Cloudflare Vectorize (768-dim, cosine) | Semantic embeddings for document retrieval |
| Cache | Cloudflare KV (TBH_DATA_CACHE) |
Reserved for shorts generator; also used for synthesis context |
| Embeddings | Workers AI (@cf/baai/bge-base-en-v1.5) |
768-dim text embeddings, ~512 token context |
| Synthesis LLM | Google Gemini 2.5 Pro | Weekly brief generation with WISER framework |
| Classification LLM | Google Gemini 2.5 Flash | Document balance tagging across 7 axes |
| Prompt framework | WISER | Structured prompt: Who, Instructions, Sub-tasks, Examples, Review |
| Email inbound | Cloudflare Email Routing + PostalMime | Newsletter ingestion from 15 addresses |
| Email outbound | Cloudflare Email Sending (MailChannels) | Brief delivery and failure alerts from pipeline@thebrokenhandle.com |
| Secrets | Cloudflare Secrets Store | FRED, BLS, Census, Gemini API keys |
| External scrape | GitHub Actions + Playwright | Headless Chromium rendering for anti-bot sites |
| CI/CD | GitHub Actions | TypeScript typecheck, deployment via wrangler deploy |
| Source config | D1 rows + JSON scrape_config |
Data-driven source management; no code changes for most new sources |
14. Security & Compliance
Secrets management. Four API keys (FRED, BLS, Census, Gemini) are stored in Cloudflare Secrets Store and accessed asynchronously via Worker bindings (await env.GEMINI_API_KEY.get()). The Worker’s admin endpoints are gated by a TRIGGER_KEY secret passed as a query parameter. No secrets appear in code, environment files, or logs.
Ingestion is read-only. All source fetching reads from public or permission-granted endpoints. The pipeline never modifies external data. Email ingestion accepts only inbound messages — the Worker never sends to external recipients.
Email sender validation. Each email source has a per-source sender-domain allowlist (SOURCE_ALLOWED_SENDERS). Emails from unrecognized domains receive a 5xx reject, generating a bounce to the sender. This prevents spoofed content from entering the D1/Vectorize pipeline.
Restricted email delivery. Outbound email is limited to two hardcoded recipient addresses (the hosts’ personal Gmail accounts). The RECIPIENTS array is a constant in synthesis.ts; there is no mechanism for external parties to add recipients. Test mode further restricts to a single recipient.
Admin endpoint authentication. All mutating HTTP endpoints (POST /sources, PATCH /sources/:id, POST /run-synthesis, POST /ingest/:id) require ?key=<TRIGGER_KEY>. Unauthenticated requests receive a 401.
No PII. The pipeline ingests public/published data and stores only document text, metadata, and structured indicators. No personally identifiable information is collected or stored.
15. Deployment & Operations
The tbh-research Worker deploys via Wrangler from the project root. CI/CD runs on GitHub Actions: ci.yml performs TypeScript typechecking on every push/PR; deploy.yml handles production deployment to the tbh-research Worker on the the-broken-handle-account Cloudflare account.
Infrastructure provisioning. D1 database, R2 buckets, Vectorize index, KV namespace, and Secrets Store bindings are declared in wrangler.toml and provisioned via Wrangler. Source configuration lives in D1 and is managed through dated SQL migrations under scripts/migrations/.
Cron schedules. Two cron triggers are configured in wrangler.toml: 0 * * * * (hourly ingestion sweep) and 0 13 * * 1 (Monday 13:00 UTC synthesis). The hard-scrape GHA workflow runs on its own cron: 15 */6 * * * (every 6 hours at :15 past).
Observability. Worker observability is enabled in wrangler.toml ([observability] enabled = true). Console logs from ingestion and synthesis are visible in the Cloudflare dashboard. The /status endpoint returns a JSON health check including Worker version, registered fetcher count, total documents, total indicators, and per-source document counts.
Source management. The admin API exposes full CRUD on the sources table (GET /sources, GET /sources/:id, POST /sources, PATCH /sources/:id). Source fields are validated against a whitelist; ingestion methods and cadences are checked against valid sets. This enables adding, disabling, or reconfiguring sources without code changes.
Email routing. Inbound email addresses are configured in Cloudflare Email Routing for the thebrokenhandle.com zone, pointing to the tbh-research Worker. Adding a new email source requires three steps: (1) add the address→source mapping in email.ts, (2) create or update the D1 source row, (3) add an Email Routing rule in the Cloudflare dashboard.
16. Cross-Project Context
TBH Editor (shared Cloudflare infrastructure). TBH Editor is a browser-based, Descript-style podcast editor running on the same Cloudflare account (the-broken-handle-account). Both projects share the account-level billing, DNS zone (thebrokenhandle.com), and Secrets Store. The research pipeline’s tbh-media R2 bucket is architecturally adjacent to the editor’s MEDIA R2 bucket — both store podcast-related media, though they are separate buckets with separate access controls. The pipeline’s curated data layer (D1 documents + Vectorize embeddings) is designed to be consumable by the editor for episode research and show-notes generation, though this integration is not yet built.
Shared Cloudflare account resources. The the-broken-handle-account Cloudflare account hosts: the research pipeline Worker (tbh-research), the editor Worker + Durable Objects + Workflow + Container (tbh-editor), the podcast website (Cloudflare Pages), Email Routing for newsletter ingestion, and DNS for thebrokenhandle.com. Both Workers run on the Workers Paid plan, which is required for the editor’s Durable Objects and Containers.
WISER framework (shared with ServiceBay AI / HandyHome AI). The WISER prompting framework used for the synthesis pipeline is the same methodology applied in ServiceBay AI and HandyHome AI for agent behavior definition. In those projects, WISER defines agent instructions within IBM watsonx Orchestrate; here it defines the Gemini synthesis prompt. The framework’s portability across LLM providers (Gemini, GPT-OSS) and runtimes (Orchestrate, direct API) validates its utility as a cross-project standard.
17. Risks, Assumptions & Limitations
- Source fragility. Scraping depends on third-party site structure; CSS selectors and link patterns require maintenance when publishers redesign. The
scrape_configdata model minimizes code changes, but selector tuning is an ongoing operational task. Dead-RSS migrations (e.g., Brookings, BCG Henderson) show this is a recurring pattern. - Anti-bot escalation. TrueUp Layoffs already requires Playwright via GitHub Actions because Cloudflare Workers are blocked. If more sources adopt aggressive anti-bot measures, the external-scrape path may need to expand, increasing GHA compute costs and operational complexity.
- Gemini API dependency. Both synthesis (Pro) and tagging (Flash) depend on Google Gemini. An outage blocks the weekly brief and tagging pass. The failure-alerting mechanism ensures missed briefs are visible, but there is no automatic fallback to an alternative LLM.
- Balance guardrail tuning. Floor/ceiling thresholds in
guardrail_configare manually set. Over-aggressive floors can force the synthesis model to include low-signal stories; under-aggressive floors defeat the purpose. Thebalance_historytrend data supports tuning but requires periodic human review. - Data volume projections. At current ingestion rates (~100–200 documents/week, ~50–100 indicators/week), D1 and R2 growth is modest. Over 12 months: ~10,000 documents in D1 (~50 MB), ~5,000 raw payloads in R2 (~500 MB), ~10,000 vectors in Vectorize. All well within Cloudflare’s included allocations on the Workers Paid plan. If source count doubles, these projections scale linearly.
- Email deliverability. Outbound brief delivery uses Cloudflare Email Sending (MailChannels) from
pipeline@thebrokenhandle.com. SPF/DKIM/DMARC are configured for the domain. Delivery is to two Gmail addresses; Gmail’s spam filtering has not flagged the briefs to date, but deliverability is not guaranteed and there is no fallback delivery mechanism. Failed sends are caught and logged but not retried. - Shorts generator is planned. The
tbh-shortsWorker andTBH_DATA_CACHEKV namespace are reserved but not yet built. Technical details (video generation model, narration voice, output format, distribution channels) are TBD.
18. Roadmap
Phase 1 — Research Pipeline (current). Automated ingestion from 75+ sources, weekly synthesis brief with WISER-guided Gemini, balance guardrails across seven editorial axes. The pipeline is the data foundation that makes everything else possible.
Phase 2 — Shorts Generator. AI-narrated 60–90s “Pivot Briefs” generated from the curated data layer. Each short highlights one story from the weekly brief, narrated in show voice, formatted for vertical video (YouTube Shorts, Instagram Reels, TikTok). The tbh-shorts Worker will read from the same D1/Vectorize layer and use KV for generation state. Technical stack TBD: likely a text-to-speech model for narration and a template-driven video composition pipeline.
Phase 3 — Source Health Monitoring. Automated alerting for source degradation: consecutive fetch failures, selector breakage (scrape returns zero documents), RSS feed staleness (no new items in 2× cadence window). Currently, source health is visible only through the /status endpoint and fetch_runs table — no proactive alerting exists.
Phase 4 — Semantic Search & Retrieval. Expose the Vectorize embeddings for interactive querying — “what did we see about healthcare layoffs in Q1?” — via an API endpoint or chat interface. The embeddings exist today but are write-only; no retrieval path is built.
Phase 5 — Audience & Revenue. Expand from internal research tool to audience-facing content. Newsletter automation (Substack API integration), social media posting, and eventually sponsorship/advertising infrastructure. This phase builds on the content generation capabilities from Phase 2.
Diagrams: Data Flow Diagram · Source Taxonomy · Deployment Diagram