Solution Architecture Document — The Broken Handle

Project: The Broken Handle — automated research pipeline & AI shorts generator
Authors: Jarrod E. Brown, William Lowdermilk
Platform: Cloudflare developer platform (Workers, D1, R2, Vectorize, KV, Workers AI, Email Routing, Secrets Store)
Status: Research pipeline live (v0.6); shorts generator planned
Repository: Private

A more formal SAD also exists for this project; this document supersedes it.

Web & Social

Website: thebrokenhandle.com
Newsletter: brokenhandlepod.substack.com
YouTube: @brokenhandlepod
Instagram: @brokenhandlepod
Threads: @brokenhandlepod
X / Twitter: @brokenhandlepod
BlueSky: @brokenhandlepod.bsky.social
Facebook: @brokenhandlepod

Subscribe — scan to join the newsletter on Substack:

1. Overview

The Broken Handle is a weekly business podcast on the modern job market — layoffs, AI displacement, and career pivots. This repository contains the engine behind the show: an automated research pipeline that ingests forward-looking labor and business indicators from 75+ registered sources across five ingestion methods, and a planned AI shorts generator that produces short-form “Pivot Briefs” as marketing for the weekly long-form episode. Everything runs on Cloudflare’s developer platform.

The pipeline implements a Research Synthesizer archetype: the data already exists across government agencies, executive surveys, hiring-intent trackers, and sector-specific publications, but the value is in aggregation, classification, and synthesis — turning a week’s worth of scattered signals into a ranked, citeable Monday planning brief that makes the show’s commentary defensible rather than anecdotal.

2. Problem & Context

Producing a credible weekly podcast on the labor market requires tracking dozens of forward-looking indicators that publish on different cadences (real-time layoff trackers, monthly government releases, quarterly executive surveys) across incompatible formats (RSS feeds, HTML landing pages, email newsletters, PDF reports, API endpoints). Manually checking 75+ sources each week is unsustainable for two part-time hosts. Worse, the resulting commentary risks confirmation bias — hosts naturally gravitate toward familiar sources and miss signals from sectors or geographies outside their experience.

No existing tool combines labor-market data aggregation with editorial synthesis. News aggregators (Feedly, Google Alerts) collect headlines but don’t extract structured indicators. Research platforms (Bloomberg Terminal, Statista) are prohibitively expensive for an independent podcast. The Broken Handle pipeline fills this gap: a serverless, zero-maintenance system that ingests, classifies, and synthesizes research on a cron schedule, delivering a ready-to-use planning brief every Monday morning.

3. Goals & Requirements

Functional

Ingest forward-looking labor/business indicators from 75+ sources across five methods (API, RSS, HTML scrape, email ingestion, external scrape via GitHub Actions).
Store raw content in R2, structured indicators and documents in D1, and semantic embeddings in Vectorize.
Classify documents across seven balance axes (sector, geography, career stage, role level, org size, persona, signal type) using Gemini Flash.
Run a weekly synthesis that produces a ranked, cited Monday planning brief and emails it to both hosts.
(Planned) Generate 60–90s AI-narrated short-form videos (“Pivot Briefs”) from the curated data layer.

Non-functional

Serverless and low-maintenance: cron-driven ingestion, no manual intervention for steady-state operation.
Source configuration managed as data (D1 rows + JSON scrape configs), not code — adding most new sources requires zero TypeScript.
Reproducible, versioned deployments via GitHub Actions CI/CD with TypeScript type checking.
Balance guardrails prevent editorial blind spots by measuring and nudging topical coverage.

4. Decision Rationale

Why Cloudflare Workers over a traditional backend? The pipeline runs on cron schedules (hourly ingestion, weekly synthesis) with no persistent connections or long-running processes. Workers’ per-request billing means the system costs near-zero between crons. The integrated D1/R2/Vectorize/KV stack eliminates cross-service networking latency and authentication overhead. The shared Cloudflare account hosts the podcast website (thebrokenhandle.com) and the companion editor, so infrastructure is consolidated under one billing and DNS umbrella.

Why five ingestion methods instead of standardizing on one? Source publishers don’t standardize their output. Government agencies (FRED, BLS) expose APIs with structured JSON. Think tanks and trade press (McKinsey, Construction Dive) publish RSS feeds. Research firms (Deloitte, Vistage) post reports on HTML landing pages with no feed. Executive newsletters (Apollo Academy, CB Insights, JPMorgan) deliver via email with no web archive. Anti-bot sites (TrueUp Layoffs) block Worker fetch requests entirely. Each ingestion method addresses a real constraint — the alternative would be losing coverage.

Why Gemini 2.5 Pro for synthesis? The weekly brief requires ranking, citation, and editorial voice — tasks that benefit from a large context window (the payload can include 100+ documents) and strong instruction-following for the WISER framework’s anti-fabrication rules. Gemini 2.5 Pro’s 1M-token context window handles the full payload without chunking, and its structured output reliability is high enough that the brief lands directly in the hosts’ inbox without human review. Gemini Flash handles the lighter classification workload (tagging documents across balance axes) at lower cost.

Why WISER prompting? The synthesis prompt defines the brief’s editorial voice, ranking criteria, anti-fabrication rules, and output structure. WISER (Who, Instructions, Sub-tasks, Examples, Review) provides a repeatable framework that makes each section auditable and independently tunable. The locked system prompt (~1,200 words) has been iterated through production runs; changes are version-controlled alongside the TypeScript source.

Why balance guardrails? Without measurement, the brief would over-index on whatever published most that week (typically tech layoffs and macro indicators) and under-represent sectors, geographies, and career stages that publish less frequently. The guardrail system measures Natural coverage (what the source pool ingested) versus Published coverage (what the brief cited), computes lift, and injects a directive into the synthesis prompt when floors are binding. This makes editorial balance a measured, tunable property rather than a subjective judgment.

Why email ingestion over scraping? Several high-value sources (Apollo Academy, CB Insights, JPMorgan Eye on the Market, Big Four CEO outlook newsletters) distribute exclusively via email with no public web archive. Rather than lose coverage, the pipeline uses Cloudflare Email Routing to accept newsletters at dedicated per-source addresses on the podcast domain, parse them with PostalMime, and feed them through the same document/embedding pipeline as other sources. Sender-domain allowlists prevent spoofing.

5. Architecture Overview

See also: Data Flow Diagram for a visual trace of the full pipeline, and Deployment Diagram for Cloudflare resource layout.

The system is organized as a three-stage pipeline — Ingest → Store & Embed → Synthesize & Deliver — with a balance measurement layer that spans stages two and three.

Stage 1: Ingest. An hourly cron triggers the research Worker, which queries D1 for sources that are “due” based on their configured cadence. Each source dispatches to one of three registered handler types (API, RSS, HTML scrape). Email sources arrive asynchronously via Cloudflare Email Routing. External scrape sources are rendered by Playwright in a GitHub Actions workflow and POSTed to the Worker’s /ingest endpoint.

Stage 2: Store & Embed. Each handler writes raw payloads to a raw-content R2 bucket, structured documents and indicators to the D1 database, and generates 768-dimensional text embeddings via Workers AI (BGE-base-en-v1.5) stored in the Vectorize index.

Stage 3: Synthesize & Deliver. Every Monday at 13:00 UTC, the synthesis cron pulls the past week’s documents and indicators from D1, classifies them across seven balance axes using Gemini Flash, computes Natural coverage shares, injects balance directives where floors are binding, calls Gemini 2.5 Pro with the full WISER-framework system prompt to generate the brief, resolves [source_id] citations to APA-style parentheticals, writes the result to D1 as a draft content_item with provenance links, computes Published coverage, persists the week’s balance history, and emails the HTML-rendered brief to both hosts.

6. Components

#	Component	Responsibility
1	Research Worker (`src/index.ts`)	Main entry point: hourly ingestion cron, Monday synthesis cron, admin HTTP API (`/sources`, `/run-synthesis`, `/ingest`, `/status`), email handler.
2	`src/sources/fred.ts`	FRED API handler — fetches economic time series via the FRED STLFED API, writes indicators (value, unit, period, direction, change).
3	`src/sources/rss.ts`	Generic RSS handler — parses Atom/RSS feeds using D1-driven source config (`sources.url` + `url_fallbacks`), writes documents with content hashing for dedup.
4	`src/sources/generic_html.ts`	Generic HTML scraper — three modes (`list`, `snapshot`, `two_step`) driven by JSON `scrape_config` in D1. Uses HTMLRewriter for server-rendered pages. Also exposes `ingestExternalHtml` for the GHA external-scrape path.
5	`src/email.ts`	Email ingestion handler — Cloudflare Email Routing dispatcher. Maps recipient addresses to source IDs, validates sender domains, parses RFC822 via PostalMime, stores raw `.eml` in R2, persists attachments to the media bucket, writes documents, embeds into Vectorize.
6	`src/synthesis.ts`	Weekly synthesis orchestrator — pulls recent docs/indicators, calls Gemini 2.5 Pro with the locked WISER system prompt, resolves citations, writes content items with provenance, emails the brief. Includes idempotency guard, failure alerting, and test mode.
7	`src/tagging.ts`	Balance classification — batches documents (12/batch) through Gemini Flash to tag across seven axes using a controlled vocabulary from the `tags` table. Best-effort; failures are logged and swallowed.
8	`src/balance.ts`	Balance measurement — loads guardrail rules from `guardrail_config`, computes Natural/Published coverage shares per bucket, generates synthesis directives for binding floors, persists weekly trends to `balance_history`.
9	`src/embeddings.ts`	Vectorize embedding — truncates text to ~512 tokens, runs Workers AI BGE-base-en-v1.5, upserts 768-dim vectors with source/document metadata.
10	`src/db.ts`	D1 query layer — source scheduling (`getDueSources` with cadence-aware due logic), document upsert with content-hash idempotency, indicator upsert, fetch-run lifecycle.
11	`src/r2.ts`	R2 storage helpers — raw payload persistence keyed by `source_id/date/filename`.
12	`src/feed.ts`	RSS/Atom feed parsing utilities.
13	`src/scrape.ts`	HTML scrape framework — shared logic for HTMLRewriter-based content extraction.
14	`.github/workflows/hard-scrape.yml`	External scrape via GitHub Actions — runs Playwright (headless Chromium) every 6 hours against JS-hydrated sites that block Worker fetch, POSTs rendered HTML to `/ingest`.
15	Shorts Worker (planned)	AI short-form content generator; a KV namespace reserved.

7. Data Flow

The Broken Handle Data Flow Diagram — Data Flow Diagram — visual trace of data through the pipeline

Source scheduling. The hourly cron calls getDueSources(), which queries D1 for active sources whose last_fetched_at exceeds their cadence window (15 min for real-time, 7 days for weekly, 30/90/365 for monthly/quarterly/annual). Sources are processed in tier order, with never-fetched sources prioritized.
Ingestion dispatch. Each due source is dispatched to its registered handler in the FETCHERS registry. A fetch_run record is created in D1 to track the attempt. The handler fetches content, and on success: (a) stores the raw payload in R2, (b) writes structured documents and/or indicators to D1 with content-hash deduplication, (c) generates a text embedding via Workers AI and upserts it into Vectorize with source metadata. The fetch_run is completed with status, byte count, and document/indicator counts.
Email arrival. Inbound emails hit Cloudflare Email Routing, which dispatches to handleIncomingEmail. The handler maps the recipient address to a source ID, validates the sender domain against an allowlist, parses the RFC822 message via PostalMime, stores the raw .eml in R2, persists attachments to the media bucket, and writes a document through the same upsert-and-embed path as other handlers.
External scrape (GHA). Every 6 hours, the hard-scrape.yml GitHub Actions workflow launches Playwright to render JS-hydrated sites (e.g., TrueUp Layoffs), extracts content, and POSTs it to the Worker’s /ingest/:source_id endpoint. The Worker’s ingestExternalHtml handler stores and indexes the content identically to the built-in scraper.
Weekly tagging. At synthesis time, tagWeekDocuments batches the week’s untagged documents (up to 60 per run) through Gemini Flash for classification across seven balance axes. Tags are written to document_tags via the controlled vocabulary in the tags table.
Balance measurement. prepareBalance loads guardrail rules from guardrail_config, computes Natural coverage shares (fraction of the week’s pool touching each bucket), and generates a directive paragraph injected into the Gemini synthesis prompt when any floor is binding — nudging the model to include under-represented sectors or geographies.
Synthesis. runSynthesis pulls all documents and indicators from the past 7 days, assembles the user payload (documents capped at 2,000 chars each, indicators as structured rows), prepends the balance directive, and calls Gemini 2.5 Pro with the locked WISER system prompt (temperature 0.4, max 8,192 output tokens). The returned markdown is citation-resolved ([source_id] → APA parentheticals), written to D1 as a draft content_item with provenance links via content_sources, and emailed to both hosts as rendered HTML.
Balance finalization. After synthesis, finalizeBalance extracts the brief’s citations to compute Published coverage, calculates lift (Published − Natural) per bucket, persists the week’s balance snapshot to balance_history, and appends a coverage footer to the brief.

8. Data Model

The D1 database contains the following core tables:

sources — the source registry. Each row defines a data source with its ID, name, URL, tier (priority ordering), category, ingestion method (api/rss/scrape/external_scrape/email/manual_pdf), cadence, outlook window, paywall/auth flags, status, and optional JSON columns for url_fallbacks (tried in order if the primary URL fails) and scrape_config (mode + selectors for the generic HTML scraper).

fetch_runs — audit log of every ingestion attempt. Tracks source ID, trigger (cron/manual/email), start/complete timestamps, HTTP status, R2 key, byte count, and error messages. Used for source-health monitoring and debugging.

documents — ingested content. Each row links to a source and fetch run, with title, URL, published date, full text content, summary, content hash (SHA-256 for dedup), and an optional vector_id linking to the Vectorize index.

indicators — structured numeric/text data points. Keyed by (source_id, name, period_end) for upsert idempotency. Stores value (numeric or text), unit, period label, direction (up/down/unchanged/mixed), change versus prior, and extraction confidence.

tags — controlled vocabulary for balance classification. Each tag has a type (axis: sector, geography, career_stage, role_level, org_size, persona, signal_type) and a name (bucket within that axis).

document_tags — many-to-many join linking documents to tags. Written by the Gemini Flash classifier during the weekly tagging pass.

guardrail_config — balance rules. Each row defines a floor, ceiling, min_count, or group_floor for a specific axis/bucket combination. Rules are data, not code — tuning a target is a one-row UPDATE.

balance_history — weekly snapshots of Natural/Published coverage and lift per bucket. Powers trend analysis to distinguish under-sourced buckets (add sources) from genuinely balanced ones.

content_items — generated content (weekly briefs, planned shorts). Stores type, title, slug, status (draft/published), and the full markdown body.

content_sources — provenance links from content items to the documents they cited.

content_tags / content_people — future tagging and attribution for generated content.

9. Source Taxonomy

The Broken Handle Source Taxonomy Diagram — Source Taxonomy Diagram — sources grouped by handler type

The pipeline ingests from 75+ registered sources across five ingestion methods:

Method	Handler	Count	Examples
API	`fred.ts`	1	FRED (Federal Reserve Economic Data) — economic time series
RSS	`rss.ts`	41	Indeed Hiring Lab, McKinsey Global Institute, Challenger Gray, BCG podcasts, Construction Dive, Healthcare Dive, USDA NASS (4 feeds), USDA News/Blogs, EPI, NELP, Industry Dive family (8 verticals), Daily Yonder, Beef Magazine, Hechinger Report, Stateline
HTML Scrape	`generic_html.ts`	33	Layoffs.fyi (snapshot), Fed Beige Book (two-step), Brookings Metro, NFIB Optimism, Deloitte CFO Signals, ADP Research, Burning Glass, Gusto, Homebase, BofA Institute, USBR Lake Mead, US Drought Monitor
Email	`email.ts`	15	Apollo Academy, CB Insights, JPMorgan Eye on the Market, Axios Pro Rata, KPMG CEO Outlook, EY CEO Outlook, USDA ERS (Charts of Note + Amber Waves), Chicago Fed (catch-all for 15+ gov newsletters), Drovers/Farm Journal family
External Scrape	GHA + Playwright	1	TrueUp Layoffs (anti-bot blocks Worker fetch; rendered in headless Chromium)

Sources are tiered (1–4) for priority ordering during ingestion. Categories span government agencies, executive surveys, hiring-intent trackers, sector-specific trade press, rural/agricultural, and edge-case executive research. Several sources have dual paths — for example, Deloitte CFO Signals and AGC Outlook have both a scrape pipeline and an email backup, writing to the same source ID for dedup.

The HTML scraper operates in three modes, configured per-source via JSON scrape_config in D1:

list — parses a listing page for article links matching CSS selector and regex filters, writes each as a document (e.g., Brookings Metro, BCG Henderson Institute).
snapshot — captures a single page’s content as one document plus a count indicator (e.g., Layoffs.fyi visible entries count).
two_step — navigates a landing page to find the latest report link via regex, then fetches and extracts the report body (e.g., Fed Beige Book).

10. External Interfaces

Interface	Endpoint / Address	Purpose	Auth	Cadence
FRED API	`api.stlouisfed.org/fred/series/observations`	Economic time series (unemployment, GDP, CPI)	API key (Secrets Store)	Weekly
Gemini 2.5 Pro	`generativelanguage.googleapis.com`	Weekly synthesis brief generation	API key (Secrets Store)	Weekly (Monday)
Gemini 2.5 Flash	`generativelanguage.googleapis.com`	Document balance classification (7 axes)	API key (Secrets Store)	Weekly (Monday)
Workers AI	`@cf/baai/bge-base-en-v1.5`	768-dim text embeddings for Vectorize	Workers AI binding	Per document
Cloudflare Email Routing	per-source addresses on the podcast domain → Worker	Inbound newsletter ingestion (15 addresses)	Sender-domain allowlist	Async (on arrival)
Cloudflare Email Sending	a dedicated sending address on the podcast domain	Outbound brief delivery to hosts	Email-sending binding	Weekly + failure alerts
RSS/Atom feeds	Various publisher URLs	Content ingestion (41 sources)	Public	Hourly (cadence-gated)
HTML pages	Various publisher URLs	Content scraping (33 sources)	Public	Hourly (cadence-gated)
GitHub Actions	`hard-scrape.yml`	Playwright rendering for anti-bot sites	Authentication secret	Every 6 hours

API keys — four keys (FRED, BLS, Census, Gemini) are stored in Cloudflare Secrets Store. The Worker’s admin endpoints are protected by an authentication secret.

11. Error Handling & Resilience

Source fetch failures. Each ingestion attempt is wrapped in a try/catch that writes a failed status to the fetch_runs table with the error message. Failed sources are retried on the next cron cycle — the cadence-based scheduling naturally handles transient failures without explicit retry logic. Sources with persistent failures accumulate failed fetch runs visible in the /status endpoint.

RSS fallback URLs. Sources with unreliable primary URLs can specify a url_fallbacks JSON array in D1. The RSS handler tries the primary URL first, then each fallback in order until one succeeds.

Content deduplication. Documents are upserted using a content hash (SHA-256). If a document with the same (source_id, url) or (source_id, content_hash) already exists, the fetch succeeds without writing a duplicate. This handles sources that republish or update content without changing the URL.

External scrape resilience. The GitHub Actions hard-scrape workflow runs on a separate schedule (every 6 hours) from the main Worker cron. If Playwright fails to render a site, the GHA run fails with a logged error; the Worker continues operating on its remaining sources. The /ingest endpoint validates an authentication secret before accepting POSTed content.

Email sender validation. Each email source has a sender-domain allowlist. Emails from unrecognized domains are rejected with a bounce, preventing spoofed content from entering the pipeline. One exception: the Drovers/Farm Journal address has no allowlist because the brand family sends from many domains; the trade-off is accepted for an unpublished, low-spoof-risk address.

Synthesis idempotency. The synthesis pipeline checks for an existing content_item with the current week’s slug before calling Gemini. If a brief already exists, it returns skipped — preventing duplicate API spend on cron retries or manual re-triggers. Override with triggeredBy='force'.

Synthesis failure alerting. If the Gemini call or any downstream step fails, the pipeline catches the error, emails a failure alert to both hosts with the error message and retry instructions, and returns a failed result. This ensures missed briefs are visible without relying on Worker log monitoring.

Tagging best-effort. The balance tagging pass (tagWeekDocuments) is wrapped in error handling that logs and swallows failures per batch. A tagging failure never blocks the weekly brief — the synthesis proceeds with whatever tags were successfully applied, and untagged documents are picked up on the next run.

Email size limits. Inbound emails larger than 5 MB are rejected. Per-attachment size is capped at 25 MB. These limits prevent oversized messages from consuming R2 storage or timing out the Worker.

12. Non-Functional Requirements (Measured)

NFR	Target	Basis
Ingestion coverage	75+ sources across 5 methods	Registered fetchers + email mappings + GHA external scrape
Source cadence enforcement	Cadence-gated (15 min to annual)	`getDueSources` query with per-cadence datetime thresholds
Document deduplication	0 duplicate documents	Content-hash (SHA-256) upsert idempotency in D1
Synthesis delivery	Every Monday by 09:00 ET	Cron at 13:00 UTC (06:00 Phoenix / 09:00 ET)
Anti-fabrication	0 invented figures in brief	WISER system prompt hard rules + source_id-only citation format
Balance coverage	Measured across 7 axes	Natural/Published share tracking with configurable floor/ceiling guardrails
Embedding dimensionality	768-dim cosine	BGE-base-en-v1.5 via Workers AI, matching Vectorize index config
Brief word count	800–1,200 words	Enforced in WISER Review checklist; Gemini maxOutputTokens=8192
Email sender validation	Domain allowlist per source	Bounce on mismatch; prevents spoofed ingestion
CI/CD	TypeScript typecheck on every push/PR	GitHub Actions `ci.yml` with `npm run typecheck`

13. Tech Stack

Layer	Technology	Role
Compute	Cloudflare Workers (TypeScript)	Hourly ingestion cron, weekly synthesis cron, admin API, email handler
Database	Cloudflare D1 (SQLite)	Sources, documents, indicators, tags, content items, balance history, guardrail config
Object storage	Cloudflare R2	Raw payloads (raw-content bucket), media/attachments (media bucket)
Vector search	Cloudflare Vectorize (768-dim, cosine)	Semantic embeddings for document retrieval
Cache	Cloudflare KV	Reserved for shorts generator; also used for synthesis context
Embeddings	Workers AI (`@cf/baai/bge-base-en-v1.5`)	768-dim text embeddings, ~512 token context
Synthesis LLM	Google Gemini 2.5 Pro	Weekly brief generation with WISER framework
Classification LLM	Google Gemini 2.5 Flash	Document balance tagging across 7 axes
Prompt framework	WISER	Structured prompt: Who, Instructions, Sub-tasks, Examples, Review
Email inbound	Cloudflare Email Routing + PostalMime	Newsletter ingestion from 15 addresses
Email outbound	Cloudflare Email Sending (MailChannels)	Brief delivery and failure alerts from a dedicated sending address on the podcast domain
Secrets	Cloudflare Secrets Store	FRED, BLS, Census, Gemini API keys
External scrape	GitHub Actions + Playwright	Headless Chromium rendering for anti-bot sites
CI/CD	GitHub Actions	TypeScript typecheck, deployment via `wrangler deploy`
Source config	D1 rows + JSON `scrape_config`	Data-driven source management; no code changes for most new sources

14. Security & Compliance

Secrets management. Four API keys (FRED, BLS, Census, Gemini) are stored in Cloudflare Secrets Store and accessed asynchronously via Worker bindings. The Worker’s admin endpoints are gated by an authentication secret. No secrets appear in code, environment files, or logs.

Ingestion is read-only. All source fetching reads from public or permission-granted endpoints. The pipeline never modifies external data. Email ingestion accepts only inbound messages — the Worker never sends to external recipients.

Email sender validation. Each email source has a per-source sender-domain allowlist. Emails from unrecognized domains receive a 5xx reject, generating a bounce to the sender. This prevents spoofed content from entering the D1/Vectorize pipeline.

Restricted email delivery. Outbound email is limited to a fixed list of recipient addresses (the hosts’ inboxes). The recipient list is a constant in synthesis.ts; there is no mechanism for external parties to add recipients. Test mode further restricts to a single recipient.

Admin endpoint authentication. All mutating HTTP endpoints (POST /sources, PATCH /sources/:id, POST /run-synthesis, POST /ingest/:id) require a valid authentication secret. Unauthenticated requests receive a 401.

No PII. The pipeline ingests public/published data and stores only document text, metadata, and structured indicators. No personally identifiable information is collected or stored.

15. Deployment & Operations

The Broken Handle Deployment Diagram — Deployment Diagram — Cloudflare resource layout

The research Worker deploys via Wrangler from the project root. CI/CD runs on GitHub Actions: ci.yml performs TypeScript typechecking on every push/PR; deploy.yml deploys the research Worker to the shared Cloudflare account.

Infrastructure provisioning. D1 database, R2 buckets, Vectorize index, KV namespace, and Secrets Store bindings are declared in wrangler.toml and provisioned via Wrangler. Source configuration lives in D1 and is managed through dated SQL migrations under scripts/migrations/.

Cron schedules. Two cron triggers are configured in wrangler.toml: 0 * * * * (hourly ingestion sweep) and 0 13 * * 1 (Monday 13:00 UTC synthesis). The hard-scrape GHA workflow runs on its own cron: 15 */6 * * * (every 6 hours at :15 past).

Observability. Worker observability is enabled in wrangler.toml ([observability] enabled = true). Console logs from ingestion and synthesis are visible in the Cloudflare dashboard. The /status endpoint returns a JSON health check including Worker version, registered fetcher count, total documents, total indicators, and per-source document counts.

Source management. The admin API exposes full CRUD on the sources table (GET /sources, GET /sources/:id, POST /sources, PATCH /sources/:id). Source fields are validated against a whitelist; ingestion methods and cadences are checked against valid sets. This enables adding, disabling, or reconfiguring sources without code changes.

Email routing. Inbound email addresses are configured in Cloudflare Email Routing for the thebrokenhandle.com zone, pointing to the research Worker. Adding a new email source requires three steps: (1) add the address→source mapping in email.ts, (2) create or update the D1 source row, (3) add an Email Routing rule in the Cloudflare dashboard.

16. Cross-Project Context

TBH Editor (shared Cloudflare infrastructure). TBH Editor is a browser-based, Descript-style podcast editor running on the same shared Cloudflare account. Both projects share the account-level billing, DNS zone (thebrokenhandle.com), and Secrets Store. The research pipeline’s media R2 bucket is architecturally adjacent to the editor’s media R2 bucket — both store podcast-related media, though they are separate buckets with separate access controls. The pipeline’s curated data layer (D1 documents + Vectorize embeddings) is designed to be consumable by the editor for episode research and show-notes generation, though this integration is not yet built.

Shared Cloudflare account resources. The shared Cloudflare account hosts: the research pipeline Worker, the editor (Worker + Durable Objects + Workflow + Container), the podcast website (Cloudflare Pages), Email Routing for newsletter ingestion, and DNS for thebrokenhandle.com. Both Workers run on the Workers Paid plan, which is required for the editor’s Durable Objects and Containers.

WISER framework (shared with ServiceBay AI). The WISER prompting framework used for the synthesis pipeline is the same methodology applied in ServiceBay AI for agent behavior definition. In that project, WISER defines agent instructions within IBM watsonx Orchestrate; here it defines the Gemini synthesis prompt. The framework’s portability across LLM providers (Gemini, GPT-OSS) and runtimes (Orchestrate, direct API) validates its utility as a cross-project standard.

17. Risks, Assumptions & Limitations

Source fragility. Scraping depends on third-party site structure; CSS selectors and link patterns require maintenance when publishers redesign. The scrape_config data model minimizes code changes, but selector tuning is an ongoing operational task. Dead-RSS migrations (e.g., Brookings, BCG Henderson) show this is a recurring pattern.
Anti-bot escalation. TrueUp Layoffs already requires Playwright via GitHub Actions because Cloudflare Workers are blocked. If more sources adopt aggressive anti-bot measures, the external-scrape path may need to expand, increasing GHA compute costs and operational complexity.
Gemini API dependency. Both synthesis (Pro) and tagging (Flash) depend on Google Gemini. An outage blocks the weekly brief and tagging pass. The failure-alerting mechanism ensures missed briefs are visible, but there is no automatic fallback to an alternative LLM.
Balance guardrail tuning. Floor/ceiling thresholds in guardrail_config are manually set. Over-aggressive floors can force the synthesis model to include low-signal stories; under-aggressive floors defeat the purpose. The balance_history trend data supports tuning but requires periodic human review.
Data volume projections. At current ingestion rates (~100–200 documents/week, ~50–100 indicators/week), D1 and R2 growth is modest. Over 12 months: ~10,000 documents in D1 (~50 MB), ~5,000 raw payloads in R2 (~500 MB), ~10,000 vectors in Vectorize. All well within Cloudflare’s included allocations on the Workers Paid plan. If source count doubles, these projections scale linearly.
Email deliverability. Outbound brief delivery uses Cloudflare Email Sending (MailChannels) from a dedicated sending address on the podcast domain. SPF/DKIM/DMARC are configured for the domain. Delivery is to the hosts’ inboxes; spam filtering has not flagged the briefs to date, but deliverability is not guaranteed and there is no fallback delivery mechanism. Failed sends are caught and logged but not retried.
Shorts generator is planned. The shorts Worker and its KV namespace are reserved but not yet built. Technical details (video generation model, narration voice, output format, distribution channels) are TBD.

18. Roadmap

Phase 1 — Research Pipeline (current). Automated ingestion from 75+ sources, weekly synthesis brief with WISER-guided Gemini, balance guardrails across seven editorial axes. The pipeline is the data foundation that makes everything else possible.

Phase 2 — Shorts Generator. AI-narrated 60–90s “Pivot Briefs” generated from the curated data layer. Each short highlights one story from the weekly brief, narrated in show voice, formatted for vertical video (YouTube Shorts, Instagram Reels, TikTok). The shorts Worker will read from the same D1/Vectorize layer and use KV for generation state. Technical stack TBD: likely a text-to-speech model for narration and a template-driven video composition pipeline.

Phase 3 — Source Health Monitoring. Automated alerting for source degradation: consecutive fetch failures, selector breakage (scrape returns zero documents), RSS feed staleness (no new items in 2× cadence window). Currently, source health is visible only through the /status endpoint and fetch_runs table — no proactive alerting exists.

Phase 4 — Semantic Search & Retrieval. Expose the Vectorize embeddings for interactive querying — “what did we see about healthcare layoffs in Q1?” — via an API endpoint or chat interface. The embeddings exist today but are write-only; no retrieval path is built.

Phase 5 — Audience & Revenue. Expand from internal research tool to audience-facing content. Newsletter automation (Substack API integration), social media posting, and eventually sponsorship/advertising infrastructure. This phase builds on the content generation capabilities from Phase 2.

Diagrams: Data Flow Diagram · Source Taxonomy · Deployment Diagram