Jarrod E. Brown
← All projects

Solution Architecture Document — TBH Editor

Project: TBH-Editor — browser-based, Descript-style podcast editor Author: Jarrod E. Brown Platform: Cloudflare (Workers + Durable Objects + R2 + D1 + Workers AI + Containers) Status: In development Repository: github.com/thebrokenhandle/tbh-editor (private)


Web


1. Overview

TBH Editor is a browser-based, Descript-style podcast editor where you edit audio by editing the transcript, in real time, alongside your co-host. It is built entirely on Cloudflare: a Workers API with Durable Objects for live collaboration, a Workflow for chunked transcription, Workers AI for transcription and voice work, R2 for media, D1 for metadata, and a CPU container running ffmpeg for export.

The system implements a non-destructive edit model: the original audio is never modified. All edits operate on the transcript layer — a structured sequence of timed word segments — and the final render reconstructs the output by reading the edit decision list against the source audio. This means every edit is reversible, and multiple edit branches can coexist without duplicating media files.

2. Problem & Context

Editing a two-host podcast in traditional DAWs is slow and not collaborative. The hosts record in a single session, then one person takes the file offline, edits in Audacity or Logic, and sends a bounce back for review. Feedback cycles are asynchronous and lossy — “cut that section around minute 12” is imprecise, and the editing host carries the full cognitive load of mapping notes to waveform positions.

Descript-style transcript editing is the desired workflow — you read and edit a document, and the audio follows — but commercial tools are subscription-bound ($24–33/month per seat), not self-hostable, and not customizable to a two-person podcast workflow. Features like filler-word removal, overdub, and real-time collaboration exist in Descript’s business tier but come with per-seat pricing that doesn’t justify the cost for a two-person show.

TBH Editor delivers transcript-driven, real-time collaborative editing on infrastructure the team already runs for The Broken Handle podcast, with AI overdub for fixing lines without re-recording. Both hosts see and edit the same transcript simultaneously, resolving the async feedback problem entirely.

3. Goals & Requirements

Functional

  • Upload audio (WAV, MP3, M4A, FLAC), transcribe it, and edit audio by editing the transcript.
  • Real-time collaborative editing between the two hosts.
  • AI “Overdub” — regenerate a corrected line in a host’s cloned voice via MiniMax voice cloning.
  • Non-destructive edit model: original audio is never modified; edits are tracked as operations against timed transcript segments.
  • Export a finished render (ffmpeg) with edits applied, producing MP3 or WAV output.
  • Project management: create, list, archive, and delete projects; each project contains multiple tracks.

Non-functional

  • Fully Cloudflare-native; serverless except the render container.
  • Authenticated access restricted to the host team (Cloudflare Access).
  • Resilient long-running transcription via a managed Workflow with per-chunk retry.
  • Upload files up to 500 MB per track (R2 multipart upload limit for presigned URLs).
  • Transcription word error rate (WER) below 10% for clear two-speaker English podcast audio.
  • Render export completes within 5 minutes for a 90-minute episode.
  • Sub-200ms WebSocket round-trip for edit operations between collaborators.

4. Decision Rationale

Why Cloudflare over alternatives? The Broken Handle already runs its website, research pipeline, and media assets on the Cloudflare developer platform. Adding TBH Editor to the same account consolidates billing, DNS management, and deployment tooling under one provider. The specific technical drivers for Cloudflare are: Durable Objects provide a single-writer, globally consistent coordination primitive that maps directly to the “authoritative edit document” pattern — no external database or message broker needed for real-time collaboration. R2 offers S3-compatible object storage with zero egress fees, which matters for a media-heavy application where hosts repeatedly download audio for playback. Workers AI provides on-platform access to Whisper (transcription) and MiniMax (voice cloning) without managing GPU infrastructure or external API credentials. Cloudflare Containers provide a managed runtime for CPU-intensive ffmpeg renders without maintaining a separate server or orchestration layer. The alternative — hosting on AWS (Lambda + DynamoDB + S3 + ECS) or a VPS — would require stitching together multiple services, managing cross-service auth, and paying S3 egress on every audio playback. The Cloudflare stack trades ecosystem breadth for deployment simplicity and cost predictability.

Why Durable Objects for collaboration over CRDTs or external real-time databases? The collaboration model requires exactly one authoritative document state at any time. CRDTs (e.g., Yjs, Automerge) are designed for eventual consistency across distributed peers, but TBH Editor’s two-user scenario doesn’t need eventual consistency — it needs immediate consistency with a single source of truth. A Durable Object running in one location provides this naturally: it holds the canonical document in memory, receives edit operations over WebSocket, applies them sequentially, and broadcasts the result. The tradeoff is that the Durable Object is a single point of coordination (not a single point of failure — Cloudflare manages failover), but for a two-user podcast editor, the simplicity of “one object owns the document” outweighs the complexity of distributed conflict resolution. If the editor needed to scale to many concurrent editors (like Google Docs), CRDTs would be the right choice; for two hosts editing one episode, the Durable Object pattern is simpler and correct.

Why a Workflow for transcription instead of a single Worker invocation? Whisper transcription of a 90-minute podcast exceeds the Worker CPU time limit (30 seconds on paid plan). The Workflow primitive provides durable, multi-step execution with automatic retries and state persistence between steps. The transcription Workflow chunks the audio into segments, transcribes each chunk independently via Workers AI, and stitches the results — surviving any individual chunk failure without restarting the entire transcription.

Why MiniMax for voice cloning? MiniMax’s speech synthesis API is available through Cloudflare Workers AI, keeping the integration on-platform with no external API gateway or credential management. The model supports voice cloning from a short reference sample (15–30 seconds), which fits the podcast use case: each host provides one sample, and subsequent overdub requests generate speech in their cloned voice. The alternative — ElevenLabs or OpenAI TTS — would require external API keys, additional latency from cross-provider calls, and per-character pricing separate from the Cloudflare billing consolidation.

Why a non-destructive edit model? Podcast editing involves frequent experimentation — cutting a tangent, then deciding to keep it, trying different section orders. A destructive model (modifying the audio file directly) makes undo expensive and branching impossible. The non-destructive model stores edits as an ordered list of operations (delete segment, insert overdub, reorder sections) against the immutable source audio. The render step reads this edit decision list and produces the final output. This is the same model used by professional video editors (Premiere, Final Cut) and Descript.

5. Architecture Overview

TBH Editor Container Diagram
Container Diagram (C4) — full Cloudflare-native architecture

The system consists of a React SPA frontend served via Workers Assets, a Hono-based Worker API, three Cloudflare primitives (Durable Object, Workflow, Container), and three storage backends (R2, D1, Workers AI).

6. Components

# Component Responsibility
1 Frontend (frontend/) React SPA (Vite): project list, transcript editor with waveform visualization, playback controls, overdub UI. Communicates with the Worker via REST (project/upload/render CRUD) and WebSocket (live editing). Served as Workers Assets with SPA routing.
2 Worker (src/index.ts) Hono application: Cloudflare Access authentication middleware, route mounting, Durable Object namespace exports, Workflow binding, and R2/D1/AI bindings. Entry point for all API requests.
3 EditDocument (Durable Object) The authoritative edit document. Holds the current transcript state and edit operation history in memory, accepts edit commands over WebSocket, applies them sequentially, and broadcasts the updated state to all connected clients. Persists snapshots to its transactional storage on every write for durability.
4 TranscribeWorkflow (Workflow) Cloudflare Workflow for chunked transcription. Steps: (1) fetch audio from R2, (2) split into chunks, (3) transcribe each chunk with Whisper via Workers AI, (4) align timestamps across chunks, (5) write the stitched transcript to D1, (6) notify the EditDocument DO. Retries individual chunk failures up to 3 times.
5 RenderContainer (DO + Container) A Durable Object that manages an ffmpeg container instance. Receives a render request (project ID + edit decision list), fetches source audio and overdub clips from R2, runs ffmpeg to apply edits (cuts, inserts, reorders, crossfades), and writes the final output back to R2.
6 lib/ai.ts Workers AI integration module. Exposes three functions: transcribeChunk(audioBuffer) for Whisper, cloneVoice(referenceAudio, text) for MiniMax voice synthesis, and rewriteLine(originalText, instruction) for LLM-assisted line correction.
7 routes/* REST route handlers: projects (CRUD), uploads (presigned R2 URL generation + upload finalization), transcribe (start Workflow), render (trigger export), voice (overdub request).
8 db/ D1 schema definitions (SQL migrations) and typed query helpers using prepared statements.

7. Collaboration & Edit Model

TBH Editor Collaboration State Diagram
Collaboration State Diagram — WebSocket architecture, edit operations, and D1 data model

Real-time collaboration architecture. The EditDocument Durable Object is the single source of truth for each project’s edit state. When a host opens a project, the frontend establishes a WebSocket connection to the DO. The DO maintains an in-memory representation of the current document state: the transcript segments (with word-level timestamps), the edit operation stack, and the playback cursor positions of all connected clients.

Edit operations. The editor supports five operation types, each represented as a JSON command sent over WebSocket:

Operation Payload Effect
delete_segment { start_word_id, end_word_id } Marks a contiguous range of words as deleted. The words remain in the transcript data but are flagged as excluded from playback and render.
restore_segment { start_word_id, end_word_id } Restores a previously deleted range.
insert_overdub { after_word_id, overdub_r2_key, text, duration_ms } Inserts a synthesized audio clip (already stored in R2) at a position in the transcript.
reorder_section { section_id, new_position } Moves a named section to a new position in the timeline.
update_word { word_id, new_text } Corrects a transcription error in the text layer (does not affect audio timing).

Conflict resolution. The Durable Object processes operations sequentially in the order they arrive. Because the DO is single-threaded and processes one message at a time, there are no concurrent write conflicts at the storage level. When two hosts submit conflicting operations (e.g., both delete overlapping ranges), the first operation to arrive is applied, and the second is rebased against the new state. The rebase logic is simple: if the target segment range has already been modified, the operation is adjusted to the remaining valid range or rejected with a conflict notification sent back to the submitting client. This is a last-writer-wins with notification strategy — simpler than operational transforms but sufficient for a two-editor scenario where conflicts are rare and immediately visible.

Non-destructive edit model. All edit operations produce an Edit Decision List (EDL) — an ordered sequence of instructions that describe how to reconstruct the output from the source audio. The source audio files in R2 are never modified. The EDL is stored as a JSON array in the Durable Object’s transactional storage and is the input to the render step. The frontend reconstructs playback in real time by reading the EDL and seeking through the source audio accordingly, using the Web Audio API to skip deleted segments and insert overdub clips at the correct positions.

Offline and degraded-mode behavior. If a host’s WebSocket connection drops, the frontend enters read-only mode: the last-known document state remains visible, but edits are disabled until the connection is re-established. The DO detects the disconnection via WebSocket close events and removes the client from the active-editors set. On reconnection, the client receives a full state snapshot from the DO, reconciling any edits made by the other host during the disconnection. There is no offline editing queue — edits require an active connection to the authoritative DO. This is a deliberate simplicity tradeoff: offline editing would require a CRDT or OT layer that adds complexity disproportionate to the two-user use case.

8. Data Flow

TBH Editor Sequence Diagram
Sequence Diagram — upload → transcribe → edit → export workflow
  1. Upload. A host selects an audio file in the frontend. The frontend requests a presigned R2 upload URL from the Worker (POST /uploads/presign), uploads the file directly to R2 via the presigned URL (bypassing the Worker for large files), then calls the finalize endpoint (POST /uploads/finalize) which records the track metadata in D1 and links it to the project.
  2. Transcribe. The frontend triggers transcription (POST /transcribe). The Worker starts a TranscribeWorkflow instance, passing the R2 object key and project/track IDs. The Workflow fetches the audio from R2, chunks it into segments (target: 30-second chunks with 2-second overlap for timestamp alignment), sends each chunk to Workers AI Whisper, collects the word-level timestamp results, deduplicates the overlap regions, and writes the unified transcript to D1. On completion, the Workflow notifies the EditDocument DO to load the new transcript into its in-memory state.
  3. Edit. Both hosts connect to the EditDocument DO via WebSocket. They see the transcript as an editable document. Selecting text and pressing delete marks those word segments as excluded. Typing replacement text triggers an overdub flow. Dragging sections reorders them. Every operation is applied immediately by the DO and broadcast to the other host’s client within the WebSocket round-trip.
  4. Overdub. When a host requests an overdub (corrected line in their voice), the frontend sends the request to the Worker (POST /voice/overdub). The Worker calls lib/ai.ts which: (a) optionally rewrites the line via an LLM for natural phrasing, (b) sends the text and the host’s voice reference sample to MiniMax via Workers AI for speech synthesis, (c) stores the generated audio clip in R2, and (d) returns the R2 key and duration to the frontend. The frontend then submits an insert_overdub operation to the DO.
  5. Export. The host triggers a render (POST /render). The Worker reads the current EDL from the EditDocument DO, starts the RenderContainer DO which provisions an ffmpeg container. The container fetches all referenced audio from R2, applies the EDL (cuts, inserts, reorders, crossfades), encodes the final output (MP3 at 192kbps or WAV), stores it in R2, and returns the download URL to the frontend.

9. Data Model

The D1 database (tbh-editor) stores project metadata, track references, and transcript data. The edit state lives in the EditDocument Durable Object’s transactional storage, not in D1, because it requires single-writer consistency and sub-second read/write performance that a SQL database behind a Worker cannot guarantee.

Core entities:

projects — a podcast episode editing session.

Column Type Description
id TEXT (ULID) Primary key.
title TEXT Episode title (e.g., “S2E14 — The AI Pivot”).
status TEXT draft · editing · rendering · exported · archived.
created_by TEXT Email of the creating host.
created_at TEXT (ISO 8601) Creation timestamp.
updated_at TEXT (ISO 8601) Last modification timestamp.

tracks — an audio file within a project (one per host, or a combined recording).

Column Type Description
id TEXT (ULID) Primary key.
project_id TEXT FK → projects.id.
label TEXT Display name (e.g., “Jarrod”, “Will”, “Combined”).
r2_key TEXT R2 object key for the source audio.
codec TEXT Source codec: wav, mp3, m4a, flac.
duration_ms INTEGER Duration in milliseconds.
file_size_bytes INTEGER File size for upload/download progress.
sample_rate INTEGER Audio sample rate (e.g., 44100, 48000).
channels INTEGER Channel count (1 = mono, 2 = stereo).
transcription_status TEXT pending · in_progress · completed · failed.
created_at TEXT (ISO 8601) Upload timestamp.

transcript_words — individual words with timestamps, the atomic unit of transcript editing.

Column Type Description
id TEXT (ULID) Primary key.
track_id TEXT FK → tracks.id.
word_index INTEGER Sequential position in the transcript.
text TEXT The transcribed word.
start_ms INTEGER Word start time in the source audio (milliseconds).
end_ms INTEGER Word end time in the source audio (milliseconds).
confidence REAL Whisper confidence score (0.0–1.0).
speaker TEXT Speaker label if diarization is available.

voice_profiles — stored voice clone references for overdub.

Column Type Description
id TEXT (ULID) Primary key.
host_email TEXT The host this voice belongs to.
label TEXT Display name (e.g., “Jarrod — Studio”).
reference_r2_key TEXT R2 key for the 15–30s reference audio sample.
minimax_voice_id TEXT MiniMax API voice ID after cloning registration, if applicable.
created_at TEXT (ISO 8601) Creation timestamp.

renders — completed export records.

Column Type Description
id TEXT (ULID) Primary key.
project_id TEXT FK → projects.id.
r2_key TEXT R2 object key for the rendered output.
format TEXT mp3 · wav.
duration_ms INTEGER Final output duration.
file_size_bytes INTEGER Output file size.
edl_snapshot TEXT (JSON) The EDL used for this render (frozen at render time).
status TEXT queued · rendering · completed · failed.
started_at TEXT (ISO 8601) Render start time.
completed_at TEXT (ISO 8601) Render completion time.

Edit state (Durable Object transactional storage). The EditDocument DO stores its state as a set of keys in its transactional storage: document_state (word statuses and overdub references), edl (the Edit Decision List as JSON), operation_log (append-only history for undo), and connected_clients (transient WebSocket connections and cursor positions).

10. External Interfaces

Interface Provider Protocol Purpose Auth Rate Limits / Constraints
Whisper (transcription) Cloudflare Workers AI Workers AI binding Speech-to-text with word-level timestamps Platform binding (no key) 720 req/min on Workers Paid; input ≤ ~25 MB per call
MiniMax Speech Synthesis Cloudflare Workers AI Workers AI binding Text-to-speech with voice cloning Platform binding (no key) Subject to Workers AI rate limits; voice reference 15–30s
LLM (line rewrite) Cloudflare Workers AI Workers AI binding Natural-language line correction before overdub synthesis Platform binding (no key) Standard Workers AI rate limits
R2 (media storage) Cloudflare R2 S3-compatible / Worker binding Source audio, overdub clips, rendered exports Worker binding; presigned URLs for direct upload 1,000 PUT/s, 10,000 GET/s per bucket; objects up to 5 TB
D1 (metadata) Cloudflare D1 Worker binding Project, track, transcript, voice profile, render metadata Worker binding (no key) Unlimited reads/writes on paid; 10 GB max database size
Cloudflare Access Cloudflare Zero Trust JWT validation middleware Authentication — restricts access to allowed email addresses JWT in CF-Access-JWT-Assertion header N/A — policy-based
ffmpeg (render) Cloudflare Container Container process Audio processing: cut, concatenate, crossfade, encode Internal — DO-to-Container communication Container CPU/memory limits; max_instances config

11. Error Handling & Resilience

Transcription failure (partial). The TranscribeWorkflow processes audio in chunks. If an individual chunk fails Whisper transcription (timeout, model error, corrupt audio segment), the Workflow retries that chunk up to 3 times with exponential backoff. If the chunk still fails after retries, the Workflow marks it as a gap in the transcript — the surrounding chunks are still written to D1, and the frontend displays the gap with a “transcription unavailable” placeholder. The host can manually transcribe the gap or re-upload a cleaner version of that audio segment.

Transcription failure (complete). If the Workflow itself fails (e.g., the source audio cannot be fetched from R2), the track’s transcription_status is set to failed in D1 with an error message. The frontend shows the failure and offers a retry button that starts a new Workflow instance.

WebSocket disconnection. When a host’s WebSocket connection to the EditDocument DO drops, the DO removes the client from its active set and continues operating normally for the remaining connected client. The disconnected client’s frontend enters read-only mode and attempts to reconnect with exponential backoff (1s, 2s, 4s, 8s, max 30s). On reconnection, the client requests a full state snapshot from the DO, which overwrites the local state — any edits the other host made during the disconnection are immediately visible.

Durable Object restart. Cloudflare may restart a Durable Object at any time (deployment, migration, inactivity eviction). The DO persists its state to transactional storage on every write operation, so restarts are transparent: on the next request, the DO loads its state from storage and resumes. Connected WebSocket clients receive a close event and reconnect, triggering a fresh state sync.

Render failure. If the ffmpeg container crashes or times out during a render, the RenderContainer DO detects the failure, sets the render status to failed in D1, and returns an error to the frontend. The host can retry the render, which starts a new container instance. Partial render output (if any) is cleaned up from R2.

R2 upload failure. Presigned URL uploads happen directly between the browser and R2. If the upload fails (network error, timeout), the frontend retries the upload. The finalize endpoint is idempotent — calling it multiple times with the same R2 key is safe. If a presigned URL expires before the upload completes, the frontend requests a new one.

Workers AI unavailability. If Workers AI is temporarily unavailable (for transcription, overdub, or LLM rewrite), the Worker returns a 503 with a Retry-After header. The frontend displays the error and allows the host to retry manually. Transcription retries are handled at the Workflow level; overdub and LLM calls are retried at the application level (up to 2 retries with 5-second delays).

Graceful degradation pattern. The system is designed so that each capability degrades independently. If transcription is unavailable, uploads still work and previously transcribed projects remain editable. If voice cloning is unavailable, all editing except overdub continues to function. If the render container is unavailable, editing and collaboration continue — the host just cannot export yet. The only hard dependency is R2: if R2 is unreachable, no uploads, playback, or renders can proceed.

12. Non-Functional Requirements (Measured)

NFR Target Basis
Upload file size ≤ 500 MB per track R2 presigned multipart upload limit
Supported input codecs WAV, MP3, M4A (AAC), FLAC Common podcast recording formats; ffmpeg handles transcoding
Transcription word error rate (WER) < 10% for clear English speech Whisper large-v3 baseline on podcast-style audio
Transcription latency < 3 minutes for a 90-minute episode Chunked parallel processing via Workflow (est. 20 × 30s chunks)
WebSocket edit round-trip < 200 ms Measured between edit submission and broadcast receipt
Concurrent editors per project 2 (both hosts) Design constraint; DO single-writer model supports more but UI is optimized for two
Render time < 5 minutes for 90 min episode ffmpeg container with adequate CPU allocation
Export formats MP3 (128/192/320 kbps), WAV (16-bit/44.1kHz) Standard podcast distribution formats
R2 egress cost $0 Cloudflare R2 has zero egress fees
Playback latency < 500 ms from click to audio Web Audio API with preloaded buffer segments
Edit operation throughput ≥ 50 ops/sec sustained DO processes operations sequentially; each op < 20ms

13. Tech Stack

Layer Technology Role
Frontend React 18, Vite, TypeScript SPA: transcript editor, waveform visualization, project management
Audio playback Web Audio API Real-time playback with EDL-based seeking and segment skipping
Waveform rendering wavesurfer.js (or custom Canvas) Visual waveform display synchronized with transcript
API framework Hono (on Cloudflare Workers) Lightweight, typed HTTP router with middleware support
Authentication Cloudflare Access (Zero Trust) JWT-based access control restricted to allowed email addresses
Real-time collaboration Durable Objects + WebSocket Single-writer authoritative document with broadcast
Transcription pipeline Cloudflare Workflows Durable, multi-step chunked transcription with retry
Speech-to-text Workers AI — Whisper large-v3 Word-level transcription with timestamps and confidence scores
Voice cloning Workers AI — MiniMax Speech-02-HD Text-to-speech with cloned voice from reference sample
LLM (line rewrite) Workers AI — Llama 3.1 or equivalent Natural-language correction of lines before voice synthesis
Media storage Cloudflare R2 Source audio, overdub clips, rendered exports; zero egress
Metadata storage Cloudflare D1 (SQLite) Projects, tracks, transcript words, voice profiles, render records
Audio processing ffmpeg in Cloudflare Container Cut, concatenate, crossfade, encode; CPU-bound render workloads
Static assets Cloudflare Workers Assets SPA hosting with edge caching and SPA routing
Deployment Wrangler CLI + GitHub Actions CI/CD to editor.thebrokenhandle.com
Local development Wrangler dev + Docker Local Worker emulation; Docker for container build/test

14. Security & Compliance

Authentication and authorization. Access is gated by Cloudflare Access, configured as a Zero Trust application policy. Only the two hosts’ email addresses are in the ALLOWED_EMAILS list. Every request to the Worker is validated against the CF-Access-JWT-Assertion header — requests without a valid JWT are rejected before reaching any route handler. WebSocket upgrade requests are also subject to Access validation.

Media isolation. All media (source audio, overdub clips, renders) lives in a private R2 bucket with no public access. Uploads use time-limited presigned URLs (expiry: 1 hour) generated by the authenticated Worker. Downloads are proxied through the Worker (which validates Access JWT) or served via short-lived presigned download URLs.

Secrets management. Third-party credentials (MiniMax API configuration, voice profile data) are stored as Worker secrets or environment variables via wrangler secret, never in code or wrangler.jsonc. Workers AI bindings use platform-level authentication — no API keys are managed by the application.

Voice cloning consent. Voice cloning is restricted to the two consenting hosts. Each host explicitly provides their voice reference sample, and the voice_profiles table is scoped to their email addresses. No mechanism exists for cloning voices outside the allowed-email set, enforced at the application level.

Data residency. All data (D1, R2, DO storage) resides within the Cloudflare network. No user data is sent to external services — Workers AI calls stay on-platform.

15. Deployment & Operations

Infrastructure provisioning. D1 database and R2 bucket are pre-provisioned and wired in wrangler.jsonc. The Durable Object namespaces (EditDocument, RenderContainer), Workflow (TranscribeWorkflow), and Container configuration are declared in the same config file. Schema migrations for D1 are managed as numbered SQL files in db/migrations/ and applied via wrangler d1 migrations apply.

Deployment pipeline. The Worker, frontend assets, Durable Objects, Workflow, and Container image deploy together through a single wrangler deploy invocation. GitHub Actions runs this on push to main, targeting the custom domain editor.thebrokenhandle.com. The Container image is built from a Dockerfile in the repo and pushed to Cloudflare’s container registry as part of the deploy.

Local development. wrangler dev provides local Worker emulation with D1, R2, and DO support. The render container is built and tested locally with Docker. Workers AI calls in local dev are proxied to the live Cloudflare API (requires authentication).

Cost and resource consumption (estimated monthly, based on Workers Paid plan at $5/month base).

Resource Estimated Usage Cost
Workers requests ~50,000/month (API + asset serving) Included in paid plan (10M included)
Workers AI — Whisper ~20 episodes × 20 chunks = 400 calls ~$0.40
Workers AI — MiniMax ~10 overdubs/episode × 20 = 200 calls ~$1.00
Durable Object duration ~100 hours/month (editing sessions) ~$1.50
Durable Object requests ~200,000/month (WebSocket messages) ~$0.20
R2 storage ~50 GB (growing; audio files accumulate) ~$0.75
R2 operations ~100,000 Class A + 500,000 Class B ~$0.50
D1 reads/writes ~500,000 reads, ~50,000 writes Included in paid plan
Container compute ~10 renders × 5 min = 50 min CPU ~$0.50
Total estimated ~$10/month

The Workers Paid plan ($5/month) covers the base; incremental costs for AI, storage, and compute add approximately $5/month at current usage levels. This compares favorably to Descript’s $24–33/seat/month for two users ($48–66/month).

16. Cross-Project Context

The Broken Handle (shared Cloudflare infrastructure). TBH Editor and The Broken Handle’s research pipeline (tbh-research Worker) share the same Cloudflare account, DNS zone (thebrokenhandle.com), and billing. Both projects use D1, R2, and Workers on the same paid plan. The R2 bucket for podcast media (tbh-media) is separate from the editor’s media bucket (MEDIA) — the research pipeline stores raw scrape data in tbh-scrape-raw, while the editor stores audio in its own bucket. The separation is deliberate: the research pipeline’s cron-driven ingestion has different access patterns and retention policies than the editor’s interactive uploads. DNS routing is handled via Cloudflare’s custom domains: thebrokenhandle.com routes to the podcast website, editor.thebrokenhandle.com routes to TBH Editor, and the research pipeline runs on a scheduled cron trigger with no public-facing domain.

Shared Cloudflare account implications. Both projects contribute to the same Workers Paid plan billing. D1 database size limits, R2 storage, and Workers AI usage are pooled at the account level. The research pipeline’s Vectorize index and the editor’s Workers AI transcription calls both draw from the same account quota, which is important for capacity planning as both projects grow.

jarrodebrown.com (portfolio site). The SAD documents, including this one, are published as static HTML on jarrodebrown.com, deployed separately via Cloudflare Pages. The portfolio site references TBH Editor as a project and links to the live editor at editor.thebrokenhandle.com.

17. Risks, Assumptions & Limitations

  • Workers Paid plan required. Durable Objects, Containers, and Workflows are not available on the free tier. The entire system depends on the $5/month Workers Paid plan.
  • Voice cloning ethics. Voice cloning introduces consent and misuse considerations. The current implementation limits cloning to the two consenting hosts via application-level enforcement (allowed-email check), not a platform-level control. If the allowed-email list were expanded carelessly, voice cloning could be used without proper consent.
  • Container instance limits. Long renders depend on container instance availability. The max_instances setting in wrangler.jsonc caps concurrent renders. Under load (multiple renders queued), later renders wait for an available instance. For a two-user system, max_instances: 2 is sufficient.
  • Whisper accuracy on overlapping speech. Whisper’s word error rate degrades significantly when speakers talk over each other. The target < 10% WER assumes clear, turn-based speech — common in a two-host podcast but not guaranteed. Crosstalk sections may require manual transcript correction.
  • No offline editing. The system requires an active WebSocket connection to the Durable Object for editing. There is no offline editing capability — this is a deliberate simplicity tradeoff appropriate for a two-user system, but it means editing requires internet access.
  • D1 size constraints. A 90-minute episode with clear speech produces approximately 12,000–15,000 transcript words. At ~100 bytes per transcript_words row, each episode consumes roughly 1.5 MB of D1 storage. The paid plan supports 10 GB, allowing ~6,500 episodes before approaching the limit — not a near-term concern.
  • Durable Object single-region coordination. The EditDocument DO runs in one Cloudflare data center. If both hosts are in different geographic regions, one will experience higher WebSocket latency than the other. For a US-based two-host team, this is negligible; for geographically distributed teams, it could affect the real-time editing experience.
  • MiniMax voice quality. Voice cloning quality depends on the reference sample quality and length. Short or noisy reference samples produce less accurate voice clones. The 15–30 second reference requirement assumes a clean, studio-quality recording.

18. Roadmap

Phase 1 — Core Editor (current). Upload, transcribe, collaborative editing, and export. The MVP delivers the end-to-end workflow for a two-host podcast: record externally, upload to TBH Editor, transcribe via Whisper, edit by editing the transcript, and export the finished episode. The core value proposition: replace the async “edit and bounce” workflow with real-time collaborative editing.

Phase 2 — Overdub & Voice. MiniMax voice cloning integration for AI overdub. Hosts can correct flubbed lines by typing the corrected text and generating speech in their own cloned voice. Requires voice profile setup (one-time reference sample upload per host) and the LLM rewrite pipeline for natural phrasing.

Phase 3 — Smart Editing. AI-assisted editing features: automatic filler-word detection and removal (“um”, “uh”, “like”, “you know”), silence trimming, and suggested cuts for long tangents. These features build on the transcript data already in D1 and the non-destructive edit model — they generate edit operations programmatically rather than requiring manual selection.

Phase 4 — Multi-episode Workflow. Project templates, episode numbering, show notes generation (LLM summary of transcript), and chapter markers for podcast apps. Integration with The Broken Handle’s research pipeline to pull relevant data points into show notes automatically.


Diagrams: Container Diagram · Sequence Diagram · Collaboration State Diagram