US-04: YouTube Video Transcription Implementation

User Story

As a content creator, researcher, or student I want to transcribe audio from a YouTube video URL into readable text So that I can read, search, quote, or repurpose spoken content without having to watch or listen to the entire video

Business Value

Problem We're Solving

Current State: The app has a polished UI (US-03) but the "Transcribe" button shows placeholder text. Users cannot get actual transcripts from YouTube videos.

Impact: Without real transcription, we don't have a product—we have a demo. This story is the first true value delivery moment that transforms the prototype into a functional tool.

Value Delivered

First Real Functionality: Users can finally get actual YouTube transcripts
Privacy-Focused: Local transcription (using existing local-transcriber) keeps data private
Accuracy: Whisper-based transcription typically provides better results than YouTube auto-captions
Accessibility: Makes video content accessible to deaf/hard-of-hearing users
Content Utility: Enables searchable, quotable text from video content

Strategic Alignment

This is our MVP completion milestone. Without this story, every downstream feature (export formats, timestamp display, history) has nothing to build upon. This validates our technical approach and enables real user feedback.

Success Criteria:

Users successfully transcribe at least one video end-to-end
User experience during 2-10 minute wait is clear and reassuring
Error scenarios are handled gracefully with actionable messages

Context & Prerequisites

Completed Stories

✅ US-01: Next.js 15 + Vitest scaffold
✅ US-02: GitHub Actions PR verification pipeline
✅ US-03: UI foundation with UrlInput, TranscribeButton, TranscriptionOutput components

Existing Assets

Local Transcriber CLI: ~/Repos/LT/claude-code-playground/local-transcriber
- Uses: whisper-node (local Whisper.cpp), yt-dlp-wrap (YouTube downloads), fluent-ffmpeg (audio extraction)
- Architecture: Modular design with YouTubeDownloader, VideoProcessor, Transcriber, LLMEnhancer
- Progress callbacks: Already implemented via ProcessingProgress interface
- File cleanup: Automatic temp file management

Scope Constraints (Simplifications)

Local Development Only: No production deployment concerns (Vercel, Docker, etc.)
Single Transcription: Only one transcription at a time (no job queue, no concurrency)
Developer Machine: Assumes ffmpeg and yt-dlp installed on local development machine

Acceptance Criteria

AC1: YouTube URL Validation and Submission

GIVEN a valid YouTube URL (formats: watch?v=, youtu.be, embed, m.youtube.com)
- WHEN user enters URL and clicks Transcribe
- THEN transcription process starts immediately
GIVEN an invalid URL (empty, not YouTube, malformed)
- WHEN user clicks Transcribe
- THEN error message appears: "Please enter a valid YouTube URL"
- AND button remains disabled for empty input

Verification: Test with 10+ URL formats (5 valid variants, 5 invalid cases)

AC2: Async Job Creation

GIVEN user submits valid YouTube URL
- WHEN Transcribe button clicked
- THEN API returns immediately (within 500ms) with job ID
- AND user sees loading state with initial message
GIVEN job submitted successfully
- WHEN API responds
- THEN frontend stores job ID and begins polling for status

Verification: Network tab shows POST /api/transcribe returns 202 Accepted with jobId

AC3: Real-Time Progress Indication

GIVEN transcription in progress
- WHEN user waits
- THEN loading indicator remains visible throughout process
- AND progress updates appear (percentage or status messages)
- AND elapsed time displays (e.g., "Started 2m 30s ago")
GIVEN transcription stages (downloading → extracting → transcribing)
- THEN user sees current stage in UI ("Downloading audio...", "Transcribing...")

Verification: Visual inspection shows progress updates every 2-5 seconds

AC4: Loading State Behavior

GIVEN transcription starts
- THEN loading indicator appears within 500ms
- AND Transcribe button becomes disabled
- AND URL input becomes disabled
- AND button text changes to "Transcribing..." with spinner
GIVEN transcription in progress
- THEN user cannot start another transcription
- AND previous transcription results (if any) remain visible until new results arrive

Verification: Try clicking button multiple times rapidly—only one transcription starts

AC5: Transcription Success

GIVEN transcription completes successfully
- THEN actual transcript text appears in TranscriptionOutput component
- AND loading indicator disappears
- AND Transcribe button re-enables
- AND URL input re-enables
- AND transcript text is selectable and copyable
GIVEN completed transcript displayed
- THEN user can start new transcription without page refresh

Verification: Test with 3 different real videos (30 seconds, 2 minutes, 5 minutes)

AC6: Video Availability Errors

GIVEN YouTube URL points to private/unavailable video
- THEN error message: "This video is private or unavailable. Please use a public video."
- AND loading indicator disappears
- AND form re-enables for retry
GIVEN YouTube URL points to deleted video
- THEN error message: "Video not found. It may have been removed."
GIVEN YouTube URL points to age-restricted video
- THEN error message: "Cannot access age-restricted videos. Please use a different video."

Verification: Test with known private, deleted, and age-restricted video URLs

AC7: Transcription Failures

GIVEN network error during download
- THEN error message: "Failed to download video. Please check your connection and try again."
- AND "Retry" button available
GIVEN transcription process fails (Whisper error, ffmpeg error)
- THEN error message: "Transcription failed. Please try again or contact support if this persists."
- AND "Retry" button available
GIVEN any error occurs
- THEN loading indicator disappears
- AND form re-enables
- AND error is announced to screen readers (aria-live)

Verification: Mock failures in backend, verify frontend error handling

AC8: Timeout Handling

GIVEN transcription exceeds 10 minutes (CLI timeout)
- THEN error message: "Transcription timed out. Please try a shorter video or retry later."
- AND process is killed on server
- AND temp files are cleaned up
GIVEN frontend timeout (12 minutes)
- THEN frontend shows timeout error even if backend still processing
- AND user can retry

Verification: Mock 10+ minute delay, verify timeout triggers correctly

AC9: Resource Management (Simplified - Single Transcription)

GIVEN transcription completes (success or failure)
- THEN server cleans up temp files within 60 seconds
GIVEN transcription is already in progress
- WHEN user tries to start another transcription
- THEN button remains disabled
- AND UI shows "Transcription in progress" message

Verification: Check temp directory is cleaned up, try starting second transcription while first is running

AC10: State Management

GIVEN user navigates away during transcription
- THEN transcription continues (or is canceled, team decision)
- AND server cleans up resources appropriately
GIVEN user refreshes page during transcription
- THEN in-progress job is lost (acceptable for MVP)
- AND server continues processing and eventually cleans up

Verification: Manual testing with browser back/forward/refresh

AC11: Accessibility

GIVEN transcription state changes (start, progress, complete, error)
- THEN changes are announced to screen readers via aria-live
GIVEN loading indicator visible
- THEN button has aria-busy="true"
- AND screen reader announces "Transcription in progress"
GIVEN progress updates occur
- THEN significant milestones announced (25%, 50%, 75%, complete)
- NOT every 1% change (too chatty)

Verification: Test with VoiceOver (macOS) or NVDA (Windows)

AC12: API Contract Implementation

POST /api/transcribe endpoint exists and returns:
- 202 Accepted: { jobId: string, status: 'queued' | 'processing', estimatedDuration?: number }
- 400 Bad Request: { error: { code: string, message: string } } for invalid URL
- 503 Service Unavailable: { error: { code: 'SERVER_BUSY', message: string, retryAfter: number } }
GET /api/transcribe/status/:jobId endpoint exists and returns:
- 200 OK (processing): { jobId, status: 'processing', progress: { stage: string, percent: number, message: string } }
- 200 OK (complete): { jobId, status: 'complete', result: { transcript: string, duration: number, language: string } }
- 200 OK (failed): { jobId, status: 'failed', error: { code: string, message: string } }
- 404 Not Found: { error: { code: 'JOB_NOT_FOUND', message: string } }
Both endpoints have TypeScript types defined in /lib/types/transcription.ts

Verification: Use curl or Postman to call endpoints, verify response structure

Technical Implementation

High-Level Architecture (Simplified - Single Job)

┌─────────────────┐
│  User Browser   │
│                 │
│  [React Form]   │  Polling every 2s
│       │         │  ◄─────────────┐
│       ▼         │                │
│  POST /api/     │                │
│  transcribe     │                │
│       │         │                │
└───────┼─────────┘                │
        │                          │
        ▼                          │
┌─────────────────────────────────┴──┐
│  Next.js API Routes                │
│                                    │
│  POST /api/transcribe              │
│    → Creates single job            │
│    → Returns jobId                 │
│    → Blocks if already processing  │
│                                    │
│  GET /api/transcribe/status/:id    │
│    → Returns job status            │
│                                    │
│  Simple State (in-memory)          │
│    → Single job at a time          │
│    → No queue needed               │
│                                    │
└────────────────┬───────────────────┘
                 │
                 ▼
┌──────────────────────────────────────┐
│  Transcription Service               │
│  (from local-transcriber)            │
│                                      │
│  YouTubeDownloader                   │
│    → yt-dlp-wrap                     │
│    → Downloads video/audio           │
│         │                            │
│         ▼                            │
│  VideoProcessor                      │
│    → fluent-ffmpeg                   │
│    → Extracts audio (WAV)            │
│         │                            │
│         ▼                            │
│  Transcriber                         │
│    → whisper-node                    │
│    → Transcribes audio               │
│         │                            │
│         ▼                            │
│  Returns transcript text             │
│                                      │
└──────────────────────────────────────┘

Directory Structure

/next-transcriber
  /app
    /api
      /transcribe
        /route.ts                  # POST - Create job
        /route.test.ts
        /status
          /[jobId]
            /route.ts              # GET - Poll status
            /route.test.ts
  /lib
    /transcription
      /core                        # Copied from local-transcriber
        /transcriber.ts
        /video-processor.ts
        /youtube-downloader.ts
      /queue
        /job-manager.ts            # Job queue implementation
        /job-processor.ts          # Background worker
      /types.ts                    # TypeScript interfaces
    /types
      /api.ts                      # API request/response types

Technology Stack

Backend:

Next.js 15 API Routes (server-side only)
whisper-node (Whisper.cpp Node.js bindings)
fluent-ffmpeg (audio extraction)
yt-dlp-wrap (YouTube downloads)
In-memory job queue (Map-based for MVP)

Frontend:

React 19 with hooks (useReducer for state)
Polling mechanism (setInterval with cleanup)
Tailwind CSS for loading indicators

Dependencies to Install:

{
  "dependencies": {
    "whisper-node": "^1.1.1",
    "fluent-ffmpeg": "^2.1.3",
    "yt-dlp-wrap": "^2.3.12",
    "@types/fluent-ffmpeg": "^2.1.27"
  }
}

System Dependencies (must be installed on server):

ffmpeg (audio processing)
yt-dlp (YouTube downloads)
Whisper models (downloaded on first run, ~100MB-1.5GB)

Backend Implementation

API Design

POST /api/transcribe

Request:

{
  url: string  // YouTube URL (required)
}

Response (202 Accepted):

{
  jobId: string           // UUID
  status: 'queued'
  estimatedDuration: number  // Rough estimate in seconds
}

Response (400 Bad Request):

{
  error: {
    code: 'INVALID_URL' | 'INVALID_INPUT'
    message: string
  }
}

GET /api/transcribe/status/:jobId

Response (200 OK - Processing):

{
  jobId: string
  status: 'queued' | 'processing'
  progress: {
    stage: 'downloading' | 'extracting' | 'transcribing'
    percent: number        // 0-100
    message: string
  }
  createdAt: string
  updatedAt: string
}

Response (200 OK - Complete):

{
  jobId: string
  status: 'complete'
  result: {
    transcript: string
    duration: number      // Video duration in seconds
    language: string      // Detected language
    wordCount: number
  }
  createdAt: string
  completedAt: string
}

Response (200 OK - Failed):

{
  jobId: string
  status: 'failed'
  error: {
    code: string
    message: string
  }
  createdAt: string
  failedAt: string
}

Integration with Local Transcriber

Step 1: Copy Core Modules

# Copy modules from local-transcriber
cp -r ~/Repos/LT/claude-code-playground/local-transcriber/src/modules \
      /next-transcriber/lib/transcription/core/

# Copy types
cp -r ~/Repos/LT/claude-code-playground/local-transcriber/src/types \
      /next-transcriber/lib/transcription/types/

# Copy utilities (selective)
cp ~/Repos/LT/claude-code-playground/local-transcriber/src/utils/{logger,file-utils,youtube-utils}.ts \
   /next-transcriber/lib/transcription/utils/

Step 2: Adapt for Web (Remove CLI Dependencies)

Remove commander imports
Keep progress callbacks (already async-friendly)
Adapt logger for Next.js (use console or structured logger)
Remove interactive prompts

Step 3: Implement Job Queue

// /lib/transcription/queue/job-manager.ts
interface TranscriptionJob {
  jobId: string
  url: string
  status: 'queued' | 'processing' | 'complete' | 'failed'
  progress: ProcessingProgress
  result?: TranscriptionResult
  error?: ErrorDetail
  createdAt: Date
  updatedAt: Date
}

class JobManager {
  private jobs = new Map<string, TranscriptionJob>()
  private queue: string[] = []
  private maxConcurrent = 3

  async createJob(url: string): Promise<string> {
    const jobId = crypto.randomUUID()
    const job: TranscriptionJob = {
      jobId,
      url,
      status: 'queued',
      progress: { stage: 'queued', progress: 0, message: 'Waiting to start...' },
      createdAt: new Date(),
      updatedAt: new Date()
    }

    this.jobs.set(jobId, job)
    this.queue.push(jobId)
    this.processQueue()

    return jobId
  }

  async getJobStatus(jobId: string): Promise<TranscriptionJob | null> {
    return this.jobs.get(jobId) || null
  }

  private async processQueue() {
    const processing = Array.from(this.jobs.values())
      .filter(j => j.status === 'processing').length

    if (processing >= this.maxConcurrent) return

    const nextJobId = this.queue.shift()
    if (!nextJobId) return

    const job = this.jobs.get(nextJobId)
    if (!job) return

    this.processJob(job)
  }

  private async processJob(job: TranscriptionJob) {
    job.status = 'processing'
    job.updatedAt = new Date()

    try {
      const transcriber = new TranscriberApp(config)

      const result = await transcriber.processFile(
        job.url,
        undefined,
        { mode: 'none' }, // No LLM enhancement for MVP
        (progress) => {
          // Update job progress
          job.progress = progress
          job.updatedAt = new Date()
        }
      )

      job.status = 'complete'
      job.result = { transcript: result, /* ... */ }
    } catch (error) {
      job.status = 'failed'
      job.error = { code: 'TRANSCRIPTION_ERROR', message: error.message }
    }

    this.processQueue() // Start next job
  }
}

export const jobManager = new JobManager()

Timeout Implementation

Three-Layer Timeout Strategy:

// CLI Tool: 10 minutes
const CLI_TIMEOUT = 10 * 60 * 1000

// Backend API: 11 minutes (1 min buffer to kill CLI and respond)
const BACKEND_TIMEOUT = 11 * 60 * 1000

// Frontend: 12 minutes (1 min buffer to receive backend response)
const FRONTEND_TIMEOUT = 12 * 60 * 1000

Backend Implementation:

async function processWithTimeout(job: TranscriptionJob) {
  const timeoutPromise = new Promise((_, reject) => {
    setTimeout(() => reject(new Error('TIMEOUT')), BACKEND_TIMEOUT)
  })

  try {
    await Promise.race([
      transcriber.processFile(job.url, ...),
      timeoutPromise
    ])
  } catch (error) {
    if (error.message === 'TIMEOUT') {
      // Kill CLI process, cleanup files
      throw new Error('Transcription timed out')
    }
    throw error
  }
}

Frontend Implementation

State Management

Recommended: useReducer Pattern

// hooks/useTranscription.ts
interface TranscriptionState {
  url: string
  output: string
  status: 'idle' | 'validating' | 'transcribing' | 'complete' | 'error'
  jobId: string | null
  progress: number
  stage: string
  error: ErrorDetail | null
  startedAt: Date | null
  estimatedTimeRemaining: number | null
}

type TranscriptionAction =
  | { type: 'SET_URL'; url: string }
  | { type: 'START_TRANSCRIPTION'; jobId: string }
  | { type: 'UPDATE_PROGRESS'; progress: number; stage: string; estimatedTime?: number }
  | { type: 'SET_COMPLETE'; transcript: string }
  | { type: 'SET_ERROR'; error: ErrorDetail }
  | { type: 'RESET' }

function transcriptionReducer(state: TranscriptionState, action: TranscriptionAction) {
  switch (action.type) {
    case 'SET_URL':
      return { ...state, url: action.url }
    case 'START_TRANSCRIPTION':
      return {
        ...state,
        status: 'transcribing',
        jobId: action.jobId,
        startedAt: new Date(),
        error: null,
        progress: 0
      }
    case 'UPDATE_PROGRESS':
      return {
        ...state,
        progress: action.progress,
        stage: action.stage,
        estimatedTimeRemaining: action.estimatedTime ?? state.estimatedTimeRemaining
      }
    case 'SET_COMPLETE':
      return {
        ...state,
        status: 'complete',
        output: action.transcript,
        progress: 100
      }
    case 'SET_ERROR':
      return {
        ...state,
        status: 'error',
        error: action.error,
        progress: 0
      }
    case 'RESET':
      return initialState
    default:
      return state
  }
}

Polling Mechanism

useEffect(() => {
  if (!jobId || status !== 'transcribing') return

  const poll = async () => {
    try {
      const response = await fetch(`/api/transcribe/status/${jobId}`)
      const data = await response.json()

      if (data.status === 'processing') {
        dispatch({
          type: 'UPDATE_PROGRESS',
          progress: data.progress.percent,
          stage: data.progress.stage,
          estimatedTime: calculateEstimate(data)
        })
      } else if (data.status === 'complete') {
        dispatch({
          type: 'SET_COMPLETE',
          transcript: data.result.transcript
        })
      } else if (data.status === 'failed') {
        dispatch({
          type: 'SET_ERROR',
          error: data.error
        })
      }
    } catch (error) {
      console.error('Polling error:', error)
      // Retry logic here
    }
  }

  const intervalId = setInterval(poll, 2000) // Poll every 2 seconds

  return () => clearInterval(intervalId)
}, [jobId, status])

Loading Indicator Component

interface LoadingIndicatorProps {
  progress: number
  stage: string
  startedAt: Date
  estimatedTimeRemaining?: number
}

function LoadingIndicator({ progress, stage, startedAt, estimatedTimeRemaining }: LoadingIndicatorProps) {
  const elapsed = Math.floor((Date.now() - startedAt.getTime()) / 1000)

  return (
    <div role="status" aria-live="polite" className="loading-container">
      {/* Progress Bar */}
      <div
        role="progressbar"
        aria-valuenow={progress}
        aria-valuemin={0}
        aria-valuemax={100}
        className="progress-bar"
      >
        <div className="progress-fill" style={{ width: `${progress}%` }} />
      </div>

      {/* Status Text */}
      <div className="status-text">
        <p>{stage} {progress > 0 && `${progress}%`}</p>
        <p className="text-muted">
          Started {formatDuration(elapsed)} ago
          {estimatedTimeRemaining && ` • ~${formatDuration(estimatedTimeRemaining)} remaining`}
        </p>
      </div>
    </div>
  )
}

Error Display Component

interface ErrorDisplayProps {
  error: ErrorDetail
  onRetry: () => void
  onDismiss: () => void
}

function ErrorDisplay({ error, onRetry, onDismiss }: ErrorDisplayProps) {
  return (
    <div role="alert" aria-live="assertive" className="error-banner">
      <div className="error-icon">⚠️</div>
      <div className="error-content">
        <strong>Transcription Failed</strong>
        <p>{error.message}</p>
      </div>
      <div className="error-actions">
        {error.retryable && (
          <button onClick={onRetry} className="btn-primary">
            Retry
          </button>
        )}
        <button onClick={onDismiss} aria-label="Dismiss error" className="btn-secondary">
          ✕
        </button>
      </div>
    </div>
  )
}

QA & Testing Strategy

Unit Tests (Automated)

Backend Tests:

// /lib/transcription/youtube-utils.test.ts
describe('YouTube URL Validation', () => {
  it('should accept standard youtube.com URLs')
  it('should accept youtu.be short URLs')
  it('should accept URLs with timestamps')
  it('should reject non-YouTube URLs')
  it('should extract video ID correctly from all formats')
})

// /lib/transcription/job-manager.test.ts
describe('Job Manager', () => {
  it('should create job with valid URL')
  it('should reject invalid URL')
  it('should limit concurrent jobs to 3')
  it('should queue jobs when at capacity')
  it('should process queued jobs when slot available')
})

Frontend Tests:

// /app/transcribe/_components/TranscriptionForm.test.tsx
describe('TranscriptionForm', () => {
  it('should disable button when input empty')
  it('should enable button when URL entered')
  it('should show loading state when transcribing')
  it('should display progress updates')
  it('should display transcript on success')
  it('should display error on failure')
  it('should allow retry after error')
})

Integration Tests

// /app/api/transcribe/route.test.ts
describe('POST /api/transcribe', () => {
  it('should return 202 with jobId for valid URL')
  it('should return 400 for invalid URL')
  it('should return 503 when server busy')
})

describe('GET /api/transcribe/status/:jobId', () => {
  it('should return processing status')
  it('should return complete status with transcript')
  it('should return failed status with error')
  it('should return 404 for unknown jobId')
})

E2E Tests

// tests/e2e/transcription.test.ts
describe('Full Transcription Workflow', () => {
  it('should transcribe short video successfully', async () => {
    // Use 30-second test video
    const testUrl = 'https://youtube.com/watch?v=SHORT_TEST_VIDEO'

    // Submit job
    const { jobId } = await fetch('/api/transcribe', {
      method: 'POST',
      body: JSON.stringify({ url: testUrl })
    }).then(r => r.json())

    // Poll until complete
    let status = 'processing'
    while (status === 'processing') {
      await sleep(2000)
      const result = await fetch(`/api/transcribe/status/${jobId}`).then(r => r.json())
      status = result.status
    }

    // Verify result
    expect(status).toBe('complete')
    expect(result.result.transcript).toBeTruthy()
  }, 120000) // 2 minute timeout
})

Manual Testing Checklist

Critical Manual Tests:

Test Data

Test Videos to Create/Find:

Short clear speech (10-30 seconds) - fast E2E test
Medium video (2-3 minutes) - typical use case
Long video (10 minutes) - stress test
Video with background noise - quality test
Private video URL - error handling test
Deleted video URL - error handling test

Edge Cases to Handle

URL Edge Cases

Empty input (handled by disabled button)
Whitespace-only input
Valid URL but not YouTube
Malformed YouTube URL
YouTube URL variations (watch?v=, youtu.be, embed, mobile)
URL with timestamp (t=120s)
URL with playlist parameter

Video Availability

Private video
Unlisted video (should work)
Deleted video
Age-restricted video
Region-restricted video
Premium/paid content
Live stream (currently streaming)
Upcoming scheduled video

Video Content

No audio track
Very short video (<5 seconds)
Very long video (>1 hour)
Multiple audio tracks/languages
Music-only (no speech)
Heavy background noise
Multiple overlapping speakers

System Failures

Network timeout during download
Disk full during processing
ffmpeg not installed
yt-dlp not installed
Whisper models not downloaded
Server restart during transcription
Multiple concurrent requests (queue management)

Timeout Scenarios

Frontend timeout (12 min)
Backend timeout (11 min)
CLI timeout (10 min)
Network request timeout during polling

Out of Scope (Future Stories)

This story explicitly does NOT include:

❌ Concurrent Transcriptions - Single transcription at a time (defer job queue to future story)
❌ Copy to Clipboard Button - Deferred to US-05 (users can manually copy for now)
❌ LLM Enhancement (cleanup/formatting via Ollama) - Deferred to US-05
❌ YouTube URL Validation (detailed format checking) - Basic validation only
❌ YouTube Metadata Display (title, duration, thumbnail) - Future story
❌ SRT/VTT Export Formats - TXT only for MVP
❌ Timestamp Display (word-level timestamps) - Future story
❌ Transcript Editing - Future story
❌ Transcription History (saving past transcriptions) - Future story
❌ User Authentication - Not needed for MVP
❌ Database Persistence - Simple in-memory state for single job
❌ Cancel Button (abort in-progress transcription) - Nice-to-have, defer if complex
❌ Progress Percentage (if too complex) - Status messages sufficient for MVP
❌ Server-Sent Events - Polling is simpler for MVP
❌ Multiple Video Formats - YouTube only for MVP
❌ Production Deployment - Local development only

Implementation Notes

Recommended Development Approach

Simplified Implementation (Single Story)

No job queue complexity (single transcription only)
Local development only (no deployment concerns)
Clear single deliverable: "Transcription works end-to-end for one video at a time"
Estimated effort: 5-8 days (reduced from original 8-12 days due to simplified scope)

Development Order (if single story)

Week 1: Backend Foundation

Install dependencies (whisper-node, yt-dlp-wrap, fluent-ffmpeg)
Copy modules from local-transcriber to /lib/transcription/
Implement job queue (JobManager)
Implement POST /api/transcribe endpoint
Implement GET /api/transcribe/status/:id endpoint
Basic error handling and file cleanup

Week 2: Frontend Integration

Update TranscriptionForm state management (useReducer)
Implement polling mechanism
Build LoadingIndicator component
Build ErrorDisplay component
Wire up API calls in handleTranscribe
Test basic happy path

Week 3: Testing & Hardening

Write unit tests (backend and frontend)
Write integration tests (API routes)
Write E2E test with short video
Comprehensive error handling
Manual testing (cross-browser, mobile, accessibility)
Performance testing (concurrent jobs)
Documentation

Team Roles

Backend Engineer:

Copy/adapt local-transcriber modules
Implement job queue
Implement API routes
Error handling and cleanup
Backend tests

Frontend Engineer:

State management refactor
Polling implementation
Loading/error UI components
Frontend tests
Accessibility

QA Engineer:

Create test video library
Write test cases
Manual testing execution
Cross-browser/device testing
Accessibility testing

Tech Lead:

Architecture decisions
Code reviews
Risk mitigation
Deployment planning

Product Owner:

Acceptance criteria validation
Error message review
UX feedback during development
Final acceptance testing

Local Development Setup

Prerequisites (Developer Machine)

This story targets local development only and requires the following to be installed on your development machine:

Required System Dependencies:

ffmpeg - For audio extraction
- macOS: brew install ffmpeg
- Ubuntu/Debian: apt install ffmpeg
- Windows: Download from https://ffmpeg.org/download.html
yt-dlp - For YouTube downloads
- macOS: brew install yt-dlp
- Python (all platforms): pip install yt-dlp
- Or: https://github.com/yt-dlp/yt-dlp#installation
Node.js 20+ - Already required by project
Sufficient disk space - 5GB+ recommended for temp files and Whisper models

Whisper Models:

Downloaded automatically by whisper-node on first transcription
Size: ~100MB (tiny) to ~1.5GB (large)
Default location: ~/.cache/whisper or configured path
Default model for MVP: base (~140MB, good quality/speed balance)

Dependencies

Prerequisites (Must Be Complete)

✅ US-01: Next.js scaffold
✅ US-02: GitHub Actions CI
✅ US-03: UI foundation (UrlInput, TranscribeButton, TranscriptionOutput)

External Dependencies

Existing local-transcriber CLI (completed and working)
System binaries: ffmpeg, yt-dlp
Whisper models (~100MB-1.5GB downloaded on first run)

Blocking Issues

Deployment platform must be selected before implementation
System dependencies must be installable on target platform

Open Questions

For Product Owner to Decide

Story Splitting: Implement as single story or split into 3 sub-stories?
- Recommendation: Single story for context continuity
- Risk: High complexity, might not finish in one sprint
- Decision: [PENDING]
Cancel Button: Should users be able to cancel in-progress transcription?
- Recommendation: OUT of scope for US-04. Add in US-04.1 if users request it
- Decision: [PENDING]
Job Persistence: Should job state persist across server restarts?
- Recommendation: NO for MVP (in-memory queue). Add database persistence in US-06
- Decision: [PENDING]
Progress Granularity: Percentage vs. stage messages only?
- Recommendation: If percentage is available from whisper-node, show it. Otherwise, stage messages sufficient.
- Decision: [PENDING]

For Tech Team to Research

Whisper Model Selection: Which model to use? (tiny, base, small, medium, large)
- Spike: Test transcription quality vs. speed with different models
- Recommendation: Start with base (balanced quality/speed)
Progress Callbacks: Does whisper-node provide progress during transcription?
- Spike: Review whisper-node documentation and test
- Affects: Whether we can show percentage or just stage messages
Concurrent Job Limit: What's realistic for 4GB RAM server?
- Spike: Test memory usage with multiple concurrent jobs
- Recommendation: Start with max 3, adjust based on testing
Disk Space Management: How much space needed per video?
- Spike: Test with various video lengths
- Recommendation: 10GB total, cleanup after each job

Estimate

Complexity Assessment (Simplified)

Overall Complexity: MEDIUM (reduced from HIGH)

Long-running async operations
External service integrations (YouTube, Whisper)
State management across frontend/backend
Comprehensive error handling
~~Resource management~~ Simple cleanup (single job only)

Technical Risk: MEDIUM (reduced from MEDIUM-HIGH)

~~Deployment platform constraints~~ Local development only
System dependency requirements (local machine)
Timeout coordination across layers
File cleanup reliability

User Experience Risk: MEDIUM

Long wait times (2-10 minutes) require excellent progress indication
Error scenarios must be clearly communicated
First-time users may not understand processing time

Effort Estimate (Simplified Scope)

Backend Development: 2-3 days (reduced from 3-5 days)

Module integration: 1 day
~~Job queue: 1 day~~ Simple state: 0.5 day
API routes: 1 day
Error handling: 0.5 day
Testing: 0.5 day

Frontend Development: 2-3 days (unchanged)

State management: 0.5 day
Polling logic: 0.5 day
UI components: 1 day
Testing: 1 day

QA & Testing: 2-3 days (reduced from 3-4 days)

Test data preparation: 0.5 day
Unit/integration tests: 1 day
E2E tests: 0.5 day
Manual testing: 1 day
~~Bug fixes & retesting: 1 day~~ (built into above)

Total Estimated Effort: 6-9 days (reduced from 8-12 days)

Story Points: 8-13 points (Medium/Large, reduced from 13-21)

Recommendation: This can fit in a single 2-week sprint for a 1-2 person team.

Risk Assessment

Critical Risks (Must Mitigate)

Risk 1: User Experience During Long Waits ⚠️ HIGHEST PRIORITY

Impact: HIGH - Users abandon if unclear what's happening
Likelihood: MEDIUM
Mitigation:
- Clear progress indication
- Elapsed time display
- Estimated time remaining (if feasible)
- "This may take several minutes" upfront message

Risk 2: Resource Exhaustion (Memory/Disk/CPU) ⚠️

Impact: MEDIUM - Dev machine slowdown, disk fills up
Likelihood: LOW (single user, local machine)
Mitigation:
- ~~Implement job queue with concurrency limit (max 3)~~ Single job only
- Automatic temp file cleanup
- ~~Monitor server resources~~ Local dev, manual monitoring
- Set hard timeouts (10 min CLI)

Risk 3: System Dependencies Missing ⚠️

Impact: MEDIUM - Transcription fails to start
Likelihood: MEDIUM (first-time setup)
Mitigation:
- Clear documentation in README for ffmpeg/yt-dlp installation
- Helpful error messages if dependencies not found
- Verify dependencies at app startup

High Risks

Risk 4: CLI Integration Brittleness

Impact: HIGH - Transcription fails silently or with cryptic errors
Likelihood: MEDIUM
Mitigation:
- Comprehensive error handling
- Test with various video types
- Version pin all dependencies
- Extensive integration tests

Risk 5: Timeout Coordination Failures

Impact: MEDIUM - Confusing errors, hung processes
Likelihood: MEDIUM
Mitigation:
- Clear timeout strategy (10/11/12 min cascade)
- Document in code comments
- Test timeout scenarios

Risk 6: Poor Transcription Accuracy

Impact: MEDIUM - Users dissatisfied with results
Likelihood: LOW-MEDIUM (depends on audio quality)
Mitigation:
- Test with various audio qualities
- Use base Whisper model (good quality/speed balance)
- Document limitations in UI/docs
- Future: Allow model selection

Medium Risks

Risk 7: Multiple Concurrent Users

Impact: MEDIUM - Unexpected load, queue backlog
Likelihood: LOW (MVP, limited users)
Mitigation:
- Queue system ready from day one
- Monitor queue length
- Return 503 when overloaded

Risk 8: YouTube Rate Limiting/Blocking

Impact: MEDIUM - Downloads fail
Likelihood: LOW-MEDIUM
Mitigation:
- Implement retry with exponential backoff
- Clear error messages to users
- Consider user-agent configuration

Definition of Done

Code Complete

Testing Complete

Quality Standards

Code reviewed by Tech Lead or senior engineer
Error messages reviewed for clarity (by Product Owner)
No critical or high-severity bugs
All automated tests pass in CI
Follows project coding conventions
TypeScript types defined for all API contracts

Documentation

README updated with:
- System dependencies required (ffmpeg, yt-dlp)
- Deployment requirements (not Vercel compatible)
- Whisper model information
API documentation created (request/response examples)
Code comments for complex logic (timeout handling, queue management)

Deployment Ready

System dependencies documented
Deployment platform selected
Environment variables documented (if any)
Dockerfile created (if using containers)
Tested on target deployment platform
Monitoring/logging configured

Product Owner Acceptance

Product Owner has transcribed 3+ videos successfully
Error messages are clear and actionable
Loading experience is acceptable (not confusing)
Performance is reasonable (5-min video < 10-min processing)
User experience during long waits is reassuring

Success Metrics

Functional Success

✅ Users can transcribe YouTube videos end-to-end
✅ Transcription accuracy is acceptable (>85% for clear speech)
✅ Error rate <5% (excluding user errors like invalid URLs)
✅ System stability (no crashes during transcription)

Performance Success

✅ 2-minute video transcribes in <5 minutes
✅ 5-minute video transcribes in <10 minutes
✅ Loading indicator appears within 500ms
✅ Progress updates every 2-5 seconds

User Experience Success

✅ Users understand what's happening during long waits
✅ Error messages are actionable (users know what to do)
✅ Retry works reliably after errors
✅ Zero instances of "I thought it was frozen"

Quality Success

✅ All automated tests passing
✅ Zero critical bugs in first week after release
✅ Temp files are cleaned up (disk doesn't fill)
✅ Server remains stable under concurrent load

References

Documentation

Whisper Node: https://github.com/ariym/whisper-node
yt-dlp-wrap: https://github.com/foxesdocode/yt-dlp-wrap
fluent-ffmpeg: https://github.com/fluent-ffmpeg/node-fluent-ffmpeg
Next.js API Routes: https://nextjs.org/docs/app/building-your-application/routing/route-handlers
React useReducer: https://react.dev/reference/react/useReducer

Tools

ffmpeg Download: https://ffmpeg.org/download.html
yt-dlp Download: https://github.com/yt-dlp/yt-dlp
Whisper Models: https://github.com/openai/whisper (automatic download via whisper-node)

Story Status

Status: ✅ Ready for Sprint Planning Priority: CRITICAL - First value delivery, blocks all downstream features Complexity: Large (13-21 story points) Recommended Sprint Allocation: Full 2-week sprint for 2-3 person team Refined By: Product Team (PO, Tech Lead, Backend Eng, Frontend Eng, QA Eng) Date: 2025-11-19

Notes

This is the MOST IMPORTANT STORY in the backlog after scaffolding—it transforms prototype into product
Deployment constraint is critical: Must plan for VPS/container deployment (Vercel won't work)
Consider splitting into 3 sub-stories if team velocity or risk tolerance requires it
Defer LLM enhancement (Ollama) to US-05—keep this story focused on core transcription
Manual testing with real videos is essential—automated tests cannot verify transcription quality
User experience during 2-10 minute wait is CRITICAL—invest in clear progress indication
Error messages must be user-friendly—have Product Owner review all error text
Resource management (queue, cleanup) is non-negotiable—prevents server issues

Key Success Factor: This story must be solid and reliable because it's the foundation for all future features. Time invested in quality now prevents major rework later.

Product Owner Sign-Off: [PENDING] Tech Lead Sign-Off: [PENDING] Ready for Sprint Planning: YES