Loading...
Loading...
Loading...
As a **content creator, researcher, or student**
# US-04: YouTube Video Transcription Implementation
## User Story
As a **content creator, researcher, or student**
I want to **transcribe audio from a YouTube video URL into readable text**
So that **I can read, search, quote, or repurpose spoken content without having to watch or listen to the entire video**
---
## Business Value
### Problem We're Solving
**Current State:** The app has a polished UI (US-03) but the "Transcribe" button shows placeholder text. Users cannot get actual transcripts from YouTube videos.
**Impact:** Without real transcription, we don't have a product—we have a demo. This story is the **first true value delivery moment** that transforms the prototype into a functional tool.
### Value Delivered
- **First Real Functionality:** Users can finally get actual YouTube transcripts
- **Privacy-Focused:** Local transcription (using existing `local-transcriber`) keeps data private
- **Accuracy:** Whisper-based transcription typically provides better results than YouTube auto-captions
- **Accessibility:** Makes video content accessible to deaf/hard-of-hearing users
- **Content Utility:** Enables searchable, quotable text from video content
### Strategic Alignment
This is our **MVP completion milestone**. Without this story, every downstream feature (export formats, timestamp display, history) has nothing to build upon. This validates our technical approach and enables real user feedback.
**Success Criteria:**
- Users successfully transcribe at least one video end-to-end
- User experience during 2-10 minute wait is clear and reassuring
- Error scenarios are handled gracefully with actionable messages
---
## Context & Prerequisites
### Completed Stories
- ✅ **US-01:** Next.js 15 + Vitest scaffold
- ✅ **US-02:** GitHub Actions PR verification pipeline
- ✅ **US-03:** UI foundation with UrlInput, TranscribeButton, TranscriptionOutput components
### Existing Assets
- **Local Transcriber CLI:** `~/Repos/LT/claude-code-playground/local-transcriber`
- Uses: `whisper-node` (local Whisper.cpp), `yt-dlp-wrap` (YouTube downloads), `fluent-ffmpeg` (audio extraction)
- Architecture: Modular design with YouTubeDownloader, VideoProcessor, Transcriber, LLMEnhancer
- Progress callbacks: Already implemented via `ProcessingProgress` interface
- File cleanup: Automatic temp file management
### Scope Constraints (Simplifications)
- **Local Development Only:** No production deployment concerns (Vercel, Docker, etc.)
- **Single Transcription:** Only one transcription at a time (no job queue, no concurrency)
- **Developer Machine:** Assumes ffmpeg and yt-dlp installed on local development machine
---
## Acceptance Criteria
### AC1: YouTube URL Validation and Submission
- [ ] **GIVEN** a valid YouTube URL (formats: `watch?v=`, `youtu.be`, `embed`, `m.youtube.com`)
- **WHEN** user enters URL and clicks Transcribe
- **THEN** transcription process starts immediately
- [ ] **GIVEN** an invalid URL (empty, not YouTube, malformed)
- **WHEN** user clicks Transcribe
- **THEN** error message appears: "Please enter a valid YouTube URL"
- **AND** button remains disabled for empty input
**Verification:** Test with 10+ URL formats (5 valid variants, 5 invalid cases)
---
### AC2: Async Job Creation
- [ ] **GIVEN** user submits valid YouTube URL
- **WHEN** Transcribe button clicked
- **THEN** API returns immediately (within 500ms) with job ID
- **AND** user sees loading state with initial message
- [ ] **GIVEN** job submitted successfully
- **WHEN** API responds
- **THEN** frontend stores job ID and begins polling for status
**Verification:** Network tab shows POST /api/transcribe returns 202 Accepted with jobId
---
### AC3: Real-Time Progress Indication
- [ ] **GIVEN** transcription in progress
- **WHEN** user waits
- **THEN** loading indicator remains visible throughout process
- **AND** progress updates appear (percentage or status messages)
- **AND** elapsed time displays (e.g., "Started 2m 30s ago")
- [ ] **GIVEN** transcription stages (downloading → extracting → transcribing)
- **THEN** user sees current stage in UI ("Downloading audio...", "Transcribing...")
**Verification:** Visual inspection shows progress updates every 2-5 seconds
---
### AC4: Loading State Behavior
- [ ] **GIVEN** transcription starts
- **THEN** loading indicator appears within 500ms
- **AND** Transcribe button becomes disabled
- **AND** URL input becomes disabled
- **AND** button text changes to "Transcribing..." with spinner
- [ ] **GIVEN** transcription in progress
- **THEN** user cannot start another transcription
- **AND** previous transcription results (if any) remain visible until new results arrive
**Verification:** Try clicking button multiple times rapidly—only one transcription starts
---
### AC5: Transcription Success
- [ ] **GIVEN** transcription completes successfully
- **THEN** actual transcript text appears in TranscriptionOutput component
- **AND** loading indicator disappears
- **AND** Transcribe button re-enables
- **AND** URL input re-enables
- **AND** transcript text is selectable and copyable
- [ ] **GIVEN** completed transcript displayed
- **THEN** user can start new transcription without page refresh
**Verification:** Test with 3 different real videos (30 seconds, 2 minutes, 5 minutes)
---
### AC6: Video Availability Errors
- [ ] **GIVEN** YouTube URL points to private/unavailable video
- **THEN** error message: "This video is private or unavailable. Please use a public video."
- **AND** loading indicator disappears
- **AND** form re-enables for retry
- [ ] **GIVEN** YouTube URL points to deleted video
- **THEN** error message: "Video not found. It may have been removed."
- [ ] **GIVEN** YouTube URL points to age-restricted video
- **THEN** error message: "Cannot access age-restricted videos. Please use a different video."
**Verification:** Test with known private, deleted, and age-restricted video URLs
---
### AC7: Transcription Failures
- [ ] **GIVEN** network error during download
- **THEN** error message: "Failed to download video. Please check your connection and try again."
- **AND** "Retry" button available
- [ ] **GIVEN** transcription process fails (Whisper error, ffmpeg error)
- **THEN** error message: "Transcription failed. Please try again or contact support if this persists."
- **AND** "Retry" button available
- [ ] **GIVEN** any error occurs
- **THEN** loading indicator disappears
- **AND** form re-enables
- **AND** error is announced to screen readers (aria-live)
**Verification:** Mock failures in backend, verify frontend error handling
---
### AC8: Timeout Handling
- [ ] **GIVEN** transcription exceeds 10 minutes (CLI timeout)
- **THEN** error message: "Transcription timed out. Please try a shorter video or retry later."
- **AND** process is killed on server
- **AND** temp files are cleaned up
- [ ] **GIVEN** frontend timeout (12 minutes)
- **THEN** frontend shows timeout error even if backend still processing
- **AND** user can retry
**Verification:** Mock 10+ minute delay, verify timeout triggers correctly
---
### AC9: Resource Management (Simplified - Single Transcription)
- [ ] **GIVEN** transcription completes (success or failure)
- **THEN** server cleans up temp files within 60 seconds
- [ ] **GIVEN** transcription is already in progress
- **WHEN** user tries to start another transcription
- **THEN** button remains disabled
- **AND** UI shows "Transcription in progress" message
**Verification:** Check temp directory is cleaned up, try starting second transcription while first is running
---
### AC10: State Management
- [ ] **GIVEN** user navigates away during transcription
- **THEN** transcription continues (or is canceled, team decision)
- **AND** server cleans up resources appropriately
- [ ] **GIVEN** user refreshes page during transcription
- **THEN** in-progress job is lost (acceptable for MVP)
- **AND** server continues processing and eventually cleans up
**Verification:** Manual testing with browser back/forward/refresh
---
### AC11: Accessibility
- [ ] **GIVEN** transcription state changes (start, progress, complete, error)
- **THEN** changes are announced to screen readers via aria-live
- [ ] **GIVEN** loading indicator visible
- **THEN** button has aria-busy="true"
- **AND** screen reader announces "Transcription in progress"
- [ ] **GIVEN** progress updates occur
- **THEN** significant milestones announced (25%, 50%, 75%, complete)
- **NOT** every 1% change (too chatty)
**Verification:** Test with VoiceOver (macOS) or NVDA (Windows)
---
### AC12: API Contract Implementation
- [ ] **POST /api/transcribe** endpoint exists and returns:
- 202 Accepted: `{ jobId: string, status: 'queued' | 'processing', estimatedDuration?: number }`
- 400 Bad Request: `{ error: { code: string, message: string } }` for invalid URL
- 503 Service Unavailable: `{ error: { code: 'SERVER_BUSY', message: string, retryAfter: number } }`
- [ ] **GET /api/transcribe/status/:jobId** endpoint exists and returns:
- 200 OK (processing): `{ jobId, status: 'processing', progress: { stage: string, percent: number, message: string } }`
- 200 OK (complete): `{ jobId, status: 'complete', result: { transcript: string, duration: number, language: string } }`
- 200 OK (failed): `{ jobId, status: 'failed', error: { code: string, message: string } }`
- 404 Not Found: `{ error: { code: 'JOB_NOT_FOUND', message: string } }`
- [ ] Both endpoints have TypeScript types defined in `/lib/types/transcription.ts`
**Verification:** Use curl or Postman to call endpoints, verify response structure
---
## Technical Implementation
### High-Level Architecture (Simplified - Single Job)
```
┌─────────────────┐
│ User Browser │
│ │
│ [React Form] │ Polling every 2s
│ │ │ ◄─────────────┐
│ ▼ │ │
│ POST /api/ │ │
│ transcribe │ │
│ │ │ │
└───────┼─────────┘ │
│ │
▼ │
┌─────────────────────────────────┴──┐
│ Next.js API Routes │
│ │
│ POST /api/transcribe │
│ → Creates single job │
│ → Returns jobId │
│ → Blocks if already processing │
│ │
│ GET /api/transcribe/status/:id │
│ → Returns job status │
│ │
│ Simple State (in-memory) │
│ → Single job at a time │
│ → No queue needed │
│ │
└────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Transcription Service │
│ (from local-transcriber) │
│ │
│ YouTubeDownloader │
│ → yt-dlp-wrap │
│ → Downloads video/audio │
│ │ │
│ ▼ │
│ VideoProcessor │
│ → fluent-ffmpeg │
│ → Extracts audio (WAV) │
│ │ │
│ ▼ │
│ Transcriber │
│ → whisper-node │
│ → Transcribes audio │
│ │ │
│ ▼ │
│ Returns transcript text │
│ │
└──────────────────────────────────────┘
```
### Directory Structure
```
/next-transcriber
/app
/api
/transcribe
/route.ts # POST - Create job
/route.test.ts
/status
/[jobId]
/route.ts # GET - Poll status
/route.test.ts
/lib
/transcription
/core # Copied from local-transcriber
/transcriber.ts
/video-processor.ts
/youtube-downloader.ts
/queue
/job-manager.ts # Job queue implementation
/job-processor.ts # Background worker
/types.ts # TypeScript interfaces
/types
/api.ts # API request/response types
```
### Technology Stack
**Backend:**
- Next.js 15 API Routes (server-side only)
- whisper-node (Whisper.cpp Node.js bindings)
- fluent-ffmpeg (audio extraction)
- yt-dlp-wrap (YouTube downloads)
- In-memory job queue (Map-based for MVP)
**Frontend:**
- React 19 with hooks (useReducer for state)
- Polling mechanism (setInterval with cleanup)
- Tailwind CSS for loading indicators
**Dependencies to Install:**
```json
{
"dependencies": {
"whisper-node": "^1.1.1",
"fluent-ffmpeg": "^2.1.3",
"yt-dlp-wrap": "^2.3.12",
"@types/fluent-ffmpeg": "^2.1.27"
}
}
```
**System Dependencies (must be installed on server):**
- ffmpeg (audio processing)
- yt-dlp (YouTube downloads)
- Whisper models (downloaded on first run, ~100MB-1.5GB)
---
## Backend Implementation
### API Design
#### POST /api/transcribe
**Request:**
```typescript
{
url: string // YouTube URL (required)
}
```
**Response (202 Accepted):**
```typescript
{
jobId: string // UUID
status: 'queued'
estimatedDuration: number // Rough estimate in seconds
}
```
**Response (400 Bad Request):**
```typescript
{
error: {
code: 'INVALID_URL' | 'INVALID_INPUT'
message: string
}
}
```
#### GET /api/transcribe/status/:jobId
**Response (200 OK - Processing):**
```typescript
{
jobId: string
status: 'queued' | 'processing'
progress: {
stage: 'downloading' | 'extracting' | 'transcribing'
percent: number // 0-100
message: string
}
createdAt: string
updatedAt: string
}
```
**Response (200 OK - Complete):**
```typescript
{
jobId: string
status: 'complete'
result: {
transcript: string
duration: number // Video duration in seconds
language: string // Detected language
wordCount: number
}
createdAt: string
completedAt: string
}
```
**Response (200 OK - Failed):**
```typescript
{
jobId: string
status: 'failed'
error: {
code: string
message: string
}
createdAt: string
failedAt: string
}
```
### Integration with Local Transcriber
**Step 1: Copy Core Modules**
```bash
# Copy modules from local-transcriber
cp -r ~/Repos/LT/claude-code-playground/local-transcriber/src/modules \
/next-transcriber/lib/transcription/core/
# Copy types
cp -r ~/Repos/LT/claude-code-playground/local-transcriber/src/types \
/next-transcriber/lib/transcription/types/
# Copy utilities (selective)
cp ~/Repos/LT/claude-code-playground/local-transcriber/src/utils/{logger,file-utils,youtube-utils}.ts \
/next-transcriber/lib/transcription/utils/
```
**Step 2: Adapt for Web (Remove CLI Dependencies)**
- Remove `commander` imports
- Keep progress callbacks (already async-friendly)
- Adapt logger for Next.js (use console or structured logger)
- Remove interactive prompts
**Step 3: Implement Job Queue**
```typescript
// /lib/transcription/queue/job-manager.ts
interface TranscriptionJob {
jobId: string
url: string
status: 'queued' | 'processing' | 'complete' | 'failed'
progress: ProcessingProgress
result?: TranscriptionResult
error?: ErrorDetail
createdAt: Date
updatedAt: Date
}
class JobManager {
private jobs = new Map<string, TranscriptionJob>()
private queue: string[] = []
private maxConcurrent = 3
async createJob(url: string): Promise<string> {
const jobId = crypto.randomUUID()
const job: TranscriptionJob = {
jobId,
url,
status: 'queued',
progress: { stage: 'queued', progress: 0, message: 'Waiting to start...' },
createdAt: new Date(),
updatedAt: new Date()
}
this.jobs.set(jobId, job)
this.queue.push(jobId)
this.processQueue()
return jobId
}
async getJobStatus(jobId: string): Promise<TranscriptionJob | null> {
return this.jobs.get(jobId) || null
}
private async processQueue() {
const processing = Array.from(this.jobs.values())
.filter(j => j.status === 'processing').length
if (processing >= this.maxConcurrent) return
const nextJobId = this.queue.shift()
if (!nextJobId) return
const job = this.jobs.get(nextJobId)
if (!job) return
this.processJob(job)
}
private async processJob(job: TranscriptionJob) {
job.status = 'processing'
job.updatedAt = new Date()
try {
const transcriber = new TranscriberApp(config)
const result = await transcriber.processFile(
job.url,
undefined,
{ mode: 'none' }, // No LLM enhancement for MVP
(progress) => {
// Update job progress
job.progress = progress
job.updatedAt = new Date()
}
)
job.status = 'complete'
job.result = { transcript: result, /* ... */ }
} catch (error) {
job.status = 'failed'
job.error = { code: 'TRANSCRIPTION_ERROR', message: error.message }
}
this.processQueue() // Start next job
}
}
export const jobManager = new JobManager()
```
### Timeout Implementation
**Three-Layer Timeout Strategy:**
```typescript
// CLI Tool: 10 minutes
const CLI_TIMEOUT = 10 * 60 * 1000
// Backend API: 11 minutes (1 min buffer to kill CLI and respond)
const BACKEND_TIMEOUT = 11 * 60 * 1000
// Frontend: 12 minutes (1 min buffer to receive backend response)
const FRONTEND_TIMEOUT = 12 * 60 * 1000
```
**Backend Implementation:**
```typescript
async function processWithTimeout(job: TranscriptionJob) {
const timeoutPromise = new Promise((_, reject) => {
setTimeout(() => reject(new Error('TIMEOUT')), BACKEND_TIMEOUT)
})
try {
await Promise.race([
transcriber.processFile(job.url, ...),
timeoutPromise
])
} catch (error) {
if (error.message === 'TIMEOUT') {
// Kill CLI process, cleanup files
throw new Error('Transcription timed out')
}
throw error
}
}
```
---
## Frontend Implementation
### State Management
**Recommended: useReducer Pattern**
```typescript
// hooks/useTranscription.ts
interface TranscriptionState {
url: string
output: string
status: 'idle' | 'validating' | 'transcribing' | 'complete' | 'error'
jobId: string | null
progress: number
stage: string
error: ErrorDetail | null
startedAt: Date | null
estimatedTimeRemaining: number | null
}
type TranscriptionAction =
| { type: 'SET_URL'; url: string }
| { type: 'START_TRANSCRIPTION'; jobId: string }
| { type: 'UPDATE_PROGRESS'; progress: number; stage: string; estimatedTime?: number }
| { type: 'SET_COMPLETE'; transcript: string }
| { type: 'SET_ERROR'; error: ErrorDetail }
| { type: 'RESET' }
function transcriptionReducer(state: TranscriptionState, action: TranscriptionAction) {
switch (action.type) {
case 'SET_URL':
return { ...state, url: action.url }
case 'START_TRANSCRIPTION':
return {
...state,
status: 'transcribing',
jobId: action.jobId,
startedAt: new Date(),
error: null,
progress: 0
}
case 'UPDATE_PROGRESS':
return {
...state,
progress: action.progress,
stage: action.stage,
estimatedTimeRemaining: action.estimatedTime ?? state.estimatedTimeRemaining
}
case 'SET_COMPLETE':
return {
...state,
status: 'complete',
output: action.transcript,
progress: 100
}
case 'SET_ERROR':
return {
...state,
status: 'error',
error: action.error,
progress: 0
}
case 'RESET':
return initialState
default:
return state
}
}
```
### Polling Mechanism
```typescript
useEffect(() => {
if (!jobId || status !== 'transcribing') return
const poll = async () => {
try {
const response = await fetch(`/api/transcribe/status/${jobId}`)
const data = await response.json()
if (data.status === 'processing') {
dispatch({
type: 'UPDATE_PROGRESS',
progress: data.progress.percent,
stage: data.progress.stage,
estimatedTime: calculateEstimate(data)
})
} else if (data.status === 'complete') {
dispatch({
type: 'SET_COMPLETE',
transcript: data.result.transcript
})
} else if (data.status === 'failed') {
dispatch({
type: 'SET_ERROR',
error: data.error
})
}
} catch (error) {
console.error('Polling error:', error)
// Retry logic here
}
}
const intervalId = setInterval(poll, 2000) // Poll every 2 seconds
return () => clearInterval(intervalId)
}, [jobId, status])
```
### Loading Indicator Component
```typescript
interface LoadingIndicatorProps {
progress: number
stage: string
startedAt: Date
estimatedTimeRemaining?: number
}
function LoadingIndicator({ progress, stage, startedAt, estimatedTimeRemaining }: LoadingIndicatorProps) {
const elapsed = Math.floor((Date.now() - startedAt.getTime()) / 1000)
return (
<div role="status" aria-live="polite" className="loading-container">
{/* Progress Bar */}
<div
role="progressbar"
aria-valuenow={progress}
aria-valuemin={0}
aria-valuemax={100}
className="progress-bar"
>
<div className="progress-fill" style={{ width: `${progress}%` }} />
</div>
{/* Status Text */}
<div className="status-text">
<p>{stage} {progress > 0 && `${progress}%`}</p>
<p className="text-muted">
Started {formatDuration(elapsed)} ago
{estimatedTimeRemaining && ` • ~${formatDuration(estimatedTimeRemaining)} remaining`}
</p>
</div>
</div>
)
}
```
### Error Display Component
```typescript
interface ErrorDisplayProps {
error: ErrorDetail
onRetry: () => void
onDismiss: () => void
}
function ErrorDisplay({ error, onRetry, onDismiss }: ErrorDisplayProps) {
return (
<div role="alert" aria-live="assertive" className="error-banner">
<div className="error-icon">⚠️</div>
<div className="error-content">
<strong>Transcription Failed</strong>
<p>{error.message}</p>
</div>
<div className="error-actions">
{error.retryable && (
<button onClick={onRetry} className="btn-primary">
Retry
</button>
)}
<button onClick={onDismiss} aria-label="Dismiss error" className="btn-secondary">
✕
</button>
</div>
</div>
)
}
```
---
## QA & Testing Strategy
### Unit Tests (Automated)
**Backend Tests:**
```typescript
// /lib/transcription/youtube-utils.test.ts
describe('YouTube URL Validation', () => {
it('should accept standard youtube.com URLs')
it('should accept youtu.be short URLs')
it('should accept URLs with timestamps')
it('should reject non-YouTube URLs')
it('should extract video ID correctly from all formats')
})
// /lib/transcription/job-manager.test.ts
describe('Job Manager', () => {
it('should create job with valid URL')
it('should reject invalid URL')
it('should limit concurrent jobs to 3')
it('should queue jobs when at capacity')
it('should process queued jobs when slot available')
})
```
**Frontend Tests:**
```typescript
// /app/transcribe/_components/TranscriptionForm.test.tsx
describe('TranscriptionForm', () => {
it('should disable button when input empty')
it('should enable button when URL entered')
it('should show loading state when transcribing')
it('should display progress updates')
it('should display transcript on success')
it('should display error on failure')
it('should allow retry after error')
})
```
### Integration Tests
```typescript
// /app/api/transcribe/route.test.ts
describe('POST /api/transcribe', () => {
it('should return 202 with jobId for valid URL')
it('should return 400 for invalid URL')
it('should return 503 when server busy')
})
describe('GET /api/transcribe/status/:jobId', () => {
it('should return processing status')
it('should return complete status with transcript')
it('should return failed status with error')
it('should return 404 for unknown jobId')
})
```
### E2E Tests
```typescript
// tests/e2e/transcription.test.ts
describe('Full Transcription Workflow', () => {
it('should transcribe short video successfully', async () => {
// Use 30-second test video
const testUrl = 'https://youtube.com/watch?v=SHORT_TEST_VIDEO'
// Submit job
const { jobId } = await fetch('/api/transcribe', {
method: 'POST',
body: JSON.stringify({ url: testUrl })
}).then(r => r.json())
// Poll until complete
let status = 'processing'
while (status === 'processing') {
await sleep(2000)
const result = await fetch(`/api/transcribe/status/${jobId}`).then(r => r.json())
status = result.status
}
// Verify result
expect(status).toBe('complete')
expect(result.result.transcript).toBeTruthy()
}, 120000) // 2 minute timeout
})
```
### Manual Testing Checklist
**Critical Manual Tests:**
- [ ] Transcribe 2-minute public YouTube video
- [ ] Transcribe 5-minute video (typical use case)
- [ ] Test with private video → verify error message
- [ ] Test with deleted video → verify error message
- [ ] Test with invalid URL → verify error message
- [ ] Loading indicator appears immediately (<500ms)
- [ ] Progress updates visible during processing
- [ ] Error messages are clear and actionable
- [ ] Retry works after error
- [ ] Multiple browser tabs don't conflict
- [ ] Browser back/forward during transcription
- [ ] Test on Chrome, Firefox, Safari
- [ ] Test on mobile (iOS Safari, Android Chrome)
- [ ] Screen reader test (VoiceOver or NVDA)
### Test Data
**Test Videos to Create/Find:**
- Short clear speech (10-30 seconds) - fast E2E test
- Medium video (2-3 minutes) - typical use case
- Long video (10 minutes) - stress test
- Video with background noise - quality test
- Private video URL - error handling test
- Deleted video URL - error handling test
---
## Edge Cases to Handle
### URL Edge Cases
- Empty input (handled by disabled button)
- Whitespace-only input
- Valid URL but not YouTube
- Malformed YouTube URL
- YouTube URL variations (watch?v=, youtu.be, embed, mobile)
- URL with timestamp (t=120s)
- URL with playlist parameter
### Video Availability
- Private video
- Unlisted video (should work)
- Deleted video
- Age-restricted video
- Region-restricted video
- Premium/paid content
- Live stream (currently streaming)
- Upcoming scheduled video
### Video Content
- No audio track
- Very short video (<5 seconds)
- Very long video (>1 hour)
- Multiple audio tracks/languages
- Music-only (no speech)
- Heavy background noise
- Multiple overlapping speakers
### System Failures
- Network timeout during download
- Disk full during processing
- ffmpeg not installed
- yt-dlp not installed
- Whisper models not downloaded
- Server restart during transcription
- Multiple concurrent requests (queue management)
### Timeout Scenarios
- Frontend timeout (12 min)
- Backend timeout (11 min)
- CLI timeout (10 min)
- Network request timeout during polling
---
## Out of Scope (Future Stories)
This story explicitly does **NOT** include:
- ❌ **Concurrent Transcriptions** - Single transcription at a time (defer job queue to future story)
- ❌ **Copy to Clipboard Button** - Deferred to US-05 (users can manually copy for now)
- ❌ **LLM Enhancement** (cleanup/formatting via Ollama) - Deferred to US-05
- ❌ **YouTube URL Validation** (detailed format checking) - Basic validation only
- ❌ **YouTube Metadata Display** (title, duration, thumbnail) - Future story
- ❌ **SRT/VTT Export Formats** - TXT only for MVP
- ❌ **Timestamp Display** (word-level timestamps) - Future story
- ❌ **Transcript Editing** - Future story
- ❌ **Transcription History** (saving past transcriptions) - Future story
- ❌ **User Authentication** - Not needed for MVP
- ❌ **Database Persistence** - Simple in-memory state for single job
- ❌ **Cancel Button** (abort in-progress transcription) - Nice-to-have, defer if complex
- ❌ **Progress Percentage** (if too complex) - Status messages sufficient for MVP
- ❌ **Server-Sent Events** - Polling is simpler for MVP
- ❌ **Multiple Video Formats** - YouTube only for MVP
- ❌ **Production Deployment** - Local development only
---
## Implementation Notes
### Recommended Development Approach
**Simplified Implementation (Single Story)**
- No job queue complexity (single transcription only)
- Local development only (no deployment concerns)
- Clear single deliverable: "Transcription works end-to-end for one video at a time"
- **Estimated effort:** 5-8 days (reduced from original 8-12 days due to simplified scope)
### Development Order (if single story)
**Week 1: Backend Foundation**
1. Install dependencies (whisper-node, yt-dlp-wrap, fluent-ffmpeg)
2. Copy modules from local-transcriber to /lib/transcription/
3. Implement job queue (JobManager)
4. Implement POST /api/transcribe endpoint
5. Implement GET /api/transcribe/status/:id endpoint
6. Basic error handling and file cleanup
**Week 2: Frontend Integration**
1. Update TranscriptionForm state management (useReducer)
2. Implement polling mechanism
3. Build LoadingIndicator component
4. Build ErrorDisplay component
5. Wire up API calls in handleTranscribe
6. Test basic happy path
**Week 3: Testing & Hardening**
1. Write unit tests (backend and frontend)
2. Write integration tests (API routes)
3. Write E2E test with short video
4. Comprehensive error handling
5. Manual testing (cross-browser, mobile, accessibility)
6. Performance testing (concurrent jobs)
7. Documentation
### Team Roles
**Backend Engineer:**
- Copy/adapt local-transcriber modules
- Implement job queue
- Implement API routes
- Error handling and cleanup
- Backend tests
**Frontend Engineer:**
- State management refactor
- Polling implementation
- Loading/error UI components
- Frontend tests
- Accessibility
**QA Engineer:**
- Create test video library
- Write test cases
- Manual testing execution
- Cross-browser/device testing
- Accessibility testing
**Tech Lead:**
- Architecture decisions
- Code reviews
- Risk mitigation
- Deployment planning
**Product Owner:**
- Acceptance criteria validation
- Error message review
- UX feedback during development
- Final acceptance testing
---
## Local Development Setup
### Prerequisites (Developer Machine)
This story targets **local development only** and requires the following to be installed on your development machine:
**Required System Dependencies:**
- [ ] **ffmpeg** - For audio extraction
- macOS: `brew install ffmpeg`
- Ubuntu/Debian: `apt install ffmpeg`
- Windows: Download from https://ffmpeg.org/download.html
- [ ] **yt-dlp** - For YouTube downloads
- macOS: `brew install yt-dlp`
- Python (all platforms): `pip install yt-dlp`
- Or: https://github.com/yt-dlp/yt-dlp#installation
- [ ] **Node.js 20+** - Already required by project
- [ ] **Sufficient disk space** - 5GB+ recommended for temp files and Whisper models
**Whisper Models:**
- Downloaded automatically by `whisper-node` on first transcription
- Size: ~100MB (tiny) to ~1.5GB (large)
- Default location: `~/.cache/whisper` or configured path
- Default model for MVP: `base` (~140MB, good quality/speed balance)
---
## Dependencies
### Prerequisites (Must Be Complete)
- ✅ **US-01:** Next.js scaffold
- ✅ **US-02:** GitHub Actions CI
- ✅ **US-03:** UI foundation (UrlInput, TranscribeButton, TranscriptionOutput)
### External Dependencies
- Existing local-transcriber CLI (completed and working)
- System binaries: ffmpeg, yt-dlp
- Whisper models (~100MB-1.5GB downloaded on first run)
### Blocking Issues
- **Deployment platform must be selected** before implementation
- **System dependencies must be installable** on target platform
---
## Open Questions
### For Product Owner to Decide
1. **Story Splitting:** Implement as single story or split into 3 sub-stories?
- **Recommendation:** Single story for context continuity
- **Risk:** High complexity, might not finish in one sprint
- **Decision:** [PENDING]
2. **Cancel Button:** Should users be able to cancel in-progress transcription?
- **Recommendation:** OUT of scope for US-04. Add in US-04.1 if users request it
- **Decision:** [PENDING]
3. **Job Persistence:** Should job state persist across server restarts?
- **Recommendation:** NO for MVP (in-memory queue). Add database persistence in US-06
- **Decision:** [PENDING]
4. **Progress Granularity:** Percentage vs. stage messages only?
- **Recommendation:** If percentage is available from whisper-node, show it. Otherwise, stage messages sufficient.
- **Decision:** [PENDING]
### For Tech Team to Research
1. **Whisper Model Selection:** Which model to use? (tiny, base, small, medium, large)
- **Spike:** Test transcription quality vs. speed with different models
- **Recommendation:** Start with `base` (balanced quality/speed)
2. **Progress Callbacks:** Does whisper-node provide progress during transcription?
- **Spike:** Review whisper-node documentation and test
- **Affects:** Whether we can show percentage or just stage messages
3. **Concurrent Job Limit:** What's realistic for 4GB RAM server?
- **Spike:** Test memory usage with multiple concurrent jobs
- **Recommendation:** Start with max 3, adjust based on testing
4. **Disk Space Management:** How much space needed per video?
- **Spike:** Test with various video lengths
- **Recommendation:** 10GB total, cleanup after each job
---
## Estimate
### Complexity Assessment (Simplified)
**Overall Complexity:** MEDIUM (reduced from HIGH)
- Long-running async operations
- External service integrations (YouTube, Whisper)
- State management across frontend/backend
- Comprehensive error handling
- ~~Resource management~~ Simple cleanup (single job only)
**Technical Risk:** MEDIUM (reduced from MEDIUM-HIGH)
- ~~Deployment platform constraints~~ Local development only
- System dependency requirements (local machine)
- Timeout coordination across layers
- File cleanup reliability
**User Experience Risk:** MEDIUM
- Long wait times (2-10 minutes) require excellent progress indication
- Error scenarios must be clearly communicated
- First-time users may not understand processing time
### Effort Estimate (Simplified Scope)
**Backend Development:** 2-3 days (reduced from 3-5 days)
- Module integration: 1 day
- ~~Job queue: 1 day~~ Simple state: 0.5 day
- API routes: 1 day
- Error handling: 0.5 day
- Testing: 0.5 day
**Frontend Development:** 2-3 days (unchanged)
- State management: 0.5 day
- Polling logic: 0.5 day
- UI components: 1 day
- Testing: 1 day
**QA & Testing:** 2-3 days (reduced from 3-4 days)
- Test data preparation: 0.5 day
- Unit/integration tests: 1 day
- E2E tests: 0.5 day
- Manual testing: 1 day
- ~~Bug fixes & retesting: 1 day~~ (built into above)
**Total Estimated Effort:** 6-9 days (reduced from 8-12 days)
**Story Points:** 8-13 points (Medium/Large, reduced from 13-21)
**Recommendation:** This can fit in a single 2-week sprint for a 1-2 person team.
---
## Risk Assessment
### Critical Risks (Must Mitigate)
**Risk 1: User Experience During Long Waits** ⚠️ HIGHEST PRIORITY
- **Impact:** HIGH - Users abandon if unclear what's happening
- **Likelihood:** MEDIUM
- **Mitigation:**
- Clear progress indication
- Elapsed time display
- Estimated time remaining (if feasible)
- "This may take several minutes" upfront message
**Risk 2: Resource Exhaustion (Memory/Disk/CPU)** ⚠️
- **Impact:** MEDIUM - Dev machine slowdown, disk fills up
- **Likelihood:** LOW (single user, local machine)
- **Mitigation:**
- ~~Implement job queue with concurrency limit (max 3)~~ Single job only
- Automatic temp file cleanup
- ~~Monitor server resources~~ Local dev, manual monitoring
- Set hard timeouts (10 min CLI)
**Risk 3: System Dependencies Missing** ⚠️
- **Impact:** MEDIUM - Transcription fails to start
- **Likelihood:** MEDIUM (first-time setup)
- **Mitigation:**
- Clear documentation in README for ffmpeg/yt-dlp installation
- Helpful error messages if dependencies not found
- Verify dependencies at app startup
### High Risks
**Risk 4: CLI Integration Brittleness**
- **Impact:** HIGH - Transcription fails silently or with cryptic errors
- **Likelihood:** MEDIUM
- **Mitigation:**
- Comprehensive error handling
- Test with various video types
- Version pin all dependencies
- Extensive integration tests
**Risk 5: Timeout Coordination Failures**
- **Impact:** MEDIUM - Confusing errors, hung processes
- **Likelihood:** MEDIUM
- **Mitigation:**
- Clear timeout strategy (10/11/12 min cascade)
- Document in code comments
- Test timeout scenarios
**Risk 6: Poor Transcription Accuracy**
- **Impact:** MEDIUM - Users dissatisfied with results
- **Likelihood:** LOW-MEDIUM (depends on audio quality)
- **Mitigation:**
- Test with various audio qualities
- Use `base` Whisper model (good quality/speed balance)
- Document limitations in UI/docs
- Future: Allow model selection
### Medium Risks
**Risk 7: Multiple Concurrent Users**
- **Impact:** MEDIUM - Unexpected load, queue backlog
- **Likelihood:** LOW (MVP, limited users)
- **Mitigation:**
- Queue system ready from day one
- Monitor queue length
- Return 503 when overloaded
**Risk 8: YouTube Rate Limiting/Blocking**
- **Impact:** MEDIUM - Downloads fail
- **Likelihood:** LOW-MEDIUM
- **Mitigation:**
- Implement retry with exponential backoff
- Clear error messages to users
- Consider user-agent configuration
---
## Definition of Done
### Code Complete
- [ ] All acceptance criteria met (AC1-AC12)
- [ ] POST /api/transcribe endpoint implemented and tested
- [ ] GET /api/transcribe/status/:id endpoint implemented and tested
- [ ] Job queue implemented with concurrency limits
- [ ] Frontend polling mechanism implemented
- [ ] Loading indicator with progress display
- [ ] Error handling for all scenarios
- [ ] Timeout handling at all layers
- [ ] File cleanup implemented and tested
- [ ] TypeScript compiles without errors
- [ ] No console errors/warnings
### Testing Complete
- [ ] Unit tests written and passing (>80% coverage)
- [ ] Integration tests written and passing
- [ ] E2E test with short video passing
- [ ] Manual test: 2-minute video transcribes successfully
- [ ] Manual test: 5-minute video transcribes successfully
- [ ] Manual test: Error scenarios tested (private video, invalid URL)
- [ ] Cross-browser tested (Chrome, Firefox, Safari)
- [ ] Mobile tested (iOS Safari, Android Chrome)
- [ ] Accessibility tested (VoiceOver or NVDA)
- [ ] Loading indicator appears within 500ms
- [ ] Progress updates visible during processing
### Quality Standards
- [ ] Code reviewed by Tech Lead or senior engineer
- [ ] Error messages reviewed for clarity (by Product Owner)
- [ ] No critical or high-severity bugs
- [ ] All automated tests pass in CI
- [ ] Follows project coding conventions
- [ ] TypeScript types defined for all API contracts
### Documentation
- [ ] README updated with:
- System dependencies required (ffmpeg, yt-dlp)
- Deployment requirements (not Vercel compatible)
- Whisper model information
- [ ] API documentation created (request/response examples)
- [ ] Code comments for complex logic (timeout handling, queue management)
### Deployment Ready
- [ ] System dependencies documented
- [ ] Deployment platform selected
- [ ] Environment variables documented (if any)
- [ ] Dockerfile created (if using containers)
- [ ] Tested on target deployment platform
- [ ] Monitoring/logging configured
### Product Owner Acceptance
- [ ] Product Owner has transcribed 3+ videos successfully
- [ ] Error messages are clear and actionable
- [ ] Loading experience is acceptable (not confusing)
- [ ] Performance is reasonable (5-min video < 10-min processing)
- [ ] User experience during long waits is reassuring
---
## Success Metrics
### Functional Success
- ✅ Users can transcribe YouTube videos end-to-end
- ✅ Transcription accuracy is acceptable (>85% for clear speech)
- ✅ Error rate <5% (excluding user errors like invalid URLs)
- ✅ System stability (no crashes during transcription)
### Performance Success
- ✅ 2-minute video transcribes in <5 minutes
- ✅ 5-minute video transcribes in <10 minutes
- ✅ Loading indicator appears within 500ms
- ✅ Progress updates every 2-5 seconds
### User Experience Success
- ✅ Users understand what's happening during long waits
- ✅ Error messages are actionable (users know what to do)
- ✅ Retry works reliably after errors
- ✅ Zero instances of "I thought it was frozen"
### Quality Success
- ✅ All automated tests passing
- ✅ Zero critical bugs in first week after release
- ✅ Temp files are cleaned up (disk doesn't fill)
- ✅ Server remains stable under concurrent load
---
## Related Stories
### Prerequisites
- ✅ **US-01:** Scaffold Next.js + Vitest - COMPLETE
- ✅ **US-02:** GitHub Actions PR Verification - COMPLETE
- ✅ **US-03:** Transcription UI Foundation - COMPLETE
### Blocks
- None (but strongly recommended before adding advanced features)
### Future Enhancements (Unblocks These)
- **US-05:** Copy to Clipboard Button + LLM Enhancement (cleanup/formatting)
- **US-06:** Transcript Export (SRT, VTT formats)
- **US-07:** Timestamp Display & Video Navigation
- **US-08:** Transcription History & Persistence
- **US-09:** Advanced Settings (model selection, language override)
- **US-10:** YouTube Metadata Display (title, duration, thumbnail)
---
## References
### Documentation
- **Whisper Node:** https://github.com/ariym/whisper-node
- **yt-dlp-wrap:** https://github.com/foxesdocode/yt-dlp-wrap
- **fluent-ffmpeg:** https://github.com/fluent-ffmpeg/node-fluent-ffmpeg
- **Next.js API Routes:** https://nextjs.org/docs/app/building-your-application/routing/route-handlers
- **React useReducer:** https://react.dev/reference/react/useReducer
### Tools
- **ffmpeg Download:** https://ffmpeg.org/download.html
- **yt-dlp Download:** https://github.com/yt-dlp/yt-dlp
- **Whisper Models:** https://github.com/openai/whisper (automatic download via whisper-node)
---
## Story Status
**Status:** ✅ Ready for Sprint Planning
**Priority:** CRITICAL - First value delivery, blocks all downstream features
**Complexity:** Large (13-21 story points)
**Recommended Sprint Allocation:** Full 2-week sprint for 2-3 person team
**Refined By:** Product Team (PO, Tech Lead, Backend Eng, Frontend Eng, QA Eng)
**Date:** 2025-11-19
---
## Notes
- This is the **MOST IMPORTANT STORY** in the backlog after scaffolding—it transforms prototype into product
- **Deployment constraint is critical:** Must plan for VPS/container deployment (Vercel won't work)
- Consider splitting into 3 sub-stories if team velocity or risk tolerance requires it
- Defer LLM enhancement (Ollama) to US-05—keep this story focused on core transcription
- Manual testing with real videos is essential—automated tests cannot verify transcription quality
- User experience during 2-10 minute wait is CRITICAL—invest in clear progress indication
- Error messages must be user-friendly—have Product Owner review all error text
- Resource management (queue, cleanup) is non-negotiable—prevents server issues
**Key Success Factor:** This story must be **solid and reliable** because it's the foundation for all future features. Time invested in quality now prevents major rework later.
---
**Product Owner Sign-Off:** [PENDING]
**Tech Lead Sign-Off:** [PENDING]
**Ready for Sprint Planning:** YES
You are an autonomous senior full-stack engineer responsible for building and maintaining a complete SaaS product. You operate with minimal supervision, making independent decisions while consulting on major strategic changes.
<author>blefnk/rules</author>
trigger: model_decision
description: Authoritative guide for all software-writing agents in this repository