OpenReels
Pipeline

Asset Generation

How the visuals stage resolves images, video clips, stock footage, and music in parallel.

The Visuals stage is the most complex and expensive stage in the pipeline. It takes the DirectorScore and TTS word timings and produces all the media assets needed for rendering: images, video clips, and background music.

Parallel execution

The stage runs two major work streams concurrently using Promise.all:

Visuals Stage
  |
  +-- Scene asset resolution (all scenes in parallel)
  |     |-- ai_image scenes -> Image Prompter -> Image Gen
  |     |-- ai_video scenes -> Image Gen -> Video Gen
  |     |-- stock_image/stock_video -> Adaptive Stock Resolver
  |     |-- text_card scenes -> no asset needed
  |
  +-- Music resolution (runs alongside scenes)
        |-- Music Prompter -> Lyria API
        |-- (fallback) -> Bundled music

Each scene is resolved independently, so a slow stock search for scene 3 does not block AI image generation for scene 5.

Visual asset resolution

The resolveVisualAsset function dispatches each scene based on its visual_type:

ai_image

  1. The Image Prompter agent optimizes the scene's visual_prompt into a detailed image-generation prompt, injecting the archetype's style bible (art style, color palette, lighting, composition rules, mood, anti-artifact guidance)
  2. The configured image provider (Gemini Imagen, OpenAI DALL-E) generates the image
  3. If the image provider rejects the prompt due to safety filters, a retry is attempted with a softened prompt that conveys the same mood through atmosphere and implication

ai_video

A two-phase process:

  1. Phase 1: AI image -- generates a first-frame image using the same flow as ai_image
  2. Phase 2: Video animation -- the image is sent to a video provider (Veo, Kling via fal.ai) with a motion-aware prompt generated by the Image Prompter in "video" mode

The video resolver picks the smallest supported duration that fits the scene's voiceover timing. If all video providers fail, the scene falls back to the Phase 1 still image with Ken Burns motion applied. A module-level concurrency limiter (pLimit(3)) prevents overwhelming video generation APIs.

stock_image / stock_video

The Adaptive Stock Resolver handles these with a multi-step flow:

  1. Original query -- search each configured stock provider (Pexels, Pixabay) with the scene's visual_prompt as a search query
  2. Verification -- if a verification model is configured, each candidate is downloaded and evaluated by a vision model for relevance to the script line. The top candidate above the confidence threshold is selected
  3. Reformulation -- if all candidates from the original query are rejected, the LLM generates alternative search queries and the process repeats
  4. AI fallback -- if all stock attempts are exhausted (up to maxAttempts, default 4), the scene falls back to AI image generation. Rejection context from failed stock searches is passed to the Image Prompter as negative examples

Without verification enabled, the first stock result is accepted directly.

text_card

No asset is generated. The renderer creates the text card from the scene's script_line and the archetype's color palette.

Image Prompter agent

The Image Prompter is a shared agent used by both ai_image and ai_video pipelines, as well as the stock fallback path. It transforms a scene description into a provider-optimized prompt.

The prompt includes:

  • A style bible derived from the archetype config: art style, color palette, lighting, composition rules, cultural markers, mood, and anti-artifact guidance
  • The scene position (e.g., "Scene 2 of 6") for emotional intensity calibration
  • The narration text so the image enriches beyond what the words say
  • Rules for vertical (9:16) composition, no text in images, naming real people and places, and depicting dark themes through atmosphere rather than graphic content

For ai_video scenes, the prompt shifts focus to motion, camera movement, and temporal dynamics.

When retrying after a safety rejection, rejection context is injected telling the LLM to rewrite the prompt avoiding explicit content while preserving the archetype style and emotional tone.

Music resolution

Music runs in parallel with scene asset generation. The resolver follows this flow:

Music Resolver
  |
  +-- Bundled provider? -> Select pre-packaged track by mood -> Done
  |
  +-- Lyria provider:
        |
        +-- Music Prompter agent -> Generate detailed Lyria prompt
        |     (genre, instruments, tempo, timestamp sections, constraints)
        |
        +-- Lyria 3 Pro API -> Generate audio file
        |
        +-- On failure -> Bundled fallback

Music Prompter agent

For AI music generation, the Music Prompter agent produces a rich text prompt for Lyria 3 Pro. It receives:

  • The music_mood from the DirectorScore (one of 8 mood tags)
  • The emotional_arc for dynamic progression
  • The archetype name and mood for production aesthetic
  • Per-scene durations with timestamps and narration text

The agent translates narrative emotion into musical direction -- it describes sound (instruments, dynamics, tempo, texture) and never references the video's topic or characters. This is a hard constraint because Lyria's safety filter blocks prompts referencing violence, politics, or specific artist names.

The output prompt includes:

  1. Opening direction with genre, instruments, and tempo
  2. Timestamp sections ([0:00 - 0:12]) matching scene timing with intensity levels
  3. Critical constraints: purely instrumental, no vocals, background level, exact duration

Bundled fallback

If Lyria generation fails (safety rejection, API error), the resolver falls back to pre-packaged bundled tracks selected by mood. The rejected Lyria prompt is preserved in metadata for debugging.

Scene duration calculation

Scene durations are computed from TTS word timings using proportional scaling. Words are split into per-scene groups based on the ratio of expected words (from script_line) to actual TTS words (which differ due to number/abbreviation expansion). Each scene's duration is the time span of its word group plus a 0.5-second buffer, with a 2-second minimum.

Source files

FileRole
src/pipeline/orchestrator.tsVisuals step, resolveVisualAsset dispatch
src/agents/image-prompter.tsImage/video prompt optimization agent
src/agents/music-prompter.tsLyria prompt generation agent
src/pipeline/music-resolver.tsMusic resolution with Lyria/bundled fallback
src/providers/stock/adaptive-resolver.tsStock search, verify, reformulate, AI fallback
src/providers/stock/query-reformer.tsLLM-based stock query reformulation
src/providers/stock/stock-verifier.tsVision model verification of stock candidates
src/providers/video/video-resolver.tsAI video generation with provider failover
prompts/image-prompter.mdImage Prompter system prompt
prompts/music-prompter.mdMusic Prompter system prompt