Building Dynamic Audio with Emotion & Pace: Gemini 3.1 Flash TTS, Angular & Firebase Cloud Functions

Google released the Gemini 3.1 Flash TTS Preview model for AI audio generation in the Gemini API, Gemini in Vertex AI, and Gemini AI Studio. This model introduces a new `Audio tags` feature to exhibit expressive human emotion, pace, and style. This application explores Firebase AI Logic to analyze an uploaded image to generate recommendations, description, alternative tags, and an obscure fact. The obscure fact is sent to a Firebase Cloud Function to generate an audio using a Gemini TTS model. The Cloud Function returns the stream to an Angular application that converts it to a Blob URL object. An audio player sets the URL to the source that users can click the Play button to play the stream. In this blog post, I migrate my application to use the Gemini 3.1 Flash TTS Preview model and create a signal form in Angular to input a scene, emotion, and pace. Then, the Angular application provides the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK. ## Prerequisites The technical stack of the project: - **Angular 21:** The latest version as of May 2026. - **Node.js LTS:** The LTS version as of May 2026. - **Firebase Remote Config:** To manage dynamic parameters. - **Firebase Cloud Functions:** To generate an expressive human voice when called by the frontend. - **Firebase Local Emulator Suite:** To test the functions locally at `http://localhost:5001`. - **Gemini in Vertex AI:** To generate videos and store them in Firebase Cloud Storage. The public Google AI Studio API is restricted in my region (Hong Kong). However, Vertex AI (Google Cloud) offers enterprise access that works reliably here, so I chose Vertex AI for this demo. ```bash npm i -g firebase-tools ``` Install `firebase-tools` globally using `npm`. ```bash firebase logout ``` ```bash firebase login ``` Log out of Firebase and log in again to perform proper Firebase authentication. ```bash firebase init ``` Execute `firebase init` and follow the prompts to set up Firebase Cloud Functions, the Firebase Local Emulator Suite, Firebase Cloud Storage, and Firebase Remote Config. If you have an existing project or multiple projects, you can specify the project ID on the command line. ```bash firebase init --project <PROJECT_ID> ``` In both cases, the Firebase CLI automatically installs the `firebase-admin` and `firebase-functions` dependencies. After completing the setup steps, the Firebase tools generate the functions emulator, functions, a storage rules file, remote config templates, and configuration files such as `.firebaserc` and `firebase.json`. - Angular dependency ```bash npm i firebase ``` The Angular application requires the `firebase` dependency to initialize a Firebase app, load remote config, and invoke the Firebase Cloud Functions to generate videos. - Firebase dependencies ```bash npm i @cfworker/json-schema @google/genai @modelcontextprotocol/sdk ``` Install the above dependencies to access Gemini in Vertex AI. `@google/genai` depends on `@cfworker/json-schema` and `@modelcontextprotocol/sdk`. Without these, the Cloud Functions cannot start. With our project configured, let's look at how the frontend and backend communicate. --- ## Architecture ![High-level architecture of obscure fact generation](https://raw.githubusercontent.com/railsstudent/colab_images/refs/heads/main/blog-posts/gemini-tts-firebase-angular/generate-an-obscure-fact.jpg) A user uploads an image in an Angular application and prompts the Gemini 3.1 Flash Lite Preview model to generate a few recommendations for improving the image, a description, and alternative tags. The user also uses the same model and the Google Search tool to find an obscure fact related to the image. ![High-level architecture of audio generation](https://raw.githubusercontent.com/railsstudent/colab_images/refs/heads/main/blog-posts/gemini-tts-firebase-angular/generate-tts-with-audio-tags.jpg) A user inputs a scene, an emotion, and a pace in an experimental signal form. When a user clicks the generate audio button, the Angular application sends the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK and Gemini 3.1 Flash TTS Preview model. --- ## Limitations of Gemini 3.1 Flash TTS Preview Model - The model can only accept text inputs and generate audio outputs. - The context window is 32K tokens - TTS does not support streaming. - The supported languages can be found in <https://ai.google.dev/gemini-api/docs/speech-generation#languages>. My mother tongue, Cantonese, is currently unsupported. --- ## Firebase Integration ### 1. Configure Environment Variables Defining the environment variables in the Firebase project ensures the functions know the region of the Google Cloud project, the Firebase Cloud Function location, and the required TTS model. **`.env.example`** ```env GOOGLE_CLOUD_LOCATION="global" GOOGLE_FUNCTION_LOCATION="asia-east2" GEMINI_TTS_MODEL_NAME="gemini-3.1-flash-tts-preview" WHITELIST="http://localhost:4200" REFERER="http://localhost:4200/" ``` | Variable | Description | | --- | --- | | GOOGLE_CLOUD_LOCATION | The region of the Google Cloud project. I chose `global` so that the Firebase project has access to the newest Gemini 3.1 Flash TTS preview model. | | GOOGLE_FUNCTION_LOCATION | The region of the Firebase Cloud Functions. I chose `asia-east2` because this is the region where I live. | | WHITELIST | Requests must come from <http://localhost:4200>. | | REFERER | Requests originate from <http://localhost:4200/>. | <http://localhost:4200> is the host and port number of my local Angular application. ### 2. Validating Environment Variables Before the Cloud Function proceeds with any AI calls, it is critical to ensure that all necessary environment variables are present. I implemented an `AUDIO_CONFIG` IIFE (Immediately Invoked Function Expression) to validate environment variables like the TTS model name, Google Cloud Project ID, and location. ```typescript import logger from "firebase-functions/logger"; export function validate(value: string | undefined, fieldName: string, missingKeys: string[]) { const err = `${fieldName} is missing.`; if (!value) { logger.error(err); missingKeys.push(fieldName); return ""; } return value; } ``` ```typescript export const AUDIO_CONFIG = (() => { logger.info("AUDIO_CONFIG initialization: Loading environment variables and validating configuration..."); const env = process.env; const missingKeys: string[] = []; const location = validate(env.GOOGLE_CLOUD_LOCATION, "Vertex Location", missingKeys); const model = validate(env.GEMINI_TTS_MODEL_NAME, "Gemini TTS Model Name", missingKeys); const project = validate(env.GCLOUD_PROJECT, "Google Cloud Project", missingKeys); if (missingKeys.length > 0) { throw new HttpsError("failed-precondition", `Missing environment variables: ${missingKeys.join(", ")}`); } return { genAIOptions: { project, location, vertexai: true, }, model, }; })(); ``` I am using Node 24 as of May 2026. Since Node 20, we can use the built-in `process.loadEnvFile` function that loads environment variables from the `.env` file. In `env.ts`, the try-catch block attempts to load the environment variables from the `.env` file. ```typescript try { process.loadEnvFile(); } catch { // Ignore error if .env file is not found (e.g., in production where env vars are set by the platform) } ``` In `src/index.ts`, the first statement imports the `env.ts` before importing other files and libraries. ```typescript import "./env"; ... other import statements ... ``` If you are using a Node version that does not support `process.loadEnvfile`, the alternative is to install `dotenv` to load the environment variables. ```bash npm i dotenv ``` ```typescript import dotenv from "dotenv"; dotenv.config(); ``` Firebase provides the `GCLOUD_PROJECT` variable, so it is not defined in the `.env` file. When the `missingKeys` array is not empty, `AUDIO_CONFIG` throws an error that lists all the missing variable names. If the validation is successful, the `genAIOptions` and `model` are returned. The `genAIOptions` is used to initialize the `GoogleGenAI` and `model` is the selected TTS model name. ### 3. Sanitize the Prompt Inputs The Cloud Function sanitizes the scene and transcript before composing the audio prompt. The `sanitizeScene` function accepts the scene by escaping the newline character ('\n') with the '\\\n'. The newline character creates a blank line and often signals the end of a block. The sanitization effectively flattens the scene into one continuous line of data and the LLM's Markdown parser recognizes it as a single, safe paragraph. The sanitization also removes all Markdown headers that are injected into the scene. ```typescript function sanitizeScene(text: string): string { return (text || "").trim().replace(/\r?\n/g, "\\n").replace(/^[#\s]+/gm, ""); } ``` The `sanitizeTranscript` function accepts the transcript by removing all Markdown headers and triple quotes that are injected into it. ```typescript function sanitizeTranscript(text: string): string { return (text || "").trim().replace(/^#+/gm, "").replace(/"""/g, '"'); } ``` ### 4. Build an Audio Prompt The `AudioPrompt` interface encapsulates the scene, emotion, pace, transcript, and voice option to set the location, audio tags, text, and persona of the audio. ```typescript export type AudioPrompt = { scene: string; emotion: string; pace: string; transcript: string; voiceOption: string; } ``` The `SCENE_DICTIONARY` is an array of scenes. When the user does not provide a scene, a scene is randomly selected from the array. ```typescript export const SCENE_DICTIONARY = [ "A dimly lit, dusty library filled with ancient leather-bound books.\n" + "The air is thick with history. A scholarly archivist is leaning closely into a warm, vintage ribbon microphone.\n" + "They speak with an infectious, hushed intensity, eager to share a forgotten secret they just uncovered in a decaying manuscript.", "It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright.\n" + "The red 'ON AIR' tally light is blazing. The speaker is standing up, bouncing on the balls of their heels to the rhythm of a thumping backing track.\n" + "It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.", "A meticulously sound-treated bedroom in a suburban home.\n" + "The space is deadened by plush velvet curtains and a heavy rug, creating an intimate, close-up acoustic environment.\n" + "The speaker delivers the information like a trusted friend sharing an inside joke.", "A high-tech, minimalist laboratory humming with servers.\n" + "Crisp, clean acoustics reflect off glass and steel.\n" + "A brilliant but eccentric scientist is pacing back and forth, speaking rapidly and enthusiastically into a headset microphone, excited to explain a complex phenomenon.", ]; ``` I define a `buildAudioPrompt` function to construct the advanced audio prompt. When an emotion is defined, the tag is `[<emotion>]`. When a pace is defined, the tag is `[<pace>]`. The combined audio tag is `[<emotion>] [<pace>]<a space>` to create a proper token boundary. The `insertAudioTagsToTranscript` uses a regular expression to split the transcript into lines, inserts the combined audio tag before each line, and then joins them with an empty string. The `buildAudioPrompt` concatenates the scene and the expressive transcript into a string before returning it. ```typescript import { SCENE_DICTIONARY } from './constants/scenes.const'; import { AudioPrompt } from './types/audio-prompt.type'; function makeTag(value: string) { const trimmedValue = value.trim(); return trimmedValue ? `[${trimmedValue}] ` : ""; } function insertAudioTagsToTranscript({ transcript, pace, emotion }: AudioPrompt): string { const audioTags = `${makeTag(emotion)}${makeTag(pace)}`; const cleanedTranscript = sanitizeTranscript(transcript); const parts = cleanedTranscript.split(/(?<!\b(?:Mr|Mrs|Ms|Dr|St|i\.e|e\.g))([.!?\n\r]+[”"’']*\s*)/); return parts .map((text, i, arr) => { if (i % 2 !== 0) { return ""; // Skip delimiters, they are appended to the text blocks } const delimiter = arr[i + 1] || ""; return text.trim() ? `${audioTags}${text.trim()}${delimiter}` : delimiter; }) .join(""); } export function buildAudioPrompt(data: AudioPrompt): string { const randomIndex = Math.floor(Math.random() * SCENE_DICTIONARY.length); const selectedScene = SCENE_DICTIONARY[randomIndex]; const trimmedScene = (data.scene || "").trim() || selectedScene; const escapedScene = sanitizeScene(trimmedScene); const transcript = insertAudioTagsToTranscript(data); return `## Scene: ${escapedScene} ## Transcript: """ ${transcript} """ `; } ``` The output of the prompt looks like: ```markdown ## Scene: <scene> ## Transcript: [<emotion>] [<pace>] <sentence 1>[<emotion>] [<pace>] <sentence 2>...[<emotion>] [<pace>] <sentence N> ``` ### 5. Generating an Expression Human Audio in a Firebase Cloud Function The `createVoiceConfig` function constructs an instance of `GenerateContentConfig` that outputs a speech narrated by the given voice name. ```typescript import { GenerateContentConfig } from "@google/genai"; export function createVoiceConfig(voiceName = "Kore"): GenerateContentConfig { return { responseModalities: ["audio"], speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName, }, }, }, }; } ``` ```typescript const splitList = (whitelist?: string) => (whitelist || "").split(",").map((origin) => origin.trim()); export const whitelist = splitList(process.env.WHITELIST); export const cors = whitelist.length > 0 ? whitelist : true; export const refererList = splitList(process.env.REFERER); ``` All Cloud Functions enforce App Check, CORS, and a timeout period of 600 seconds. If `WHITELIST` is unspecified, CORS defaults to true. While acceptable in a demo environment, configure CORS to a specific domain or `false` in production to prevent unauthorized access. The `readFact` cloud function delegates to `readFactStreamFunction` when `isStreaming` is true. Otherwise, it is delegated to `readFactFunction`. The `readFactFunction` function returns a `Promise<string>` that is the encoded base64 string. The `readFactStreamFunction` functions returns a `Promise<number[] | undefined>` that represents a buffer of WAV header bytes. ```typescript import { onCall } from "firebase-functions/v2/https"; import { cors } from "../auth"; import { buildAudioPrompt } from './audio-prompt'; import { readFactFunction, readFactFunctionStream } from "./read-fact"; import { createVoiceConfig } from './voice-config'; const options = { cors, enforceAppCheck: true, timeoutSeconds: 600, }; export const readFact = onCall(options, (request, response) => { const { data, acceptsStreaming } = request; const isStreaming = acceptsStreaming && !!response; const prompt = buildAudioPrompt(data); const voiceOption = createVoiceConfig(data.voiceOption); return isStreaming ? readFactStreamFunction(prompt, voiceOption, response) : readFactFunction(prompt, voiceOption); }); ``` The `withAIAudio` function is a high-order function that calls the callback to generate an audio stream. ```typescript async function withAIAudio(callback: (ai: GoogleGenAI, model: string) => Promise<string | number[] | undefined>) { try { const variables = AUDIO_CONFIG; if (!variables) { return ""; } const { genAIOptions, model } = variables; const ai = new GoogleGenAI(genAIOptions); return await callback(ai, model); } catch (e) { if (e instanceof HttpsError) { throw e; } throw new HttpsError("internal", "An internal error occurred while setting up the AI client.", { originalError: (e as Error).message, }); } } ``` `generateAudio` is a callback function that uses the Gemini 3.1 Flash TTS Preview model to generate a response. `getBase64DataUrl` invokes `extractInlineAudioData` to extract the raw data and the mime type from the response. The `encodeBase64String` function first converts the raw data to WAV format, then encodes it to base64 format, and finally returns the base64 string. The `createAudioParams` function constructs a parameter with the Gemini TTS model, the audio prompt, and the speech configuration. ```typescript async function generateAudio(aiTTS: AIAudio, prompt: string, voiceOption: GenerateContentConfig) { try { const { ai, model } = aiTTS; const response = await ai.models.generateContent(createAudioParams(model, prompt, voiceOption)); return getBase64DataUrl(response); } catch (error) { console.error(error); throw error; } } function createAudioParams(model: string, prompt: string, config?: GenerateContentConfig) { return { model, contents: [ { role: "user", parts: [ { text: prompt, }, ], }, ], config, }; } function extractInlineAudioData(response: GenerateContentResponse): { rawData: string | undefined; mimeType: string | undefined; } { const { data: rawData, mimeType } = response.candidates?.[0]?.content?.parts?.[0]?.inlineData ?? {}; return { rawData, mimeType }; } function getBase64DataUrl(response: GenerateContentResponse) { const { rawData, mimeType } = extractInlineAudioData(response); if (!rawData || !mimeType) { throw new Error("Audio generation failed: No audio data received."); } return encodeBase64String({ rawData, mimeType }); } export function encodeBase64String({ rawData, mimeType }: RawAudioData) { const wavBuffer = convertToWav(rawData, mimeType); const base64Data = wavBuffer.toString("base64"); return `data:audio/wav;base64,${base64Data}`; } ``` `generateAudioStream` is a callback function that uses the Gemini 3.1 Flash TTS Preview model to stream a list of audio chunks. The chunks are iterated so that each chunk is passed to the `extractInlineAudioData` function to extract the raw data and the mime type. The function converts the chunk's raw data into a buffer and sends it to the client; the byte length accumulates to determine the total size of all chunks. After all the chunks are sent to the client, the `createWavHeader` function uses the total byte length and the audio options to construct a WAV header and returns it. ```typescript async function generateAudioStream( aiTTS: AIAudio, prompt: string, voiceOption: GenerateContentConfig, response: CallableResponse<unknown>, ): Promise<number[] | undefined> { try { const { ai, model } = aiTTS; const chunks = await ai.models.generateContentStream(createAudioParams(model, prompt, voiceOption)); let byteLength = 0; let options: WavConversionOptions | undefined = undefined; for await (const chunk of chunks) { const { rawData, mimeType } = extractInlineAudioData(chunk); if (!options && mimeType) { options = parseMimeType(mimeType); response.sendChunk({ type: "metadata", payload: { sampleRate: options.sampleRate, }, }); } if (rawData && mimeType) { const buffer = Buffer.from(rawData, "base64"); byteLength = byteLength + buffer.length; response.sendChunk({ type: "data", payload: { buffer, }, }); } } if (options && byteLength > 0) { const header = createWavHeader(byteLength, options); return [...header]; } return undefined; } catch (error) { console.error(error); throw error; } } ``` The `readFactFunction` invokes the `withAIAudio` high-order function to generate a base64-encoded string. The `readFactStreamFunction` function calls the `withAIAudio` high-order function to write chunks to the response body and send them to the client. Then, the `generateAudioStream` function returns the bytes of the WAV header. ```typescript export async function readFactFunction(prompt: string, voiceOption: GenerateContentConfig) { return withAIAudio((ai, model) => generateAudio({ ai, model }, prompt, voiceOption)); } export async function readFactStreamFunction(prompt: string, voiceOption: GenerateContentConfig, response: CallableResponse<unknown>) { return withAIAudio((ai, model) => generateAudioStream({ ai, model }, prompt, voiceOption, response)); } ``` ### 6. Firebase App Configuration and reCAPTCHA Site Key I implemented a `FIREBASE_APP_CONFIG` IIFE (Immediately Invoked Function Expression) to run once to validate the environment variables of the Firebase app. ```typescript export const FIREBASE_APP_CONFIG = (() => { const env = process.env; const missingKeys: string[] = []; const apiKey = validate(env.APP_API_KEY, "API Key", missingKeys); const appId = validate(env.APP_ID, "App Id", missingKeys); const messagingSenderId = validate(env.APP_MESSAGING_SENDER_ID, "Messaging Sender ID", missingKeys); const recaptchaSiteKey = validate(env.RECAPTCHA_ENTERPRISE_SITE_KEY, "Recaptcha site key", missingKeys); const projectId = validate(env.GCLOUD_PROJECT, "Project ID", missingKeys); if (missingKeys.length > 0) { throw new Error(`Missing environment variables: ${missingKeys.join(", ")}`); } return { app: { apiKey, appId, projectId, messagingSenderId, authDomain: `${projectId}.firebaseapp.com`, storageBucket: `${projectId}.firebasestorage.app`, }, recaptchaSiteKey, }; })(); ``` The `getFirebaseConfig` function caches the `FIREBASE_APP_CONFIG` for an hour before returning it to the Angular application. The Angular application receives the Firebase app configuration and reCAPTCHA site key from the Cloud Function to initialize Firebase AI Logic and protect resources from unauthorized access and abuse. ```typescript export const getFirebaseConfig = onRequest({ cors }, (request, response) => { if (!validateRequest(request, response)) { return; } try { response.set("Cache-Control", "public, max-age=3600, s-maxage=3600"); response.json(FIREBASE_APP_CONFIG); } catch (err) { console.error(err); response.status(500).send("Internal Server Error"); } }); ``` ### 7. Local Development with Emulators For local development, I used the Firebase Local Emulator Suite to save cost and time. In the `bootstrapFirebase` process, the application calls `connectFunctionsEmulator` to link to the Cloud Functions running at `http://localhost:5001`. The port number defaulted to 5001 when `firebase init` was executed. ```typescript function connectEmulators(functions: Functions, remoteConfig: RemoteConfig) { if (location.hostname === 'localhost') { const host = getValue(remoteConfig, 'functionEmulatorHost').asString(); const port = getValue(remoteConfig, 'functionEmulatorPort').asNumber(); connectFunctionsEmulator(functions, host, port); } } ``` `loadFirebaseConfig` is a helper function that makes request to the Cloud function to obtain the Firebase App configuration and the reCAPTCHA site key. ```json { "getFirebaseConfigUrl": "http://127.0.0.1:5001/vertexai-firebase-6a64f/us-central1/getFirebaseConfig" } ``` ```typescript export type FirebaseConfigResponse = { app: FirebaseOptions; recaptchaSiteKey: string } ``` ```typescript import { HttpClient } from '@angular/common/http'; import { inject } from '@angular/core'; import { catchError, lastValueFrom, throwError } from 'rxjs'; import config from '../../public/config.json'; import { FirebaseConfigResponse } from './ai/types/firebase-config.type'; async function loadFirebaseConfig() { const httpService = inject(HttpClient); const firebaseConfig$ = httpService.get<FirebaseConfigResponse>(config.getFirebaseConfigUrl) .pipe(catchError((e) => throwError(() => e))); return lastValueFrom(firebaseConfig$); } ``` The `bootstrapFirebase` function initializes the FirebaseApp and App Check, loads the Firebase remote configuration and cloud functions, and stores them in the config service for later use. ```typescript export async function bootstrapFirebase() { try { const configService = inject(ConfigService); const firebaseConfig = await loadFirebaseConfig(); const { app, recaptchaSiteKey } = firebaseConfig; const firebaseApp = initializeApp(app); const remoteConfig = await fetchRemoteConfig(firebaseApp); initializeAppCheck(firebaseApp, { provider: new ReCaptchaEnterpriseProvider(recaptchaSiteKey), isTokenAutoRefreshEnabled: true, }); const functionRegion = getValue(remoteConfig, 'functionRegion').asString(); const functions = getFunctions(firebaseApp, functionRegion); connectEmulators(functions, remoteConfig); configService.loadConfig(firebaseApp, remoteConfig, functions); } catch (err) { console.error(err); } } ``` The AppConfig remains unchanged. ```typescript import { ApplicationConfig, provideAppInitializer } from '@angular/core'; import { bootstrapFirebase } from './app.bootstrap'; export const appConfig: ApplicationConfig = { providers: [ provideAppInitializer(async () => bootstrapFirebase()), ] }; ``` --- ## 8. Angular Implementation ### 8.1 Audio Tags Component I create an `AudioTagsComponent` and a new signal form to input the scene, emotion, pace, and voice name in the Angular frontend. ```html <div> <h3> <span class="text-xl">🎙️</span> Customize Audio Generation </h3> <div class="grid grid-cols-1 md:grid-cols-2 gap-4">  <div class="flex flex-col gap-1.5 md:col-span-2"> <label for="scene">Scene Description</label> <textarea id="scene" [formField]="audioPromptForm.scene" ></textarea> </div>  <div class="flex flex-col gap-1.5"> <label for="emotion">Vocal Emotion</label> <input type="text" id="emotion" [formField]="audioPromptForm.emotion" placeholder="e.g., panicked, whispers" /> </div>  <div class="flex flex-col gap-1.5"> <label for="pace">Speaking Pace</label> <input type="text" id="pace" [formField]="audioPromptForm.pace" placeholder="e.g., very slow, rapid" /> </div>  <div class="flex flex-col gap-1.5 md:col-span-2"> <label for="voiceOption">AI Voice Model</label> <select id="voiceOption" [formField]="audioPromptForm.voiceOption" > <option value="" disabled selected>Select a voice...</option> @for (option of sortedVoiceOptions(); track option.name) { <option [value]="option.name" class="bg-slate-800">{{ option.label }}</option> } </select> </div> </div> </div> ``` ```typescript import { ChangeDetectionStrategy, Component, computed, signal } from '@angular/core'; import { form, FormField } from '@angular/forms/signals'; import { VOICE_OPTIONS } from './constants/voice-options.const'; import { AudioPromptData } from './types/audio-prompt-data.type'; @Component({ selector: 'app-audio-tags', imports: [FormField], templateUrl: './audio-tags.component.html', changeDetection: ChangeDetectionStrategy.OnPush, }) export class AudioTagsComponent { #audioPromptModel = signal<AudioPromptData>({ scene: 'A news anchor reading the news in a busy newsroom', emotion: 'professional, slightly serious', pace: 'moderate, clear enunciation', voiceOption: 'Kore' }); audioPromptForm = form(this.#audioPromptModel); sortedVoiceOptions = computed(() => { const sortedList = VOICE_OPTIONS.sort((a, b) => a.name.localeCompare(b.name)); return sortedList.map(option => ({ name: option.name, label: `${option.name} - ${option.description}` })); }); audioPromptModel = this.#audioPromptModel.asReadonly(); } ``` The `AudioTagsComponent` is imported into `ObscureFactComponent` such that users can input values into the experimental signal form. In the HTML template of `ObscureFactComponent`, the `<app-audio-tags>` has a template variable `audioTags`, and `audioTags.audioPromptModel()` resolves to an instance of `AudioPromptData`. The data is assigned to the `audioTags` property of the `generateSpeech` method. ```html <div class="w-full mt-6"> <app-audio-tags #audioTags /> <h3>A surprising or obscure fact about the tags</h3> @if (interestingFact()) { <p>{{ interestingFact() }}</p> <app-error-display [error]="ttsError()" /> <app-text-to-speech [isLoadingSync]="isLoadingSync()" [isLoadingStream]="isLoadingStream()" [isLoadingWebAudio]="isLoadingWebAudio()" [audioUrl]="audioUrl()" (generateSpeech)="generateSpeech({ mode: $event, audioTags: audioTags.audioPromptModel() })" [playbackRate]="playbackRate()" /> } @else { <p>The tag(s) does not have any interesting or obscure fact.</p> } </div> ``` ```typescript import { AudioPromptData } from './audio-prompt-data.type'; import { GenerateSpeechMode } from '../../generate-audio.util'; export type ModeWithAudioTags = { mode: GenerateSpeechMode; audioTags: AudioPromptData; }; export type AudioPrompt = { scene: string; emotion: string; pace: string; transcript: string; voiceOption: string; }; ``` The `generateSpeech` method uses the `fact` and `audioTags` to contruct an instance of `AudioPrompt`. When the `mode` is `stream`, the SpeechService calls `generateAudioBlobURL` to use the `audioPrompt` to construct a blob URL. When the `mode` is `sync`, the SpeechService calls `generateAudio` to use the `audioPrompt` to generate an encoded base64 string. When the `mode` is `web_audio_api`, the AudioPlayerService calls `playStream` to stream the audio. ```typescript import { SpeechService } from '@/ai/services/speech.service'; import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { ChangeDetectionStrategy, Component, inject, input, OnDestroy, signal } from '@angular/core'; import { revokeBlobURL } from '../blob.util'; import { AudioTagsComponent } from './audio-tags/audio-tags.component'; import { ModeWithAudioTags } from './audio-tags/types/mode-audio-tags.type'; import { generateSpeechHelper, streamSpeechWithWebAudio, ttsError } from './generate-audio.util'; import { AudioPlayerService } from './services/audio-player.service'; @Component({ selector: 'app-obscure-fact', templateUrl: './obscure-fact.component.html', imports: [ TextToSpeechComponent, ], changeDetection: ChangeDetectionStrategy.OnPush, }) export class ObscureFactComponent implements OnDestroy { interestingFact = input<string | undefined>(undefined); speechService = inject(SpeechService); audioPlayerService = inject(AudioPlayerService); isLoadingSync = signal(false); isLoadingStream = signal(false); isLoadingWebAudio = signal(false); audioUrl = signal<string | undefined>(undefined); ttsError = ttsError; async generateSpeech({ mode, audioTags }: ModeWithAudioTags) { const fact = this.interestingFact(); if (fact) { revokeBlobURL(this.audioUrl); this.audioUrl.set(undefined); const audioPrompt = { ...audioTags, transcript: fact, }; if (mode === 'sync' || mode === 'stream') { const loadingSignal = mode === 'stream' ? this.isLoadingStream : this.isLoadingSync; const speechFn = (audioPrompt: AudioPrompt) => mode === 'stream' ? this.speechService.generateAudioBlobURL(audioPrompt) : this.speechService.generateAudio(audioPrompt); await generateSpeechHelper(audioPrompt, loadingSignal, this.audioUrl, speechFn); } else if (mode === 'web_audio_api') { await streamSpeechWithWebAudio( audioPrompt, this.isLoadingWebAudio, (audioPrompt: AudioPrompt) => this.audioPlayerService.playStream(audioPrompt)); } } } ngOnDestroy(): void { revokeBlobURL(this.audioUrl); } } ``` ### 8.2 Call Firebase Cloud Functions directly The `SpeechService` has a `generateAudio` method that calls the `readFact` cloud function to obtain the encoded base64 string. Similarly, the service has a `generateAudioBlobURL` method that streams the chunks to create a buffer and prepend it with the WAV header. The `constructBlobURL` creates a blob URL from the Blob Part array. ```typescript export function constructBlobURL(parts: BlobPart[]) { return URL.createObjectURL(new Blob(parts, { type: 'audio/wav' })); } ``` ```typescript import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { constructBlobURL } from '@/photo-panel/blob.util'; import { inject, Injectable } from '@angular/core'; import { Functions, httpsCallable } from 'firebase/functions'; import { StreamMessage } from '../types/stream-message.type'; import { ConfigService } from './config.service'; @Injectable({ providedIn: 'root' }) export class SpeechService { private configService = inject(ConfigService); private get functions(): Functions { if (!this.configService.functions) { throw new Error('Firebase Functions has not been initialized.'); } return this.configService.functions; } async generateAudio(audioPrompt: AudioPrompt) { const readFactFunction = httpsCallable<AudioPrompt, string>( this.functions, 'textToAudio-readFact' ); const { data: audioUri } = await readFactFunction(audioPrompt); return audioUri; } async generateAudioStream(audioPrompt: AudioPrompt) { const readFactStreamFunction = httpsCallable<AudioPrompt, number[] | undefined, StreamMessage>( this.functions, 'textToAudio-readFact' ); return readFactStreamFunction.stream(audioPrompt); } async generateAudioBlobURL(audioPrompt: AudioPrompt) { const { stream, data } = await this.generateAudioStream(audioPrompt); const audioParts: BlobPart[] = []; for await (const audioChunk of stream) { if (audioChunk && audioChunk.type === 'data') { audioParts.push(new Uint8Array(audioChunk.payload.buffer.data)); } } const wavHeader = await data; if (wavHeader && wavHeader.length) { audioParts.unshift(new Uint8Array(wavHeader)); } return constructBlobURL(audioParts); } } ``` Similar to `SpeechService.generateAudioBlobURL`, the `playStream` method of `AudioPlayerService` also calls `generateAudioStream` to get a stream of chunks and play each of them immediately. ```typescript import { SpeechService } from '@/ai/services/speech.service'; import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { inject, Injectable, OnDestroy, signal } from '@angular/core'; @Injectable({ providedIn: 'root' }) export class AudioPlayerService implements OnDestroy { async playStream(audioPrompt: AudioPrompt) { const { stream } = await this.speechService.generateAudioStream(audioPrompt); for await (const audioChunk of stream) { ... process each chunk ... } } ngOnDestroy(): void { ... release resources to prevent memory leak ... } } ``` --- This is the end of the walkthrough for the demo. You should now be able to input different combinations of scene, emotion, and pace to create a unique personality to say the given text in an audio clip. --- ## 9. Caveats and Lessons Learned: Avoiding the Dynamic Prompt Trap The examples in Gemini AI Studio and Vertex AI Studio use static audio tags and transcripts and they work correctly for me. When I applied dynamic audio tags and transcripts in the demo, the Gemini 3.1 TTS Flash Preview model ignored the audio tags. The issue was resolved after debugging in Gemini CLI for hours. Here are the Caveats and Lessons Learned: 1. The Token Boundary Trap. The code originally concatenated tags and transcript without a space (for example, "[giggle][slow]Before"). The LLM tokenizer failed to recognize the instruction to change the behavior and pace of the audio. My fix was to insert a space between the tags and the transcript, which was "[giggle] [slow] Before". 2. Sanitize inputs before injecting into the prompt template. The sanitize functions remove Markdown headers (#) and triple quotes from the scene and transcript. The cleansed scene and transcript are injected into the prompt template to construct the final audio prompt. 3. LLM does not understand idiom. I typed "at a snail's pace" in the signal form and inserted "[at a snail's pace]" before the line. However, the model vocalized the tag literally, and no pace change occurred. 4. **"Repetitive Weighting" is a Real Strategy.** If standard tags like [slow] and [fast] are not dramatic enough, prepend the pace with "very" to increase the dramatic effect of the pace. It was evident when [very, very, very slow] generated a longer audio than [slow]. 5. Replace newline character (\n) with \\\n. to flatten the lines into a single paragraph. When the scene and transcript are cleansed and escaped, they are injected into the prompt template while the structure is preserved for the LLM parser. --- ## Conclusion The integration of text-to-speech with Firebase's serverless scalability empowers Angular applications for real-time audio generation. First, the Angular application neither requires the `genai` dependency nor stores the Vertex AI environment variables in a `.env` file. The client application calls the Cloud Functions to perform the text to speech tasks to generate an audio stream. The Cloud Functions receive arguments from the client, and execute a TTS operation to either return the entire audio as an encoded base64 string or stream the audio bytes in chunks. During local development, the Firebase Emulator calls the functions at `http://localhost:5001` instead of the ones deployed on the Cloud Run platform to save cost. Try cloning the GitHub repository, uploading an image to generate an obscure fact, and using the Gemini 3.1 Flash TTS preview model to speak it with the specified scene, emotion, and pace. ## Resources - [Demo GitHub Repo](https://github.com/railsstudent/firebase-ai-hybrid-demo) - [Firebase Cloud Functions](https://firebase.google.com/docs/functions?utm_campaign=deveco_gdemembers&utm_source=deveco) - [Connect to the Cloud Functions Emulator](https://firebase.google.com/docs/emulator-suite/connect_functions?utm_campaign=deveco_gdemembers&utm_source=deveco) - [Audio Tags](https://ai.google.dev/gemini-api/docs/speech-generation#audio-tags#generate-from-images&utm_campaign=deveco_gdemembers&utm_source=deveco) - [Advanced Audio Prompting](https://ai.google.dev/gemini-api/docs/speech-generation#advanced-prompting?utm_campaign=deveco_gdemembers&utm_source=deveco) - [Prompting Strategies](https://ai.google.dev/gemini-api/docs/speech-generation#prompting-strategies?utm_campaign=deveco_gdemembers&utm_source=deveco) - [Previous Post about Gemini 2.5 Flash TTS, Angular and Firebase](https://dev.to/railsstudent/streaming-ai-speech-with-gemini-25-flash-tts-angular-v21-and-firebase-1odm?utm_campaign=deveco_gdemembers&utm_source=deveco)

Building Dynamic Audio with Emotion & Pace: Gemini 3.1 Flash TTS, Angular & Firebase Cloud Functions

Tags

Comments

More Blog

Minimalist EKS: The Easy Way

Never forget to enter the Stern Grove lottery again!

A Free Screenshot Editor That Never Uploads Your Image

I built a CLI to break my highlights out of Apple Books

A Developer's Guide to Agent Hooks in Antigravity CLI

Tactical vs. Strategic Agentic AI Development — A Playbook for Developers