Vākya AI — A real-time AI voice agent that listens, understands and talks back . Powered by FastAPI, WebSockets, AssemblyAI, Gemini, and Murf AI.
Markdown # Vākya AI 🗣️ Vākya, which means "sentence" in Sanskrit, is a conversational AI that allows users to interact with a Large Language Model (LLM) using their voice. Speak. Understand. Reply. That’s **Vākya** , a full loop of human-like conversation, built with **Python**, driven by **FastAPI**, and delivered in a crisp, modern UI. *** ## Wanna know How I look?? [](https://raw.githubusercontent.com/Yasaswini38/Vakya-AI/main/demo_vakyaai.mp4) *** ### Features * **Voice-to-Voice Interaction**: Talk to the AI and hear it respond back in real time. * **Real-time Transcription**: See your speech transcribed instantly on screen. * **Conversational Memory**: Maintains context within a session for natural dialogue. * **Session Management**: Start new chats or revisit past conversations. * **Voice Customization**: Choose from multiple voices for responses. * **Fun Skills Built-in**: Weather updates, News headlines, Jokes on demand. * **Modern UI**: Cute & aesthetic responsive interface with chat history panel. *** ### Technologies Used * **FastAPI** — backend framework for REST + WebSocket support * **Jinja2** — template rendering for frontend * **AssemblyAI** — speech-to-text (transcription) * **Google Gemini** — LLM for text generation * **Murf.ai** — text-to-speech (streaming natural voices) * **python-dotenv** — for managing API keys securely * **HTML, CSS, JavaScript** — responsive UI (with chat history, persona selection) *** ### Architecture The application follows a simple client-server architecture: 1. User’s voice recorded via browser microphone 2. Audio streamed to FastAPI backend over WebSocket 3. AssemblyAI transcribes speech → text 4. Transcribed text + history → Gemini (LLM) for response generation 5. Response text → Murf.ai → natural voice audio stream 6. Frontend shows transcription, AI’s reply, and plays back
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.