AI-powered WhatsApp automation built with n8n. Supports text, audio, and image inputs with Google Gemini for intelligent responses.
# WhatsApp AI Agent with Multi-Modal Support [](LICENSE) [](https://docs.n8n.io/) [](https://developers.facebook.com/docs/whatsapp) [](https://ai.google.dev/) [](CONTRIBUTING.md) [](https://github.com/ZohaibCodez/n8n-whatsapp-ai-agent/issues) [](https://github.com/ZohaibCodez/n8n-whatsapp-ai-agent) A sophisticated n8n workflow that creates an intelligent WhatsApp bot capable of processing text messages, voice notes, and images using Google Gemini AI. ## 📸 Demo Screenshots For live demo screenshots showcasing the AI agent in action, including WhatsApp conversations, voice transcription, image analysis, and workflow overview, please check the docs/demo-screenshots/ folder in this repository. ### Architecture Diagram  *High-level architecture showing message flow and processing* ## 🌟 Features - **Multi-Modal Processing**: Handles text, audio, and image inputs seamlessly - **Voice Transcription**: Converts WhatsApp voice messages to text using Google Gemini - **Image Analysis**: Analyzes and describes images sent via WhatsApp - **Conversation Memory**: Maintains context across conversations using session-based memory - **Smart Routing**: Automatically detects message type and routes to appropriate processing pipeline - **Real-time Responses**: Instant AI-powered replies through WhatsApp Business API ## 🏗️ Architecture The workflow uses a smart routing system that: 1. **Receives** WhatsA
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.