LocalLLM Technical Architecture Documentation

LocalLLM is a high-performance Android application designed to run Large Language Models (LLMs) locally on-device. It leverages the `llama.cpp` library for efficient inference, enabling users to interact with state-of-the-art AI models without an internet connection, ensuring complete data privacy. The application supports text generation, vision capabilities (multimodal), and Retrieval Augmented Generation (RAG) for document analysis.

Soham-Kamathi

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm rag eval

View source

# LocalLLM Technical Architecture Documentation ## 1. Executive Summary LocalLLM is a high-performance Android application designed to run Large Language Models (LLMs) locally on-device. It leverages the `llama.cpp` library for efficient inference, enabling users to interact with state-of-the-art AI models without an internet connection, ensuring complete data privacy. The application supports text generation, vision capabilities (multimodal), and Retrieval Augmented Generation (RAG) for document analysis. ## 2. High-Level Architecture The application follows a **Clean Architecture** pattern combined with **MVVM (Model-View-ViewModel)**. This ensures separation of concerns, testability, and maintainability. ### Architectural Layers 1. **Presentation Layer (UI)**: Jetpack Compose + Material 3. 2. **Domain Layer**: Pure Kotlin Use Cases defining business logic. 3. **Data Layer**: Repositories, Data Sources (Local DB, Network, File System). 4. **Inference Layer**: JNI Bridge to native C++ code. 5. **Native Layer**: `llama.cpp` library and custom C++ wrappers. ### System Architecture Diagram ```mermaid graph TD subgraph "Presentation Layer" UI[Compose UI Screens] VM[ViewModels] UI --> VM end subgraph "Domain Layer" UC[Use Cases] VM --> UC end subgraph "Data Layer" Repo[Repositories] LocalDS[Local Data Source] RemoteDS[Remote Data Source] UC --> Repo Repo --> LocalDS Repo --> RemoteDS end subgraph "Inference Layer" LlamaWrapper[LlamaAndroid (Kotlin)] JNI[JNI Bridge (C++)] Repo --> LlamaWrapper LlamaWrapper --> JNI end subgraph "Native Layer" LlamaCPP[llama.cpp Library] Vulkan[Vulkan GPU Backend] JNI --> LlamaCPP LlamaCPP --> Vulkan end subgraph "RAG Subsystem" DocParser[Document Parser] EmbedGen[Embedding Generator (ONNX)] VectorStore[Vector Store (Room)] Repo --> DocParser Repo --> EmbedGen Repo --> VectorStore end ``` ## 3. Technology Stack | Component | Technology | Version | Purpose | |-----------|------------|---------|---------| | **Language** | Kotlin | 1.9.20 | Primary development language | | **UI Toolkit** | Jetpack Compose | BOM 2023.10.01 | Declarative UI framework | | **DI** | Hilt | 2.48.1 | Dependency Injection | | **Async** | Coroutines & Flow | 1.7.3 | Asynchronous programming | | **Database** | Room | 2.6.1 | Local SQLite abstraction | | **Network** | Retrofit + OkHttp | 2.9.0 | API Client (Model Catalog) | | **Native Interface** | JNI (Java Native Interface) | - | Bridge between Kotlin and C++ | | **Inference Engine** | llama.cpp | Custom | LLM Inference | | **Vector Embeddings** | ONNX Runtime | 1.17.0 | Running BGE-Small model | | **PDF Parsing** | PDFBox-Android | 2.0.27.0 | Extracting text from PDFs | | **Build System** | Gradle + CMake | 8.9 / 3.22.1 | Build automation | ## 4. Core Components Detail ### 4.1 Presentation Layer Located in `com.localllm.app.ui`. - **Screens**: Composable functions representing different views (Chat, Home, ModelLibrary). - **ViewModels**: `HiltViewModel` classes that manage UI state and expose flows. - `ChatViewModel`: Handles message history, inference state, and streaming responses. - `ModelLibraryViewModel`: Manages model downloads and catalog state. - `RAGChatViewModel`: Specialized for document interaction. ### 4.2 Domain Layer Located in `com.localllm.app.domain`. Contains **Use Cases** that encapsulate specific business rules. - `SendMessageUseCase`: Orchestrates sending a message, saving to DB, and triggering inference. - `GetCompatibleModelsUseCase`: Filters models based on device RAM. - `GenerateResponseUseCase`: Connects to the inference engine to generate text. ### 4.3 Data Layer Located in `com.localllm.app.data`. - **Repositories**: - `ModelRepository`: Single source of truth for model data (downloaded vs available). - `ConversationRepository`: Manages chat history and sessions. - **Local Data Source**: - `LocalLLMDatabase`: Room database with tables for `models`, `conversations`, `messages`, and `document_chunks`. - `PreferencesDataStore`: Stores user settings (theme, default parameters). - **Remote Data Source**: - `HuggingFaceApi`: Fetches model files. - `ModelCatalogApi`: Fetches curated model lists. ### 4.4 Inference Layer Located in `com.localllm.app.inference` and `cpp/`. #### Kotlin Wrapper (`LlamaAndroid.kt`) A Singleton class that manages the lifecycle of the native model. - **Loading**: `loadModel()` calls native code to load GGUF files. - **Generation**: `generateTokens()` initiates the inference loop. - **Callbacks**: Uses `TokenCallback` interface to stream tokens back to Kotlin. #### JNI Bridge (`llama_jni.cpp`) The C++ translation layer. - Maps Java types to C++ types. - Handles pointer arithmetic for passing `llama_model` and `llama_context` pointers between Java and C++. - Catches C++ exceptions to prevent app crashes. #### Native Implementation (`llama_android.cpp`) - Manages the `llama_context` struct. - Implements the token generation loop. - Handles sampling (Temperature, Top-K, Top-P). - Manages the KV Cache. ### 4.5 RAG (Retrieval Augmented Generation) Subsystem Located in `com.localllm.app.rag`. Enables "Chat with Document" functionality. 1. **Ingestion**: - `DocumentParser`: Extracts text from PDF/TXT/MD files. - **Chunking**: Splits text into 800-character chunks with 200-character overlap. 2. **Embedding**: - `EmbeddingGenerator`: Uses **ONNX Runtime** to run the `bge-small-en-v1.5` model. - Converts text chunks into 384-dimensional float vectors. - Optimized for mobile (IntraOpNumThreads=4). 3. **Storage**: - `DocumentChunkDao`: Stores text chunks and their vector embeddings in Room. 4. **Retrieval**: - Calculates Cosine Similarity between query embedding and stored chunk embeddings. - Retrieves top-K (default 3) most relevant chunks. 5. **Generation**: - Injects retrieved chunks into the system prompt as context. ## 5. Native Implementation Details ### 5.1 Build Configuration (`CMakeLists.txt`) - **Standard**: C++17 / C11. - **Optimization**: `-O3 -DNDEBUG`. - **Android Specifics**: - `max-page-size=16384`: Ensures compatibility with Android 15+ (16KB page size). - ABI Filters: `arm64-v8a`, `x86_64`. ### 5.2 Memory Management - **Manual Management**: The Kotlin layer holds `Long` pointers to C++ objects (`modelPtr`, `contextPtr`). - **Lifecycle**: `freeModel()` must be called explicitly to prevent memory leaks. - **mmap**: Uses memory mapping (`use_mmap=true`) to load models, allowing the OS to manage memory paging efficiently, crucial for large models on mobile. ### 5.3 GPU Acceleration (Vulkan) - **Backend**: Uses `ggml-vulkan` for GPU acceleration. - **Shaders**: Requires pre-compiled SPIR-V shaders (`vulkan-shaders-hpp.hpp`) or runtime compilation. - **Configuration**: Enabled via `-DLOCALLLM_ENABLE_VULKAN=ON` in Gradle. - **Performance**: Provides 2-10x speedup over CPU inference. ## 6. Data Flow: Message Lifecycle 1. **User Input**: User types a message in `ChatScreen`. 2. **ViewModel**: `ChatViewModel` receives the event. 3. **Persistence**: `SendMessageUseCase` saves the user message to `Room`. 4. **Context Building**: - If RAG is enabled, `VectorStore` retrieves relevant context. - Previous messages are fetched to build conversation history. - System prompt is prepended. 5. **Inference Trigger**: `LlamaAndroid.generateTokens()` is called with the formatted prompt. 6. **Native Execution**: - JNI passes string to C++. - `llama_tokenize` converts text to tokens. - `llama_decode` processes the prompt. - Loop: `llama_sample_token` -> `llama_decode` -> Callback to Java. 7. **Streaming**: `TokenCallback.onToken()` updates `ChatViewModel` via a `Flow`. 8. **UI Update**: Compose UI recomposes to show the new token. 9. **Completion**: Full response is saved to `Room`. ## 7. Security & Privacy - **Local Execution**: All inference happens on `d:\apppp\app\src\main\cpp\llama.cpp`. No data leaves the device. - **Storage**: Chat history is stored in a private app-scoped SQLite database. - **Permissions**: - `INTERNET`: Only for downloading models. - `READ_EXTERNAL_STORAGE`: For accessing user documents (PDFs). ## 8. Performance Optimizations - **Quantization**: Supports GGUF quantized models (Q4_K_M, Q8_0) to reduce memory footprint. - **Threading**: Automatically detects CPU cores (`std::thread::hardware_concurrency`) to optimize thread count. - **KV Cache**: Reuses previous context computation to speed up multi-turn conversations. - **ONNX Runtime**: Uses NNAPI where available for embedding generation.

Related Documents

University of Guelph Rocketry Club - Complete Tech Stack

Reactory Data -- Agent Context

Frontend Development Rules

TypeScript CLI AI Conversation App - Technical Plan