Loading...
Loading...
LocalLLM is a high-performance Android application designed to run Large Language Models (LLMs) locally on-device. It leverages the `llama.cpp` library for efficient inference, enabling users to interact with state-of-the-art AI models without an internet connection, ensuring complete data privacy. The application supports text generation, vision capabilities (multimodal), and Retrieval Augmented Generation (RAG) for document analysis.
# LocalLLM Technical Architecture Documentation
## 1. Executive Summary
LocalLLM is a high-performance Android application designed to run Large Language Models (LLMs) locally on-device. It leverages the `llama.cpp` library for efficient inference, enabling users to interact with state-of-the-art AI models without an internet connection, ensuring complete data privacy. The application supports text generation, vision capabilities (multimodal), and Retrieval Augmented Generation (RAG) for document analysis.
## 2. High-Level Architecture
The application follows a **Clean Architecture** pattern combined with **MVVM (Model-View-ViewModel)**. This ensures separation of concerns, testability, and maintainability.
### Architectural Layers
1. **Presentation Layer (UI)**: Jetpack Compose + Material 3.
2. **Domain Layer**: Pure Kotlin Use Cases defining business logic.
3. **Data Layer**: Repositories, Data Sources (Local DB, Network, File System).
4. **Inference Layer**: JNI Bridge to native C++ code.
5. **Native Layer**: `llama.cpp` library and custom C++ wrappers.
### System Architecture Diagram
```mermaid
graph TD
subgraph "Presentation Layer"
UI[Compose UI Screens]
VM[ViewModels]
UI --> VM
end
subgraph "Domain Layer"
UC[Use Cases]
VM --> UC
end
subgraph "Data Layer"
Repo[Repositories]
LocalDS[Local Data Source]
RemoteDS[Remote Data Source]
UC --> Repo
Repo --> LocalDS
Repo --> RemoteDS
end
subgraph "Inference Layer"
LlamaWrapper[LlamaAndroid (Kotlin)]
JNI[JNI Bridge (C++)]
Repo --> LlamaWrapper
LlamaWrapper --> JNI
end
subgraph "Native Layer"
LlamaCPP[llama.cpp Library]
Vulkan[Vulkan GPU Backend]
JNI --> LlamaCPP
LlamaCPP --> Vulkan
end
subgraph "RAG Subsystem"
DocParser[Document Parser]
EmbedGen[Embedding Generator (ONNX)]
VectorStore[Vector Store (Room)]
Repo --> DocParser
Repo --> EmbedGen
Repo --> VectorStore
end
```
## 3. Technology Stack
| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| **Language** | Kotlin | 1.9.20 | Primary development language |
| **UI Toolkit** | Jetpack Compose | BOM 2023.10.01 | Declarative UI framework |
| **DI** | Hilt | 2.48.1 | Dependency Injection |
| **Async** | Coroutines & Flow | 1.7.3 | Asynchronous programming |
| **Database** | Room | 2.6.1 | Local SQLite abstraction |
| **Network** | Retrofit + OkHttp | 2.9.0 | API Client (Model Catalog) |
| **Native Interface** | JNI (Java Native Interface) | - | Bridge between Kotlin and C++ |
| **Inference Engine** | llama.cpp | Custom | LLM Inference |
| **Vector Embeddings** | ONNX Runtime | 1.17.0 | Running BGE-Small model |
| **PDF Parsing** | PDFBox-Android | 2.0.27.0 | Extracting text from PDFs |
| **Build System** | Gradle + CMake | 8.9 / 3.22.1 | Build automation |
## 4. Core Components Detail
### 4.1 Presentation Layer
Located in `com.localllm.app.ui`.
- **Screens**: Composable functions representing different views (Chat, Home, ModelLibrary).
- **ViewModels**: `HiltViewModel` classes that manage UI state and expose flows.
- `ChatViewModel`: Handles message history, inference state, and streaming responses.
- `ModelLibraryViewModel`: Manages model downloads and catalog state.
- `RAGChatViewModel`: Specialized for document interaction.
### 4.2 Domain Layer
Located in `com.localllm.app.domain`.
Contains **Use Cases** that encapsulate specific business rules.
- `SendMessageUseCase`: Orchestrates sending a message, saving to DB, and triggering inference.
- `GetCompatibleModelsUseCase`: Filters models based on device RAM.
- `GenerateResponseUseCase`: Connects to the inference engine to generate text.
### 4.3 Data Layer
Located in `com.localllm.app.data`.
- **Repositories**:
- `ModelRepository`: Single source of truth for model data (downloaded vs available).
- `ConversationRepository`: Manages chat history and sessions.
- **Local Data Source**:
- `LocalLLMDatabase`: Room database with tables for `models`, `conversations`, `messages`, and `document_chunks`.
- `PreferencesDataStore`: Stores user settings (theme, default parameters).
- **Remote Data Source**:
- `HuggingFaceApi`: Fetches model files.
- `ModelCatalogApi`: Fetches curated model lists.
### 4.4 Inference Layer
Located in `com.localllm.app.inference` and `cpp/`.
#### Kotlin Wrapper (`LlamaAndroid.kt`)
A Singleton class that manages the lifecycle of the native model.
- **Loading**: `loadModel()` calls native code to load GGUF files.
- **Generation**: `generateTokens()` initiates the inference loop.
- **Callbacks**: Uses `TokenCallback` interface to stream tokens back to Kotlin.
#### JNI Bridge (`llama_jni.cpp`)
The C++ translation layer.
- Maps Java types to C++ types.
- Handles pointer arithmetic for passing `llama_model` and `llama_context` pointers between Java and C++.
- Catches C++ exceptions to prevent app crashes.
#### Native Implementation (`llama_android.cpp`)
- Manages the `llama_context` struct.
- Implements the token generation loop.
- Handles sampling (Temperature, Top-K, Top-P).
- Manages the KV Cache.
### 4.5 RAG (Retrieval Augmented Generation) Subsystem
Located in `com.localllm.app.rag`.
Enables "Chat with Document" functionality.
1. **Ingestion**:
- `DocumentParser`: Extracts text from PDF/TXT/MD files.
- **Chunking**: Splits text into 800-character chunks with 200-character overlap.
2. **Embedding**:
- `EmbeddingGenerator`: Uses **ONNX Runtime** to run the `bge-small-en-v1.5` model.
- Converts text chunks into 384-dimensional float vectors.
- Optimized for mobile (IntraOpNumThreads=4).
3. **Storage**:
- `DocumentChunkDao`: Stores text chunks and their vector embeddings in Room.
4. **Retrieval**:
- Calculates Cosine Similarity between query embedding and stored chunk embeddings.
- Retrieves top-K (default 3) most relevant chunks.
5. **Generation**:
- Injects retrieved chunks into the system prompt as context.
## 5. Native Implementation Details
### 5.1 Build Configuration (`CMakeLists.txt`)
- **Standard**: C++17 / C11.
- **Optimization**: `-O3 -DNDEBUG`.
- **Android Specifics**:
- `max-page-size=16384`: Ensures compatibility with Android 15+ (16KB page size).
- ABI Filters: `arm64-v8a`, `x86_64`.
### 5.2 Memory Management
- **Manual Management**: The Kotlin layer holds `Long` pointers to C++ objects (`modelPtr`, `contextPtr`).
- **Lifecycle**: `freeModel()` must be called explicitly to prevent memory leaks.
- **mmap**: Uses memory mapping (`use_mmap=true`) to load models, allowing the OS to manage memory paging efficiently, crucial for large models on mobile.
### 5.3 GPU Acceleration (Vulkan)
- **Backend**: Uses `ggml-vulkan` for GPU acceleration.
- **Shaders**: Requires pre-compiled SPIR-V shaders (`vulkan-shaders-hpp.hpp`) or runtime compilation.
- **Configuration**: Enabled via `-DLOCALLLM_ENABLE_VULKAN=ON` in Gradle.
- **Performance**: Provides 2-10x speedup over CPU inference.
## 6. Data Flow: Message Lifecycle
1. **User Input**: User types a message in `ChatScreen`.
2. **ViewModel**: `ChatViewModel` receives the event.
3. **Persistence**: `SendMessageUseCase` saves the user message to `Room`.
4. **Context Building**:
- If RAG is enabled, `VectorStore` retrieves relevant context.
- Previous messages are fetched to build conversation history.
- System prompt is prepended.
5. **Inference Trigger**: `LlamaAndroid.generateTokens()` is called with the formatted prompt.
6. **Native Execution**:
- JNI passes string to C++.
- `llama_tokenize` converts text to tokens.
- `llama_decode` processes the prompt.
- Loop: `llama_sample_token` -> `llama_decode` -> Callback to Java.
7. **Streaming**: `TokenCallback.onToken()` updates `ChatViewModel` via a `Flow`.
8. **UI Update**: Compose UI recomposes to show the new token.
9. **Completion**: Full response is saved to `Room`.
## 7. Security & Privacy
- **Local Execution**: All inference happens on `d:\apppp\app\src\main\cpp\llama.cpp`. No data leaves the device.
- **Storage**: Chat history is stored in a private app-scoped SQLite database.
- **Permissions**:
- `INTERNET`: Only for downloading models.
- `READ_EXTERNAL_STORAGE`: For accessing user documents (PDFs).
## 8. Performance Optimizations
- **Quantization**: Supports GGUF quantized models (Q4_K_M, Q8_0) to reduce memory footprint.
- **Threading**: Automatically detects CPU cores (`std::thread::hardware_concurrency`) to optimize thread count.
- **KV Cache**: Reuses previous context computation to speed up multi-turn conversations.
- **ONNX Runtime**: Uses NNAPI where available for embedding generation.
Full-stack web application for the University of Guelph Rocketry Club featuring AI-powered chatbot, member management, project showcases, and sponsor integration.
Reactory Data (`reactory-data`) is the data, assets, and CDN repository for the Reactory platform. It provides baseline directory structures, fonts, themes, internationalization files, client plugin source code and runtime bundles, email templates, workflow schedules, database backups, AI learning resources, and static content.
globs: src/app/**/*.tsx src/components/**/*.tsx src/hooks/**/*.ts src/lib/**/*.ts
A TypeScript CLI application that initiates and maintains an autonomous conversation between two AI personas using Ollama. The app starts with user input and then continues the conversation automatically until stopped.