Token Constraints Implementation

# Token Constraints Implementation This document describes the token constraint system implemented to ensure that input to the local LLM (served via Ollama) never exceeds the model's maximum token limit. ## Overview The system implements comprehensive token management with the following key features: 1. **Token Count Enforcement** - Uses tiktoken with GPT-2 encoding for accurate token counting 2. **Streaming Support** - Implements Ollama's streaming API for real-time responses 3. **Prompt Construction Optimization** - Token budgeting per section with intelligent truncation 4. **UI Warnings** - Displays warnings when input truncation occurs ## Components ### 1. TokenCounter (`src/tokenCounter.ts`) The core token counting and optimization service: - **Token Counting**: Uses tiktoken with GPT-2 encoding (compatible with Deepseek Coder) - **Token Budgeting**: Configurable percentage allocation for different prompt sections - **Smart Truncation**: Truncates from least relevant content upwards - **Fallback**: Character-based estimation when tiktoken fails #### Configuration Options: - `maxTokens`: Maximum token limit (default: 8192 for Deepseek Coder) - `systemTokenBudget`: Percentage for system prompt (default: 20%) - `userTokenBudget`: Percentage for user input (default: 30%) - `contextTokenBudget`: Percentage for RAG context (default: 50%) ### 2. StreamingService (`src/streamingService.ts`) Handles streaming and non-streaming communication with Ollama: - **Streaming API**: Real-time response chunks for better UX - **Timeout Support**: Configurable timeouts for long responses - **Cancellation**: AbortController support for request cancellation - **Fallback**: Non-streaming mode when streaming fails ### 3. Enhanced ModelIntegration (`src/modelIntegration.ts`) Updated to use token constraints and streaming: - **Token Optimization**: Automatically optimizes prompts before sending - **Usage Logging**: Detailed token usage information for debugging - **Streaming Integration**: Uses streaming by default with fallback - **Warning Generation**: Provides token usage information for UI warnings ## Configuration Add these settings to your VS Code settings: ```json { "therapist.maxTokens": 8192, "therapist.systemTokenBudget": 0.2, "therapist.userTokenBudget": 0.3, "therapist.contextTokenBudget": 0.5, "therapist.enableStreaming": true, "therapist.streamingTimeout": 30000 } ``` ### Setting Descriptions: - **maxTokens**: Maximum token limit for your model (8192 for Deepseek Coder) - **systemTokenBudget**: Fraction of tokens reserved for system prompt (0.0-1.0) - **userTokenBudget**: Fraction of tokens reserved for user input (0.0-1.0) - **contextTokenBudget**: Fraction of tokens reserved for RAG context and completion buffer (0.0-1.0) - **enableStreaming**: Enable streaming responses from Ollama - **streamingTimeout**: Timeout for streaming responses in milliseconds ## Token Budgeting Strategy The system allocates tokens as follows: 1. **System Prompt (20%)**: Reserved for the base system instructions 2. **User Input (30%)**: Reserved for the user's query/request 3. **Context (50%)**: Split between RAG context and completion buffer - **RAG Context (40%)**: Retrieved code chunks and conversation history - **Completion Buffer (10%)**: Reserved space for the model's response ## Truncation Logic When content exceeds token limits, the system truncates in this order: 1. **RAG Context Chunks**: Removes least relevant chunks first 2. **Conversation History**: Truncates older conversation context 3. **User Input**: Only truncated as last resort (with warning) 4. **System Prompt**: Only truncated if absolutely necessary (with warning) ## Streaming Implementation ### Benefits: - **Real-time Feedback**: Users see responses as they're generated - **Better UX**: No waiting for complete responses - **Cancellation**: Users can cancel long-running requests - **Timeout Handling**: Prevents hanging requests ### Fallback: - Automatically falls back to non-streaming if streaming fails - Maintains compatibility with different Ollama versions - Graceful error handling ## Usage Examples ### Basic Token Usage Check: ```typescript const tokenUsage = modelIntegration.getTokenUsageInfo(prompt, context); console.log(`Total tokens: ${tokenUsage.totalTokens}`); console.log(`Truncated: ${tokenUsage.truncated}`); ``` ### Streaming Response: ```typescript const response = await modelIntegration.generateResponse(prompt, context); ``` ### Manual Token Optimization: ```typescript const tokenCounter = new TokenCounter(8192, 0.2, 0.3, 0.5); const optimization = tokenCounter.optimizePrompt(systemPrompt, userInput, contextChunks); ``` ## UI Integration The system provides user feedback through: 1. **Warning Messages**: Displayed when truncation occurs 2. **Token Usage Logs**: Detailed information in developer console 3. **System Messages**: Real-time feedback during streaming ### Warning Example: ``` WARNING: Input was truncated due to token limits. Truncated sections: context chunks. Consider reducing context or breaking down your request. ``` ## Testing Run the token counter test to verify functionality: ```bash node test-token-counter.js ``` This test verifies: - Basic token counting accuracy - Token budget calculations - Prompt optimization - Large context truncation ## Performance Considerations 1. **Token Counting**: tiktoken is fast but has initialization overhead 2. **Streaming**: Reduces perceived latency but uses more resources 3. **Context Truncation**: Intelligent truncation preserves most relevant content 4. **Caching**: Token counts could be cached for repeated content ## Troubleshooting ### Common Issues: 1. **tiktoken Import Errors**: Falls back to character estimation 2. **Streaming Failures**: Automatically falls back to non-streaming 3. **Token Limit Exceeded**: Check configuration and reduce context 4. **Performance Issues**: Consider reducing maxTokens or context window ### Debug Information: Enable detailed logging by checking the VS Code Developer Console for: - Token usage statistics - Truncation warnings - Streaming status - Optimization results ## Future Enhancements Potential improvements: 1. **Dynamic Token Budgets**: Adjust budgets based on content type 2. **Semantic Truncation**: Use embeddings to preserve most relevant content 3. **Token Caching**: Cache token counts for repeated content 4. **Model-Specific Limits**: Auto-detect token limits per model 5. **Real-time UI Updates**: Show streaming progress in the UI

Related Documents

Xilinx Constraints

Design Constraints - Why OpenClawfice is Simple

Constraints

Constraints