Loading...
Loading...
Loading...
This document describes the token constraint system implemented to ensure that input to the local LLM (served via Ollama) never exceeds the model's maximum token limit.
# Token Constraints Implementation
This document describes the token constraint system implemented to ensure that input to the local LLM (served via Ollama) never exceeds the model's maximum token limit.
## Overview
The system implements comprehensive token management with the following key features:
1. **Token Count Enforcement** - Uses tiktoken with GPT-2 encoding for accurate token counting
2. **Streaming Support** - Implements Ollama's streaming API for real-time responses
3. **Prompt Construction Optimization** - Token budgeting per section with intelligent truncation
4. **UI Warnings** - Displays warnings when input truncation occurs
## Components
### 1. TokenCounter (`src/tokenCounter.ts`)
The core token counting and optimization service:
- **Token Counting**: Uses tiktoken with GPT-2 encoding (compatible with Deepseek Coder)
- **Token Budgeting**: Configurable percentage allocation for different prompt sections
- **Smart Truncation**: Truncates from least relevant content upwards
- **Fallback**: Character-based estimation when tiktoken fails
#### Configuration Options:
- `maxTokens`: Maximum token limit (default: 8192 for Deepseek Coder)
- `systemTokenBudget`: Percentage for system prompt (default: 20%)
- `userTokenBudget`: Percentage for user input (default: 30%)
- `contextTokenBudget`: Percentage for RAG context (default: 50%)
### 2. StreamingService (`src/streamingService.ts`)
Handles streaming and non-streaming communication with Ollama:
- **Streaming API**: Real-time response chunks for better UX
- **Timeout Support**: Configurable timeouts for long responses
- **Cancellation**: AbortController support for request cancellation
- **Fallback**: Non-streaming mode when streaming fails
### 3. Enhanced ModelIntegration (`src/modelIntegration.ts`)
Updated to use token constraints and streaming:
- **Token Optimization**: Automatically optimizes prompts before sending
- **Usage Logging**: Detailed token usage information for debugging
- **Streaming Integration**: Uses streaming by default with fallback
- **Warning Generation**: Provides token usage information for UI warnings
## Configuration
Add these settings to your VS Code settings:
```json
{
"therapist.maxTokens": 8192,
"therapist.systemTokenBudget": 0.2,
"therapist.userTokenBudget": 0.3,
"therapist.contextTokenBudget": 0.5,
"therapist.enableStreaming": true,
"therapist.streamingTimeout": 30000
}
```
### Setting Descriptions:
- **maxTokens**: Maximum token limit for your model (8192 for Deepseek Coder)
- **systemTokenBudget**: Fraction of tokens reserved for system prompt (0.0-1.0)
- **userTokenBudget**: Fraction of tokens reserved for user input (0.0-1.0)
- **contextTokenBudget**: Fraction of tokens reserved for RAG context and completion buffer (0.0-1.0)
- **enableStreaming**: Enable streaming responses from Ollama
- **streamingTimeout**: Timeout for streaming responses in milliseconds
## Token Budgeting Strategy
The system allocates tokens as follows:
1. **System Prompt (20%)**: Reserved for the base system instructions
2. **User Input (30%)**: Reserved for the user's query/request
3. **Context (50%)**: Split between RAG context and completion buffer
- **RAG Context (40%)**: Retrieved code chunks and conversation history
- **Completion Buffer (10%)**: Reserved space for the model's response
## Truncation Logic
When content exceeds token limits, the system truncates in this order:
1. **RAG Context Chunks**: Removes least relevant chunks first
2. **Conversation History**: Truncates older conversation context
3. **User Input**: Only truncated as last resort (with warning)
4. **System Prompt**: Only truncated if absolutely necessary (with warning)
## Streaming Implementation
### Benefits:
- **Real-time Feedback**: Users see responses as they're generated
- **Better UX**: No waiting for complete responses
- **Cancellation**: Users can cancel long-running requests
- **Timeout Handling**: Prevents hanging requests
### Fallback:
- Automatically falls back to non-streaming if streaming fails
- Maintains compatibility with different Ollama versions
- Graceful error handling
## Usage Examples
### Basic Token Usage Check:
```typescript
const tokenUsage = modelIntegration.getTokenUsageInfo(prompt, context);
console.log(`Total tokens: ${tokenUsage.totalTokens}`);
console.log(`Truncated: ${tokenUsage.truncated}`);
```
### Streaming Response:
```typescript
const response = await modelIntegration.generateResponse(prompt, context);
```
### Manual Token Optimization:
```typescript
const tokenCounter = new TokenCounter(8192, 0.2, 0.3, 0.5);
const optimization = tokenCounter.optimizePrompt(systemPrompt, userInput, contextChunks);
```
## UI Integration
The system provides user feedback through:
1. **Warning Messages**: Displayed when truncation occurs
2. **Token Usage Logs**: Detailed information in developer console
3. **System Messages**: Real-time feedback during streaming
### Warning Example:
```
WARNING: Input was truncated due to token limits. Truncated sections: context chunks.
Consider reducing context or breaking down your request.
```
## Testing
Run the token counter test to verify functionality:
```bash
node test-token-counter.js
```
This test verifies:
- Basic token counting accuracy
- Token budget calculations
- Prompt optimization
- Large context truncation
## Performance Considerations
1. **Token Counting**: tiktoken is fast but has initialization overhead
2. **Streaming**: Reduces perceived latency but uses more resources
3. **Context Truncation**: Intelligent truncation preserves most relevant content
4. **Caching**: Token counts could be cached for repeated content
## Troubleshooting
### Common Issues:
1. **tiktoken Import Errors**: Falls back to character estimation
2. **Streaming Failures**: Automatically falls back to non-streaming
3. **Token Limit Exceeded**: Check configuration and reduce context
4. **Performance Issues**: Consider reducing maxTokens or context window
### Debug Information:
Enable detailed logging by checking the VS Code Developer Console for:
- Token usage statistics
- Truncation warnings
- Streaming status
- Optimization results
## Future Enhancements
Potential improvements:
1. **Dynamic Token Budgets**: Adjust budgets based on content type
2. **Semantic Truncation**: Use embeddings to preserve most relevant content
3. **Token Caching**: Cache token counts for repeated content
4. **Model-Specific Limits**: Auto-detect token limits per model
5. **Real-time UI Updates**: Show streaming progress in the UI
Constraints are essential. Constraints are not that hard to understand and use.
**Purpose:** Document the intentional constraints that make OpenClawfice easy to use and maintain
In [Day25](./day25-primary-key-and-entity-id.md), we discussed enumerated types (enums). To some extent, enums are also a type of constraint—they limit the values that can be assigned to a specific field to a predefined set.
The concept of a Constraint has many names: constraints, cost functions, factors, probably many others. At the most