FastEmbed-rb Roadmap

# FastEmbed-rb Roadmap This document outlines features from the original [FastEmbed Python library](https://github.com/qdrant/fastembed) that are not yet implemented in fastembed-rb. ## Current Status (v1.0.0) ### Implemented - Dense text embeddings with 12 models - Automatic model downloading from HuggingFace - Lazy evaluation via `Enumerator` - Query/passage prefixes for retrieval models - Mean pooling and L2 normalization - Configurable batch size and threading - CoreML execution provider support - CLI tool (`fastembed`) - **Reranking / Cross-Encoder models** (5 models) ## Feature Gap Analysis ### High Priority #### 1. Sparse Text Embeddings The Python library supports sparse embedding models that return indices and values rather than dense vectors. These are useful for hybrid search combining keyword and semantic matching. **Models to support:** - `Qdrant/bm25` - Classic BM25 (0.010 GB) - `Qdrant/bm42-all-minilm-l6-v2-attentions` - Attention-based sparse (0.090 GB) - `prithivida/Splade_PP_en_v1` - SPLADE++ (0.532 GB) **API design:** ```ruby sparse = Fastembed::SparseTextEmbedding.new result = sparse.embed(["hello world"]).first # => { indices: [123, 456, 789], values: [0.5, 0.3, 0.2] } ``` **Implementation notes:** - Need new `SparseTextEmbedding` class - Different output format (sparse vectors instead of dense) - May require different tokenization approach for BM25 #### 2. Late Interaction (ColBERT) Models ColBERT-style models produce token-level embeddings rather than a single vector per document. This enables more fine-grained matching. **Models to support:** - `answerdotai/answerai-colbert-small-v1` (96 dim) - `colbert-ir/colbertv2.0` (128 dim) - `jinaai/jina-colbert-v2` (128 dim) **API design:** ```ruby colbert = Fastembed::LateInteractionTextEmbedding.new result = colbert.embed(["hello world"]).first # => Array of token embeddings, shape: [num_tokens, dim] ``` **Implementation notes:** - Returns 2D array per document (tokens × dimensions) - Different pooling strategy (no pooling, keep all tokens) - Scoring requires MaxSim operation between query and document tokens #### ~~3. Reranking / Cross-Encoder Models~~ ✅ IMPLEMENTED See `Fastembed::TextCrossEncoder` class. ### Medium Priority #### ~~4. Image Embeddings~~ ✅ IMPLEMENTED Vision models for converting images to vectors. Requires `mini_magick` gem. **Supported models:** - `Qdrant/resnet50-onnx` (2048 dim) - `Qdrant/clip-ViT-B-32-vision` (512 dim) - `jinaai/jina-clip-v1` (768 dim) **Usage:** ```ruby # Add to Gemfile: gem "mini_magick" image_embed = Fastembed::ImageEmbedding.new vector = image_embed.embed(["path/to/image.jpg"]).first ``` #### ~~5. Custom Model Support~~ ✅ IMPLEMENTED Implemented via `CustomModelRegistry` module. Users can register custom models: ```ruby Fastembed.register_model( model_name: "my-org/my-model", dim: 768, sources: { hf: "my-org/my-model" } ) embed = Fastembed::TextEmbedding.new(model_name: "my-org/my-model") ``` Also supports local model loading via `local_model_dir` parameter. ### Low Priority #### 6. Multimodal Late Interaction (ColPali) ColPali models that can embed both images and text for document retrieval. **Models to support:** - `vidore/colpali-v1.2` - `vidore/colqwen2-v1.0` **Implementation notes:** - Combines image and text embedding - Requires vision preprocessing - Complex architecture, lower priority #### 7. Quantized Models Support for INT8/INT4 quantized models for faster inference and lower memory usage. **Implementation notes:** - ONNX Runtime supports quantized models natively - Need to add quantized model variants to registry - Trade-off between speed and accuracy ## ~~CLI Enhancements~~ ✅ IMPLEMENTED All planned CLI features have been implemented: - ✅ `fastembed download <model>` - Pre-download models for offline use - ✅ `fastembed benchmark` - Run performance benchmarks with configurable iterations - ✅ `fastembed info <model>` - Show detailed model information including cache status - ✅ `-i input.txt` - Read texts from file (one per line) - ✅ `-p` / `--progress` - Show progress bar during embedding - ✅ `-q` / `--quiet` - Suppress progress output for scripting ## Breaking Changes for v2.0 If we do a major version bump: 1. Consider making `embed()` return an Array instead of Enumerator by default 2. Rename `query_embed`/`passage_embed` to `embed_query`/`embed_passage` for consistency 3. Use keyword arguments consistently throughout --- ## Refactoring Plan ### Completed: Phase 1 - Extract Shared Helpers - [x] Create `Validators` module for document validation - [x] Extract `prepare_model_inputs` to BaseModel - [x] Extract `setup_model_and_tokenizer` to BaseModel - [x] Update all model classes to use shared helpers **Result:** Reduced ~60 lines of duplicated code across 4 model classes. --- ### Completed: Phase 2 - Add Missing Features (Medium Risk) Goal: Achieve API consistency across all model types. #### 2.1 Add `passage_embed` to TextSparseEmbedding ✅ IMPLEMENTED Added to TextSparseEmbedding. ```ruby # lib/fastembed/sparse_embedding.rb def passage_embed(passages, batch_size: 32) passages = [passages] if passages.is_a?(String) embed(passages, batch_size: batch_size) end ``` #### 2.2 Add async methods to all embedding classes ✅ IMPLEMENTED Added async methods to all model classes: - TextSparseEmbedding: embed_async, query_embed_async, passage_embed_async - LateInteractionTextEmbedding: embed_async, query_embed_async, passage_embed_async - TextCrossEncoder: rerank_async, rerank_with_scores_async ```ruby # Add to TextSparseEmbedding def embed_async(documents, batch_size: 32) Async::Future.new { embed(documents, batch_size: batch_size).to_a } end def query_embed_async(queries, batch_size: 32) Async::Future.new { query_embed(queries, batch_size: batch_size).to_a } end def passage_embed_async(passages, batch_size: 32) Async::Future.new { passage_embed(passages, batch_size: batch_size).to_a } end # Add to TextCrossEncoder def rerank_async(query:, documents:, batch_size: 64) Async::Future.new { rerank(query: query, documents: documents, batch_size: batch_size) } end ``` #### 2.3 Add progress callback support to all embedding classes ✅ IMPLEMENTED Added progress callback support to TextSparseEmbedding and LateInteractionTextEmbedding. #### 2.4 Add `show_progress` parameter to TextCrossEncoder ✅ IMPLEMENTED Made configurable (was hardcoded to true). --- ### Completed: Phase 3 - Unify Initialization (Higher Risk) Goal: Consistent initialization API across all model types. #### 3.1 Add quantization support to all models ✅ IMPLEMENTED Added quantization parameter to all model classes (TextSparseEmbedding, LateInteractionTextEmbedding, TextCrossEncoder). #### 3.2 Add local_model_dir support to all models ✅ IMPLEMENTED Added local_model_dir, model_file, and tokenizer_file parameters to all model classes. Shared logic extracted to BaseModel (initialize_from_local, create_local_model_info). #### 3.3 Document batch size rationale ✅ DOCUMENTED Default batch sizes vary by model type based on memory requirements: | Model Type | Default Batch Size | Rationale | |------------|-------------------|-----------| | TextEmbedding | 256 | Dense embeddings have fixed output size (e.g., 384 floats). Memory is predictable and efficient. | | TextSparseEmbedding | 32 | SPLADE models output logits for entire vocabulary (~30k tokens) per sequence position. Much higher memory per document. | | LateInteractionTextEmbedding | 32 | ColBERT keeps per-token embeddings (not pooled), so output size scales with sequence length × embedding dim. | | TextCrossEncoder | 64 | Processes query-document pairs together. Each pair requires more memory than single documents, but less than sparse/late interaction. | Users can override these defaults via the `batch_size` parameter if they have different memory constraints. --- ### Implementation Priority | Task | Risk | Effort | Value | |------|------|--------|-------| | 2.1 Add passage_embed to Sparse | Low | Small | Medium | | 2.2 Add async to all classes | Low | Medium | High | | 2.3 Add progress to all classes | Medium | Medium | Medium | | 2.4 Add show_progress to CrossEncoder | Low | Small | Low | | 3.1 Add quantization to all | Medium | Medium | Medium | | 3.2 Add local_model_dir to all | Medium | Large | Medium | | 3.3 Document batch size rationale | Low | Small | Low | --- ## Contributing Contributions are welcome! If you'd like to implement any of these features: 1. Open an issue to discuss the approach 2. Follow the existing code style (run `bundle exec rubocop`) 3. Add tests for new functionality 4. Update the README and CHANGELOG

Related Documents

cheap-RAG Development Roadmap

Semblance AI — Development Roadmap

Changelog

Toasty — AI Triage & Responsible Disclosure Assistant (2026 — 350 hours)