Loading...
Loading...
Loading...
This document explains **how** the screener works β what we index, how we score, and how decisions are made.
# π§ Logic & Algorithms
This document explains **how** the screener works β what we index, how we score, and how decisions are made.
(Updated to include **passage-level chunking** and **cross-encoder** re-ranking.)
> The earlier logic overview is retained and expanded here. :contentReference[oaicite:2]{index=2}
---
## Overview
1. **Index time (CLI)**
Parse each resume β PII mask β normalize β build 3 document-level signals (hash, lexical, semantic) β (optionally) build **chunk-level** semantic vectors β persist.
2. **Query time (API)**
Parse upload β **exact-dup gate** β candidate search (document & chunk levels) β per-candidate scoring β **cross-encoder re-rank** on the top set β **decision** β (optional) evidence report.
---
## A. Parsing & Normalization
- **Parsing:** PDFs via `pdfminer`, DOCX via `python-docx`. We keep `text` and a simple `layout`.
- **PII masking:** `mask_pii_if_enabled(text)` to reduce false positives and make **hashing stable** across runs.
- **Normalization:** lowercase, whitespace collapse, Unicode cleanup, and section/bullet extraction used for reporting.
---
## B. Signals & Indexes
### 1) Exact-duplicate hash (O(1))
Aggressive normalization β SHA-1 β perfect equality check for βeffectively the sameβ resume.
```python
import hashlib
def normalized_hash(t: str) -> str:
s = " ".join(t.lower().split())
return hashlib.sha1(s.encode("utf-8")).hexdigest()
```
We persist a **hash map**: `hash -> [doc_ids...]`.
### 2) Lexical fingerprint (MinHash + LSH)
* Build MinHash signatures over word 5-grams (or char 5-grams for more robustness).
* Insert into an LSH index to retrieve likely neighbors.
* Provides a fast **Jaccard** overlap estimate without comparing to every doc.<br/><br/><br/>
### <u><b>Lexical fingerprint:</b></u>
Lexical fingerprint (MinHash + LSH) is a technique that combines two powerful algorithms, MinHash and Locality Sensitive Hashing (LSH), to efficiently find and group together documents that are lexically similar.
Lexical fingerprint refers to a way of representing a document based on its words and phrases. Instead of treating a document as a whole, it's broken down into smaller, overlapping units called shingles (e.g., sequences of words or characters). These shingles become the building blocks for identifying similarity.<br/>
</b><u>MinHash:</b></u> MinHash is a technique that estimates the Jaccard similarity between two sets.
1. <u>Shingling:</u> Documents are first converted into sets of shingles.
2. <u>Hashing and Minimums:</u> Multiple hash functions are applied to these sets of shingles, and for each function, the minimum hash value is recorded.
3. <u>Signatures:</u> This process generates a compact "signature" (a vector of these minimum hash values) for each document.
4. <u>Similarity Estimation:</u> The number of matching values between two documents' MinHash signatures provides an approximation of their Jaccard similarity.
<b><u>NOTE:</b></u> <b>Jaccard similarity</b>, also known as the <b>Jaccard index</b>, is a statistic used for gauging the similarity and diversity of sample sets. It measures the overlap between two sets by dividing the size of their intersection by the size of their union. Essentially, it provides a ratio indicating how much the sets have in common compared to the total number of unique elements in both sets.
Formula:
The Jaccard similarity (J) between sets A and B is calculated as:
J(A, B) = |A β© B| / |A βͺ B|
### <u><b>Locality Sensitive Hashing (LSH): </b></u>
LSH is a technique for speeding up the search for similar items in large datasets. It's crucial because comparing every MinHash signature pairwise is still computationally expensive for large collections of documents.
1. <u>Banding:</u> LSH typically employs a technique called "banding." It divides each MinHash signature into several smaller "bands".
2. <u>Hashing Bands:</u> Each band is then hashed into a bucket.
3. <u>Candidate Identification:</u> The key idea is that documents with similar signatures are more likely to have at least one band that hashes to the same bucket. These documents become "candidate pairs" for further, more precise similarity comparison. This differs from traditional hashing, which aims to minimize collisions.
In harmony, they work like this:
1. Documents are transformed into sets of shingles (lexical fingerprints).
2. MinHash creates compact signatures (hash representations) of these shingle sets that preserve similarity.
3. LSH, using the banding technique, takes these MinHash signatures and efficiently groups potential similar documents into buckets, significantly reducing the number of pairwise comparisons needed to find near-duplicates or similar texts.
### 3) Semantic vector (embeddings)
A semantic vector is a numerical representation of a word, phrase, or document that captures its meaning and contextual relationships within a larger dataset or corpus (in our case, our corpus of resumes).
Semantic vector embeddings are numerical representations of data (like words, sentences, or documents) that capture their meaning or semantic relationships in a multi-dimensional space. Cosine similarity is a method to measure the similarity between these embeddings by calculating the cosine of the angle between them. Essentially, it tells you how closely the vectors point in the same direction, with a higher cosine value indicating greater similarity
* Encode **whole document** with a sentence-transformer (configurable).
* Similar docs β high **cosine** similarity.
* Stored as `sem.npy` + `sem_ids.json`; searched via <b>ANN</b> (Approximate Nearest Neighbor (ANN) search is a technique used to find data points that are similar to a given query point, but it doesn't necessarily find the absolute closest one. Instead, it finds a point (or points) within a certain "close enough" distance, prioritizing speed and efficiency, especially when dealing with large, high-dimensional datasets) or brute force for small corpora.
### 4) Layout vector (optional)
* A small numeric vector (e.g., section order, bullet density). If unused, set `w_lay = 0.0`.
---
## C. Passage-level Chunking
Why chunk? Whole-document embeddings can miss partial plagiarism (e.g., one copied project section). Chunking raises recall.
- **Chunking strategy:** sliding windows over tokens/sentences (e.g., ~140 tokens, stride ~70) β see `chunk.py`.
- **Indexing:** For each doc, we embed each chunk and store vectors + `(doc_id, chunk_id, offsets)`.
- **Query:** The uploaded resume is chunked the same way. We retrieve top-k neighbors **per chunk** using semantic ANN.
We then aggregate candidates across chunks by **doc_id** (vote/score sum).
---
## D. Cross-Encoder Re-Ranker (pairwise scoring)
A cross-encoder is a type of model in natural language processing (NLP) that analyzes the relationship between two pieces of text by processing them together as a single unit.
To sharpen precision, we run a **cross-encoder** on top of the ANN/LSH candidates β see `cross_encoder.py`.
- **What it does:** For each candidate doc (or its strongest chunk pair), the CE receives `(query_text, candidate_text)` and outputs a relevance score.
- **Why it helps:** Cross-encoders **jointly** attend to both texts, catching paraphrases and nuanced overlaps that bi-encoders may blur.
- **How we use it:** After initial scoring, we take the top N candidates (e.g., 20β50) and compute CE scores; we then **blend** CE into the fused score (weighted) or use it to re-order the top set.
---
## E. Candidate Generation (multi-signal)
We take the **union** of:
- Top-K **semantic** neighbors for the whole document
- **LSH** neighbors from MinHash (lexical)
- **Chunk**-level semantic neighbors (aggregated by doc)
- (Optional) **layout** neighbors
Deduplicate β cap to `k` (e.g., 20β50). This balances recall and speed.
---
## F. Scoring & Fusion
For each candidate `doc`:
1. **Semantic (document):** `s_sem_doc = cosine(emb_doc(q), emb_doc(doc))`
2. **Semantic (chunk, max/avg):** `s_sem_chunk = max/mean over best chunk pairs`
3. **Lexical:** `s_lex = MinHash-estimated Jaccard`
4. **Layout (optional):** `s_lay β [0,1]` or `0.0`
5. **Cross-encoder:** `s_ce β [0,1]` (after sigmoid/normalization)
**Fused total** (example default):
````
total = 0.40*s\_sem\_doc
\+ 0.20*s\_sem\_chunk
\+ 0.30*s\_lex
\+ 0.00*s\_lay
\+ 0.10\*s\_ce
````
> Weights are configurable in `config`/env. If youβre not using layout, keep `w_lay=0.0`.
> If you want a stronger CE influence, increase `w_ce` (with careβCE is slower than bi-encoder).
---
## G. Near-Duplicate & Decision Policy
**Near-dup hard gates** (short-circuit to 100% plagiarized):
- `best_sem_doc β₯ 0.985` **or** `best_lex β₯ 0.95`
**Component overrides** (to avoid βhigh semantic but low totalβ issues):
- Examples:
- `sem_doc β₯ 0.90 & lex β₯ 0.25` β plagiarized
- `sem_doc β₯ 0.70` β at least needs_review
**Fused thresholds** (after CE re-rank):
- `total β₯ 0.85` β **plagiarized_likely**
- `0.30 β€ total < 0.85` β **needs_review**
- `< 0.30` β **unique**
---
## H. Evidence Report (HTML)
For `/screen_with_report` we render:
- Header with **Overall / Lex / Sem / CE** (as configured)
- **Exact overlaps** (phrase highlighter) or **soft keyword** overlaps when exact is sparse
- **Similar bullets** (cosine-matched pairs) with inline token highlights
- Filenames for context
Reports are saved to `data/reports/<timestamp>.html` and linked from the API/UI.
---
## I. FAISS / Embedding Dimensions
Embedding models output different vector sizes (e.g., MiniLM: 384, MPNet: 768). The vector index is created for that dimension.
- Switching models with a different **dim** β **reindex** (`rm -rf ./data/index && cli index`)
- Same dim β vectors can be reused
---
## J. Complexity & Performance
- Exact hash: **O(1)**
- Candidate gen: **sublinear** with ANN + LSH (or brute force for small corpora)
- CE re-rank: **O(K)** forward passes (slower; keep K modest)
- Index build: linear in documents and chunks (embedding + MinHash dominate)
**Scaling tips**
- Use ANN (FAISS/ScaNN) for both doc and chunk indices
- Keep NUM_PERM consistent for MinHash & LSH
- Tune chunk sizes/stride and CE weight for balanced quality vs. latency
---
## K. Why scores can differ with/without report
Scores should match now that `/screen` and `/screen_with_report` share the same candidate & scoring path (including chunking + CE). If you ever see drift, reindex and ensure env variables match.
---
## L. Pseudocode (query)
```python
text, layout = parse(upload)
raw_text = text
text = mask_pii_if_enabled(text)
qdoc = normalize_and_split(text)
# 1) exact dup
if hash(qdoc["text"]) in hash_index or hash(raw_text) in hash_index:
return 100%, plagiarized_likely
# 2) candidates (doc ANN, LSH, chunk ANN) -> union
C = generate_candidates(qdoc)
# 3) score
v_q_doc = enc_doc(qdoc["text"])
scores = []
for cid in C:
odoc = store.get(cid)
s_lex = jaccard_minhash(qdoc["text"], odoc["text"])
s_semD = cosine(v_q_doc, enc_doc(odoc["text"]))
s_semC = max_chunk_cosine(qdoc_chunks, odoc_chunks)
s_ce = cross_encode(q_best_text, odoc_best_text) # pair with best chunk or doc
total = fuse(s_semD, s_semC, s_lex, s_ce, s_lay=0.0)
scores.append((cid, total, s_lex, s_semD, s_semC, s_ce))
best = max(scores, key=lambda x: x[1])
# 4) near-dup and decision policy
if best.s_semD >= 0.985 or best.s_lex >= 0.95:
return 100%, plagiarized_likely
decision = decide(best.total, best.s_lex, best.s_semD, best.s_semC, best.s_ce)
# 5) (optional) evidence
report = build_html_report(qdoc, top_doc, overlaps, similar_bullets, scores)
return result
````
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.