Implementing a RAG system: Walk

Now that we've established the basics in our "Crawl" phase, it's time to pick up the pace. In this guid, we'll move beyond the initial setup to focus on optimizing core architectural components for better performance and accuracy.

Walk

We ended the previous "Crawl" design with a functioning AI HR agent with a RAG system. The responses, however, could be better. I've introduced some new elements to the architecture to perform better document processing and chunking, as well as re-ranker model to sort the semantic retrieval results by relevance:

"Walk" RAG architecture

The ugly Docling

IBM's Docling is an open-source document processing tool and easily one of the most effective ones I've tested. It can convert various file formats (e.g., PDF, docx, HTML) into clean, structured formats like Markdown and JSON. By integrating AI models and OCR, it doesn't just extract text, but also preserve the original layout's integrity. Through its hierarchical and hybrid chunking methods, Docling intelligently groups content by heading, merges smaller fragments for better context, and attaches rich metadata to streamline downstream searching and citations. Here's a Python function I use for chunking a PDF file:

def docling_chunk_pdf(file: str) -> tuple[list, list]:
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    result = converter.convert(file)
    doc = result.document
    chunker = HybridChunker()
    chunks = list(chunker.chunk(doc))
    chunk_texts = [c.text for c in chunks]

    return chunks, chunk_texts

I plan to take a deeper dive into Docling in a future article, so give me a follow so you won't miss it! 😄

Dot product vs cosine similarity

In the "Crawl" post, I talked briefly about cosine similarity and how it ignore magnitude and only focuses on the angle between two vectors. This is because normalization is baked into the cosine similarity formula. Dot product is effectively cosine similarity but without the final normalization step, which is why its result is affected by the magnitude of the vectors. Since many modern embedding models output pre-normalized unit vectors, the extra normalization step in cosine similarity becomes a redundant calculation. By using dot product on these pre-normalized vectors, you can achieve identical results with higher computational efficiency.

NOTE #1: While switching to dot product can increase your raw retrieval throughput, the latency gains may feel negligible when considering the entire end-to-end RAG pipeline depending on your particular use case and scale.

NOTE #2: A friendly reminder that choosing dot product over cosine similarity has the hard requirement that your vectors be normalized beforehand, or the magnitude will skew your search results. It's also quite easy to update your search configuration to use one or the other. If you're ever in doubt, just run a quick test with both settings to verify that both methods return the exact same nearest neighbours (top semantic matches).

Re-ranking

Standard search is built for speed and not deep understanding, so it can sometimes miss nuances. Re-ranking takes a crucial "second look" at the standard retrieval results to see which one(s) actually address the user's query. While the Cosine distance represents how similar the query and document align in the vector space, a "close" match doesn't guarantee an answer. The re-ranker's job to is to bridge this gap by scrutinizing the top results to assign a true relevance_score and ensure the most helpful contexts rise to the top. Here's what that snippet of code looks like:

co = cohere.ClientV2()

response = co.rerank(
    model=RERANKING_MODEL,
    query=user_query,
    documents=documents_to_rerank,
    top_n=len(candidate_responses),
)

reranked_results = []
for res in response.results:
    original_data = candidate_responses[res.index]

    reranked_results.append({
        "content": original_data["content"],
        "source": original_data["source"],
        "heading": original_data["heading"],
        "page": original_data["page"],
        "search_distance": original_data["search_distance"],
        "relevance_score": res.relevance_score
    })

As part of the full code that performs the re-ranking, I assign a threshold for the relevance score. Scores lower than this threshold is deemed irrelevant.

Updated example

I'm shaking up the stack for the "Walk" phase! In addition to using a different document processor, I will also be using a different embedding model and vector database. Since I wanted to try out Cohere's re-ranking model, I opted to lean into their full suite and use their embedding model as well. I made a deliberate choice here to set the embedding dimension to 384, which is a lower than the 768 I previous used in the "Crawl" example. I wanted to handicap the initial semantic search, and by doing so, we can more clearly see the re-ranker work its magic to fix the order or the results. I switched out ChromaDB with LanceDB to showcase just how many robust, easy-to-use open-source local vector databases are available for use.

Querying the HR agent

While I kept the core agent configuration from the "Crawl" phase the same, the addition of the re-ranking step made a significant impact. I asked the same two benchmark questions and this time the results were more refined and accurate:

HR RAG + re-ranking ADK Agent w/Gemini 3.1 Pro Preview

You can find the code for the "Walk" phase → here

Next steps

Now that we've manually optimized our retrieval and re-ranking, the next step is to scale. I will be migrating this architecture to Vertex AI's RAG Engine for a fully managed, high-performance RAG pipeline at an enterprise scale.

Additional learning

I used Cohere's embedding and re-ranking models in my example, but if you want to try out Vertex AI's re-ranking capabilities (and more), try out this Advanced RAG Techniques Codelab.

Walk

The ugly Docling

Dot product vs cosine similarity

Re-ranking

Updated example

Querying the HR agent

Next steps

Additional learning

Tags

Comments

More Blog

Five Gemma-4 models, one accelerator: what porting E2B 31B to AWS Inferentia2 taught me

Hey DEV, I'm Tobore. Let's actually connect.

I burned through thousands of AI tokens. Then a friend did it for free

Claude might be saturating your machine

Automated GitHub Code Reviews Using Google Gemini

What is an "agentic harness," actually?

Ready-made automations for this