Implementing a RAG system: Crawl — DeepSeek Blog | Neura Market
    Neura MarketNeura Market/DeepSeek
    ChatGPTChatGPTClaudeClaudeGeminiGeminiCursorCursorGrokGrokPerplexityPerplexityDeepSeekDeepSeek
    CoPilotCoPilotStable DiffusionStable DiffusionMidjourneyMidjourney
    View All Directories
    OverviewRulesPromptsMCPsAgentsBlogVideosGuidesCoursesCommunityTrendingGenerate
    DeepSeekBlogImplementing a RAG system: Crawl
    Back to Blog
    Implementing a RAG system: Crawl
    rag

    Implementing a RAG system: Crawl

    Glen Yu March 23, 2026
    0 views

    I'm starting a "Crawl, walk, run" series of posts on various topics and decided to start with...

    > I'm starting a "Crawl, walk, run" series of posts on various topics and decided to start with Retrieval-Augmented Generation (RAG). Learn the basics and progress to a production-ready system! ## Crawl In this phase of your journey, we're going to learn about the core concepts of a Retrieval-Augmented Generation (RAG) system and then apply them in a simple example. We're going to build a Human Resources (HR) agent that can help answer and navigate HR-related questions. Using the [Government of British Columbia's HR Policy PDFs](https://www2.gov.bc.ca/gov/content/careers-myhr/managers-supervisors/employee-labour-relations/conditions-agreements/policy/hr-policy-pdf) as our knowledge base, we will process, chunk, and embed the documents into a local vector database. This allows the agent to provided grounded answers and ensures that every response is rooted directly in the ingested BC government policies. !["Crawl" RAG architecture](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/goafc5mkjuzhbn4xr9p6.png) ### Why RAG? RAG is a very common design pattern that turns a standard LLM into an informed AI agent. Standard models can be a "black box", but RAG gives your agent an "open-book test". It bypasses knowledge cutoffs by linking directly to your documents, providing factual grounding and citations. No fine-tuning is required, data can be updated quickly. RAG provides a real-time bridge between your LLM and your data. This post will focus primarily on indexing and retrieval of the your data. Let's get started! ## How do you eat an elephant? One bite at a time It's not feasible to have to feed all the information into the AI every time you want to ask it a question. Instead, it is broken down into smaller, more manageable pieces called chunks, which the AI can process and retrieve efficiently. We will use a "recursive character chunking" strategy which is a fast and smart and will try to split at natural boundaries like paragraphs, but can still cut off mid-sentence if the chunk is too big. An overlap is used to ensure that context isn't lost at the edges of a the cut if a split does occur. Snippet of code used for splitting & chunking using LangChain: ```python from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader from langchain_text_splitters import RecursiveCharacterTextSplitter loader = DirectoryLoader( DATA_DIR, glob="./**/*.pdf", loader_cls=PyPDFLoader ) docs = loader.load() text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=100 ) chunks = text_splitter.split_documents(docs) ``` ![Recursive character chunking](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rb9dmvjp63cx85mif5kw.png) Recursive character chunking is the successor to "fixed-sized chunking", which is just a fixed sliding window. Here, you always need the overlap because you never know how much of what sentence you're cutting off. ![Fixed-size chunking](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/11wlflm1ejdaamj0his0.png) ## What's the vector representation of 'Life'?  The embedding process transforms text chunks into vectors, which are mathematical arrays of floating-point numbers that capture semantic meaning. However, higher dimensionality won't necessarily generate better results. For simple or straightforward documents, expanding the vector size often introduces latency and computational overhead without providing better search accuracy. ### Two sides of the same coin: Indexing & retrieval Indexing and retrieval are two parts of the same conversation, and you must use the same embedding model for both. This is important because every embedding model puts emphasis on words in a sentence differently. One might prioritize the subject, while another might prioritize the action, which would yield different results. What happens when different embedding models try to embed “To be, or not to be…”: {% embed https://www.youtube.com/watch?v=iQULEW2JwHE %} Selection the appropriate embedding type is also very important. The documents that you embed and index are usually a long, structured documents where the focus is on the information it *provides*. This is in contrast to the user queries which are usually short, messy text, so the retrieval process focuses on the information it is *looking for*. ## Finding a match Once your user query is embedded, the RAG system performs a similarity search against the vector database to identify the most relevant answers. In most vector databases, this is calculated using cosine similarity. This metric focuses exclusively on the angle between vectors rather than their magnitude; it measures how closely the semantic "intent" (angle) of the query aligns with the document, regardless of the text's length or word frequency. This is important because it means the AI can recognize that a short question and a long technical manual can share the same intent even if their scale (magnitude) is completely different. ## Putting it all together Link to my GitHub repository → [here](https://github.com/Neutrollized/rag-systems-crawl-walk-run/tree/main/01_crawl) To handle HR questions, I'm building an agent using [Google's Agent Development Kit (ADK)](https://google.github.io/adk-docs/get-started/python/) that connects directly to this RAG system: ```python from .tools import query_hr hr_rag_tool = FunctionTool(func=query_hr) hr_agent = LlmAgent( name="hr_agent", model="gemini-3.1-pro-preview", description="Specialist in company HR policies and procedures.", instruction=( "You are a professional HR assistant. Your goal is to answer questions " "using ONLY the information retrieved from the 'query_hr' tool. " "When calling the 'query_hr' tool, ensure all string arguments are properly formatted as standard JSON strings with double quotes.\n\n" "RULES:\n" "1. If the tool returns relevant information, summarize it clearly.\n" "2. You MUST cite your sources using the format: (Source: [Source Name], Page: [Page Number]).\n" "3. If the tool results do not contain the answer, state: 'I'm sorry, I couldn't find that in our HR documents.'\n" "4. Do not use outside knowledge or make up facts about company policy." ), tools=[query_hr], ) ``` By giving the agent clear instructions and the right tools to search our vector database, it should be able to pull precise answers for users in seconds: ![HR RAG ADK Agent w/Gemini 3.1 Pro Preview](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8pu768fxnl6o3x7eycx9.png) It answers questions reliably, but if I'm being honest, I can't help but feel we're only scratching the surface of the "full" answers that we're looking for. ## Next steps We have a working prototype, but there's still plenty of room to grow. To transform this from a simple RAG system into a high performance engine, our next steps will focus on precision. We'll refine how we process and chunk documents and introduce a reranking layer to our search results to significantly boost the quality of the agent's responses. ### Additional learning If you haven't used Agent Development Kit yet, but would like to learn more, checkout this Codelab: "[ADK Crash Course - From Beginner to Expert](https://codelabs.developers.google.com/onramp/instructions#0?utm_campaign=CDR_0xe7f5807a_default_b479282946&utm_medium=external&utm_source=blog)" (it comes with a link to claim some free GCP credits to get you through the course).

    Tags

    raggenaiopensourceadk

    Comments

    More Blog

    View all
    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠ai

    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

    Hi everyone! 👋 I’m Tara, a Senior Software Engineer and Consultant. Over the years, I've jumped...

    T
    tworrell
    Local AI Will Save Us All (The Math Says So, Trust Me)ai

    Local AI Will Save Us All (The Math Says So, Trust Me)

    Every few weeks a take goes viral in tech circles making the case for ditching cloud AI and running...

    S
    Sebastian Schürmann
    Lost in the AI Hype, I Started Smallai

    Lost in the AI Hype, I Started Small

    And it helped me get back into tech without drowning TL;DR at the end Coming back to...

    R
    Rohini Gaonkar
    Building a Replay-Tested Interactive Brokers Client in Gogo

    Building a Replay-Tested Interactive Brokers Client in Go

    I wanted an IBKR library that felt like Go and had testing I could trust. So I wrote one.

    T
    Thomas Marcelis
    Playwright in Pictures: Fully Parallel Modeplaywright

    Playwright in Pictures: Fully Parallel Mode

    Playwright’s fullyParallel mode is often treated as a simple performance switch. In practice, it...

    V
    Vitaliy Potapov
    Designing a CLI for Both Humans and Agentscli

    Designing a CLI for Both Humans and Agents

    Learn how Alpic designed its CLI for both human developers and AI agents — covering tradeoffs like polling, context windows, interactivity, and statelessness.

    J
    Julien Vallini

    Stay up to date

    Get the latest DeepSeek prompts, rules, and resources delivered to your inbox weekly.

    Neura Market LogoNeura Market

    Discover the best AI prompts, plugins, and resources for DeepSeek and more.

    Content Types

    • Rules
    • Prompts
    • MCPs
    • Agents
    • Guides

    Platforms

    • ChatGPT Directory
    • Claude Directory
    • Gemini Directory
    • Cursor Directory
    • Grok Directory
    • Perplexity Directory
    • DeepSeek Directory
    • CoPilot Directory
    • Stable Diffusion Directory
    • Midjourney Directory
    • All Directories

    Resources

    • Blog
    • Documentation
    • Help Center
    • Marketplace

    Legal

    • Privacy Policy
    • Terms of Service

    © 2026 Neura Market. All rights reserved.

    |

    Not affiliated with any AI platform vendors.