After LangGraph node execution, convert messages

RAGAS Integration with LangGraph for Python RAG Pipeline

RAGAS (Retrieval-Augmented Generation Assessment) is a specialized evaluation framework designed to measure RAG pipeline performance through reference-free metrics, making it ideal for production systems. LangGraph is a state-based orchestration framework that structures AI workflows as directed graphs. Integrating these two creates a powerful system for building and evaluating complex RAG pipelines systematically.

Core Architecture and Integration Strategy

LangGraph manages your RAG workflow through state management, nodes (computation units), and edges (routing logic). The framework maintains a centralized GraphState object that flows through each node, enabling precise tracking of queries, retrieved documents, and generated answers. RAGAS evaluates the outputs of each pipeline by converting LangGraph's message sequences into its evaluation format and calculating quality metrics.¹²³⁴

The integration follows this pattern: your LangGraph nodes handle retrieval and generation, while RAGAS metrics assess the outputs asynchronously without requiring ground truth data for online evaluation. This enables continuous evaluation of production traces.⁵

Key RAGAS Metrics for RAG Assessment

RAGAS provides comprehensive metrics split into retriever-focused and generator-focused categories:⁶⁷

Retriever Metrics:

Context Precision: Measures whether retrieved contexts are ranked correctly (higher relevance first)
Context Recall: Determines if retrieved contexts contain all information needed to answer the question (requires ground truth)
Context Entities Recall: Evaluates entity-level retrieval accuracy
Noise Sensitivity: Assesses the signal-to-noise ratio in retrieved documents

Generator Metrics:

Faithfulness: Checks if generated answers contain hallucinations or unsupported claims relative to retrieved context
Response Relevancy: Measures how relevant and on-topic the answer is to the question
Answer Correctness: Compares generated answers against reference answers

Agent/Tool Metrics:

Tool Call Accuracy: Evaluates whether the LLM correctly identifies and invokes required tools
Agent Goal Accuracy: Measures whether the LLM achieved the user's stated objective⁴

Implementation Pattern: Converting LangGraph State to RAGAS Format

RAGAS requires data in specific formats: SingleTurnSample for single-turn interactions or MultiTurnSample for multi-turn conversations. LangGraph message sequences must be converted to this format using RAGAS's integration utilities.⁸⁹⁴

from ragas.integrations.langgraph import convert_to_ragas_messages
from ragas.dataset_schema import SingleTurnSample, MultiTurnSample
from ragas.metrics import Faithfulness, ContextRecall

# After LangGraph node execution, convert messages
ragas_trace = convert_to_ragas_messages(
    messages=langraph_result["messages"]
)

# Create evaluation sample
sample = SingleTurnSample(
    user_input=query,
    retrieved_contexts=contexts,
    response=answer,
    reference=ground_truth  # Optional for reference-free metrics
)

# Score asynchronously
scorer = Faithfulness()
score = await scorer.single_turn_ascore(sample)

LangGraph Node Integration Pattern

LangGraph nodes should return state updates that RAGAS can consume. A typical RAG pipeline with evaluation looks like:¹⁰¹¹

from typing import TypedDict, List
from langchain_core.documents import Document
from langgraph.graph import StateGraph, START, END

class GraphState(TypedDict):
    question: str
    documents: List[Document]
    generation: str
    evaluation_scores: dict

def retrieval_node(state: GraphState):
    """Retrieve relevant documents"""
    documents = retriever.invoke(state["question"])
    return {"documents": documents}

def generation_node(state: GraphState):
    """Generate answer from retrieved documents"""
    context = "\n\n".join([d.page_content for d in state["documents"]])
    generation = rag_chain.invoke({
        "context": context,
        "question": state["question"]
    })
    return {"generation": generation}

def evaluation_node(state: GraphState):
    """Evaluate using RAGAS metrics"""
    ragas_trace = convert_to_ragas_messages(state.get("messages", []))
    
    sample = SingleTurnSample(
        user_input=state["question"],
        retrieved_contexts=[d.page_content for d in state["documents"]],
        response=state["generation"]
    )
    
    # Run evaluation metrics
    scores = {}
    for metric in metrics:
        scores[metric.name] = await metric.single_turn_ascore(sample)
    
    return {"evaluation_scores": scores}

# Build graph
workflow = StateGraph(GraphState)
workflow.add_node("retrieve", retrieval_node)
workflow.add_node("generate", generation_node)
workflow.add_node("evaluate", evaluation_node)

workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", "evaluate")
workflow.add_edge("evaluate", END)

graph = workflow.compile()

Initialization of RAGAS Metrics

RAGAS metrics require LLM and embedding model initialization. Use LangchainLLMWrapper and LangchainEmbeddingsWrapper to integrate with LangChain-based models:¹²⁵

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.metrics import (
    Faithfulness, 
    ContextRecall, 
    ContextPrecision, 
    ResponseRelevancy
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.run_config import RunConfig

# Initialize LLMs and embeddings
evaluator_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
evaluator_embeddings = OpenAIEmbeddings()

# Wrap for RAGAS compatibility
llm_wrapper = LangchainLLMWrapper(evaluator_llm)
embeddings_wrapper = LangchainEmbeddingsWrapper(evaluator_embeddings)

# Initialize metrics
metrics = [
    Faithfulness(),
    ContextRecall(),
    ContextPrecision(),
    ResponseRelevancy()
]

# Configure metrics
for metric in metrics:
    if hasattr(metric, 'llm'):
        metric.llm = llm_wrapper
    if hasattr(metric, 'embeddings'):
        metric.embeddings = embeddings_wrapper
    
    run_config = RunConfig()
    metric.init(run_config)

Evaluation Strategies: Offline vs. Online

Offline Evaluation involves batch evaluation on static datasets before deployment. This approach:¹³

Tests system changes before production impact
Runs comprehensive, computationally expensive metrics
Uses a "golden dataset" with known good answers and contexts
Allows regression testing against benchmarks
Cannot capture dynamic data drift or user behavior changes

Online Evaluation assesses live production traffic in real-time. Benefits include:¹³

Measures actual user experience
Detects live performance degradation
Captures data drift and evolving query patterns
Enables A/B testing of system versions
Provides direct business impact metrics

For production RAG systems, integrate both: use offline evaluation to validate changes before deployment, then deploy with online monitoring to capture real-world performance.¹³

Asynchronous Batch Evaluation

For production efficiency, RAGAS supports async evaluation to score multiple samples in parallel:¹⁴

import asyncio
from ragas.dataset_schema import EvaluationDataset

# Create evaluation dataset
dataset = EvaluationDataset(samples=[sample1, sample2, sample3])

# Run async batch evaluation
async def evaluate_batch():
    from ragas import evaluate
    
    results = await evaluate(
        dataset,
        metrics=metrics,
        llm=llm_wrapper,
        embeddings=embeddings_wrapper
    )
    return results

# Execute
results = asyncio.run(evaluate_batch())

LangGraph Persistence for Evaluation State

LangGraph supports checkpointing to persist graph state across executions. This enables:¹⁵¹⁶

Memory: Access conversation history and prior states
Human-in-the-loop: Pause execution for validation before continuing
Fault tolerance: Resume from checkpoints after failures
Time travel: Inspect and replay past states

from langgraph.checkpoint.memory import InMemorySaver

# For production, use PostgresSaver or other durable checkpointers
checkpointer = InMemorySaver()

graph = workflow.compile(checkpointer=checkpointer)

# Invoke with thread_id for persistence
result = graph.invoke(
    {"question": "What is RAG?"},
    config={"configurable": {"thread_id": "user_123"}}
)

# Access checkpoint history
history = graph.get_state_history(
    config={"configurable": {"thread_id": "user_123"}}
)

Integration with LangSmith/Langfuse for Tracing

For comprehensive observability, integrate evaluation with tracing platforms:¹⁷⁵

from langfuse.langchain import CallbackHandler

# Initialize tracing
langfuse_handler = CallbackHandler()

# Invoke graph with tracing
result = graph.invoke(
    {"question": "What is RAG?"},
    config={"callbacks": [langfuse_handler]}
)

# RAGAS scores are automatically traced
for metric_name, score in result["evaluation_scores"].items():
    langfuse.create_score(
        name=metric_name,
        value=score,
        trace_id=trace_id
    )

Best Practices for Production Deployment

Use LangGraph for adaptive RAG architectures: Combine conditional routing with RAGAS evaluation to implement adaptive retrieval that adjusts depth and strategy based on query complexity. Route low-confidence retrievals to retry with modified queries, then evaluate results.¹⁸

Implement hierarchical checkpointing: Use PostgresSaver or production-grade checkpointers for persistent state management, enabling pause/resume functionality and process recovery.¹⁶

Minimize token costs: Run reference-free RAGAS metrics online for continuous monitoring, reserve expensive ground-truth metrics for periodic offline batches.⁵

Establish evaluation data pipelines: Create feedback loops where insights from online evaluation (problematic queries, user feedback) augment offline golden datasets, ensuring evolving relevance.¹³

Monitor evaluation latency: RAGAS metrics may exceed timeout thresholds with local LLMs or network delays. Configure appropriate timeout windows for async scoring.¹⁹

Common Integration Patterns

Self-correcting RAG: Use document grading nodes to evaluate retrieval quality, then conditionally route to query rewriting if scores fall below threshold. Evaluate final output with RAGAS metrics.²⁰

Multi-turn conversation evaluation: For chatbot applications, use MultiTurnSample and AgentGoalAccuracyWithReference to evaluate whether the agent achieves multi-step user objectives.⁴

Batch scoring for reporting: Periodically sample production traces, format as EvaluationDataset, and run offline batch evaluation to generate performance reports.⁵

This integrated approach provides systematic, measurable assessment of RAG pipeline quality while maintaining production performance through efficient asynchronous evaluation patterns. <span style="display:none">²¹²²²³²⁴²⁵²⁶²⁷²⁸²⁹³⁰³¹³²³³³⁴³⁵³⁶³⁷³⁸³⁹⁴⁰⁴¹⁴²⁴³⁴⁴⁴⁵⁴⁶⁴⁷⁴⁸⁴⁹</span>

After LangGraph node execution, convert messages

RAGAS Integration with LangGraph for Python RAG Pipeline

Core Architecture and Integration Strategy

Key RAGAS Metrics for RAG Assessment

Implementation Pattern: Converting LangGraph State to RAGAS Format

LangGraph Node Integration Pattern

Initialization of RAGAS Metrics

Evaluation Strategies: Offline vs. Online

Asynchronous Batch Evaluation

LangGraph Persistence for Evaluation State

Integration with LangSmith/Langfuse for Tracing

Best Practices for Production Deployment

Common Integration Patterns

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance

After LangGraph node execution, convert messages

RAGAS Integration with LangGraph for Python RAG Pipeline

Core Architecture and Integration Strategy

Key RAGAS Metrics for RAG Assessment

Implementation Pattern: Converting LangGraph State to RAGAS Format

LangGraph Node Integration Pattern

Initialization of RAGAS Metrics

Evaluation Strategies: Offline vs. Online

Asynchronous Batch Evaluation

LangGraph Persistence for Evaluation State

Integration with LangSmith/Langfuse for Tracing

Best Practices for Production Deployment

Common Integration Patterns

Footnotes

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance