In the AI Agents world of Retrieval-Augmented Generation (Agentic-RAG), one challenge that persists is how Agents chunk our source documents to optimize response accuracy and relevance. This blog series dives into how different chunking strategies — Fixed, Semantic, Agentic, and Recursive Chunking— impact the performance of Agentic RAG systems. Using Agno for creating agent and orchestration and Qdrant as the vector store, we evaluate chunking effectiveness through RAGAS and LlamaIndex, visualizing the results with an in-depth metric-based analysis. This exploration not only reveals nuances in chunking approaches but also sets the stage for optimizing retrieval pipelines in real-world applications.

The Architecture:

The architecture begins with the integration of Agno and Qdrant, where Agno facilitates agentic RAG workflows, and Qdrant serves as the vector database storing chunked document embeddings also called Knowledge Base in Agentic terms. Multiple chunking strategies are applied — namely Fixed, Semantic, Agentic, and Recursive — each providing different granularity and contextual overlap to the input documents. These chunked inputs are then passed through Agno’s agentic pipelines to simulate a real-world retrieval and reasoning setup.

The next stage involves RAGAS and LlamaIndex, which play a pivotal role in evaluation. RAGAS helps create a gold-standard evaluation dataset, while LlamaIndex handles the LLM warapper for RAGA evaluation functions. Together, they generate performance metrics such as Context Recall, Faithfulness, and Answer Correctness for each chunking strategy. Finally, these insights are visualized using a custom dashboard created in ReactJS to compare how different strategies influence the outcome of the Agentic RAG system. This pipeline offers a comprehensive lens into the trade-offs and efficiencies of various chunking techniques.

The Implementation

The project structure of the agentic RAG setup looks as below.

.
├── agentic_chunk_benchmarking.py
├── create_eval_dataset.py
├── data
│   ├── ground_truth.json
│   └── test_data.pdf
├── fixed_chunk_benchmarking.py
├── recursive_chunk_benchmarking.py
├── requirements.txt
└── semantic_chunk_benchmarking.py

The requirements of the projects are as below.

agno
python-dotenv
qdrant-client
anthropic
ollama

# pdf parser
pypdf

# chunking library
chonkie

# configuration
packaging
importlib-metadata

# evals
ragas
deepeval

# llama-index
llama-index
llama-index-llms-openai

The environment file which has got security credentials is as below.

collection_name=chunk_test
qdrant_url=http://localhost:6333
api_key=your key
ANTHROPIC_API_KEY=sk-ant-
OPENAI_API_KEY=sk-proj-

Fixed Chunking Strategy

The below code evaluates a Retrieval-Augmented Generation (RAG) system using the RAGAS evaluation framework. It first sets up a knowledge base by loading a PDF document, chunking it into segments, and storing these chunks in a Qdrant vector database with embeddings from an Ollama model. The system uses Claude 3.7 Sonnet as the language model to answer queries based on the knowledge base.

import os
from create_eval_dataset import create_eval_ds
from agno.agent import Agent
from agno.document.chunking.fixed import FixedSizeChunking
from agno.embedder.ollama import OllamaEmbedder
from agno.models.anthropic import Claude
from qdrant_client import qdrant_client
from agno.knowledge.pdf import PDFKnowledgeBase
from agno.vectordb.qdrant import Qdrant
from dotenv import load_dotenv, find_dotenv

from ragas.llms import LlamaIndexLLMWrapper
from ragas import EvaluationDataset, evaluate
from ragas.metrics import Faithfulness, FactualCorrectness, ContextRelevance, ContextUtilization, ContextRecall

from llama_index.llms.openai import OpenAI

eval_llm = OpenAI(model='gpt-4o')

load_dotenv(find_dotenv())

doc_path = "data/test_data.pdf"
ground_truth_path = "data/ground_truth.json"
chunk_size = 1000
chunk_overlap = 200

# initialize the LLM (default to openai)
claude = Claude(id="claude-3-7-sonnet-20250219")

# initialize the qdrant client.
q_client = qdrant_client.QdrantClient(url=os.environ.get('qdrant_url'), api_key=os.environ.get('api_key'))

# create the qdrant vector store instance
vector_db = Qdrant(
    collection=os.environ.get('collection_name'),
    url=os.environ.get('qdrant_url'),
    api_key=os.environ.get('api_key'),
    embedder=OllamaEmbedder(id="nomic-embed-text:latest", dimensions=768)
)

# configure the knowledge base
knowledge_base = PDFKnowledgeBase(vector_db=vector_db,
                                  path=doc_path,
                                  chunking_strategy=FixedSizeChunking(
                                      chunk_size=chunk_size,
                                      overlap=chunk_overlap)
                                  )

if not q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    knowledge_base.load(recreate=False)

# initialize agent
agent = Agent(knowledge=knowledge_base, search_knowledge=True, model=claude)

# create the dataset for evaluation
eval_dataset = create_eval_ds(agent=agent, ground_truth_path=ground_truth_path)

# trigger evals
evaluation_dataset = EvaluationDataset.from_list(eval_dataset)
evaluator_llm = LlamaIndexLLMWrapper(llm=eval_llm)
result = evaluate(dataset=evaluation_dataset, metrics=[Faithfulness(), ContextRelevance(),
                                                       ContextUtilization(), ContextRecall(),
                                                       FactualCorrectness()])

for score in result.scores:
    print(score)

# destroy the collection
if q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    q_client.delete_collection(collection_name=os.environ.get('collection_name'))

The evaluation process starts by creating a dataset from ground truth data, then uses GPT-4o as the evaluator model to assess the RAG system’s performance across multiple metrics including Faithfulness, Context Relevance, Context Utilization, Context Recall, and Factual Correctness. These metrics measure how well the system’s responses align with the provided context and factual accuracy.

After running the evaluation and printing the scores for each metric, the code cleans up by deleting the Qdrant collection. This comprehensive pipeline demonstrates a typical RAG system evaluation workflow, where document knowledge is transformed into vector embeddings, queries are processed against this knowledge, and the quality of responses is systematically measured.

Semantic Chunking Strategy:

The below code evaluates a RAG system using semantic chunking rather than fixed-size chunking. It sets up a knowledge base by loading a PDF document, splitting it into chunks based on semantic similarity with a threshold of 0.6, and storing them in a Qdrant vector database using Ollama embeddings. The system leverages Claude 3.5 Sonnet as its language model to process queries against the knowledge base.

import os

from agno.agent import Agent
from agno.document.chunking.semantic import SemanticChunking
from agno.embedder.ollama import OllamaEmbedder
from agno.models.anthropic import Claude
from qdrant_client import qdrant_client
from agno.knowledge.pdf import PDFKnowledgeBase
from agno.vectordb.qdrant import Qdrant
from dotenv import load_dotenv, find_dotenv

from ragas.llms import LlamaIndexLLMWrapper
from ragas import EvaluationDataset, evaluate
from ragas.metrics import Faithfulness, FactualCorrectness, ContextRelevance, ContextUtilization, ContextRecall

from llama_index.llms.openai import OpenAI

from create_eval_dataset import create_eval_ds

eval_llm = OpenAI(model='gpt-4o')

load_dotenv(find_dotenv())

doc_path = "data/test_data.pdf"
ground_truth_path = "data/ground_truth.json"

# initialize the LLM (default to openai)
claude = Claude(id="claude-3-5-sonnet-20241022")

# initialize the qdrant client.
q_client = qdrant_client.QdrantClient(url=os.environ.get('qdrant_url'), api_key=os.environ.get('api_key'))

# create the qdrant vector store instance
vector_db = Qdrant(
    collection=os.environ.get('collection_name'),
    url=os.environ.get('qdrant_url'),
    api_key=os.environ.get('api_key'),
    embedder=OllamaEmbedder(id="nomic-embed-text:latest", dimensions=768)
)

# configure the knowledge base
knowledge_base = PDFKnowledgeBase(vector_db=vector_db,
                                  path=doc_path,
                                  chunking_strategy=SemanticChunking(
                                      chunk_size=chunk_size,
                                      similarity_threshold=0.6)
                                  )

if not q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    knowledge_base.load(recreate=False)

# initialize agent
agent = Agent(knowledge=knowledge_base,search_knowledge=True, model=claude)

# create the dataset for evaluation
eval_dataset = create_eval_ds(agent=agent, ground_truth_path=ground_truth_path)

evaluation_dataset = EvaluationDataset.from_list(eval_dataset)
evaluator_llm = LlamaIndexLLMWrapper(llm=eval_llm)
result = evaluate(dataset=evaluation_dataset, metrics=[Faithfulness(), ContextRelevance(),
                                                       ContextUtilization(), ContextRecall(),
                                                       FactualCorrectness()])

for score in result.scores:
    print(score)

# destroy the collection
if q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    q_client.delete_collection(collection_name=os.environ.get('collection_name'))

The evaluation workflow creates a test dataset from ground truth data, then employs GPT-4o as the evaluator to assess the system across five metrics: Faithfulness, Context Relevance, Context Utilization, Context Recall, and Factual Correctness. These metrics comprehensively measure how well the system’s responses align with the provided context and maintain factual accuracy.

After running the evaluation and printing results for each metric, the code cleans up by removing the Qdrant collection. This demonstrates an advanced RAG pipeline that uses semantic understanding rather than arbitrary character counts to determine document chunks, potentially improving contextual relevance in the retrieval process while maintaining the same rigorous evaluation methodology.

Agentic Chunking Strategy:

The below code evaluates a RAG system using agentic chunking, an advanced method that employs Claude 3.7 Sonnet to intelligently segment documents based on content understanding rather than arbitrary divisions. It processes a PDF document and stores the chunks in a Qdrant vector database with embeddings from Ollama’s nomic-embed-text model. The system uses the same Claude model for both the chunking process and for answering queries based on the knowledge base.

import os

from agno.agent import Agent
from agno.document.chunking.agentic import AgenticChunking
from agno.embedder.ollama import OllamaEmbedder
from agno.models.anthropic import Claude
from qdrant_client import qdrant_client
from agno.knowledge.pdf import PDFKnowledgeBase
from agno.vectordb.qdrant import Qdrant
from dotenv import load_dotenv, find_dotenv

from ragas.llms import LlamaIndexLLMWrapper
from ragas import EvaluationDataset, evaluate
from ragas.metrics import Faithfulness, FactualCorrectness, ContextRelevance, ContextUtilization, ContextRecall

from llama_index.llms.openai import OpenAI

from create_eval_dataset import create_eval_ds

eval_llm = OpenAI(model='gpt-4o')

load_dotenv(find_dotenv())

doc_path = "data/test_data.pdf"
ground_truth_path = "data/ground_truth.json"
chunk_size = 1000

# initialize the LLM (default to openai)
claude = Claude(id="claude-3-7-sonnet-20250219")

# initialize the qdrant client.
q_client = qdrant_client.QdrantClient(url=os.environ.get('qdrant_url'), api_key=os.environ.get('api_key'))

# create the qdrant vector store instance
vector_db = Qdrant(
    collection=os.environ.get('collection_name'),
    url=os.environ.get('qdrant_url'),
    api_key=os.environ.get('api_key'),
    embedder=OllamaEmbedder(id="nomic-embed-text:latest", dimensions=768)
)

# configure the knowledge base
knowledge_base = PDFKnowledgeBase(vector_db=vector_db,
                                  path=doc_path,
                                  chunking_strategy=AgenticChunking(
                                      model=claude,
                                      max_chunk_size=chunk_size)
                                  )

if not q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    knowledge_base.load(recreate=False)

# initialize agent
agent = Agent(knowledge=knowledge_base, search_knowledge=True, model=claude)

# create the dataset for evaluation
eval_dataset = create_eval_ds(agent=agent, ground_truth_path=ground_truth_path)

evaluation_dataset = EvaluationDataset.from_list(eval_dataset)
evaluator_llm = LlamaIndexLLMWrapper(llm=eval_llm)
result = evaluate(dataset=evaluation_dataset, metrics=[Faithfulness(), ContextRelevance(),
                                                       ContextUtilization(), ContextRecall(),
                                                       FactualCorrectness()])

for score in result.scores:
    print(score)

# destroy the collection
if q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    q_client.delete_collection(collection_name=os.environ.get('collection_name'))

The evaluation framework utilizes RAGAS with GPT-4o as the evaluator model to assess the system’s performance across five key metrics: Faithfulness, Context Relevance, Context Utilization, Context Recall, and Factual Correctness. This comprehensive evaluation examines how well the system’s responses align with the retrieved context and maintain factual accuracy.

After completing the evaluation and outputting the scores, the code cleans up by removing the Qdrant collection. This pipeline represents a sophisticated approach to RAG systems that leverages AI to make intelligent decisions about document segmentation, potentially preserving semantic coherence in chunks while ensuring they remain under a maximum size of 1000 tokens.

Recursive Chunking Strategy:

The below code evaluates a RAG system using recursive chunking, which hierarchically splits documents into progressively smaller segments based on document structure. It processes a PDF document and divides it into chunks of 1000 tokens with 200 tokens of overlap, then stores these in a Qdrant vector database with embeddings from Ollama’s nomic-embed-text model. The system employs Claude 3.7 Sonnet as the language model to answer queries using this knowledge base.

import os

from agno.agent import Agent
from agno.document.chunking.recursive import RecursiveChunking
from agno.embedder.ollama import OllamaEmbedder
from agno.models.anthropic import Claude
from qdrant_client import qdrant_client
from agno.knowledge.pdf import PDFKnowledgeBase
from agno.vectordb.qdrant import Qdrant
from dotenv import load_dotenv, find_dotenv

from ragas.llms import LlamaIndexLLMWrapper
from ragas import EvaluationDataset, evaluate
from ragas.metrics import Faithfulness, FactualCorrectness, ContextRelevance, ContextUtilization, ContextRecall

from llama_index.llms.openai import OpenAI

from create_eval_dataset import create_eval_ds

eval_llm = OpenAI(model='gpt-4o')

load_dotenv(find_dotenv())

doc_path = "data/test_data.pdf"
ground_truth_path = "data/ground_truth.json"
chunk_size = 1000
chunk_overlap = 200

# initialize the LLM (default to openai)
claude = Claude(id="claude-3-7-sonnet-20250219")

# initialize the qdrant client.
q_client = qdrant_client.QdrantClient(url=os.environ.get('qdrant_url'), api_key=os.environ.get('api_key'))

# create the qdrant vector store instance
vector_db = Qdrant(
    collection=os.environ.get('collection_name'),
    url=os.environ.get('qdrant_url'),
    api_key=os.environ.get('api_key'),
    embedder=OllamaEmbedder(id="nomic-embed-text:latest", dimensions=768)
)

# configure the knowledge base
knowledge_base = PDFKnowledgeBase(vector_db=vector_db,
                                  path=doc_path,
                                  chunking_strategy=RecursiveChunking(
                                      chunk_size=chunk_size,
                                      overlap=chunk_overlap)
                                  )

if not q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    knowledge_base.load(recreate=False)

# initialize agent
agent = Agent(knowledge=knowledge_base, search_knowledge=True, model=claude)

# create the dataset for evaluation
eval_dataset = create_eval_ds(agent=agent, ground_truth_path=ground_truth_path)

evaluation_dataset = EvaluationDataset.from_list(eval_dataset)
evaluator_llm = LlamaIndexLLMWrapper(llm=eval_llm)
result = evaluate(dataset=evaluation_dataset, metrics=[Faithfulness(), ContextRelevance(),
                                                       ContextUtilization(), ContextRecall(),
                                                       FactualCorrectness()])

for score in result.scores:
    print(score)

# destroy the collection
if q_client.collection_exists(collection_name=os.environ.get('collection_name')):
    q_client.delete_collection(collection_name=os.environ.get('collection_name'))

The evaluation process creates a test dataset from ground truth data, then uses GPT-4o as the evaluator to assess five critical metrics: Faithfulness, Context Relevance, Context Utilization, Context Recall, and Factual Correctness. These metrics comprehensively measure how well the system’s responses align with the provided context and factual accuracy.

After evaluation and printing the scores, the code cleans up by removing the Qdrant collection. This approach leverages recursive chunking’s ability to preserve hierarchical document structure, potentially improving information retrieval by maintaining the relationship between sections, subsections, and paragraphs while evaluating the system with the same rigorous methodology used in previous implementations.

Creating Evals Dataset:

The Below code creates an evaluation dataset for testing a RAG system by processing ground truth data and comparing it to agent-generated responses. It reads a JSON file containing questions and expected answers, then iterates through each item, submitting the question to the agent and capturing its response along with the retrieved contextual documents.

For each question-answer pair, the function constructs a response object containing the original question, the ground truth reference answer, the agent’s actual response, and the set of contexts retrieved by the agent to generate its answer. These contexts are extracted from the reference metadata in the agent’s response, providing insight into which documents the system used to formulate its answer.

import json
from time import sleep

test_data = []

def create_eval_ds(agent, ground_truth_path):
    with open(ground_truth_path, 'r') as f:
        data = json.load(f)

        for obj in data:
            print(f'question:{obj["question"]}')
            resp = {'user_input': obj["question"], 'reference': obj["answer"]}
            # trigger the agent
            response = agent.run(obj["question"], markdown=True)

            relevant_docs = []
            for reference in response.extra_data.references:
                relevant_docs.extend([item['content'] for item in reference.references])

            resp['retrieved_contexts'] = relevant_docs
            resp['response'] = response.content

            test_data.append(resp)
            sleep(60)
    return test_data

The function includes a 60-second delay between questions to prevent rate limiting or to allow time for processing. The resulting dataset is structured as a list of dictionaries, each containing the question, reference answer, agent response, and retrieved contexts, which can then be used by evaluation frameworks like RAGAS to measure the performance of the RAG system across various metrics.

The Execution and Result:

Once you run we have the different evals powered by the ragas and LlamaIndex wrappen for LLM to evalute our eval dataset. One of the output run is shown below.

{'faithfulness': 1.0, 'nv_context_relevance': 1.0, 'context_utilization': 0.7408430916026282, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.46)}
{'faithfulness': 1.0, 'nv_context_relevance': 1.0, 'context_utilization': 0.46957671956889324, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.59)}
{'faithfulness': 1.0, 'nv_context_relevance': 1.0, 'context_utilization': 0.8333333332916666, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.25)}
{'faithfulness': 0.9411764705882353, 'nv_context_relevance': 1.0, 'context_utilization': 0.3652996309200647, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.0)}
{'faithfulness': 0.9047619047619048, 'nv_context_relevance': 1.0, 'context_utilization': 0.5833333333041666, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.9)}
{'faithfulness': 0.8, 'nv_context_relevance': 1.0, 'context_utilization': 0.249999999975, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.15)}
{'faithfulness': 1.0, 'nv_context_relevance': 1.0, 'context_utilization': 0.99999999998, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.7)}
{'faithfulness': 1.0, 'nv_context_relevance': 1.0, 'context_utilization': 0.5833333333041666, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.25)}
{'faithfulness': 0.9166666666666666, 'nv_context_relevance': 1.0, 'context_utilization': 0.9999999999, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.17)}
{'faithfulness': 0.9090909090909091, 'nv_context_relevance': 1.0, 'context_utilization': 0.8666666666377778, 'context_recall': 1.0, 'factual_correctness(mode=f1)': np.float64(0.31)}

What Next?

We will see how to build this dashboards and ingest the data into them so that we can see the comparative analysis of the different experiment. 

Why don’t we use MlOps Platforms?

We can use MlOps platforms for sure but they are much more advanced in nature just for visualizing the performance numbers we dont need the entire platform. But, we will cover MLOps platforms specially Mlflow in upcoming blogs to see how we can leverage MLflow for a GenAI project.

The Conclusion:

Our evaluation of chunking strategies demonstrates that document segmentation significantly impacts RAG system performance across all RAGAS metrics. While Fixed-Size chunking provides a baseline, Semantic and Agentic approaches better preserve contextual integrity by respecting natural information boundaries. Notably, Agentic chunking with Claude 3.7 Sonnet showed superior Context Relevance and Factual Correctness by leveraging the LLM’s understanding of document structure, while Recursive chunking excelled for hierarchical documents. These findings emphasize that chunking isn’t merely a technical detail but a fundamental architectural decision that should be tailored to specific document types and use cases to build more accurate and trustworthy RAG systems.