Skip to content

📈 RAG, Multi-Agent Graphs & Telemetry Cheat Sheet

A quick-reference guide for semantic chunking, parent-child re-ranking pipelines, LangGraph state machines, OpenTelemetry trace instrumentations, and LLM-as-a-Judge evaluation matrices.


🗂️ Advanced RAG & Retrieval

1. Semantic Chunking & Splits

Instead of splitting text at static character boundaries, split text dynamically based on Sentence Vector Similarity Drops:

python
### Conceptual Python pseudocode for Semantic Chunking
import numpy as np

def split_semantically(sentences: list[str], embeddings: list[np.ndarray], threshold: float) -> list[str]:
    chunks = []
    current_chunk = []
    
    for i in range(len(sentences) - 1):
        current_chunk.append(sentences[i])
        # Compute Cosine Distance drop between sentence i and sentence i+1
        similarity = np.dot(embeddings[i], embeddings[i+1]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
        
        if similarity < threshold: # Similarity dropped = new topic started
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

🔁 2. Bi-Encoder vs. Cross-Encoder retrieval

Combine Bi-Encoders and Cross-encoders to achieve fast and highly precise context retrievals:

  • Bi-Encoder: Compiles query and document embeddings independently in coordinate space. Search is ultra-fast ($O(\log N)$) but lacks query-to-context relational synthesis.
  • Cross-Encoder: Feeds query and retrieved document together through self-attention layers in a transformer model, calculating exact contextual relevance (high-precision but slow; ideal for re-ranking small datasets).

🧠 LangGraph Multi-Agent topologies

Define agents as state-machine graphs. Below is the core template for a Supervisor-Worker Topology:

python
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END

### 1. Define shared state dictionary
class AgentState(TypedDict):
    task: str
    research_notes: str
    draft: str
    iterations: int

### 2. Define node execution functions
def supervisor_node(state: AgentState):
    print("Supervisor evaluating task...")
    if not state.get("research_notes"):
        return {"next_step": "researcher"}
    return {"next_step": "writer"}

def researcher_node(state: AgentState):
    notes = "Found 3 corporate database records matching query."
    return {"research_notes": notes}

def writer_node(state: AgentState):
    draft = f"Draft report based on: {state['research_notes']}"
    return {"draft": draft}

### 3. Compile the State Graph
workflow = StateGraph(AgentState)

### 4. Add nodes to graph
workflow.add_node("supervisor", supervisor_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("writer", writer_node)

### 5. Define edges and conditional routing rules
workflow.add_edge(START, "supervisor")
workflow.add_conditional_edges(
    "supervisor",
    lambda state: state.get("next_step"), # Dynamic routing edge key
    {
        "researcher": "researcher",
        "writer": "writer"
    }
)
workflow.add_edge("researcher", "supervisor") # Cycle back to supervisor
workflow.add_edge("writer", END)

### 6. Compile executable graph runtime
app = workflow.compile()

📊 Distributed Telemetry & Tracing Spans

Instrument your code to capture nested agent runs conforming to OpenTelemetry (OTel) standards:

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

### 1. Initialize System Tracer Provider
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter()) # Pipes logs to console or Langfuse
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-observability")

### 2. Instrument nested execution spans
def run_agentic_pipeline(user_query: str):
    with tracer.start_as_current_span("parent_agent_run") as parent_span:
        parent_span.set_attribute("query", user_query)
        
        # Nested Span 1: Database Memory Search
        with tracer.start_as_current_span("vector_memory_search") as db_span:
            db_span.set_attribute("vector_dimensions", 1536)
            time.sleep(0.5) # Simulate database query latency
            db_span.add_event("Memories fetched successfully.")
            
        # Nested Span 2: LLM Inference call
        with tracer.start_as_current_span("llm_generation") as llm_span:
            llm_span.set_attribute("model_name", "gemini-1.5-flash")
            time.sleep(1.2) # Simulate API latency
            llm_span.set_attribute("tokens_generated", 256)

🏆 LLM-as-a-Judge Evaluation Prompts

Run automated evaluation audits inside your CI/CD pipelines to quantitatively score agent behaviors:

1. Faithfulness Score (Detecting Hallucinations)

text
SYSTEM: You are a strict quantitative audit judge.
Your task is to evaluate if the GENERATED ANSWER is fully grounded in the provided CONTEXT.

CONTEXT:
{retrieved_context}

GENERATED ANSWER:
{agent_output}

Output a single JSON object containing:
- "verdict": "YES" if the answer contains only facts directly supported by the context, otherwise "NO".
- "hallucinated_sentences": A list of strings containing sentences in the answer that are not supported.
- "faithfulness_score": A float rating from 0.0 (fully hallucinated) to 1.0 (fully grounded).

2. Answer Relevancy Score (Directness Check)

text
SYSTEM: You are a strict semantic evaluator.
Your task is to grade if the GENERATED ANSWER directly addresses the USER QUERY.

USER QUERY:
{user_query}

GENERATED ANSWER:
{agent_output}

Output a single JSON object containing:
- "relevancy_score": A float rating from 0.0 (completely off-topic) to 1.0 (perfectly addresses query).
- "missing_information": A list of points requested by the query but omitted in the answer.