🐳 Module 03: Local-First Lightweight Models & Serving

Welcome to Module 03. In this section, you will master the "Physics of Inference." You will learn to deploy lightweight, high-reasoning models (Gemma 2 2B, Llama 3.2) locally, moving beyond basic execution to understand VRAM geometry, quantization math, and high-concurrency serving architectures.

🏛️ 1. Architectural Deep Dive: VRAM Geometry & The KV-Cache Tax

To run models locally, you must move past "model size" and understand the two primary physical constraints: Static Weights and Dynamic Memory (KV-Cache).

The Static Weight Math

A model's memory footprint is determined by its parameter count and bit-precision.

Formula: Parameters (B) * (Bits / 8) = VRAM (GB)
Example: A Gemma 2 2B model at 4-bit quantization (Q4_K_M) consumes: 2.6B * (4 / 8) = 1.3GB for the weights alone.

The dynamic Bottleneck: KV-Cache

Inference is not just about weights. As an agent generates tokens, it must store "Key-Value" pairs for every previous token in the context window to avoid re-calculating them. This is the KV-Cache.

The Tax: Large context windows (e.g., 128k) can consume more VRAM than the model weights themselves.
The Fragmentation Problem: Traditional serving engines allocate contiguous memory blocks for the KV-Cache. If an agent generates 500 tokens but the engine reserved 4000, that "internal fragmentation" wastes VRAM, preventing you from running multiple agents concurrently.

📊 2. Structured Tradeoff Matrix: Local Serving Engines

Engine	Optimization	Primary Use Case	Scaling Type	Primary Production Bottleneck
Ollama	User Experience	Rapid prototyping, desktop agents.	Vertical (Single User)	Higher memory overhead per instance.
llama.cpp	CPU/Metal Portability	Edge devices, laptops without NVIDIA GPUs.	Hybrid (CPU/GPU Offloading)	Slow inference on large batches.
vLLM	Throughput	Production multi-agent gateways, APIs.	Horizontal (Batching)	High "Idle" VRAM reservation.
TGI	Reliability	Enterprise-grade deployment.	Distributed	Strict Docker dependency requirements.

🛠️ 3. Step-by-Step Mechanics Breakdown

Pattern: PagedAttention (vLLM)

In Lab 2, we utilize vLLM. Unlike standard engines, vLLM implements PagedAttention, inspired by virtual memory in operating systems.

Logical Mapping: It breaks the KV-cache into small, non-contiguous physical "pages."
Zero-Waste Allocation: It only maps a new page when the agent needs more context memory.
Result: This allows you to serve 5x more concurrent agents on the same hardware compared to standard Hugging Face implementations.

Pattern: GGUF & MMap (llama.cpp)

In Lab 1, we use llama.cpp's GGUF format.

Rationale: GGUF uses mmap (memory-mapping). Instead of the OS loading the entire model into RAM at once, it maps the model file directly to the virtual address space.
The Benefit: The OS manages the "loading" and "unloading" of weight blocks as needed, allowing you to run models that are slightly larger than your physical VRAM by "spilling over" into system RAM.

🛡️ 4. Failure Mode Analysis: Local Infrastructure Breaking Points

Failure Mode	Error/Log Signature	Root Cause	Code-Level Mitigation
CUDA OOM	`torch.cuda.OutOfMemoryError`	Combined Weights + KV-Cache > VRAM.	Reduce `max_model_len` or increase quantization level.
Context Window Truncation	`Token limit reached` (Silent error)	Prompt + Generation exceeds context max.	Implement sliding-window context management.
Model Spilling	Massive drop in Tokens/Sec (e.g. 50 ➡️ 2)	OS moved weights from GPU to System RAM.	Use `n_gpu_layers` to pin critical layers to VRAM.
Socket Hang-up	`Connection reset by peer`	Local inference server crashed due to heat/OOM.	Wrap server in a `systemd` watchdog or tmux restarter.

🧪 5. Runtime Verification: What to Observe

When running the labs, open a second tmux pane and monitor these metrics:

VRAM Profiling: Run nvidia-smi -l 1 (NVIDIA) or sudo htop (CPU).
- Observation: Watch the VRAM spike during model load, then watch it "creep" upward as the agent generates a long response (this is the KV-Cache growing).
The "Ollama Cold-Start": Run time ollama run gemma2:2b "hi".
- Observation: The first run will take seconds as the model is pulled from disk. Subsequent runs should respond in <100ms as the model remains "hot" in memory.
Throughput Test: While running the vLLM server, watch the logs for Avg prompt throughput: XX.X tokens/s. Compare this when running a 2B model vs. an 8B model on the same hardware.

Next Step: proceed to Module 04: Spec-Driven Development to learn how to structure the output from these local engines.

🐳 Module 03: Local-First Lightweight Models & Serving ​

🏛️ 1. Architectural Deep Dive: VRAM Geometry & The KV-Cache Tax ​

The Static Weight Math ​

The dynamic Bottleneck: KV-Cache ​

📊 2. Structured Tradeoff Matrix: Local Serving Engines ​

🛠️ 3. Step-by-Step Mechanics Breakdown ​

Pattern: PagedAttention (vLLM) ​

Pattern: GGUF & MMap (llama.cpp) ​

🛡️ 4. Failure Mode Analysis: Local Infrastructure Breaking Points ​

🧪 5. Runtime Verification: What to Observe ​