Appearance
๐ณ Module 03: Local-First Lightweight Models & Serving โ
Welcome to Module 03. In this section, you will master the "Physics of Inference." You will learn to deploy lightweight, high-reasoning models (Gemma 2 2B, Llama 3.2) locally, moving beyond basic execution to understand VRAM geometry, quantization math, and high-concurrency serving architectures.
๐๏ธ 1. Architectural Deep Dive: VRAM Geometry & The KV-Cache Tax โ
To run models locally, you must move past "model size" and understand the two primary physical constraints: Static Weights and Dynamic Memory (KV-Cache).
The Static Weight Math โ
A model's memory footprint is determined by its parameter count and bit-precision.
- Formula:
Parameters (B) * (Bits / 8) = VRAM (GB) - Example: A Gemma 2 2B model at 4-bit quantization (Q4_K_M) consumes:
2.6B * (4 / 8) = 1.3GBfor the weights alone.
The dynamic Bottleneck: KV-Cache โ
Inference is not just about weights. As an agent generates tokens, it must store "Key-Value" pairs for every previous token in the context window to avoid re-calculating them. This is the KV-Cache.
- The Tax: Large context windows (e.g., 128k) can consume more VRAM than the model weights themselves.
- The Fragmentation Problem: Traditional serving engines allocate contiguous memory blocks for the KV-Cache. If an agent generates 500 tokens but the engine reserved 4000, that "internal fragmentation" wastes VRAM, preventing you from running multiple agents concurrently.
๐ 2. Structured Tradeoff Matrix: Local Serving Engines โ
| Engine | Optimization | Primary Use Case | Scaling Type | Primary Production Bottleneck |
|---|---|---|---|---|
| Ollama | User Experience | Rapid prototyping, desktop agents. | Vertical (Single User) | Higher memory overhead per instance. |
| llama.cpp | CPU/Metal Portability | Edge devices, laptops without NVIDIA GPUs. | Hybrid (CPU/GPU Offloading) | Slow inference on large batches. |
| vLLM | Throughput | Production multi-agent gateways, APIs. | Horizontal (Batching) | High "Idle" VRAM reservation. |
| TGI | Reliability | Enterprise-grade deployment. | Distributed | Strict Docker dependency requirements. |
๐ ๏ธ 3. Step-by-Step Mechanics Breakdown โ
Pattern: PagedAttention (vLLM) โ
In Lab 2, we utilize vLLM. Unlike standard engines, vLLM implements PagedAttention, inspired by virtual memory in operating systems.
- Logical Mapping: It breaks the KV-cache into small, non-contiguous physical "pages."
- Zero-Waste Allocation: It only maps a new page when the agent needs more context memory.
- Result: This allows you to serve 5x more concurrent agents on the same hardware compared to standard Hugging Face implementations.
Pattern: GGUF & MMap (llama.cpp) โ
In Lab 1, we use llama.cpp's GGUF format.
- Rationale: GGUF uses
mmap(memory-mapping). Instead of the OS loading the entire model into RAM at once, it maps the model file directly to the virtual address space. - The Benefit: The OS manages the "loading" and "unloading" of weight blocks as needed, allowing you to run models that are slightly larger than your physical VRAM by "spilling over" into system RAM.
๐ก๏ธ 4. Failure Mode Analysis: Local Infrastructure Breaking Points โ
| Failure Mode | Error/Log Signature | Root Cause | Code-Level Mitigation |
|---|---|---|---|
| CUDA OOM | torch.cuda.OutOfMemoryError | Combined Weights + KV-Cache > VRAM. | Reduce max_model_len or increase quantization level. |
| Context Window Truncation | Token limit reached (Silent error) | Prompt + Generation exceeds context max. | Implement sliding-window context management. |
| Model Spilling | Massive drop in Tokens/Sec (e.g. 50 โก๏ธ 2) | OS moved weights from GPU to System RAM. | Use n_gpu_layers to pin critical layers to VRAM. |
| Socket Hang-up | Connection reset by peer | Local inference server crashed due to heat/OOM. | Wrap server in a systemd watchdog or tmux restarter. |
๐งช 5. Runtime Verification: What to Observe โ
When running the labs, open a second tmux pane and monitor these metrics:
- VRAM Profiling: Run
nvidia-smi -l 1(NVIDIA) orsudo htop(CPU).- Observation: Watch the VRAM spike during model load, then watch it "creep" upward as the agent generates a long response (this is the KV-Cache growing).
- The "Ollama Cold-Start": Run
time ollama run gemma2:2b "hi".- Observation: The first run will take seconds as the model is pulled from disk. Subsequent runs should respond in <100ms as the model remains "hot" in memory.
- Throughput Test: While running the
vLLMserver, watch the logs forAvg prompt throughput: XX.X tokens/s. Compare this when running a 2B model vs. an 8B model on the same hardware.
Next Step: proceed to Module 04: Spec-Driven Development to learn how to structure the output from these local engines.