Agentic Memory
Agentic memory is a set of techniques that give an LLM agent persistent state beyond its context window — short-term scratchpads, long-term semantic stores, and episodic logs — so it can learn from past interactions across sessions.
Curiosity
Short, precise definitions with working examples — the things you need to know before you build.
Agentic memory is a set of techniques that give an LLM agent persistent state beyond its context window — short-term scratchpads, long-term semantic stores, and episodic logs — so it can learn from past interactions across sessions.
AI safety red-teaming is the practice of deliberately probing an AI system with adversarial prompts and scenarios — by humans, by other models, or automated tools — to uncover harmful, unsafe, or policy-violating behaviours before deployment.
Attention is a neural network operation that lets a model compute a weighted combination of input elements for each output position, where the weights are learned from the similarity between a query and a set of keys.
Batching runs multiple LLM inference requests through the GPU together to amortize fixed costs; continuous batching, pioneered by Orca and vLLM, dynamically adds and removes requests from the batch at every decoding step for much higher throughput.
Beam search is a deterministic decoding algorithm that keeps the top-B partial sequences at every generation step and expands them — it approximates the argmax-probability sequence better than greedy decoding but tends to produce bland, repetitive text in modern LLMs.
BM25 is the classical bag-of-words ranking function used by search engines to score documents against a query using term frequency, inverse document frequency, and document length normalization.
Chain-of-thought (CoT) prompting is a technique where the model is asked to show its step-by-step reasoning before giving a final answer, which dramatically improves accuracy on math, logic, and multi-step tasks.
Chatbot Arena is a crowdsourced LLM evaluation platform where users submit a prompt, receive anonymous responses from two different models, and vote for the better one — producing Elo-style rankings from millions of head-to-head comparisons.
Chunking strategies are the rules by which a RAG pipeline splits documents into retrievable units — choice of size, overlap, and boundary (character, token, sentence, section, semantic) directly controls retrieval quality and answer grounding.
Constitutional AI is Anthropic's alignment technique where a model is trained to critique and revise its own outputs against a written set of principles (the 'constitution'), producing preference data used to fine-tune a safer assistant — largely replacing human red-teamers with AI feedback.
The context window is the maximum number of tokens — prompt plus output — a language model can process in a single call, bounded by architecture and memory.
Cosine similarity is a metric that measures how close two vectors point in the same direction, computed as their dot product divided by the product of their magnitudes. It's the default similarity used for embeddings.
A decoder-only transformer is a stack of transformer blocks with causal (masked) self-attention that predicts the next token conditioned on all previous tokens — the architecture behind GPT, Claude, Llama, and most modern LLMs.
Direct Preference Optimization (DPO) is an alignment technique that fine-tunes a language model directly on pairs of preferred vs dispreferred responses, skipping the reward model and RL loop used in RLHF.
Embeddings are dense numerical vectors that represent words, sentences, images, or other objects in a space where semantic similarity corresponds to geometric closeness.
Few-shot prompting is a technique where you include a handful of input-output examples directly in the prompt so the LLM can infer the task format and respond in kind — no weights change, the model learns in-context.
Fine-tuning adapts a base LLM's weights to new task formats, style, or tone using labeled examples. Prefer RAG for new facts; fine-tune for new behavior.
FlashAttention is an IO-aware exact implementation of self-attention that tiles computation across GPU SRAM to avoid materializing the full attention matrix, giving large speedups and linear-in-sequence-length memory.
GAIA is a benchmark of 466 real-world questions that require multi-step tool use, web browsing, file handling, and reasoning — it is the standard evaluation for general AI assistants and agents, with humans scoring 92% and frontier agents historically far below.
GGUF is a single-file binary format for quantized LLM weights and metadata, designed for llama.cpp and its ecosystem — it packages tokenizer, architecture, and quantized tensors into one portable file that loads via mmap on CPU or GPU.
Group Relative Policy Optimization (GRPO) is the reinforcement-learning algorithm DeepSeek used to train R1 — it drops PPO's value network and estimates advantages by comparing multiple sampled responses within the same prompt group.
Grouped-Query Attention (GQA) is an attention variant where multiple query heads share a single key/value head — it cuts KV-cache memory and boosts inference throughput with almost no quality loss versus full multi-head attention.
Guardrails are input and output validation layers wrapped around an LLM — filters, classifiers, schema checks, and policy rules — that block unsafe, off-topic, or malformed generations before they reach users or downstream systems.
Hallucination is when a language model confidently generates content that is factually wrong, fabricated, or unsupported by any provided source — the single most important reliability problem in LLM applications.
Hybrid search is a retrieval strategy that combines sparse keyword scoring (usually BM25) with dense vector similarity, then fuses the two ranked lists — catching both exact-term matches and semantically related passages.
HyDE is a retrieval technique where the LLM first generates a hypothetical answer to the user query, then embeds that generated answer and uses it — not the query — to search the vector index.
Instruction tuning is the supervised fine-tuning stage where a pretrained language model is trained on (instruction, response) pairs so that it learns to follow natural-language commands instead of merely continuing text.
INT4 quantization compresses LLM weights from 16-bit floating point down to 4-bit integers — it cuts model memory by ~4x and typically doubles inference throughput with only small quality degradation when paired with modern algorithms like GPTQ or AWQ.
INT8 quantization stores LLM weights (and sometimes activations) as 8-bit integers — it halves memory versus FP16 while preserving near-baseline accuracy and is the safest first step for deploying a large model on cheaper hardware.
The KV cache stores the key and value tensors computed by self-attention for past tokens so that generating each new token becomes O(1) in sequence length instead of re-processing the entire prefix.
LLM KV-cache compression is a family of techniques — quantization, eviction, low-rank projection, token pruning — that shrink the key/value cache at inference time so long-context and high-batch serving fit on smaller GPUs.
LLM-as-judge is an evaluation pattern where a language model grades or ranks another model's outputs, serving as a scalable — if imperfect — substitute for human evaluation.
Local Attention is a family of attention patterns where each token only attends to a small local neighbourhood of tokens rather than the full sequence — it is the general technique behind sliding-window, block, and dilated attention designs.
LoRA is a parameter-efficient fine-tuning technique that freezes a base model's weights and trains small low-rank matrices injected into each layer, drastically cutting memory and storage cost.
Mixture of Experts is a neural architecture where a router sends each token to a small subset of 'expert' sub-networks, giving huge total parameter counts while keeping per-token compute low.
MMLU is a widely used LLM evaluation benchmark with about 16,000 multiple-choice questions across 57 subjects — from elementary math to professional law and medicine — designed to measure broad academic and professional knowledge.
Model distillation is a compression technique where a small 'student' model is trained to mimic a larger 'teacher' model's outputs, transferring capability into a cheaper, faster model.
Model parallelism is the set of techniques that split a single neural network across multiple GPUs when it is too large to fit on one — primarily tensor parallelism (splitting individual matrix multiplies) and pipeline parallelism (assigning different layers to different GPUs).
Multi-Latent Attention (MLA) is the attention variant introduced by DeepSeek that compresses keys and values into a low-rank latent vector — it shrinks the KV cache by an order of magnitude while matching or beating multi-head attention quality.
Multi-Query Attention (MQA) is an attention variant where all query heads share a single key/value head — it shrinks the KV cache dramatically and speeds up autoregressive decoding at the cost of a small quality drop.
PagedAttention is a GPU memory-management technique from vLLM that stores each sequence's key-value cache in fixed-size non-contiguous blocks — like virtual-memory paging in an OS — eliminating the internal fragmentation that cripples naive KV-cache allocation.
Perplexity is the exponential of the average negative log-likelihood a language model assigns to a held-out text — lower is better, and it is the oldest and simplest measure of how well a model 'predicts' natural language.
Planning is the agent capability of breaking a high-level goal into a sequence (or tree) of concrete sub-steps before acting, and revising the plan as new information arrives from tool results.
Positional encoding is the technique that injects token-order information into a Transformer, since self-attention by itself is permutation-invariant and cannot distinguish sequence position.
Prompt caching is a server-side optimization that stores the KV-cache state of a stable prompt prefix so repeated requests reuse it, cutting latency and cost for long system prompts, tools, and documents.
Prompt chaining is the pattern of decomposing a complex task into a sequence of simpler prompts, where each step's output feeds the next — trading latency for more reliable, auditable behavior than a single monolithic prompt.
Prompt injection is an attack where adversarial instructions hidden in untrusted input — a document, webpage, email, or tool output — override the developer's intended prompt and cause the LLM to behave maliciously.
Proximal Policy Optimization (PPO) is the on-policy reinforcement-learning algorithm that became the default optimizer for RLHF — it constrains updates with a clipped ratio between new and old policies for stable training on language models.
QLoRA is a fine-tuning method that quantizes a frozen base LLM to 4-bit NF4 weights and trains small LoRA adapters on top — it shrinks the memory footprint enough to fine-tune 65B-parameter models on a single 48 GB GPU.
Quantization is the technique of representing neural network weights and activations with fewer bits — typically INT8, INT4, or FP8 — to shrink memory use and speed up inference with minimal quality loss.
Query rewriting is the step in a RAG pipeline where the original user query is transformed — expanded, decomposed, or reformulated — before retrieval, to increase the chance of matching the right passages in the index.
RAGAS is an open-source evaluation framework for RAG pipelines that scores outputs along four LLM-graded dimensions — faithfulness, answer relevance, context precision, and context recall — without needing ground-truth labels.
ReAct is an agent pattern where an LLM interleaves reasoning traces with tool-using actions and observations, producing a Thought-Action-Observation loop until the task is solved.
Reflexion is an agent pattern where, after an attempt fails, the LLM writes a natural-language self-critique of what went wrong and stores it in episodic memory so the next attempt is better informed — learning by reflection instead of gradient descent.
Reinforcement Learning from AI Feedback (RLAIF) is a post-training technique where a strong AI model, rather than humans, produces the preference labels used to train a reward model — it scales alignment beyond what human annotation can cheaply provide.
RLHF is the training technique that aligns a language model's behavior with human preferences by using human-ranked outputs to train a reward model, then fine-tuning the LLM against that reward with reinforcement learning.
Reranking is a second-stage retrieval step where a heavier cross-encoder model rescores the top-k candidates from a fast first-stage retriever, reordering them so the most relevant passages end up in the prompt.
Retrieval-Augmented Generation (RAG) is a pattern where an LLM is grounded on retrieved passages at query time — fewer hallucinations, up-to-date answers, no retraining required.
Rotary Position Embeddings (RoPE) encode token position by rotating the query and key vectors inside self-attention, so relative position falls out of the attention dot product directly.
Self-attention is the mechanism that lets a Transformer weigh how strongly each token in a sequence relates to every other token, producing context-aware representations.
Self-consistency is a decoding strategy that samples multiple chain-of-thought reasoning paths from an LLM at non-zero temperature, then picks the final answer by majority vote across the samples.
Semantic chunking splits documents at points where the embedding similarity between consecutive sentences drops sharply — instead of fixed sizes, chunks naturally end when the topic changes, improving retrieval coherence.
Sliding Window Attention is an attention pattern where each token only attends to a fixed-size window of recent tokens — it turns quadratic full attention into linear-cost local attention and is the basis for Mistral's long-context design.
Speculative decoding is an inference acceleration technique where a small 'draft' model proposes several tokens and a large 'target' model verifies them in parallel, yielding 2-3x speedup with identical outputs.
Structured output is the capability of having an LLM return JSON, a typed schema, or a tool call that conforms exactly to a declared structure — the bridge between free-form language models and deterministic code.
Supervised Fine-Tuning (SFT) is the first post-training step for an LLM where the base model is trained on curated input-output pairs to follow instructions — it is the foundation every RLHF, DPO, or GRPO pipeline builds on top of.
SWE-bench is an LLM evaluation benchmark of real GitHub issues paired with their resolving pull requests from popular Python repositories, where the model must edit the codebase so that a set of hidden tests pass.
Temperature sampling is a decoding knob that divides the model's logits by a temperature T before softmax — lower T sharpens the distribution toward the argmax, higher T flattens it and increases randomness.
Tokenization is the process of breaking text into discrete units — usually subwords — that a language model actually consumes as input, using algorithms like BPE, WordPiece, or SentencePiece.
Tool calling is the capability where an LLM emits a structured request to invoke an external function — weather lookup, SQL query, code execution — the runtime executes it, returns the result, and the model continues with that result in context.
Top-k sampling restricts next-token choice to the k most-probable tokens, renormalizes those probabilities, and samples from the resulting distribution — a simple way to cut the long tail of low-probability garbage tokens.
Top-p sampling, also called nucleus sampling, restricts the next-token distribution to the smallest set of tokens whose cumulative probability exceeds p, then renormalizes — it adapts the candidate pool dynamically to the model's confidence.
Transformer architecture is a neural network design built around self-attention that replaced recurrent networks for sequence modeling and underpins virtually every modern large language model.
Tree of Thoughts is a prompting framework where the LLM explores a search tree of intermediate reasoning steps, evaluates each state, and uses BFS or DFS with pruning to find a solution — generalizing chain-of-thought from a straight line to a branching search.
A vector database is a specialized store that indexes high-dimensional embeddings and serves fast approximate nearest-neighbor (ANN) similarity search — the retrieval layer underneath most RAG and semantic-search systems.
Vision-Language Models are multimodal neural networks that accept images (and sometimes video) alongside text, producing language outputs grounded in what they see.
Zero-shot prompting is asking the LLM to perform a task from an instruction alone, with no worked examples in the prompt. It relies entirely on the model's pretrained knowledge and instruction-tuned capabilities.