Field Notes · Vol. 01 · Large Language Models

A Practical Dashboard for LLMs

How they actually work, why they fail, and how to build reliable systems around them. Built for the operators, not the theorists.

Field Notes · Vol. 01
01 Landscape & Evolution · from rules to reasoning

The field moved through four distinct phases in the last 70 years. Each one failed at the same thing (generalization) until transformers showed up and quietly changed what was computationally possible.

A brief, useful history
1950s–90s
Rule-based systems ELIZA, expert systems, handcrafted grammars. If you can write the rule, it works. You usually can't.
1990s–00s
Statistical NLP N-grams, hidden Markov models. The field figured out you could learn from data rather than program every case.
2013
Word embeddings (word2vec) Words become vectors. Meaning becomes geometry. The phrase "king − man + woman = queen" starts working.
2014–16
Sequence models (LSTMs, seq2seq) Neural nets that read and write sequences. Machine translation gets good. Still slow, still forgetful.
2017
"Attention Is All You Need" Google introduces the transformer. The paper that quietly started the current era.
2018–20
BERT, GPT-1, GPT-2, GPT-3 Scale works. Pretrain on everything, fine-tune on little. GPT-3 makes it clear: bigger is smarter.
Nov 2022
ChatGPT Not a research moment. A product moment. 100M users in two months. The world gets a chat interface to a language model.
2023–24
The platform era GPT-4, Claude, Gemini, Llama. Multimodality, long context, tool use, agents. RAG becomes standard practice.
2024–26
Reasoning & agents Models that "think before speaking" (o-series, Claude with extended thinking). Agentic systems that plan, call tools, and recover from errors.
Why transformers changed everything

Before transformers, models read text sequentially, word by word. This meant two things. First, they were slow: you couldn't parallelize training across GPUs effectively. Second, they forgot. A word at the start of a paragraph had very little influence on a word at the end.

The transformer's trick is attention. Every token looks at every other token simultaneously, and the model learns which ones matter. This made training parallelizable (faster and bigger), and it let the model handle long dependencies (more context). Combined with scaling laws (the discovery that bigger models + more data + more compute reliably produces better performance), the whole field took off.

In operator terms: transformers unlocked scale, and scale unlocked emergence. Capabilities like translation, summarization, and code generation weren't programmed in. They appeared as side effects of "just predict the next word, really well, on a lot of text."

Current ecosystem · the five that matter
OpenAI
GPT-5 · o-series
First mover with ChatGPT. Strong reasoning models (o3, o4) for hard problems. Deep developer platform, big enterprise push.
Strengths Coding, advanced reasoning, ecosystem breadth, enterprise sales motion.
Anthropic
Claude Opus 4.x · Sonnet
Research-first lab focused on safety and alignment. Claude leads on long-context, nuanced writing, and agentic coding workloads.
Strengths Long context, code, honest/calibrated answers, Constitutional AI approach.
Google DeepMind
Gemini 2.x
Natively multimodal. Huge context windows. Tightly integrated into Workspace, Search, Android. Strongest distribution of any lab.
Strengths Multimodal, context length, distribution via Google products, research depth.
Meta
Llama (open weights)
Bets on open models. Llama is the default for self-hosted / fine-tuned deployments. Changed the economics of AI dramatically.
Strengths Open weights, self-hosting, customization, enormous ecosystem.
xAI
Grok
Real-time data via X integration. Fast training, provocative branding. Catching up on benchmarks quickly with massive GPU investments.
Strengths Real-time info, speed of iteration, differentiated personality.
Operator's note

There is no single "best" model. There is the best model for a specific task at a specific price point under specific latency constraints. Production systems route across multiple models. Treat model choice as a design decision, not a loyalty decision.

02 How LLMs Actually Work · explained simply

Everything an LLM does reduces to one operation: given a sequence of tokens, guess the next one. That's the entire magic trick. Writing, reasoning, coding, translation: all of it emerges from this one primitive, applied at enormous scale.

Tokens · the atoms

LLMs don't read words. They read tokens, which are sub-word pieces. The word "unbelievable" might become three tokens: un, believ, able. Each token has an ID, a number. The model only ever sees numbers. The text you see is translated to numbers on the way in, and back to text on the way out.

A rough rule: 1 token ≈ 0.75 words in English. So 1,000 tokens is about 750 words. Pricing, context limits, and latency are all measured in tokens. It's the unit of everything.

Next-token prediction · the whole trick

The model is handed a sequence of tokens. Its job is to output a probability over every possible next token in its vocabulary (typically 50,000 to 200,000 options). Pick one. Append it. Repeat.

  input:  "The capital of France is"
                │
                ▼
       ┌────────────────┐
       │  Transformer   │
       │    ~billions   │
       │   of params    │
       └────────┬───────┘
                │
                ▼
     Probability distribution
     ──────────────────────────
        Paris   ████████ 0.87
        Lyon    ▌        0.04
        the     ▌        0.02
        a       ▌        0.01
        ... (50k+ other tokens)
                │
                ▼
  sampled next token: "Paris"

Temperature controls how "confident" the sampling is. Temperature 0 means always pick the most likely token (deterministic). Temperature 1.0+ means sample from the distribution (creative, more variance). That single knob is the difference between "reliable classifier" and "creative writer."

Training · the shortest version
  • Pretraining. Feed the model trillions of tokens from the internet, books, code. Task: predict the next token. It learns grammar, facts, reasoning patterns, styles, all as a side effect.
  • Post-training (RLHF / RLAIF). Humans (or other models) rank outputs. The model learns to prefer helpful, harmless, honest responses over reckless or rude ones.
  • Fine-tuning. Optional. Further train on a narrow task-specific dataset: customer support conversations, legal contracts, a specific tone of voice.
Context window · the working memory

The context window is everything the model can "see" at once: your system prompt, your message, any documents you passed in, prior turns of the conversation, and the response it's generating. Modern models range from 8K tokens (early models) to 1M+ tokens (Gemini, Claude).

Big context windows are not free memory. Attention cost grows roughly with context size. Models also suffer from lost-in-the-middle: information buried deep inside long contexts gets less attention than content at the start and end. So passing in your entire codebase doesn't mean the model actually uses all of it.

Embeddings · meaning in numbers

An embedding is a vector: a list of numbers, typically 768 to 3,072 dimensions, representing the meaning of a chunk of text. Text that means similar things ends up with similar vectors. "Dog" and "puppy" are close. "Dog" and "financial quarter" are far.

This lets you search by meaning instead of keywords. Ask "how do I cancel my subscription?" and the system can retrieve the help article titled "Managing billing and ending your plan" even though none of those words match exactly. That's how modern RAG works.

STEP 1 Raw text "How to reset" STEP 2 Embedding model text → vector STEP 3 Vector [0.12, -0.45, 0.83, 0.21, ...×1536] STEP 4 Vector database Pinecone · Weaviate pgvector · Qdrant (indexed, searchable) query comes in STEP 5 Similarity search cosine distance · top-k STEP 6 Retrieved chunks injected into prompt STEP 7 LLM answers grounded response The "R" in RAG: retrieve first, then generate. This is how LLMs access knowledge beyond their training.

This pipeline is the backbone of RAG (Retrieval-Augmented Generation), which is the most common production pattern for LLM applications. It's also why understanding embeddings matters more than understanding transformers for most applied work.

03 Failure Modes & Why They Happen

This is the section that separates people who have shipped LLM products from people who have only demoed them. Every failure mode below has a root cause that traces back to one fact: the model is a next-token predictor without a grounded model of truth.

The seven failures you will encounter

Hallucinations

What The model generates plausible-sounding but factually wrong information, often stated with confidence.
Why The model's job is to produce a likely-sounding continuation, not a true one. When it doesn't know, it still has to output something. The most probable sequence of tokens can be completely fabricated and still "look right." No internal flag fires to say "I don't know this."
Real A lawyer submitted a legal brief in Mata v. Avianca citing six fake cases ChatGPT invented. Full fake citations, fake quotes, fake judges. Or every developer who has had a model import a Python library that does not exist.
Fix Ground in retrieved documents (RAG). Require citations with verifiable URLs. Add explicit "if you don't know, say so" instructions. Use models with better factual calibration. For high-stakes outputs, route through a verifier model or human review.

Inconsistent outputs

What The same prompt produces meaningfully different answers across runs.
Why Sampling is stochastic. At temperature above 0, the model samples from a probability distribution, so output varies. Many tasks also have multiple genuinely valid answers: summarization, classification on fuzzy boundaries, open-ended writing.
Real A sentiment classifier returns "positive" on Monday and "neutral" on Tuesday for the exact same review. A content generator produces three very different blog intros in three runs, and leadership can't decide which is "correct."
Fix Set temperature to 0 for deterministic tasks. Use structured outputs (JSON schema enforcement). Apply self-consistency by running N times and taking the majority. Monitor output drift via automated evals.

Overconfidence

What The model asserts incorrect information with the same tone of authority it uses for correct information.
Why Training optimizes for fluency and helpfulness, not calibration. The model doesn't maintain an internal "confidence score" for its claims. Hedging was often trained out because it frustrated early users.
Real Asking "does Python 3.14 support feature X?" and getting a detailed, confident "yes" with code examples, when Python 3.14 isn't even out yet. Or an internal knowledge assistant stating a company policy that was retired a year ago.
Fix Prompt for epistemic humility ("if you're uncertain, say so explicitly"). Ask for confidence ratings as part of the output. Pair with retrieval that forces grounding. Run critical outputs through a second model acting as a skeptic.

Prompt sensitivity

What Tiny, seemingly meaningless changes to a prompt produce dramatically different outputs.
Why The model is a conditional probability machine. Every word shifts the distribution over what comes next. Words that are strong signals in training data ("carefully," "step by step," "detailed") have outsized effects. Word order, punctuation, and formatting all matter.
Real Changing "summarize this" to "briefly summarize this" cuts output length by 60%. Adding "Let's think step by step" doubles response accuracy on math problems. Reordering few-shot examples changes the class distribution of a classifier.
Fix Treat prompts as versioned artifacts, like code. A/B test prompt variants on a fixed eval set. Use templates and variables instead of free-form prompts in production. Document which phrases matter and why.

Lack of grounding

What The model answers with no factual basis. Just pattern-matching from its training corpus.
Why Without retrieval or tool calls, the model's only knowledge is what it absorbed during training, which is stale (has a cutoff date), partial (didn't cover everything), and anonymized (no access to your internal data).
Real Asking "what's our Q3 refund policy?" to a vanilla LLM. It confidently invents a policy that sounds plausible. Or asking about recent news and getting information from 18 months ago stated as current.
Fix RAG for internal / proprietary knowledge. Tool use (API calls, database queries, web search) for fresh data. Design the system to refuse or escalate when no source can be retrieved rather than fabricate.

Context loss

What The model forgets or ignores parts of long conversations or long documents, even within the advertised context window.
Why Attention is not uniform across context. The "lost in the middle" phenomenon: information at the start and end of a long context gets more attention than information in the middle. Some instructions get overridden by later content. In long chats, the model also starts to drift in style and assumptions.
Real Uploading a 100-page contract and asking a specific question. The model misses a critical clause buried on page 42. Or a long chat where the model forgot the formatting rules you set at the start 30 messages ago.
Fix Chunk + retrieve instead of dumping full documents. Repeat critical instructions near the end of the prompt. Structure prompts so the most important content is at the top and bottom of the context. Benchmark real long-context performance on your data, not just the advertised max.

Bias and alignment issues

What Outputs reflect biases present in training data, or drift from the values the system should uphold.
Why Models are trained on internet-scale text, which contains cultural, gender, racial, and ideological biases. RLHF partially corrects this but can also introduce new biases based on the annotator pool. Models also take on personas if prompted in certain ways, the "jailbreak" problem.
Real Resume screeners penalizing women's names. Loan advisors offering different products based on inferred demographics. A customer support bot being talked into recommending competitor products. Generated images defaulting to a narrow set of demographic representations.
Fix Diverse eval sets that test for bias explicitly. Red-teaming by people with different backgrounds. System prompts with explicit value constraints. Constitutional / principle-based approaches. Human oversight on sensitive use cases. Audit outputs across demographic slices, not just aggregate accuracy.
How to make LLM outputs reliable in production

A reliable LLM product is rarely about picking the "best" model. It's about the system you build around the model. The model is one subroutine in a larger pipeline that handles grounding, evaluation, feedback, and failure.

R.01 · Grounding Retrieval-Augmented Generation Retrieve before generating. Index your documents, let the user's query pull the relevant chunks, pass those into the prompt. Hallucinations drop dramatically because the model has something to cite.
R.02 · Control Clear instructions & constraints System prompts with explicit behavior rules. Output schemas (JSON with validation). Refusal clauses for out-of-scope requests. Tool definitions with tight argument types.
R.03 · Evaluation Golden sets & rubrics A fixed set of 50–500 inputs with expected outputs or scoring criteria. Run on every model or prompt change. Add LLM-as-judge with a stronger model. Your golden set is your test suite.
R.04 · Oversight Human-in-the-loop High-stakes outputs (legal, medical, financial, customer-facing) get human review, at least early in deployment. Build the review UI as a first-class surface, not an afterthought.
R.05 · Learning Iterative feedback loops Collect thumbs up/down, edits, corrections, abandoned sessions. Feed these signals into prompt updates, retrieval improvements, or fine-tuning. Close the loop between usage and iteration.
R.06 · Routing Model selection & routing Don't use one model for everything. Route easy classification to small fast models, complex reasoning to flagship models, code to specialist coders. Cost and latency drop; quality rises where it matters.
R.07 · Observability Logging & monitoring Log every prompt, response, token count, latency, cost, and user signal. Build dashboards. Alert on regressions (accuracy drops, latency spikes, refusal rate shifts). You cannot fix what you cannot see.
Pattern to remember

Every failure mode above has the same structural fix: add structure around the model. Retrieval adds factual grounding. Schemas add output control. Evals add observability. Humans add judgment. The model is a component. The product is a system.

04 Prompting as System Design

Treat a prompt the way you'd treat an API spec. You are specifying: the role, the inputs, the allowed operations, the output format, and the success criteria. The better the spec, the less ambiguity the model has to resolve on its own.

Why prompts matter · mechanically

Every word in a prompt changes the conditional probability distribution over what the model outputs next. Ambiguous prompts leave the distribution wide. You get high-variance, generic-sounding answers. Specific prompts narrow the distribution. You get focused, predictable answers.

Said differently: the model is already going to pick the most probable next tokens. Your job is to make sure the "most probable" tokens happen to be the ones you want.

How the model interprets instructions

The model has no theory of mind. It doesn't "understand" your goal. What it does is pattern-match the structure of your prompt against structures it has seen during training. A prompt that looks like a well-formed task gets a well-formed answer. A prompt that looks like a vague chat gets a vague chat reply.

Best practices, by priority
  • Be explicit. State what you want, what you don't want, and what the boundaries are. "Write a 3-sentence summary focused on financial outcomes. Do not include background context." beats "summarize this."
  • Provide structure. Use tags, headers, or numbered sections to separate instruction, input data, and output specification. Models handle structured prompts far better than free-form ones.
  • Define success criteria. Tell the model what a good output looks like. "A good answer cites specific numbers from the text, stays under 100 words, and ends with a clear recommendation."
  • Use examples (few-shot). If you can't easily describe what you want, show it. Two to five examples is usually the sweet spot.
  • Control tone and format. Specify voice (formal, casual, technical). Specify format (JSON, markdown, plain prose). Specify length. Specify forbidden elements (no em dashes, no emoji, no sycophancy).
  • Give it a role when relevant. "You are a senior clinical pharmacist reviewing a medication list" activates a whole pattern of careful, technical, safety-oriented output.
  • Iterate with evals, not vibes. Change one thing. Re-run your eval set. Keep what improves; revert what doesn't. Prompting without measurement is astrology.
The simplest good-prompt template

Role. Who is the model pretending to be?
Task. What is it doing?
Context. What data or background matters?
Constraints. What must it not do?
Output format. What should the result look like?
Examples. (Optional but powerful.)

05 Prompting Strategies & When to Use Them

These strategies are not a ranked list. They're tools. Different tasks need different tools. The skill is recognizing which one this problem calls for.

Zero-shotS.01

WhatJust ask. No examples, no scaffolding. "Classify this review as positive, negative, or neutral."

WhenSimple, well-known tasks the model has seen thousands of variations of in training.

WhyIt's the cheapest, fastest option. If it works, stop. Don't add complexity you don't need.

Few-shotS.02

WhatShow 2–5 input/output examples before the real input. The model learns the pattern in-context.

WhenThe task has a specific format, tone, or judgment call the model needs to imitate. Custom classification schemes. Style-matching.

WhyLLMs are in-context learners. Examples anchor the distribution far more efficiently than long prose instructions.

Watch outOrder of examples matters. Balance the classes (don't show 4 positives and 1 negative). Last example has extra weight.

Chain-of-thoughtS.03

WhatPrompt the model to reason step by step before answering. Either with the phrase "Let's think step by step" or by showing a worked example.

WhenMulti-step reasoning: math, logic, planning, complex extraction.

WhyThe model's thinking happens inside the output tokens. More output tokens = more computation spent on the problem. You're literally giving it more room to think.

Role / personaS.04

What"You are an experienced X…" Cast the model in a specific professional identity.

WhenYou want a specific disciplinary lens: legal, medical, editorial, technical. When tone and depth must match a role.

WhyActivates a region of the model's learned representations associated with that role. Outputs shift toward the vocabulary, caution level, and conventions of that profession.

Instruction + constraintsS.05

WhatExplicit do's and don'ts. Output schemas. Refusal clauses. "Respond only in JSON. Do not include explanatory prose."

WhenProduction systems, structured outputs, anywhere downstream code parses the result.

WhyCuts variance. Makes outputs parseable by regular code. Protects against jailbreaks and scope creep.

Step-by-step decompositionS.06

WhatBreak a large task into sub-tasks. Prompt for each separately. Compose the results.

WhenComplex pipelines: analyze → categorize → summarize → recommend. When a single monolithic prompt produces a confused, shallow answer.

WhyEach sub-task is easier to specify and easier to evaluate. You can fix a failing stage without breaking the rest.

Self-critique / reflectionS.07

WhatAfter generating, ask the model (or a second model) to review the output against the success criteria and revise.

WhenQuality matters more than speed. When errors have real cost. When you have budget for two model calls instead of one.

WhyThe "reviewer" pass catches errors the "writer" pass missed. It's the model equivalent of reading your draft before hitting send.

Tree-of-thoughtS.08

WhatExplore multiple reasoning branches in parallel. Evaluate each. Pick the best or merge.

WhenProblems with multiple plausible paths where the right path isn't obvious upfront. Puzzles, planning, creative ideation.

WhyGuards against committing to a bad reasoning path too early. The cost is higher compute and more prompt design.

Self-consistencyS.09

WhatRun the same prompt N times at non-zero temperature. Take the majority answer.

WhenQuantitative or categorical tasks where you can vote over outputs. Gives a cheap reliability boost.

WhyThe model converges on the correct answer more often than any single wrong answer. Ensemble over runs.

Watch outMajority bias. If the training data is skewed, so is the consensus.

ReAct (Reason + Act)S.10

WhatInterleave reasoning steps with tool calls. The model thinks, acts (calls an API or search), observes the result, and reasons again.

WhenAgents, research tasks, anything that requires real-world information or computation the model doesn't have.

WhyThis is the foundation of modern agentic systems. It turns the LLM from a text-completion engine into something that can get things done.

Why better prompts produce better outputs

The model generates tokens by picking from a probability distribution. Ambiguous prompts leave that distribution wide, so anything could come out. Clear prompts narrow it, and the model converges on what you actually want. Prompting is the practice of shaping the distribution in your favor.

06 Use Cases Across Functions

Most LLM failures in the real world aren't technical. They're about picking the wrong use case the wrong use case. Some work fantastically out of the box. Others need heavy scaffolding. Knowing the difference is most of the job.

OPS

Operations

Used for
SOP drafting, internal knowledge search, ticket triage and routing, meeting summaries, weekly operational reports, anomaly narration from dashboards.
Works
Summarization, classification, tone normalization, drafting-from-template. Anything with clear input and clear target format.
Fails
Novel judgment calls. Anything requiring the institutional context of "why we do it this way." Edge cases without documented precedent.
Improve
RAG over your internal docs. Human-in-loop for the first 90 days. Log edge cases and fold them back into the prompt or knowledge base.
PRD

Product Teams

Used for
User feedback analysis, PRD and spec drafts, persona-driven simulation of feature reactions, support-ticket theming, A/B test result summaries.
Works
Extracting themes from unstructured feedback at scale. First-draft writing that a PM then sharpens. Exploratory what-if simulations.
Fails
Strategic prioritization. Decisions that depend on company politics, roadmap dependencies, or historical context the model doesn't have.
Improve
Feed in real artifacts: past PRDs, user interviews, analytics dumps. Use the LLM to compress and theme; reserve final judgment for humans.
MKT

Marketing

Used for
Content variations, SEO copy, ad creative ideation, email personalization, brand voice consistency checks, press release drafts.
Works
Volume and variation. Generating 20 subject lines, ten ad angles, five headline options. A/B test inputs.
Fails
Strategic positioning. Long-form that carries brand weight. Subtle emotional register. Anything that requires genuine original insight.
Improve
Build a brand voice prompt with 5–10 gold-standard examples. Use the LLM for drafts; have a senior editor review before anything goes public.
SLS

Sales

Used for
Lead enrichment, email drafting, call note summaries, objection-handling coaching, proposal drafting, CRM hygiene.
Works
Summarization of long discovery calls. Personalizing outreach at scale. Converting messy rep notes into structured CRM fields.
Fails
Actual deal judgment. Reading between the lines. Knowing when to push and when to pull back. Closing.
Improve
Feed in call transcripts + CRM context. Use the model to prep reps, not replace them. Pair automated drafts with rep review before send.
RSH

Research & Analysis

Used for
Literature scans, hypothesis generation, interview transcription and thematic coding, cross-document synthesis, bibliography building.
Works
Parallel reading across many sources. Surfacing themes from large unstructured corpora. Drafting literature reviews.
Fails
Claims requiring deep disciplinary expertise. Detecting subtle methodological flaws in a paper. Interpreting statistical nuance.
Improve
RAG over your source library. Require citations with page numbers. Spot-check outputs against the original documents before trusting.
QA

QA & Testing

Used for
Test case generation, bug report triage, flaky-test pattern detection, docs-vs-code consistency checks, edge case brainstorming.
Works
Brainstorming test cases you hadn't considered. Turning vague bug reports into structured tickets. Consistency checking across large artifacts.
Fails
Reasoning about subtle timing bugs, race conditions, or system-level interactions. Prioritizing which tests are actually worth the runtime.
Improve
Feed in specs, existing tests, and known failure modes. Use the LLM as a creative brainstorm partner, not an authority.
DTA

Data Annotation & Validation

Used for
Synthetic data generation, label proposals for human review, cross-annotator consistency checks, edge case generation, data quality scoring.
Works
Scaling human annotation: model pre-labels, humans verify. Generating hard negatives for training sets. Catching labeling inconsistencies.
Fails
Fully autonomous labeling on ambiguous tasks. Labeling data that requires genuine domain expertise (medical imaging, legal classification).
Improve
Treat model labels as suggestions. Human-in-loop on a sampled subset. Measure agreement rates and recalibrate when they drift.
A pattern across every function

The common thread: LLMs are brilliant first-draft engines and pattern recognizers at scale. They are poor final-judgment engines. The productive pattern is draft → human review → ship, not model → ship.

07 How to Think About LLMs

The mental model that separates good AI work from bad AI work is small. You can hold it in one sentence.

An LLM is not a magic tool. It is a probabilistic system that needs structure. The difference between average and excellent output is almost always the system you design around the model.

Everything in this dashboard folds into that idea. Hallucinations are what happens when there's no grounding. Inconsistency is what happens when there's no output control. Brittle prompts are what happens when there's no eval harness. Disappointing use cases are what happens when the model is asked to do the job without scaffolding around it.

The model is a component. A powerful one. The product is a system.

Three habits that compound
  • Write prompts like specs. Role, task, context, constraints, format, examples. Version them. Treat them as code.
  • Build evals before prompts. If you can't measure whether an output is good, you can't improve it. Golden sets are a product asset.
  • Design for graceful failure. Models will fail. The system should notice, contain, and recover, ideally before a user is affected.

The operators who build durable AI products are the ones who internalize this early. The rest spend their time surprised by failures that were entirely predictable.