LLM Cheat Sheet

🧠 Core Architecture

Large Language Models are built on the Transformer architecture, which revolutionized how AI processes text. Think of an LLM as a massive pattern-matching machine that learned to predict what word comes next by reading billions of text examples.

🔄 Transformer Model

The foundational architecture that allows LLMs to understand context and relationships between words, even when they're far apart in a sentence.

Self-Attention

Mechanism that lets the model focus on different parts of the input when generating each token

Layers

Stack of transformer blocks (12-96+ layers in modern LLMs)

Parameters

Learned weights that determine model behavior (7B = 7 billion parameters)

Tokens

Text broken into pieces (words, subwords, or characters) that the model processes

⚙️ Key Components

Essential building blocks inside each transformer layer that work together to process and understand text.

Embeddings

Convert tokens into numerical vectors

Attention Heads

Multiple parallel attention mechanisms in each layer

Feed-Forward Networks

Dense neural networks within each transformer block

Layer Normalization

Stabilizes training and improves performance

📚 Training Process

Training an LLM is like teaching it to read by showing it massive amounts of text and asking it to predict what comes next. This process happens in stages, starting with basic pattern learning and progressing to following human preferences.

📖 Pre-training

Foundational training phase where the model learns basic language patterns, facts, and reasoning abilities.

Objective

Predict the next token in a sequence

Data

Massive text datasets (books, web pages, code)

Self-Supervised

No human labels needed - learns from patterns in text

Compute

Requires enormous computational resources (thousands of GPUs)

🎯 Fine-tuning

Additional training to make the model helpful, harmless, and honest by teaching it to follow instructions and align with human values.

Supervised Fine-tuning (SFT)

Training on human-written examples

RLHF

Reinforcement Learning from Human Feedback

Constitutional AI

Training models to follow principles and be helpful/harmless

🎛️ Inference Parameters

These are the "knobs" you can turn when generating text from an LLM. They control how creative, focused, or random the output will be. Understanding these parameters is crucial for getting the behavior you want.

🔥 Temperature

Controls the randomness in the model's word choices. Think of it as the "creativity dial" - mathematically, it's applied to the logits before softmax to control the probability distribution.

# Temperature effect on token selection:
# Low temp (0.1): Always picks most likely token
# High temp (2.0): More random selection from probability distribution

temperature = 0.3  # Focused, deterministic
temperature = 0.7  # Balanced creativity 
temperature = 1.5  # Highly creative, less coherent
                

Range 0.0 to 2.0+

Typically used between 0.1 to 1.0 for most applications

Low (0.1-0.3)

Best for: Code generation, factual Q&A, translations, summaries

Medium (0.7-0.8)

Best for: General chat, explanations, balanced creative writing

High (1.0+)

Best for: Creative writing, brainstorming, poetry, experimental outputs

💡 Pro Tip: Start with temperature 0.3 for any new task, then increase gradually if you need more creativity. Most production applications use 0.1-0.5.

⚠️ Warning: Temperature above 1.5 often produces incoherent text. Above 2.0 typically results in near-random output.

🎯 Top-p (Nucleus Sampling)

Controls how many word choices the model considers by looking at cumulative probability.

Range 0.0 to 1.0

Consider tokens that make up the top p% of probability mass

Example

Top-p=0.9 means consider tokens until cumulative probability reaches 90%

Effect

Filters out very unlikely tokens, maintains quality while allowing variety

📊 Top-k Sampling

A simpler alternative that just looks at the k most likely next words.

Range 1 to vocab size

Only consider the k most likely next tokens

Example

Top-k=50 means only look at the 50 most probable tokens

Trade-off

Lower k = more focused, higher k = more diverse

📏 Other Parameters

Max Tokens

Limits the length of generated response

Repetition Penalty 1.0 to 1.2

Discourages repeating the same phrases

⚖️ Model Weights & Parameters

The "weights" are essentially the model's learned knowledge - billions of numbers that encode everything it knows about language, facts, and reasoning. These numbers determine how the model processes input and generates responses.

🧮 What Are Weights?

Think of weights as the model's "brain cells" - each one stores a tiny piece of learned information.

Definition

Numerical values learned during training that determine model behavior

Scale

Billions of floating-point numbers

Storage

Stored in formats like FP16, BF16, or INT8 for efficiency

📐 Model Sizes & Performance

The number of parameters roughly correlates with capabilities - more parameters generally mean better performance but require more computational resources.

1B-7B

Small Models

13B-30B

Medium Models

70B+

Large Models

100B+

Mixture of Experts

Small Models 1B-7B params

Examples: Llama 2 7B, Code Llama 7B
Speed: ~50-100 tokens/sec
Memory: 4-14GB VRAM
Best for: Simple tasks, chat, basic coding, content generation

Medium Models 13B-30B params

Examples: Llama 2 13B, Mixtral 8x7B
Speed: ~20-50 tokens/sec
Memory: 16-60GB VRAM
Best for: Complex reasoning, professional writing, advanced coding

Large Models 70B+ params

Examples: Llama 2 70B, GPT-4
Speed: ~5-20 tokens/sec
Memory: 140GB+ VRAM
Best for: Expert-level tasks, research, complex analysis

Mixture of Experts 100B+ params

Examples: Mixtral 8x22B, Switch Transformer
Efficiency: Only activates 10-20% of parameters
Best for: Specialized domains while maintaining efficiency

# Memory requirements (rough estimates):
# 7B model:   ~14GB (FP16) or ~7GB (INT8)
# 13B model:  ~26GB (FP16) or ~13GB (INT8) 
# 70B model:  ~140GB (FP16) or ~70GB (INT8)

# Performance scaling (approximate):
# 7B:   Good for 80% of tasks
# 13B:  Good for 90% of tasks  
# 70B:  Good for 95% of tasks
                

💡 Scaling Laws: Model capability doesn't scale linearly with size. Going from 7B to 70B (10x parameters) typically provides 2-3x improvement in most benchmarks, but requires 10x more resources.

💾 Context and Memory

LLMs don't have permanent memory between conversations, but they can "remember" information within a single conversation through their context window. This is like their short-term memory during a chat session.

📄 Context Window & Memory

How much text the model can "see" and remember at once - like the size of its working memory. Longer contexts enable more complex tasks but cost significantly more.

4K

~3 pages

32K

~25 pages

128K

~100 pages

1M+

~800 pages

Short Context 2K-8K tokens

Best for: Chat, Q&A, simple tasks
Cost: Lowest
Speed: Fastest
Examples: Basic GPT-3.5, early Llama models

Medium Context 16K-32K tokens

Best for: Document analysis, code review
Cost: Moderate (2-4x short context)
Speed: Good
Examples: GPT-4, Claude 2

Long Context 128K+ tokens

Best for: Book analysis, large codebases
Cost: High (8-16x short context)
Speed: Slower
Examples: GPT-4 Turbo, Claude 3

Ultra Context 1M+ tokens

Best for: Research papers, complex analysis
Cost: Very High (50x+ short context)
Speed: Much slower
Examples: Gemini Pro 1.5

# Token counting examples:
# "Hello world" = ~2 tokens
# Average word = ~1.3 tokens  
# 1 page of text = ~500-800 tokens
# Average book = ~80K-120K tokens
# Large codebase = ~500K-2M tokens

# Context vs Cost scaling:
# 4K context:   $0.01 per 1K tokens
# 32K context:  $0.03 per 1K tokens  
# 128K context: $0.10 per 1K tokens
                

💡 Context Strategy: Use the shortest context window that fits your task. Costs scale quadratically with context length due to attention mechanism complexity.

⚠️ Lost in the Middle: Models often struggle to use information in the middle of very long contexts. Place important information at the beginning or end.

💡 Key Concepts

Important phenomena and capabilities that emerge from the complex interactions within large language models.

✨ Emergent Abilities

Capabilities that suddenly appear when models reach a certain size, like how individual neurons don't think, but billions together can.

Examples

Complex reasoning, following instructions, code generation

Scale

Often emerge around 10B+ parameters

🎓 In-Context Learning

The model's ability to learn new tasks just from examples in your prompt, without additional training.

Few-Shot

Learning from a few examples

Zero-Shot

Performing tasks without examples

🌟 Hallucination

When the model confidently states information that sounds plausible but is actually false.

Causes

Patterns in training data, lack of real-world grounding

Mitigation

Lower temperature, retrieval-augmented generation

💬 Advanced Prompt Engineering

Prompting is the art of communicating effectively with LLMs. Since these models are trained on human text, the way you phrase your request dramatically affects the quality of the response. Master these techniques for dramatically better results.

📝 Prompt Structure & Best Practices

The foundation of effective prompting: clear structure and specific instructions that guide the model to your desired output.

# Effective Prompt Template:
[CONTEXT] You are an expert data scientist.
[TASK] Analyze the following dataset and identify trends.
[FORMAT] Provide your analysis in 3 bullet points.
[EXAMPLES] For example: "• Trend 1: Sales increased 15% YoY"
[CONSTRAINTS] Keep each point under 50 words.
[INPUT] [Your data here]
                

Be Specific

Instead of "Write about dogs" → "Write a 300-word informative article about Golden Retriever training techniques for first-time owners"

Use Examples

Show desired format: "Format like this: Name: [name], Age: [age], Skills: [skill1, skill2]"
Few-shot prompting can improve accuracy by 20-50%

Set Constraints

Word limits, tone requirements, forbidden topics
"Write in a professional tone, avoid jargon, maximum 200 words"

Define Output Format

JSON, bullet points, tables, code blocks
"Return results as valid JSON with keys: name, score, explanation"

💡 The CLEAR Framework: Context, Length, Examples, Audience, Role. Always include these elements for maximum effectiveness.

🚀 Advanced Techniques

Proven methods that leverage how LLMs process information to achieve expert-level performance on complex tasks.

Chain of Thought (CoT)

Add "Let's think step by step" or "Show your reasoning"
Improves accuracy on math/logic problems by 30-50%
Example: "Solve this step by step, showing your work at each stage"

Role-Based Prompting

"You are a [expert role] with [years] of experience in [domain]"
Activates domain-specific knowledge patterns
Works better than generic instructions

Tree of Thoughts

Ask for multiple approaches: "Generate 3 different solutions, then evaluate each"
Helps with creative and complex problem-solving
Reduces single-path thinking limitations

Self-Criticism

"After providing your answer, critique it and suggest improvements"
Significantly improves output quality
Catches common errors and biases

# Chain of Thought Example:
BAD: "What's 127 × 43?"
GOOD: "Calculate 127 × 43. Show your work step by step:
1. Break down the multiplication
2. Calculate each part  
3. Add the results
4. Double-check your answer"

# Result: Much more accurate mathematical reasoning
                

⚠️ Common Mistakes: Avoid vague instructions like "make it good" or "be creative." LLMs need specific, actionable guidance to produce quality outputs.

🎯 Task-Specific Strategies

Optimized prompting approaches for different types of tasks, from coding to creative writing.

Code Generation

Specify language, libraries, error handling
"Write Python code using pandas to... Include error handling and comments"
Use temperature 0.1-0.3 for consistency

Data Analysis

Provide data context and analysis goals
"Analyze this sales data for seasonal trends. Focus on Q4 performance and suggest optimization strategies"

Creative Writing

Set scene, character details, style preferences
"Write a cyberpunk short story (500 words) in the style of Philip K. Dick, featuring an AI detective"

Business Strategy

Include market context, constraints, success metrics
"Develop a go-to-market strategy for [product] targeting [audience] with budget constraints of [amount]"

📊 Benchmarks & Performance Metrics

Standardized tests that allow us to compare different models' capabilities across various tasks. These benchmarks help understand where each model excels and how they compare to human performance.

🎯 Major Benchmarks

Industry-standard tests used to evaluate LLM capabilities across different domains.

MMLU General Knowledge

Massive Multitask Language Understanding
57 subjects from elementary to professional level
Human performance: ~89%
GPT-4: ~86%, GPT-3.5: ~70%

HumanEval Code Generation

164 hand-written programming problems
Measures functional correctness
GPT-4: ~67%, GPT-3.5: ~48%
Code Llama 34B: ~54%

HellaSwag Common Sense

Physical and social common sense reasoning
Human performance: ~95%
GPT-4: ~95%, GPT-3.5: ~85%
Tests everyday reasoning abilities

GSM8K Math Reasoning

Grade school math word problems
Human performance: ~90%
GPT-4: ~92%, GPT-3.5: ~57%
Tests step-by-step reasoning

# Benchmark Score Interpretation:
# 90%+: Superhuman/Expert level
# 80-90%: Professional/Advanced level  
# 70-80%: College graduate level
# 60-70%: College student level
# 50-60%: High school level
# <50%: Below human baseline
                

💡 Benchmark Limitations: High benchmark scores don't guarantee real-world performance. Models can be good at tests but struggle with practical applications, and vice versa.

⚡ Performance Comparisons

How different model families compare across key metrics and real-world usage scenarios.

86%

GPT-4 MMLU

67%

GPT-4 HumanEval

92%

GPT-4 GSM8K

95%

GPT-4 HellaSwag

Speed vs Quality Trade-off

GPT-3.5: 10x faster than GPT-4, ~70% of quality
Claude 3 Haiku: 20x faster than Opus, ~80% of quality
Llama 2 7B: 50x faster than 70B, ~60% of quality

Cost vs Performance

GPT-3.5: $0.001/1K tokens, good performance
GPT-4: $0.03/1K tokens, best performance
Claude 3: $0.01-0.08/1K tokens, balanced

Specialized Models

Code Llama: 2x better at coding than base Llama
Mistral 7B: Matches larger models on specific tasks
Mixtral: Expert-level performance with MoE efficiency

⚡ Advanced Practical Considerations

Real-world factors that affect how you can actually deploy and use LLMs in production, from infrastructure requirements to optimization strategies.

🖥️ Hardware Requirements

Understanding computational requirements helps plan infrastructure and budget for LLM deployment.

GPU Memory (VRAM)

7B model: 14GB+ (RTX 4090, A6000)
13B model: 26GB+ (A100 40GB)
70B model: 140GB+ (8x A100 80GB)
Required for model weights + activation memory

CPU vs GPU Performance

GPU: 50-200 tokens/sec for 7B model
CPU: 1-10 tokens/sec for 7B model
GPU acceleration provides 10-50x speedup
Apple Silicon: 20-40 tokens/sec (unified memory)

Quantization Benefits

FP16 → INT8: 50% memory reduction, 5% quality loss
FP16 → INT4: 75% memory reduction, 10-15% quality loss
GPTQ/AWQ: Advanced quantization with minimal loss

Serving Optimization

Batch processing: 5-10x throughput improvement
KV caching: 3-5x faster for long conversations
Speculative decoding: 2-3x faster generation

# Memory calculation example:
# Model: Llama 2 7B
# Parameters: 7B × 2 bytes (FP16) = 14GB
# KV cache: batch_size × seq_len × hidden_dim × layers × 2
# Activation memory: ~20% of model size
# Total: ~17GB minimum for inference
                

💰 Cost Optimization Strategies

Understanding what drives LLM costs helps you optimize for your budget while maintaining quality. Costs can vary by 100x between different approaches.

$0.002

GPT-3.5 per 1K tokens

$0.03

GPT-4 per 1K tokens

$0.10+

GPT-4 long context

$0.00

Self-hosted models

Token Optimization

Shorter prompts = lower costs
Use abbreviations, remove unnecessary words
Cache repeated context across calls
Example: Save 30-50% by optimizing prompt length

Model Selection

Use smallest model that meets quality needs
GPT-3.5 for simple tasks (15x cheaper than GPT-4)
Claude 3 Haiku for speed (20x cheaper than Opus)

Context Management

Longer context = exponentially higher costs
Summarize old conversation history
Use vector databases for long-term memory
32K context costs 4x more than 8K

Self-Hosting vs API

Break-even: ~500K tokens/month
Self-hosted: Higher upfront, lower per-token
API: Zero upfront, higher per-token
Consider compliance and latency needs

# Cost comparison example (1M tokens/month):
# GPT-4 API:        $30,000/month
# GPT-3.5 API:      $2,000/month  
# Claude 3 Opus:    $15,000/month
# Self-hosted 70B:  $2,000/month (amortized)
# Self-hosted 7B:   $200/month (amortized)
                

💡 Cost-Saving Strategy: Use a cascade approach - start with a small/cheap model, escalate to larger models only when needed. Can reduce costs by 60-80% while maintaining quality.

⚖️ Performance vs Efficiency Trade-offs

Finding the optimal balance between quality, speed, and cost for your specific use case.

Speed Optimization

Streaming responses: Users see output immediately
Parallel processing: Batch multiple requests
Edge deployment: Reduce latency by 50-80%
Caching: Store common responses

Quality vs Cost

A/B test model performance on your specific tasks
Sometimes 7B models perform as well as 70B
Fine-tuning smaller models can beat larger general ones
Use RLHF data to improve smaller models

Latency Considerations

Real-time chat: <500ms response time
Content generation: 1-3 seconds acceptable
Batch processing: Minutes to hours OK
Choose model size based on latency requirements

Scaling Strategies

Load balancing across multiple model instances
Auto-scaling based on demand
Queue management for burst traffic
Graceful degradation to smaller models

⚠️ Hidden Costs: Don't forget infrastructure, monitoring, fine-tuning, and human oversight costs. These can double your total LLM deployment costs.