LLM Cheat Sheet

Complete guide to Large Language Models, parameters, and how they work

🧠 Core Architecture

Large Language Models are built on the Transformer architecture, which revolutionized how AI processes text. Think of an LLM as a massive pattern-matching machine that learned to predict what word comes next by reading billions of text examples.

🔄 Transformer Model

The foundational architecture that allows LLMs to understand context and relationships between words, even when they're far apart in a sentence.

Self-Attention
Mechanism that lets the model focus on different parts of the input when generating each token
Layers
Stack of transformer blocks (12-96+ layers in modern LLMs)
Parameters
Learned weights that determine model behavior (7B = 7 billion parameters)
Tokens
Text broken into pieces (words, subwords, or characters) that the model processes

⚙️ Key Components

Essential building blocks inside each transformer layer that work together to process and understand text.

Embeddings
Convert tokens into numerical vectors
Attention Heads
Multiple parallel attention mechanisms in each layer
Feed-Forward Networks
Dense neural networks within each transformer block
Layer Normalization
Stabilizes training and improves performance

📚 Training Process

Training an LLM is like teaching it to read by showing it massive amounts of text and asking it to predict what comes next. This process happens in stages, starting with basic pattern learning and progressing to following human preferences.

📖 Pre-training

Foundational training phase where the model learns basic language patterns, facts, and reasoning abilities.

Objective
Predict the next token in a sequence
Data
Massive text datasets (books, web pages, code)
Self-Supervised
No human labels needed - learns from patterns in text
Compute
Requires enormous computational resources (thousands of GPUs)

🎯 Fine-tuning

Additional training to make the model helpful, harmless, and honest by teaching it to follow instructions and align with human values.

Supervised Fine-tuning (SFT)
Training on human-written examples
RLHF
Reinforcement Learning from Human Feedback
Constitutional AI
Training models to follow principles and be helpful/harmless

🎛️ Inference Parameters

These are the "knobs" you can turn when generating text from an LLM. They control how creative, focused, or random the output will be. Understanding these parameters is crucial for getting the behavior you want.

🔥 Temperature

Controls the randomness in the model's word choices. Think of it as the "creativity dial" - mathematically, it's applied to the logits before softmax to control the probability distribution.

# Temperature effect on token selection: # Low temp (0.1): Always picks most likely token # High temp (2.0): More random selection from probability distribution temperature = 0.3 # Focused, deterministic temperature = 0.7 # Balanced creativity temperature = 1.5 # Highly creative, less coherent
Range 0.0 to 2.0+
Typically used between 0.1 to 1.0 for most applications
Low (0.1-0.3)
Best for: Code generation, factual Q&A, translations, summaries
Medium (0.7-0.8)
Best for: General chat, explanations, balanced creative writing
High (1.0+)
Best for: Creative writing, brainstorming, poetry, experimental outputs
💡 Pro Tip: Start with temperature 0.3 for any new task, then increase gradually if you need more creativity. Most production applications use 0.1-0.5.
⚠️ Warning: Temperature above 1.5 often produces incoherent text. Above 2.0 typically results in near-random output.

🎯 Top-p (Nucleus Sampling)

Controls how many word choices the model considers by looking at cumulative probability.

Range 0.0 to 1.0
Consider tokens that make up the top p% of probability mass
Example
Top-p=0.9 means consider tokens until cumulative probability reaches 90%
Effect
Filters out very unlikely tokens, maintains quality while allowing variety

📊 Top-k Sampling

A simpler alternative that just looks at the k most likely next words.

Range 1 to vocab size
Only consider the k most likely next tokens
Example
Top-k=50 means only look at the 50 most probable tokens
Trade-off
Lower k = more focused, higher k = more diverse

📏 Other Parameters

Max Tokens
Limits the length of generated response
Repetition Penalty 1.0 to 1.2
Discourages repeating the same phrases

⚖️ Model Weights & Parameters

The "weights" are essentially the model's learned knowledge - billions of numbers that encode everything it knows about language, facts, and reasoning. These numbers determine how the model processes input and generates responses.

🧮 What Are Weights?

Think of weights as the model's "brain cells" - each one stores a tiny piece of learned information.

Definition
Numerical values learned during training that determine model behavior
Scale
Billions of floating-point numbers
Storage
Stored in formats like FP16, BF16, or INT8 for efficiency

📐 Model Sizes & Performance

The number of parameters roughly correlates with capabilities - more parameters generally mean better performance but require more computational resources.

1B-7B
Small Models
13B-30B
Medium Models
70B+
Large Models
100B+
Mixture of Experts
Small Models 1B-7B params
Examples: Llama 2 7B, Code Llama 7B
Speed: ~50-100 tokens/sec
Memory: 4-14GB VRAM
Best for: Simple tasks, chat, basic coding, content generation
Medium Models 13B-30B params
Examples: Llama 2 13B, Mixtral 8x7B
Speed: ~20-50 tokens/sec
Memory: 16-60GB VRAM
Best for: Complex reasoning, professional writing, advanced coding
Large Models 70B+ params
Examples: Llama 2 70B, GPT-4
Speed: ~5-20 tokens/sec
Memory: 140GB+ VRAM
Best for: Expert-level tasks, research, complex analysis
Mixture of Experts 100B+ params
Examples: Mixtral 8x22B, Switch Transformer
Efficiency: Only activates 10-20% of parameters
Best for: Specialized domains while maintaining efficiency
# Memory requirements (rough estimates): # 7B model: ~14GB (FP16) or ~7GB (INT8) # 13B model: ~26GB (FP16) or ~13GB (INT8) # 70B model: ~140GB (FP16) or ~70GB (INT8) # Performance scaling (approximate): # 7B: Good for 80% of tasks # 13B: Good for 90% of tasks # 70B: Good for 95% of tasks
💡 Scaling Laws: Model capability doesn't scale linearly with size. Going from 7B to 70B (10x parameters) typically provides 2-3x improvement in most benchmarks, but requires 10x more resources.

💾 Context and Memory

LLMs don't have permanent memory between conversations, but they can "remember" information within a single conversation through their context window. This is like their short-term memory during a chat session.

📄 Context Window & Memory

How much text the model can "see" and remember at once - like the size of its working memory. Longer contexts enable more complex tasks but cost significantly more.

4K
~3 pages
32K
~25 pages
128K
~100 pages
1M+
~800 pages
Short Context 2K-8K tokens
Best for: Chat, Q&A, simple tasks
Cost: Lowest
Speed: Fastest
Examples: Basic GPT-3.5, early Llama models
Medium Context 16K-32K tokens
Best for: Document analysis, code review
Cost: Moderate (2-4x short context)
Speed: Good
Examples: GPT-4, Claude 2
Long Context 128K+ tokens
Best for: Book analysis, large codebases
Cost: High (8-16x short context)
Speed: Slower
Examples: GPT-4 Turbo, Claude 3
Ultra Context 1M+ tokens
Best for: Research papers, complex analysis
Cost: Very High (50x+ short context)
Speed: Much slower
Examples: Gemini Pro 1.5
# Token counting examples: # "Hello world" = ~2 tokens # Average word = ~1.3 tokens # 1 page of text = ~500-800 tokens # Average book = ~80K-120K tokens # Large codebase = ~500K-2M tokens # Context vs Cost scaling: # 4K context: $0.01 per 1K tokens # 32K context: $0.03 per 1K tokens # 128K context: $0.10 per 1K tokens
💡 Context Strategy: Use the shortest context window that fits your task. Costs scale quadratically with context length due to attention mechanism complexity.
⚠️ Lost in the Middle: Models often struggle to use information in the middle of very long contexts. Place important information at the beginning or end.

💡 Key Concepts

Important phenomena and capabilities that emerge from the complex interactions within large language models.

✨ Emergent Abilities

Capabilities that suddenly appear when models reach a certain size, like how individual neurons don't think, but billions together can.

Examples
Complex reasoning, following instructions, code generation
Scale
Often emerge around 10B+ parameters

🎓 In-Context Learning

The model's ability to learn new tasks just from examples in your prompt, without additional training.

Few-Shot
Learning from a few examples
Zero-Shot
Performing tasks without examples

🌟 Hallucination

When the model confidently states information that sounds plausible but is actually false.

Causes
Patterns in training data, lack of real-world grounding
Mitigation
Lower temperature, retrieval-augmented generation

💬 Advanced Prompt Engineering

Prompting is the art of communicating effectively with LLMs. Since these models are trained on human text, the way you phrase your request dramatically affects the quality of the response. Master these techniques for dramatically better results.

📝 Prompt Structure & Best Practices

The foundation of effective prompting: clear structure and specific instructions that guide the model to your desired output.

# Effective Prompt Template: [CONTEXT] You are an expert data scientist. [TASK] Analyze the following dataset and identify trends. [FORMAT] Provide your analysis in 3 bullet points. [EXAMPLES] For example: "• Trend 1: Sales increased 15% YoY" [CONSTRAINTS] Keep each point under 50 words. [INPUT] [Your data here]
Be Specific
Instead of "Write about dogs" → "Write a 300-word informative article about Golden Retriever training techniques for first-time owners"
Use Examples
Show desired format: "Format like this: Name: [name], Age: [age], Skills: [skill1, skill2]"
Few-shot prompting can improve accuracy by 20-50%
Set Constraints
Word limits, tone requirements, forbidden topics
"Write in a professional tone, avoid jargon, maximum 200 words"
Define Output Format
JSON, bullet points, tables, code blocks
"Return results as valid JSON with keys: name, score, explanation"
💡 The CLEAR Framework: Context, Length, Examples, Audience, Role. Always include these elements for maximum effectiveness.

🚀 Advanced Techniques

Proven methods that leverage how LLMs process information to achieve expert-level performance on complex tasks.

Chain of Thought (CoT)
Add "Let's think step by step" or "Show your reasoning"
Improves accuracy on math/logic problems by 30-50%
Example: "Solve this step by step, showing your work at each stage"
Role-Based Prompting
"You are a [expert role] with [years] of experience in [domain]"
Activates domain-specific knowledge patterns
Works better than generic instructions
Tree of Thoughts
Ask for multiple approaches: "Generate 3 different solutions, then evaluate each"
Helps with creative and complex problem-solving
Reduces single-path thinking limitations
Self-Criticism
"After providing your answer, critique it and suggest improvements"
Significantly improves output quality
Catches common errors and biases
# Chain of Thought Example: BAD: "What's 127 × 43?" GOOD: "Calculate 127 × 43. Show your work step by step: 1. Break down the multiplication 2. Calculate each part 3. Add the results 4. Double-check your answer" # Result: Much more accurate mathematical reasoning
⚠️ Common Mistakes: Avoid vague instructions like "make it good" or "be creative." LLMs need specific, actionable guidance to produce quality outputs.

🎯 Task-Specific Strategies

Optimized prompting approaches for different types of tasks, from coding to creative writing.

Code Generation
Specify language, libraries, error handling
"Write Python code using pandas to... Include error handling and comments"
Use temperature 0.1-0.3 for consistency
Data Analysis
Provide data context and analysis goals
"Analyze this sales data for seasonal trends. Focus on Q4 performance and suggest optimization strategies"
Creative Writing
Set scene, character details, style preferences
"Write a cyberpunk short story (500 words) in the style of Philip K. Dick, featuring an AI detective"
Business Strategy
Include market context, constraints, success metrics
"Develop a go-to-market strategy for [product] targeting [audience] with budget constraints of [amount]"

📊 Benchmarks & Performance Metrics

Standardized tests that allow us to compare different models' capabilities across various tasks. These benchmarks help understand where each model excels and how they compare to human performance.

🎯 Major Benchmarks

Industry-standard tests used to evaluate LLM capabilities across different domains.

MMLU General Knowledge
Massive Multitask Language Understanding
57 subjects from elementary to professional level
Human performance: ~89%
GPT-4: ~86%, GPT-3.5: ~70%
HumanEval Code Generation
164 hand-written programming problems
Measures functional correctness
GPT-4: ~67%, GPT-3.5: ~48%
Code Llama 34B: ~54%
HellaSwag Common Sense
Physical and social common sense reasoning
Human performance: ~95%
GPT-4: ~95%, GPT-3.5: ~85%
Tests everyday reasoning abilities
GSM8K Math Reasoning
Grade school math word problems
Human performance: ~90%
GPT-4: ~92%, GPT-3.5: ~57%
Tests step-by-step reasoning
# Benchmark Score Interpretation: # 90%+: Superhuman/Expert level # 80-90%: Professional/Advanced level # 70-80%: College graduate level # 60-70%: College student level # 50-60%: High school level # <50%: Below human baseline
💡 Benchmark Limitations: High benchmark scores don't guarantee real-world performance. Models can be good at tests but struggle with practical applications, and vice versa.

⚡ Performance Comparisons

How different model families compare across key metrics and real-world usage scenarios.

86%
GPT-4 MMLU
67%
GPT-4 HumanEval
92%
GPT-4 GSM8K
95%
GPT-4 HellaSwag
Speed vs Quality Trade-off
GPT-3.5: 10x faster than GPT-4, ~70% of quality
Claude 3 Haiku: 20x faster than Opus, ~80% of quality
Llama 2 7B: 50x faster than 70B, ~60% of quality
Cost vs Performance
GPT-3.5: $0.001/1K tokens, good performance
GPT-4: $0.03/1K tokens, best performance
Claude 3: $0.01-0.08/1K tokens, balanced
Specialized Models
Code Llama: 2x better at coding than base Llama
Mistral 7B: Matches larger models on specific tasks
Mixtral: Expert-level performance with MoE efficiency

Advanced Practical Considerations

Real-world factors that affect how you can actually deploy and use LLMs in production, from infrastructure requirements to optimization strategies.

🖥️ Hardware Requirements

Understanding computational requirements helps plan infrastructure and budget for LLM deployment.

GPU Memory (VRAM)
7B model: 14GB+ (RTX 4090, A6000)
13B model: 26GB+ (A100 40GB)
70B model: 140GB+ (8x A100 80GB)
Required for model weights + activation memory
CPU vs GPU Performance
GPU: 50-200 tokens/sec for 7B model
CPU: 1-10 tokens/sec for 7B model
GPU acceleration provides 10-50x speedup
Apple Silicon: 20-40 tokens/sec (unified memory)
Quantization Benefits
FP16 → INT8: 50% memory reduction, 5% quality loss
FP16 → INT4: 75% memory reduction, 10-15% quality loss
GPTQ/AWQ: Advanced quantization with minimal loss
Serving Optimization
Batch processing: 5-10x throughput improvement
KV caching: 3-5x faster for long conversations
Speculative decoding: 2-3x faster generation
# Memory calculation example: # Model: Llama 2 7B # Parameters: 7B × 2 bytes (FP16) = 14GB # KV cache: batch_size × seq_len × hidden_dim × layers × 2 # Activation memory: ~20% of model size # Total: ~17GB minimum for inference

💰 Cost Optimization Strategies

Understanding what drives LLM costs helps you optimize for your budget while maintaining quality. Costs can vary by 100x between different approaches.

$0.002
GPT-3.5 per 1K tokens
$0.03
GPT-4 per 1K tokens
$0.10+
GPT-4 long context
$0.00
Self-hosted models
Token Optimization
Shorter prompts = lower costs
Use abbreviations, remove unnecessary words
Cache repeated context across calls
Example: Save 30-50% by optimizing prompt length
Model Selection
Use smallest model that meets quality needs
GPT-3.5 for simple tasks (15x cheaper than GPT-4)
Claude 3 Haiku for speed (20x cheaper than Opus)
Context Management
Longer context = exponentially higher costs
Summarize old conversation history
Use vector databases for long-term memory
32K context costs 4x more than 8K
Self-Hosting vs API
Break-even: ~500K tokens/month
Self-hosted: Higher upfront, lower per-token
API: Zero upfront, higher per-token
Consider compliance and latency needs
# Cost comparison example (1M tokens/month): # GPT-4 API: $30,000/month # GPT-3.5 API: $2,000/month # Claude 3 Opus: $15,000/month # Self-hosted 70B: $2,000/month (amortized) # Self-hosted 7B: $200/month (amortized)
💡 Cost-Saving Strategy: Use a cascade approach - start with a small/cheap model, escalate to larger models only when needed. Can reduce costs by 60-80% while maintaining quality.

⚖️ Performance vs Efficiency Trade-offs

Finding the optimal balance between quality, speed, and cost for your specific use case.

Speed Optimization
Streaming responses: Users see output immediately
Parallel processing: Batch multiple requests
Edge deployment: Reduce latency by 50-80%
Caching: Store common responses
Quality vs Cost
A/B test model performance on your specific tasks
Sometimes 7B models perform as well as 70B
Fine-tuning smaller models can beat larger general ones
Use RLHF data to improve smaller models
Latency Considerations
Real-time chat: <500ms response time
Content generation: 1-3 seconds acceptable
Batch processing: Minutes to hours OK
Choose model size based on latency requirements
Scaling Strategies
Load balancing across multiple model instances
Auto-scaling based on demand
Queue management for burst traffic
Graceful degradation to smaller models
⚠️ Hidden Costs: Don't forget infrastructure, monitoring, fine-tuning, and human oversight costs. These can double your total LLM deployment costs.