LLM Glossary - Complete Guide to Large Language Models

Showing 12 of 307 terms (307 total)

Attention Mechanism

architecture

A technique that allows the model to focus on different parts of the input sequence when generating each output token. It's the core innovation that enables models to understand context and relationships across long sequences.

Example:

When translating 'The cat sat on the mat', attention helps the model know that 'cat' should influence the translation of pronouns later in the sentence.

Think of it like:

Like a spotlight that can illuminate different parts of a stage - the model can 'pay attention' to relevant words when deciding what comes next.

Transformer

architecture

The foundational neural network architecture for modern LLMs. Uses self-attention mechanisms to process sequences in parallel rather than sequentially, enabling much faster training and better long-range dependencies.

Example:

GPT, BERT, and T5 are all based on the Transformer architecture introduced in the 'Attention Is All You Need' paper.

Think of it like:

Like having multiple translators working on different parts of a document simultaneously, then combining their work intelligently.

Token

data

The smallest unit of text that a language model processes. Words are broken down into tokens, which can be whole words, parts of words, or even individual characters depending on the tokenization scheme.

Example:

The word 'unhappiness' might be tokenized as ['un', 'happy', 'ness'] or ['unhapp', 'iness'] depending on the tokenizer.

Think of it like:

Like breaking a sentence into puzzle pieces - each piece (token) is small enough for the computer to understand and work with.

Pre-training

training

The initial phase where a language model learns general language understanding by predicting the next word in billions of text examples from the internet, books, and other sources.

Example:

GPT-4 was pre-trained on hundreds of billions of tokens from web pages, books, and articles to learn basic language patterns.

Think of it like:

Like a student reading everything in a massive library to build general knowledge before specializing in any particular subject.

Fine-tuning

training

The process of taking a pre-trained model and training it further on a specific task or domain to improve performance for that particular use case.

Example:

Taking a general language model and fine-tuning it on medical literature to create a specialized medical AI assistant.

Think of it like:

Like a general doctor doing a residency to become a specialist - building on broad knowledge to excel in a specific area.

Prompt Engineering

applications

The art and science of crafting input prompts to get the best possible outputs from language models. Involves understanding how to phrase questions and provide context effectively.

Example:

Instead of asking 'Write code', a better prompt might be 'Write a Python function that takes a list of numbers and returns the average, with error handling for empty lists.'

Think of it like:

Like learning how to ask the right questions to get the best answers from a very knowledgeable but literal-minded expert.

Hallucination

evaluation

When a language model generates information that sounds plausible but is factually incorrect or completely made up. This happens because models generate text based on patterns, not truth.

Example:

An LLM might confidently state that 'The capital of Australia is Sydney' or invent fake scientific studies with realistic-sounding details.

Think of it like:

Like a very creative storyteller who sometimes can't tell the difference between what they remember and what they're imagining.

RLHF (Reinforcement Learning from Human Feedback)

training

A training technique where human evaluators rate model outputs, and the model learns to produce responses that humans prefer. This helps align AI behavior with human values and preferences.

Example:

Training ChatGPT to be helpful, harmless, and honest by having humans rate different response options and teaching the model to prefer highly-rated responses.

Think of it like:

Like a student getting feedback from teachers on their essays and gradually learning to write in a way that teachers appreciate.

Context Window

architecture

The maximum amount of text (measured in tokens) that a language model can consider at once. Text beyond this limit is forgotten or ignored.

Example:

GPT-3.5 has a context window of about 4,000 tokens, while GPT-4 can handle up to 128,000 tokens in some versions.

Think of it like:

Like short-term memory - you can only keep so much information in your head at once before older information gets pushed out.

Embedding

concepts

A way of representing words, sentences, or concepts as vectors (lists of numbers) that capture semantic meaning. Similar concepts have similar embeddings.

Example:

The words 'dog' and 'puppy' would have very similar embeddings, while 'dog' and 'mathematics' would be very different.

Think of it like:

Like giving every word a unique DNA sequence where related words have similar genetic codes.

Zero-shot Learning

applications

The ability of a model to perform a task it has never been explicitly trained on, using only general knowledge and understanding of the task description.

Example:

Asking an LLM to translate between languages it wasn't specifically trained to translate, but it figures it out from context.

Think of it like:

Like a chef who's never made Thai food before but can create a decent Thai dish using their general cooking knowledge and a recipe description.

Few-shot Learning

applications

Providing a model with a few examples of a task within the prompt to help it understand what you want, without any additional training.

Example:

Showing the model 2-3 examples of sentiment analysis before asking it to analyze a new piece of text.

Think of it like:

Like showing someone a few examples of how to fold origami before asking them to fold a new design.