A concise, personal comparison of key LLM architectures developed over the past few years.

This document reflects my individual understanding and curiosity-driven research from the year 2017 to February 2025.

This is by no means an exhaustive list, and many other excellent models exist in the field.

🎯 List of LLMs Covered (2017–2025)

Transformer, BERT, GPT, GPT-2, XLNet, GPT-3, T5, Switch Transformer, GLaM, Gopher, Chinchilla, LaMDA, PaLM, InstructGPT / ChatGPT, GPT-4, LLaMA, PaLM 2, Claude, Gemini, Mistral, Qwen-1.5, LLaMA 3, DeepSeek-R1


🎯 2017–2019

Model(Year)Architecture
Type
Attention
Type
Positional
Encoding
Normalization
&
Activation
Parameters
&
Context
Length
Training
Data
InnovationsTraining
Strategies
Capabilities
Transformer (2017) [1]Encoder-Decoder TransformerMulti-head self-attention (encoder & decoder) +
cross-attention
Fixed
sinusoidal
Post-
layernorm
β€”β€”β€”β€”-
ReLU
~65M
(base model)
β€”β€”β€”β€”
512 tokens
WMT14 translation corpora
(e.g., 4.5M sentence pairs En→De)
Introduced self-attention to replace recurrent networks, enabling parallel sequence processingSupervised learning on translation tasks;

residual connections, layer normalization, Adam optimizer
Dramatically improved machine translation quality and speed;

became foundational architecture for subsequent LLMs
BERT (2018) [2]Transformer Encoder
(bidirectional)
Full bidirectional self-attention (MLM objective)Learned
absolute
Post-
layernorm
β€”β€”β€”β€”-
GELU
110M (Base),

340M
(Large)
β€”β€”β€”β€”
512 tokens
BooksCorpus + English Wikipedia
(3.3B words total)
Masked Language Modeling and Next Sentence Prediction for deep bidirectional context understandingUnsupervised pre-training on large text corpus, then task-specific fine-tuning (transfer learning)Set new state-of-the-art on many NLP tasks (GLUE, QA) via contextualized embeddings and fine-tuning
GPT (2018) [3]Transformer Decoder
(unidirectional)
Auto-regressive masked self-attention
(causal LM)
Learned
absolute
(Layernorm in transformer blocks)
β€”β€”β€”β€”-
GELU
117M
β€”β€”β€”β€”
512 tokens
BookCorpus
(700M words of novels)
First to use generative pre-training for language understanding tasks, demonstrating transfer learning from unsupervised LMUnsupervised language model pre-training on unlabeled text, followed by supervised fine-tuning on each taskOutperformed task-specific architectures on 9 of 12 NLP tasks via pre-trained knowledge, showing power of generative pre-training
GPT-2 (2019) [4]Transformer Decoder
(deep, uni-directional)
Masked multi-head self-attention
(auto-regressive)
Learned
absolute
(Layernorm
in each layer)
β€”β€”β€”β€”-
GELU
1.5 billion
β€”β€”β€”β€”
1024 tokens
WebText
(8M web pages
from Reddit links,
~40 GB)
Demonstrated that much larger unsupervised language models can generate coherent long-form textGenerative pre-training on vast internet text;

no fine-tuning, evaluated zero-shot on tasks
Achieved notable zero-shot performance on diverse tasks (QA, translation, summarization), indicating emergent multitask learning abilities
XLNet (2019) [5]Transformer-XL Decoder
(autoregressive)
Permutation-based full self-attention (two-stream)Segment-aware
relative positional encoding
Post-
layernorm
β€”β€”β€”β€”-
GELU
340M
(Large)
β€”β€”β€”β€”
512 tokens
Diverse large text corpora
(Google Books, Wikipedia,
Giga5, ClueWeb, Common Crawl)
Generalized autoregressive pre-training that leverages all context positions (permuted order) instead of maskingMemory-augmented Transformer (recurrence from Transformer-XL) with two-stream attention;

trained with permutation language modeling objective
Outperformed BERT on NLP benchmarks (e.g., GLUE) by capturing bidirectional context without an explicit mask, improving downstream task performance

🎯 2020–2022

Model(Year)Architecture
Type
Attention
Type
Positional
Encoding
Normalization
&
Activation
Parameters
&
Context Length
Training
Data
InnovationsTraining
Strategies
Capabilities
GPT-3
(2020)
[6]
Transformer Decoder
(very deep)
Masked multi-head self-attention
(auto-regressive)
Learned absolute
(2048 tokens)
Pre-
layernorm
β€”β€”β€”β€”-
GELU
175 billion
β€”β€”β€”β€”-
2048 tokens
~300B tokens from Common Crawl, WebText2, Books, WikipediaMassive scale showed emergent few-shot learning β€” model can perform tasks from prompts without fine-tuningTrained on extremely large corpus with mixed precision and model-parallelism across GPUs;

no task-specific fine-tuning required for evaluation
Achieved state-of-the-art in few-shot and zero-shot settings on many NLP tasks;

demonstrated the benefits of scale for versatility
T5
(2020)
[7]
Transformer Encoder–DecoderFull self-attention
(enc & dec)
+ cross-attention
Relative
positional embeddings
Pre-
layernorm
β€”β€”β€”β€”-
ReLU (with variants explored)
11 billion (largest)
β€”β€”β€”β€”-
512 tokens
C4 (Colossal Cleaned Common Crawl, ~750 GB text)Unified β€œtext-to-text” framework – model treats every NLP task
(translation, QA, summarization, etc.)
as text generation
Unsupervised pre-training on C4 corpus with a denoising objective;

followed by task-specific fine-tuning in a text-to-text format
Achieved state-of-the-art on numerous benchmarks with one model applicable to all tasks;

open-sourced in various sizes for flexible fine-tuning
Switch
Transformer
(2021)
[8][9]
Transformer Decoder (Mixture-of-Experts)Sparse MoE multi-head attention
(experts in
FFN layers)
Learned
absolute
Pre-
layernorm
β€”β€”β€”β€”-
SwiGLU
1.6 trillion (with 64 experts,
~26B
active per token)
β€”β€”β€”β€”-
2048 tokens
C4 corpus (same as T5)Introduced conditional computation:

uses routing to activate one expert feed-forward network per token, enabling extreme scale with efficient compute
MoE training with load-balancing loss to ensure experts are utilized;

scaled on TPU pods (Pathways) to reach trillion+ parameters
Matched dense model quality with much lower computational cost;

set new scale records (trillion+ parameters) while maintaining strong zero-shot and one-shot performance
GLaM
(2022)
[9]
Transformer Decoder (Mixture-of-Experts)Sparse mixture-of-experts
(two experts
per token)
Learned
absolute
Pre-
layernorm
β€”β€”β€”β€”-
GELU
1.2 trillion (64 experts, 2 active per token)
β€”β€”β€”β€”-
2048 tokens
Massive multilingual web corpus (filtered web pages, dialogues, code)Scaled MoE further with a balanced gating approach
(each token routed to 2 experts) for efficiency – 7Γ— parameter count of GPT-3 with 1/3 the energy cost
Pre-trained with sparsely activated experts to reduce FLOPs;

required specialized initialization and auxiliary losses for expert balance
Outperformed GPT-3 in zero-/one-shot tasks
while using significantly less inference compute per token;

demonstrated efficient super-scaling of model capacity
Gopher
(2021)
[10]
Transformer Decoder (dense)Multi-head self-attention
(auto-regressive LM)
Learned
absolute [10]
Pre-
layernorm
β€”β€”β€”β€”-
GELU
280 billion
β€”β€”β€”β€”-
2048 tokens
MassiveText dataset (multi-domain text: web, books, news, code) Systematic study of scaling up to 280B parameters with extensive evaluation on 152 tasks;

highlighted strengths (knowledge recall) and weaknesses (logic, math) at scale
Trained on TPU v3 Pod with mixed precision;

used distributed training and periodic evaluation to analyze performance trends across model sizes
Showed that increasing model size yields broad knowledge gains but plateaus on certain reasoning tasks, informing later research on data vs. model size trade-offs
Chinchilla
(2022)
[11]
Transformer Decoder (dense)Multi-head self-attention
(auto-regressive LM)
Learned
absolute
Pre-
layernorm
β€”β€”β€”β€”-
GELU
70 billion
β€”β€”β€”β€”-
2048 tokens
1.4 trillion tokens of text (MassiveText, 4Γ— Gopher’s data)Established the compute-optimal model paradigm:

a smaller model trained on more data can outperform a larger model trained on less data
Used the same compute budget as Gopher but with 4Γ— training tokens and a 4Γ— smaller model, following new scaling law predictionsOutperformed the 280B Gopher on many benchmarks despite far fewer parameters, demonstrating the importance of adequately scaling data quantity for a given model size
LaMDA
(2022)
[12]
Transformer Decoder (dialogue-optimized)Multi-head self-attention
(conversation LM)
Learned
absolute
Pre-
layernorm
β€”β€”β€”β€”-
Swish (SiLU)
137 billion
β€”β€”β€”β€”-
2048 tokens
1.56T words of public dialog data + web text (pre-training)Specialized for open-ended dialogue, with fine-tuning to improve safety and factual grounding in responsesPre-trained on dialog-heavy corpus, then fine-tuned with human-annotated data for safety;

allowed to consult external tools/APIs during generation (to ground facts)
Produced more engaging, contextually relevant, and safer conversational responses, marking a step toward AI that can hold human-like dialogue
PaLM
(2022)
[13]
Transformer Decoder (dense)Multi-head self-attention
(auto-regressive LM)
Rotary
positional
embedding
Pre-
layernorm
β€”β€”β€”β€”-
SwiGLU
540 billion
β€”β€”β€”β€”-
2048 tokens
780B tokens (multilingual web, books, GitHub code, conversations)Achieved breakthrough few-shot performance, exceeding human average on BIG-bench, and enabled strong
multi-step reasoning and code generation
Trained on Pathways system across TPU v4 Pods, leveraging mixed parallelism;

incorporated multitask fine-tuning (FLAN) after pre-training for broad capabilities
Set new state-of-the-art on many NLP benchmarks;

demonstrated emergent abilities at scale (complex reasoning, coding, multilingual understanding)
InstructGPT /
ChatGPT
(2022)
[14]
Transformer Decoder
(GPT-3.5 series)
Masked multi-head self-attention
(with instruction tuning)
Learned
absolute
Pre-
layernorm
β€”β€”β€”β€”-
GELU
175B
(base model)
β€”β€”β€”β€”-
2048–4096 tokens
GPT-3’s pre-training data + human-generated dialogues and feedback dataAligned language model with user intentions using Reinforcement Learning from Human Feedback (RLHF), greatly improving helpfulness and safetySupervised fine-tuning on demonstration data, then RLHF:

model outputs rated by humans to train a reward model, and policy optimized via PPO
Delivered far more user-friendly responses than raw GPT-3;

reduced harmful outputs and followed instructions better, leading to ChatGPT’s widespread adoption

🎯 2023–2025

Model(Year)Architecture
Type
Attention
Type
Positional
Encoding
Normalization
&
Activation
Parameters
&
Context Length
Training Data
/ Domain
InnovationsTraining
Strategies
Capabilities
GPT-4
(2023) [15]
Transformer
(dense, multimodal)
Multi-head self-attention
(text & vision inputs)
Enhanced positional encoding
(8k–32k context)
(Details
not public)
Not disclosed (estimated β‰ˆ1.8T, MoE architecture)
β€”β€”β€”β€”-
8,192 tokens (32,768 in extended version)
Web text
(pre-training);

fine-tuned with code and imagery (multimodal)
Demonstrated powerful few-shot and reasoning abilities, with added vision input capability (accepts images as part of prompt)Post-trained with human feedback and model self-evaluation for alignment

(Reinforcement Learning with human & AI feedback)
Achieved top-level performance on a wide range of tasks (coding, math, vision-language understanding) and exams;

significantly more reliable and creative than earlier models
LLaMA
(2023) [17]
Transformer
Decoder
(open-source)
Multi-head self-attention
(auto-regressive)
Rotary positional embeddings (RoPE)RMSNorm
(pre-normalization)
β€”β€”β€”β€”-
SwiGLU
7B–65B (65B largest)
β€”β€”β€”β€”-
2048 tokens
1.0T tokens of publicly available text
(Common Crawl, Wikipedia, GitHub, etc.)
Open-sourced high-performance foundation model, achieved GPT-3-level performance with 10Γ— fewer parameters by efficient training and architecture tweakTrained on curated large-scale dataset with extensive data cleaning and deduplication;

utilized novel training efficiencies (such as mixed precision)
Enabled broad research and downstream customization (e.g., fine-tuned chat models) due to open access;

foundation for many derivative models (Alpaca, etc.), democratizing LLM research
PaLM 2
(2023) [18]
Transformer
Decoder (dense)
Multi-head self-attention (enhanced)ALiBi positional bias (longer context)Pre-
layernorm
β€”β€”β€”β€”-
GELU
340B
(reportedly, β€œUltra” model)
β€”β€”β€”β€”-
4096 tokens
Improved dataset spanning multiple languages, code, and math reasoning dataMore compute-efficient than PaLM with improved multilingual and reasoning skills; strong coding ability and domain expertise via focused training dataTrained with an updated mixture of objectives

(e.g., supervised learning on reasoning and coding tasks in addition to LM);

leveraged prior PaLM insights with reduced parameter count
Achieved superior performance across many benchmarks including logic and translation tasks;

formed the backbone of Google’s Bard and enterprise models with faster inference
Claude
(2023)
[19][20]
Transformer
Decoder (aligned AI)
Multi-head self-attention
(with long-context support)
Learned absolute (expanded context window)Pre-
layernorm
β€”β€”β€”β€”-
GELU
52B
(Claude 1) to 100B+ (Claude 2)
β€”β€”β€”β€”-
100,000 tokens (Claude 2, extended context version)
Conversational and knowledge domains
(fine-tuned from a GPT-3.5-like base)
Pioneered β€œConstitutional AI” to align model behavior via AI feedback rather than only human feedback, yielding a safer yet minimally supervised assistantInitially fine-tuned with human feedback similar to InstructGPT, then optimized via a set of written principles
(a β€œconstitution”) that the AI uses to self-refine its answers
Exhibits high-quality, less toxic dialogue and can handle extremely long documents in a single prompt (100k tokens), enabling analysis of lengthy texts;

one of the first serious competitors to OpenAI’s models
Gemini
(2023) [21]
Multimodal Transformer
(text, code, vision, audio)
Multi-modal
self-attention integrating different data types
Learned positional + modality-specific encodingsPre-layernorm
β€”β€”β€”β€”-
SwiGLU
>1T parameters
β€”β€”β€”β€”-
128k tokens
Multimodal and multilingual dataset (web text, images, code, audio, video)Natively multimodal from the ground upβ€”trained on text and other modalities together, enabling fluid combination of modalities and advanced reasoning abilitiesPre-trained jointly on diverse modalities then fine-tuned with targeted multimodal datasets;

incorporates tool use
(e.g. search, APIs) and code execution during fine-tuning for β€œagentic” behavior
Achieved state-of-the-art on vision-language and multimodal benchmarks;

capable of complex reasoning and planning across text, images, and more, representing Google DeepMind’s answer to GPT-4
Mistral
(2023) [22]
Transformer Decoder
(dense, efficient)
Sliding-window attention
+
grouped -queryattention (GQA)
Rotary positional
embeddings (RoPE)
RMSNorm
β€”β€”β€”β€”-
SwiGLU
7B
———–
8,192
tokens
Public domain English text, code, and reasoning tasks
(OpenWeb, StackExchange, etc.)
Optimized for efficiency and performance; improved context window and inference cost compared to LLaMA 2Trained on curated high-quality data with attention optimization and efficient token handlingOpen-weight model with strong results on open benchmarks and better inference efficiency on edge devices
Qwen-1.5
(2024) [23]
Transformer Decoder
(dense)
Multi-head self-attention +
GQA
RoPE + extended
context
window
(up to 128K)
RMSNorm
β€”β€”β€”β€”-
SwiGLU
0.5B – 72B
β€”β€”β€”β€”-
8k to 128k tokens
Multilingual + code-heavy datasets, instruction tuningIntroduced a wide range of open models from lightweight to ultra-scale sizes with strong performance and multilingual supportInstruction tuning, data deduplication, and large-context training pipelines for global useVersatile, competitive models across open benchmarks in both English and Chinese;

notable open-source support via HuggingFace
LLaMA 3
(2024) [24]
Transformer Decoder
(dense, efficient)
Multi-head self-attention +
GQA
RoPE with longer context support
(128k planned)
RMSNorm
β€”β€”β€”β€”-
SwiGLU
8B – 65B (current), 400B+ planned
β€”β€”β€”β€”-
8k–128k tokens
Cleaned Common Crawl, Github, multilingual text, academic papersMeta’s next-gen open models with improved alignment, multilingual performance, and code/data reasoningFine-tuned on curated datasets with alignment objectives (chat, code, math), trained on Meta’s Research SuperClusterIntended as GPT-4-class public alternative, with strong few-shot, multilingual, and tool-use capabilities
DeepSeek-R1
(2025)
[25] [26]
Transformer Decoder
(Mixture-of-Experts)
Multi-head
self-attention
+
MoE feed-forward
(32 experts per layer)
ALiBi positional bias
(extremely long context)
Pre-layernorm
β€”β€”β€”β€”-
GELU
671 billion (MoE; ~37B parameters active per token)
β€”β€”β€”β€”-
128,000 tokens
Broad web and knowledge corpora;

specialized logical reasoning datasets
β€œReasoning-centric” LLM optimized via large-scale reinforcement learning to excel at step-by-step problem solving and logic tasks, with unprecedented context lengthMulti-stage training: pretrained on diverse text, then purely reinforcement learning on reasoning tasks (no supervised fine-tune), plus reward-model guiding and distillation into smaller modelsMatches or surpasses similar-sized dense models on math, coding, and logic benchmarks at a fraction of training cost;

open-sourced by a Chinese startup, sparking global competitive pressure in advanced AI capabilities

🎯 References

1. Attention Is All You Need, Ashish Vaswani et al., 2017 – NeurIPS. [Paper]
2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova, 2018 – NAACL. [Paper]
3. Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, & Ilya Sutskever, 2018 – OpenAI (Technical Report). [PDF]
4. Language Models are Unsupervised Multitask Learners, Alec Radford et al., 2019 – OpenAI (Technical Report). [PDF]
5. XLNet: Generalized Autoregressive Pretraining for Language Understanding, Zhilin Yang et al., 2019 – NeurIPS. [Paper]
6. Language Models are Few-Shot Learners, Tom B. Brown et al., 2020 – NeurIPS. [Paper]
7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel et al., 2020 – J. Machine Learning Research. [Paper]
8. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, & Noam Shazeer, 2021 – JMLR. [Paper]
9. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du et al., 2022 – ICML. [Paper]
10. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Jack W. Rae et al., 2021 – DeepMind (Technical Report). [Paper]
11. Training Compute-Optimal Large Language Models, Jordan Hoffmann et al., 2022 – DeepMind (NeurIPS). [Paper]
12. LaMDA: Language Models for Dialog Applications, Romal Thoppilan et al., 2022 – arXiv Preprint. [Paper]
13. PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery et al., 2022 – arXiv Preprint. [Paper]
14. Training language models to follow instructions with human feedback, Long Ouyang et al., 2022 – OpenAI (NeurIPS). [Paper]
15. GPT-4 Technical Report, OpenAI, 2023. [Paper]
16. GPT-4 has more than a trillion parameters – Report, Matthias Bastian, 2023 – The Decoder. [Article]
17. LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron et al., 2023 – Meta AI. [Paper]
18. PaLM 2 Technical Report, Rohan Anil et al., 2023 – Google. [Paper]
19. Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai et al., 2022 – Anthropic. [Paper]
20. Introducing 100K Context Windows, Anthropic, 2023. [Blog]
21. Gemini: A Family of Highly Capable Multimodal Models (Technical Report), Google DeepMind, 2023. [PDF]
22. Mistral 7B Technical Report, Mistral AI, 2023. [Blog]
23. Qwen: The Qwen-1.5 Series, Alibaba DAMO Academy, 2024. [HuggingFace]
24. LLaMA 3 Preview, Meta AI, 2024. [Blog]
25. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2025. [Paper]
26. DeepSeek-R1 model now available in Amazon Bedrock Marketplace…, Vivek Gangasani, Banu Nagasundaram, Jonathan Evans, & Niithiyn Vijeaswaran, 2025 – AWS Blog. [Article]
27. Discover AI – Youtube channel

Note

  • The LLMs listed here represent a personal selection from the many published models in recent years, based on what I’ve studied or followed.
  • This is not an exhaustive list, and many excellent models may not be included here.
  • This document is designed to keep me updated with growing LLM architectures. It is based on my personal understanding and synthesis of various research papers.
  • If you spot any errors or have suggestions for improvement, please feel free to reach out.
  • The references mentioned above have been explored over the past few months. There may be more detailed explanations available, and I’d be happy to dig deeper into those if needed.

Thank you for taking the time to read this. This is part of my personal study, so if you notice anything missing or requiring correction, feel free to reach out. I’d love to connectβ€”feel free to reach out to me on LinkedIn or check out my Github-AI-ML-Repo