A concise, personal comparison of key LLM architectures developed over the past few years.
This document reflects my individual understanding and curiosity-driven research from the year 2017 to February 2025.
This is by no means an exhaustive list, and many other excellent models exist in the field. List of LLMs Covered (2017β2025)
Transformer, BERT, GPT, GPT-2, XLNet, GPT-3, T5, Switch Transformer, GLaM, Gopher, Chinchilla, LaMDA, PaLM, InstructGPT / ChatGPT, GPT-4, LLaMA, PaLM 2, Claude, Gemini, Mistral, Qwen-1.5, LLaMA 3, DeepSeek-R1
2017β2019
Model(Year) | Architecture Type | Attention Type | Positional Encoding | Normalization & Activation | Parameters & Context Length | Training Data | Innovations | Training Strategies | Capabilities |
---|---|---|---|---|---|---|---|---|---|
Transformer (2017) [1] | Encoder-Decoder Transformer | Multi-head self-attention (encoder & decoder) + cross-attention | Fixed sinusoidal | Post- layernorm ββββ- ReLU | ~65M (base model) ββββ 512 tokens | WMT14 translation corpora (e.g., 4.5M sentence pairs EnβDe) | Introduced self-attention to replace recurrent networks, enabling parallel sequence processing | Supervised learning on translation tasks; residual connections, layer normalization, Adam optimizer | Dramatically improved machine translation quality and speed; became foundational architecture for subsequent LLMs |
BERT (2018) [2] | Transformer Encoder (bidirectional) | Full bidirectional self-attention (MLM objective) | Learned absolute | Post- layernorm ββββ- GELU | 110M (Base), 340M (Large) ββββ 512 tokens | BooksCorpus + English Wikipedia (3.3B words total) | Masked Language Modeling and Next Sentence Prediction for deep bidirectional context understanding | Unsupervised pre-training on large text corpus, then task-specific fine-tuning (transfer learning) | Set new state-of-the-art on many NLP tasks (GLUE, QA) via contextualized embeddings and fine-tuning |
GPT (2018) [3] | Transformer Decoder (unidirectional) | Auto-regressive masked self-attention (causal LM) | Learned absolute | (Layernorm in transformer blocks) ββββ- GELU | 117M ββββ 512 tokens | BookCorpus (700M words of novels) | First to use generative pre-training for language understanding tasks, demonstrating transfer learning from unsupervised LM | Unsupervised language model pre-training on unlabeled text, followed by supervised fine-tuning on each task | Outperformed task-specific architectures on 9 of 12 NLP tasks via pre-trained knowledge, showing power of generative pre-training |
GPT-2 (2019) [4] | Transformer Decoder (deep, uni-directional) | Masked multi-head self-attention (auto-regressive) | Learned absolute | (Layernorm in each layer) ββββ- GELU | 1.5 billion ββββ 1024 tokens | WebText (8M web pages from Reddit links, ~40 GB) | Demonstrated that much larger unsupervised language models can generate coherent long-form text | Generative pre-training on vast internet text; no fine-tuning, evaluated zero-shot on tasks | Achieved notable zero-shot performance on diverse tasks (QA, translation, summarization), indicating emergent multitask learning abilities |
XLNet (2019) [5] | Transformer-XL Decoder (autoregressive) | Permutation-based full self-attention (two-stream) | Segment-aware relative positional encoding | Post- layernorm ββββ- GELU | 340M (Large) ββββ 512 tokens | Diverse large text corpora (Google Books, Wikipedia, Giga5, ClueWeb, Common Crawl) | Generalized autoregressive pre-training that leverages all context positions (permuted order) instead of masking | Memory-augmented Transformer (recurrence from Transformer-XL) with two-stream attention; trained with permutation language modeling objective | Outperformed BERT on NLP benchmarks (e.g., GLUE) by capturing bidirectional context without an explicit mask, improving downstream task performance |
2020β2022
Model(Year) | Architecture Type | Attention Type | Positional Encoding | Normalization & Activation | Parameters & Context Length | Training Data | Innovations | Training Strategies | Capabilities |
---|---|---|---|---|---|---|---|---|---|
GPT-3 (2020) [6] | Transformer Decoder (very deep) | Masked multi-head self-attention (auto-regressive) | Learned absolute (2048 tokens) | Pre- layernorm ββββ- GELU | 175 billion ββββ- 2048 tokens | ~300B tokens from Common Crawl, WebText2, Books, Wikipedia | Massive scale showed emergent few-shot learning β model can perform tasks from prompts without fine-tuning | Trained on extremely large corpus with mixed precision and model-parallelism across GPUs; no task-specific fine-tuning required for evaluation | Achieved state-of-the-art in few-shot and zero-shot settings on many NLP tasks; demonstrated the benefits of scale for versatility |
T5 (2020) [7] | Transformer EncoderβDecoder | Full self-attention (enc & dec) + cross-attention | Relative positional embeddings | Pre- layernorm ββββ- ReLU (with variants explored) | 11 billion (largest) ββββ- 512 tokens | C4 (Colossal Cleaned Common Crawl, ~750 GB text) | Unified βtext-to-textβ framework β model treats every NLP task (translation, QA, summarization, etc.) as text generation | Unsupervised pre-training on C4 corpus with a denoising objective; followed by task-specific fine-tuning in a text-to-text format | Achieved state-of-the-art on numerous benchmarks with one model applicable to all tasks; open-sourced in various sizes for flexible fine-tuning |
Switch Transformer (2021) [8][9] | Transformer Decoder (Mixture-of-Experts) | Sparse MoE multi-head attention (experts in FFN layers) | Learned absolute | Pre- layernorm ββββ- SwiGLU | 1.6 trillion (with 64 experts, ~26B active per token) ββββ- 2048 tokens | C4 corpus (same as T5) | Introduced conditional computation: uses routing to activate one expert feed-forward network per token, enabling extreme scale with efficient compute | MoE training with load-balancing loss to ensure experts are utilized; scaled on TPU pods (Pathways) to reach trillion+ parameters | Matched dense model quality with much lower computational cost; set new scale records (trillion+ parameters) while maintaining strong zero-shot and one-shot performance |
GLaM (2022) [9] | Transformer Decoder (Mixture-of-Experts) | Sparse mixture-of-experts (two experts per token) | Learned absolute | Pre- layernorm ββββ- GELU | 1.2 trillion (64 experts, 2 active per token) ββββ- 2048 tokens | Massive multilingual web corpus (filtered web pages, dialogues, code) | Scaled MoE further with a balanced gating approach (each token routed to 2 experts) for efficiency β 7Γ parameter count of GPT-3 with 1/3 the energy cost | Pre-trained with sparsely activated experts to reduce FLOPs; required specialized initialization and auxiliary losses for expert balance | Outperformed GPT-3 in zero-/one-shot tasks while using significantly less inference compute per token; demonstrated efficient super-scaling of model capacity |
Gopher (2021) [10] | Transformer Decoder (dense) | Multi-head self-attention (auto-regressive LM) | Learned absolute [10] | Pre- layernorm ββββ- GELU | 280 billion ββββ- 2048 tokens | MassiveText dataset (multi-domain text: web, books, news, code) | Systematic study of scaling up to 280B parameters with extensive evaluation on 152 tasks; highlighted strengths (knowledge recall) and weaknesses (logic, math) at scale | Trained on TPU v3 Pod with mixed precision; used distributed training and periodic evaluation to analyze performance trends across model sizes | Showed that increasing model size yields broad knowledge gains but plateaus on certain reasoning tasks, informing later research on data vs. model size trade-offs |
Chinchilla (2022) [11] | Transformer Decoder (dense) | Multi-head self-attention (auto-regressive LM) | Learned absolute | Pre- layernorm ββββ- GELU | 70 billion ββββ- 2048 tokens | 1.4 trillion tokens of text (MassiveText, 4Γ Gopherβs data) | Established the compute-optimal model paradigm: a smaller model trained on more data can outperform a larger model trained on less data | Used the same compute budget as Gopher but with 4Γ training tokens and a 4Γ smaller model, following new scaling law predictions | Outperformed the 280B Gopher on many benchmarks despite far fewer parameters, demonstrating the importance of adequately scaling data quantity for a given model size |
LaMDA (2022) [12] | Transformer Decoder (dialogue-optimized) | Multi-head self-attention (conversation LM) | Learned absolute | Pre- layernorm ββββ- Swish (SiLU) | 137 billion ββββ- 2048 tokens | 1.56T words of public dialog data + web text (pre-training) | Specialized for open-ended dialogue, with fine-tuning to improve safety and factual grounding in responses | Pre-trained on dialog-heavy corpus, then fine-tuned with human-annotated data for safety; allowed to consult external tools/APIs during generation (to ground facts) | Produced more engaging, contextually relevant, and safer conversational responses, marking a step toward AI that can hold human-like dialogue |
PaLM (2022) [13] | Transformer Decoder (dense) | Multi-head self-attention (auto-regressive LM) | Rotary positional embedding | Pre- layernorm ββββ- SwiGLU | 540 billion ββββ- 2048 tokens | 780B tokens (multilingual web, books, GitHub code, conversations) | Achieved breakthrough few-shot performance, exceeding human average on BIG-bench, and enabled strong multi-step reasoning and code generation | Trained on Pathways system across TPU v4 Pods, leveraging mixed parallelism; incorporated multitask fine-tuning (FLAN) after pre-training for broad capabilities | Set new state-of-the-art on many NLP benchmarks; demonstrated emergent abilities at scale (complex reasoning, coding, multilingual understanding) |
InstructGPT / ChatGPT (2022) [14] | Transformer Decoder (GPT-3.5 series) | Masked multi-head self-attention (with instruction tuning) | Learned absolute | Pre- layernorm ββββ- GELU | 175B (base model) ββββ- 2048β4096 tokens | GPT-3βs pre-training data + human-generated dialogues and feedback data | Aligned language model with user intentions using Reinforcement Learning from Human Feedback (RLHF), greatly improving helpfulness and safety | Supervised fine-tuning on demonstration data, then RLHF: model outputs rated by humans to train a reward model, and policy optimized via PPO | Delivered far more user-friendly responses than raw GPT-3; reduced harmful outputs and followed instructions better, leading to ChatGPTβs widespread adoption |
2023β2025
Model(Year) | Architecture Type | Attention Type | Positional Encoding | Normalization & Activation | Parameters & Context Length | Training Data / Domain | Innovations | Training Strategies | Capabilities |
---|---|---|---|---|---|---|---|---|---|
GPT-4 (2023) [15] | Transformer (dense, multimodal) | Multi-head self-attention (text & vision inputs) | Enhanced positional encoding (8kβ32k context) | (Details not public) | Not disclosed (estimated β1.8T, MoE architecture) ββββ- 8,192 tokens (32,768 in extended version) | Web text (pre-training); fine-tuned with code and imagery (multimodal) | Demonstrated powerful few-shot and reasoning abilities, with added vision input capability (accepts images as part of prompt) | Post-trained with human feedback and model self-evaluation for alignment (Reinforcement Learning with human & AI feedback) | Achieved top-level performance on a wide range of tasks (coding, math, vision-language understanding) and exams; significantly more reliable and creative than earlier models |
LLaMA (2023) [17] | Transformer Decoder (open-source) | Multi-head self-attention (auto-regressive) | Rotary positional embeddings (RoPE) | RMSNorm (pre-normalization) ββββ- SwiGLU | 7Bβ65B (65B largest) ββββ- 2048 tokens | 1.0T tokens of publicly available text (Common Crawl, Wikipedia, GitHub, etc.) | Open-sourced high-performance foundation model, achieved GPT-3-level performance with 10Γ fewer parameters by efficient training and architecture tweak | Trained on curated large-scale dataset with extensive data cleaning and deduplication; utilized novel training efficiencies (such as mixed precision) | Enabled broad research and downstream customization (e.g., fine-tuned chat models) due to open access; foundation for many derivative models (Alpaca, etc.), democratizing LLM research |
PaLM 2 (2023) [18] | Transformer Decoder (dense) | Multi-head self-attention (enhanced) | ALiBi positional bias (longer context) | Pre- layernorm ββββ- GELU | 340B (reportedly, βUltraβ model) ββββ- 4096 tokens | Improved dataset spanning multiple languages, code, and math reasoning data | More compute-efficient than PaLM with improved multilingual and reasoning skills; strong coding ability and domain expertise via focused training data | Trained with an updated mixture of objectives (e.g., supervised learning on reasoning and coding tasks in addition to LM); leveraged prior PaLM insights with reduced parameter count | Achieved superior performance across many benchmarks including logic and translation tasks; formed the backbone of Googleβs Bard and enterprise models with faster inference |
Claude (2023) [19][20] | Transformer Decoder (aligned AI) | Multi-head self-attention (with long-context support) | Learned absolute (expanded context window) | Pre- layernorm ββββ- GELU | 52B (Claude 1) to 100B+ (Claude 2) ββββ- 100,000 tokens (Claude 2, extended context version) | Conversational and knowledge domains (fine-tuned from a GPT-3.5-like base) | Pioneered βConstitutional AIβ to align model behavior via AI feedback rather than only human feedback, yielding a safer yet minimally supervised assistant | Initially fine-tuned with human feedback similar to InstructGPT, then optimized via a set of written principles (a βconstitutionβ) that the AI uses to self-refine its answers | Exhibits high-quality, less toxic dialogue and can handle extremely long documents in a single prompt (100k tokens), enabling analysis of lengthy texts; one of the first serious competitors to OpenAIβs models |
Gemini (2023) [21] | Multimodal Transformer (text, code, vision, audio) | Multi-modal self-attention integrating different data types | Learned positional + modality-specific encodings | Pre-layernorm ββββ- SwiGLU | >1T parameters ββββ- 128k tokens | Multimodal and multilingual dataset (web text, images, code, audio, video) | Natively multimodal from the ground upβtrained on text and other modalities together, enabling fluid combination of modalities and advanced reasoning abilities | Pre-trained jointly on diverse modalities then fine-tuned with targeted multimodal datasets; incorporates tool use (e.g. search, APIs) and code execution during fine-tuning for βagenticβ behavior | Achieved state-of-the-art on vision-language and multimodal benchmarks; capable of complex reasoning and planning across text, images, and more, representing Google DeepMindβs answer to GPT-4 |
Mistral (2023) [22] | Transformer Decoder (dense, efficient) | Sliding-window attention + grouped -queryattention (GQA) | Rotary positional embeddings (RoPE) | RMSNorm ββββ- SwiGLU | 7B ββββ 8,192 tokens | Public domain English text, code, and reasoning tasks (OpenWeb, StackExchange, etc.) | Optimized for efficiency and performance; improved context window and inference cost compared to LLaMA 2 | Trained on curated high-quality data with attention optimization and efficient token handling | Open-weight model with strong results on open benchmarks and better inference efficiency on edge devices |
Qwen-1.5 (2024) [23] | Transformer Decoder (dense) | Multi-head self-attention + GQA | RoPE + extended context window (up to 128K) | RMSNorm ββββ- SwiGLU | 0.5B β 72B ββββ- 8k to 128k tokens | Multilingual + code-heavy datasets, instruction tuning | Introduced a wide range of open models from lightweight to ultra-scale sizes with strong performance and multilingual support | Instruction tuning, data deduplication, and large-context training pipelines for global use | Versatile, competitive models across open benchmarks in both English and Chinese; notable open-source support via HuggingFace |
LLaMA 3 (2024) [24] | Transformer Decoder (dense, efficient) | Multi-head self-attention + GQA | RoPE with longer context support (128k planned) | RMSNorm ββββ- SwiGLU | 8B β 65B (current), 400B+ planned ββββ- 8kβ128k tokens | Cleaned Common Crawl, Github, multilingual text, academic papers | Metaβs next-gen open models with improved alignment, multilingual performance, and code/data reasoning | Fine-tuned on curated datasets with alignment objectives (chat, code, math), trained on Metaβs Research SuperCluster | Intended as GPT-4-class public alternative, with strong few-shot, multilingual, and tool-use capabilities |
DeepSeek-R1 (2025) [25] [26] | Transformer Decoder (Mixture-of-Experts) | Multi-head self-attention + MoE feed-forward (32 experts per layer) | ALiBi positional bias (extremely long context) | Pre-layernorm ββββ- GELU | 671 billion (MoE; ~37B parameters active per token) ββββ- 128,000 tokens | Broad web and knowledge corpora; specialized logical reasoning datasets | βReasoning-centricβ LLM optimized via large-scale reinforcement learning to excel at step-by-step problem solving and logic tasks, with unprecedented context length | Multi-stage training: pretrained on diverse text, then purely reinforcement learning on reasoning tasks (no supervised fine-tune), plus reward-model guiding and distillation into smaller models | Matches or surpasses similar-sized dense models on math, coding, and logic benchmarks at a fraction of training cost; open-sourced by a Chinese startup, sparking global competitive pressure in advanced AI capabilities |
References
1. Attention Is All You Need, Ashish Vaswani et al., 2017 β NeurIPS. [Paper] |
2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova, 2018 β NAACL. [Paper] |
3. Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, & Ilya Sutskever, 2018 β OpenAI (Technical Report). [PDF] |
4. Language Models are Unsupervised Multitask Learners, Alec Radford et al., 2019 β OpenAI (Technical Report). [PDF] |
5. XLNet: Generalized Autoregressive Pretraining for Language Understanding, Zhilin Yang et al., 2019 β NeurIPS. [Paper] |
6. Language Models are Few-Shot Learners, Tom B. Brown et al., 2020 β NeurIPS. [Paper] |
7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel et al., 2020 β J. Machine Learning Research. [Paper] |
8. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, & Noam Shazeer, 2021 β JMLR. [Paper] |
9. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du et al., 2022 β ICML. [Paper] |
10. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Jack W. Rae et al., 2021 β DeepMind (Technical Report). [Paper] |
11. Training Compute-Optimal Large Language Models, Jordan Hoffmann et al., 2022 β DeepMind (NeurIPS). [Paper] |
12. LaMDA: Language Models for Dialog Applications, Romal Thoppilan et al., 2022 β arXiv Preprint. [Paper] |
13. PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery et al., 2022 β arXiv Preprint. [Paper] |
14. Training language models to follow instructions with human feedback, Long Ouyang et al., 2022 β OpenAI (NeurIPS). [Paper] |
15. GPT-4 Technical Report, OpenAI, 2023. [Paper] |
16. GPT-4 has more than a trillion parameters β Report, Matthias Bastian, 2023 β The Decoder. [Article] |
17. LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron et al., 2023 β Meta AI. [Paper] |
18. PaLM 2 Technical Report, Rohan Anil et al., 2023 β Google. [Paper] |
19. Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai et al., 2022 β Anthropic. [Paper] |
20. Introducing 100K Context Windows, Anthropic, 2023. [Blog] |
21. Gemini: A Family of Highly Capable Multimodal Models (Technical Report), Google DeepMind, 2023. [PDF] |
22. Mistral 7B Technical Report, Mistral AI, 2023. [Blog] |
23. Qwen: The Qwen-1.5 Series, Alibaba DAMO Academy, 2024. [HuggingFace] |
24. LLaMA 3 Preview, Meta AI, 2024. [Blog] |
25. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2025. [Paper] |
26. DeepSeek-R1 model now available in Amazon Bedrock Marketplaceβ¦, Vivek Gangasani, Banu Nagasundaram, Jonathan Evans, & Niithiyn Vijeaswaran, 2025 β AWS Blog. [Article] |
27. Discover AI β Youtube channel |
Note
- The LLMs listed here represent a personal selection from the many published models in recent years, based on what Iβve studied or followed.
- This is not an exhaustive list, and many excellent models may not be included here.
- This document is designed to keep me updated with growing LLM architectures. It is based on my personal understanding and synthesis of various research papers.
- If you spot any errors or have suggestions for improvement, please feel free to reach out.
- The references mentioned above have been explored over the past few months. There may be more detailed explanations available, and Iβd be happy to dig deeper into those if needed.
Thank you for taking the time to read this. This is part of my personal study, so if you notice anything missing or requiring correction, feel free to reach out. Iβd love to connectβfeel free to reach out to me on LinkedIn or check out my Github-AI-ML-Repo