Comparison of Major LLM Architectures (2017– 2025)

A concise, personal comparison of key LLM architectures developed over the past few years.

This document reflects my individual understanding and curiosity-driven research from the year 2017 to February 2025.

This is by no means an exhaustive list, and many other excellent models exist in the field.

List of LLMs Covered (2017–2025)

Transformer, BERT, GPT, GPT-2, XLNet, GPT-3, T5, Switch Transformer, GLaM, Gopher, Chinchilla, LaMDA, PaLM, InstructGPT / ChatGPT, GPT-4, LLaMA, PaLM 2, Claude, Gemini, Mistral, Qwen-1.5, LLaMA 3, DeepSeek-R1

2017–2019

Model(Year)	Architecture Type	Attention Type	Positional Encoding	Normalization & Activation	Parameters & Context Length	Training Data	Innovations	Training Strategies	Capabilities
Transformer (2017) [1]	Encoder-Decoder Transformer	Multi-head self-attention (encoder & decoder) + cross-attention	Fixed sinusoidal	Post- layernorm ————- ReLU	~65M (base model) ———— 512 tokens	WMT14 translation corpora (e.g., 4.5M sentence pairs En→De)	Introduced self-attention to replace recurrent networks, enabling parallel sequence processing	Supervised learning on translation tasks; residual connections, layer normalization, Adam optimizer	Dramatically improved machine translation quality and speed; became foundational architecture for subsequent LLMs
BERT (2018) [2]	Transformer Encoder (bidirectional)	Full bidirectional self-attention (MLM objective)	Learned absolute	Post- layernorm ————- GELU	110M (Base), 340M (Large) ———— 512 tokens	BooksCorpus + English Wikipedia (3.3B words total)	Masked Language Modeling and Next Sentence Prediction for deep bidirectional context understanding	Unsupervised pre-training on large text corpus, then task-specific fine-tuning (transfer learning)	Set new state-of-the-art on many NLP tasks (GLUE, QA) via contextualized embeddings and fine-tuning
GPT (2018) [3]	Transformer Decoder (unidirectional)	Auto-regressive masked self-attention (causal LM)	Learned absolute	(Layernorm in transformer blocks) ————- GELU	117M ———— 512 tokens	BookCorpus (700M words of novels)	First to use generative pre-training for language understanding tasks, demonstrating transfer learning from unsupervised LM	Unsupervised language model pre-training on unlabeled text, followed by supervised fine-tuning on each task	Outperformed task-specific architectures on 9 of 12 NLP tasks via pre-trained knowledge, showing power of generative pre-training
GPT-2 (2019) [4]	Transformer Decoder (deep, uni-directional)	Masked multi-head self-attention (auto-regressive)	Learned absolute	(Layernorm in each layer) ————- GELU	1.5 billion ———— 1024 tokens	WebText (8M web pages from Reddit links, ~40 GB)	Demonstrated that much larger unsupervised language models can generate coherent long-form text	Generative pre-training on vast internet text; no fine-tuning, evaluated zero-shot on tasks	Achieved notable zero-shot performance on diverse tasks (QA, translation, summarization), indicating emergent multitask learning abilities
XLNet (2019) [5]	Transformer-XL Decoder (autoregressive)	Permutation-based full self-attention (two-stream)	Segment-aware relative positional encoding	Post- layernorm ————- GELU	340M (Large) ———— 512 tokens	Diverse large text corpora (Google Books, Wikipedia, Giga5, ClueWeb, Common Crawl)	Generalized autoregressive pre-training that leverages all context positions (permuted order) instead of masking	Memory-augmented Transformer (recurrence from Transformer-XL) with two-stream attention; trained with permutation language modeling objective	Outperformed BERT on NLP benchmarks (e.g., GLUE) by capturing bidirectional context without an explicit mask, improving downstream task performance

2020–2022

Model(Year)	Architecture Type	Attention Type	Positional Encoding	Normalization & Activation	Parameters & Context Length	Training Data	Innovations	Training Strategies	Capabilities
GPT-3 (2020) [6]	Transformer Decoder (very deep)	Masked multi-head self-attention (auto-regressive)	Learned absolute (2048 tokens)	Pre- layernorm ————- GELU	175 billion ————- 2048 tokens	~300B tokens from Common Crawl, WebText2, Books, Wikipedia	Massive scale showed emergent few-shot learning — model can perform tasks from prompts without fine-tuning	Trained on extremely large corpus with mixed precision and model-parallelism across GPUs; no task-specific fine-tuning required for evaluation	Achieved state-of-the-art in few-shot and zero-shot settings on many NLP tasks; demonstrated the benefits of scale for versatility
T5 (2020) [7]	Transformer Encoder–Decoder	Full self-attention (enc & dec) + cross-attention	Relative positional embeddings	Pre- layernorm ————- ReLU (with variants explored)	11 billion (largest) ————- 512 tokens	C4 (Colossal Cleaned Common Crawl, ~750 GB text)	Unified “text-to-text” framework – model treats every NLP task (translation, QA, summarization, etc.) as text generation	Unsupervised pre-training on C4 corpus with a denoising objective; followed by task-specific fine-tuning in a text-to-text format	Achieved state-of-the-art on numerous benchmarks with one model applicable to all tasks; open-sourced in various sizes for flexible fine-tuning
Switch Transformer (2021) [8][9]	Transformer Decoder (Mixture-of-Experts)	Sparse MoE multi-head attention (experts in FFN layers)	Learned absolute	Pre- layernorm ————- SwiGLU	1.6 trillion (with 64 experts, ~26B active per token) ————- 2048 tokens	C4 corpus (same as T5)	Introduced conditional computation: uses routing to activate one expert feed-forward network per token, enabling extreme scale with efficient compute	MoE training with load-balancing loss to ensure experts are utilized; scaled on TPU pods (Pathways) to reach trillion+ parameters	Matched dense model quality with much lower computational cost; set new scale records (trillion+ parameters) while maintaining strong zero-shot and one-shot performance
GLaM (2022) [9]	Transformer Decoder (Mixture-of-Experts)	Sparse mixture-of-experts (two experts per token)	Learned absolute	Pre- layernorm ————- GELU	1.2 trillion (64 experts, 2 active per token) ————- 2048 tokens	Massive multilingual web corpus (filtered web pages, dialogues, code)	Scaled MoE further with a balanced gating approach (each token routed to 2 experts) for efficiency – 7× parameter count of GPT-3 with 1/3 the energy cost	Pre-trained with sparsely activated experts to reduce FLOPs; required specialized initialization and auxiliary losses for expert balance	Outperformed GPT-3 in zero-/one-shot tasks while using significantly less inference compute per token; demonstrated efficient super-scaling of model capacity
Gopher (2021) [10]	Transformer Decoder (dense)	Multi-head self-attention (auto-regressive LM)	Learned absolute [10]	Pre- layernorm ————- GELU	280 billion ————- 2048 tokens	MassiveText dataset (multi-domain text: web, books, news, code)	Systematic study of scaling up to 280B parameters with extensive evaluation on 152 tasks; highlighted strengths (knowledge recall) and weaknesses (logic, math) at scale	Trained on TPU v3 Pod with mixed precision; used distributed training and periodic evaluation to analyze performance trends across model sizes	Showed that increasing model size yields broad knowledge gains but plateaus on certain reasoning tasks, informing later research on data vs. model size trade-offs
Chinchilla (2022) [11]	Transformer Decoder (dense)	Multi-head self-attention (auto-regressive LM)	Learned absolute	Pre- layernorm ————- GELU	70 billion ————- 2048 tokens	1.4 trillion tokens of text (MassiveText, 4× Gopher’s data)	Established the compute-optimal model paradigm: a smaller model trained on more data can outperform a larger model trained on less data	Used the same compute budget as Gopher but with 4× training tokens and a 4× smaller model, following new scaling law predictions	Outperformed the 280B Gopher on many benchmarks despite far fewer parameters, demonstrating the importance of adequately scaling data quantity for a given model size
LaMDA (2022) [12]	Transformer Decoder (dialogue-optimized)	Multi-head self-attention (conversation LM)	Learned absolute	Pre- layernorm ————- Swish (SiLU)	137 billion ————- 2048 tokens	1.56T words of public dialog data + web text (pre-training)	Specialized for open-ended dialogue, with fine-tuning to improve safety and factual grounding in responses	Pre-trained on dialog-heavy corpus, then fine-tuned with human-annotated data for safety; allowed to consult external tools/APIs during generation (to ground facts)	Produced more engaging, contextually relevant, and safer conversational responses, marking a step toward AI that can hold human-like dialogue
PaLM (2022) [13]	Transformer Decoder (dense)	Multi-head self-attention (auto-regressive LM)	Rotary positional embedding	Pre- layernorm ————- SwiGLU	540 billion ————- 2048 tokens	780B tokens (multilingual web, books, GitHub code, conversations)	Achieved breakthrough few-shot performance, exceeding human average on BIG-bench, and enabled strong multi-step reasoning and code generation	Trained on Pathways system across TPU v4 Pods, leveraging mixed parallelism; incorporated multitask fine-tuning (FLAN) after pre-training for broad capabilities	Set new state-of-the-art on many NLP benchmarks; demonstrated emergent abilities at scale (complex reasoning, coding, multilingual understanding)
InstructGPT / ChatGPT (2022) [14]	Transformer Decoder (GPT-3.5 series)	Masked multi-head self-attention (with instruction tuning)	Learned absolute	Pre- layernorm ————- GELU	175B (base model) ————- 2048–4096 tokens	GPT-3’s pre-training data + human-generated dialogues and feedback data	Aligned language model with user intentions using Reinforcement Learning from Human Feedback (RLHF), greatly improving helpfulness and safety	Supervised fine-tuning on demonstration data, then RLHF: model outputs rated by humans to train a reward model, and policy optimized via PPO	Delivered far more user-friendly responses than raw GPT-3; reduced harmful outputs and followed instructions better, leading to ChatGPT’s widespread adoption

2023–2025

Model(Year)	Architecture Type	Attention Type	Positional Encoding	Normalization & Activation	Parameters & Context Length	Training Data / Domain	Innovations	Training Strategies	Capabilities
GPT-4 (2023) [15]	Transformer (dense, multimodal)	Multi-head self-attention (text & vision inputs)	Enhanced positional encoding (8k–32k context)	(Details not public)	Not disclosed (estimated ≈1.8T, MoE architecture) ————- 8,192 tokens (32,768 in extended version)	Web text (pre-training); fine-tuned with code and imagery (multimodal)	Demonstrated powerful few-shot and reasoning abilities, with added vision input capability (accepts images as part of prompt)	Post-trained with human feedback and model self-evaluation for alignment (Reinforcement Learning with human & AI feedback)	Achieved top-level performance on a wide range of tasks (coding, math, vision-language understanding) and exams; significantly more reliable and creative than earlier models
LLaMA (2023) [17]	Transformer Decoder (open-source)	Multi-head self-attention (auto-regressive)	Rotary positional embeddings (RoPE)	RMSNorm (pre-normalization) ————- SwiGLU	7B–65B (65B largest) ————- 2048 tokens	1.0T tokens of publicly available text (Common Crawl, Wikipedia, GitHub, etc.)	Open-sourced high-performance foundation model, achieved GPT-3-level performance with 10× fewer parameters by efficient training and architecture tweak	Trained on curated large-scale dataset with extensive data cleaning and deduplication; utilized novel training efficiencies (such as mixed precision)	Enabled broad research and downstream customization (e.g., fine-tuned chat models) due to open access; foundation for many derivative models (Alpaca, etc.), democratizing LLM research
PaLM 2 (2023) [18]	Transformer Decoder (dense)	Multi-head self-attention (enhanced)	ALiBi positional bias (longer context)	Pre- layernorm ————- GELU	340B (reportedly, “Ultra” model) ————- 4096 tokens	Improved dataset spanning multiple languages, code, and math reasoning data	More compute-efficient than PaLM with improved multilingual and reasoning skills; strong coding ability and domain expertise via focused training data	Trained with an updated mixture of objectives (e.g., supervised learning on reasoning and coding tasks in addition to LM); leveraged prior PaLM insights with reduced parameter count	Achieved superior performance across many benchmarks including logic and translation tasks; formed the backbone of Google’s Bard and enterprise models with faster inference
Claude (2023) [19][20]	Transformer Decoder (aligned AI)	Multi-head self-attention (with long-context support)	Learned absolute (expanded context window)	Pre- layernorm ————- GELU	52B (Claude 1) to 100B+ (Claude 2) ————- 100,000 tokens (Claude 2, extended context version)	Conversational and knowledge domains (fine-tuned from a GPT-3.5-like base)	Pioneered “Constitutional AI” to align model behavior via AI feedback rather than only human feedback, yielding a safer yet minimally supervised assistant	Initially fine-tuned with human feedback similar to InstructGPT, then optimized via a set of written principles (a “constitution”) that the AI uses to self-refine its answers	Exhibits high-quality, less toxic dialogue and can handle extremely long documents in a single prompt (100k tokens), enabling analysis of lengthy texts; one of the first serious competitors to OpenAI’s models
Gemini (2023) [21]	Multimodal Transformer (text, code, vision, audio)	Multi-modal self-attention integrating different data types	Learned positional + modality-specific encodings	Pre-layernorm ————- SwiGLU	>1T parameters ————- 128k tokens	Multimodal and multilingual dataset (web text, images, code, audio, video)	Natively multimodal from the ground up—trained on text and other modalities together, enabling fluid combination of modalities and advanced reasoning abilities	Pre-trained jointly on diverse modalities then fine-tuned with targeted multimodal datasets; incorporates tool use (e.g. search, APIs) and code execution during fine-tuning for “agentic” behavior	Achieved state-of-the-art on vision-language and multimodal benchmarks; capable of complex reasoning and planning across text, images, and more, representing Google DeepMind’s answer to GPT-4
Mistral (2023) [22]	Transformer Decoder (dense, efficient)	Sliding-window attention + grouped -queryattention (GQA)	Rotary positional embeddings (RoPE)	RMSNorm ————- SwiGLU	7B ———– 8,192 tokens	Public domain English text, code, and reasoning tasks (OpenWeb, StackExchange, etc.)	Optimized for efficiency and performance; improved context window and inference cost compared to LLaMA 2	Trained on curated high-quality data with attention optimization and efficient token handling	Open-weight model with strong results on open benchmarks and better inference efficiency on edge devices
Qwen-1.5 (2024) [23]	Transformer Decoder (dense)	Multi-head self-attention + GQA	RoPE + extended context window (up to 128K)	RMSNorm ————- SwiGLU	0.5B – 72B ————- 8k to 128k tokens	Multilingual + code-heavy datasets, instruction tuning	Introduced a wide range of open models from lightweight to ultra-scale sizes with strong performance and multilingual support	Instruction tuning, data deduplication, and large-context training pipelines for global use	Versatile, competitive models across open benchmarks in both English and Chinese; notable open-source support via HuggingFace
LLaMA 3 (2024) [24]	Transformer Decoder (dense, efficient)	Multi-head self-attention + GQA	RoPE with longer context support (128k planned)	RMSNorm ————- SwiGLU	8B – 65B (current), 400B+ planned ————- 8k–128k tokens	Cleaned Common Crawl, Github, multilingual text, academic papers	Meta’s next-gen open models with improved alignment, multilingual performance, and code/data reasoning	Fine-tuned on curated datasets with alignment objectives (chat, code, math), trained on Meta’s Research SuperCluster	Intended as GPT-4-class public alternative, with strong few-shot, multilingual, and tool-use capabilities
DeepSeek-R1 (2025) [25] [26]	Transformer Decoder (Mixture-of-Experts)	Multi-head self-attention + MoE feed-forward (32 experts per layer)	ALiBi positional bias (extremely long context)	Pre-layernorm ————- GELU	671 billion (MoE; ~37B parameters active per token) ————- 128,000 tokens	Broad web and knowledge corpora; specialized logical reasoning datasets	“Reasoning-centric” LLM optimized via large-scale reinforcement learning to excel at step-by-step problem solving and logic tasks, with unprecedented context length	Multi-stage training: pretrained on diverse text, then purely reinforcement learning on reasoning tasks (no supervised fine-tune), plus reward-model guiding and distillation into smaller models	Matches or surpasses similar-sized dense models on math, coding, and logic benchmarks at a fraction of training cost; open-sourced by a Chinese startup, sparking global competitive pressure in advanced AI capabilities

References

1. Attention Is All You Need, Ashish Vaswani et al., 2017 – NeurIPS. [Paper]

2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova, 2018 – NAACL. [Paper]

3. Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, & Ilya Sutskever, 2018 – OpenAI (Technical Report). [PDF]

4. Language Models are Unsupervised Multitask Learners, Alec Radford et al., 2019 – OpenAI (Technical Report). [PDF]

5. XLNet: Generalized Autoregressive Pretraining for Language Understanding, Zhilin Yang et al., 2019 – NeurIPS. [Paper]

6. Language Models are Few-Shot Learners, Tom B. Brown et al., 2020 – NeurIPS. [Paper]

7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel et al., 2020 – J. Machine Learning Research. [Paper]

8. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, William Fedus, Barret Zoph, & Noam Shazeer, 2021 – JMLR. [Paper]

9. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du et al., 2022 – ICML. [Paper]

10. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Jack W. Rae et al., 2021 – DeepMind (Technical Report). [Paper]

11. Training Compute-Optimal Large Language Models, Jordan Hoffmann et al., 2022 – DeepMind (NeurIPS). [Paper]

12. LaMDA: Language Models for Dialog Applications, Romal Thoppilan et al., 2022 – arXiv Preprint. [Paper]

13. PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery et al., 2022 – arXiv Preprint. [Paper]

14. Training language models to follow instructions with human feedback, Long Ouyang et al., 2022 – OpenAI (NeurIPS). [Paper]

15. GPT-4 Technical Report, OpenAI, 2023. [Paper]

16. GPT-4 has more than a trillion parameters – Report, Matthias Bastian, 2023 – The Decoder. [Article]

17. LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron et al., 2023 – Meta AI. [Paper]

18. PaLM 2 Technical Report, Rohan Anil et al., 2023 – Google. [Paper]

19. Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai et al., 2022 – Anthropic. [Paper]

20. Introducing 100K Context Windows, Anthropic, 2023. [Blog]

21. Gemini: A Family of Highly Capable Multimodal Models (Technical Report), Google DeepMind, 2023. [PDF]

22. Mistral 7B Technical Report, Mistral AI, 2023. [Blog]

23. Qwen: The Qwen-1.5 Series, Alibaba DAMO Academy, 2024. [HuggingFace]

24. LLaMA 3 Preview, Meta AI, 2024. [Blog]

25. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2025. [Paper]

26. DeepSeek-R1 model now available in Amazon Bedrock Marketplace…, Vivek Gangasani, Banu Nagasundaram, Jonathan Evans, & Niithiyn Vijeaswaran, 2025 – AWS Blog. [Article]

27. Discover AI – Youtube channel

Note

The LLMs listed here represent a personal selection from the many published models in recent years, based on what I’ve studied or followed.
This is not an exhaustive list, and many excellent models may not be included here.
This document is designed to keep me updated with growing LLM architectures. It is based on my personal understanding and synthesis of various research papers.
If you spot any errors or have suggestions for improvement, please feel free to reach out.
The references mentioned above have been explored over the past few months. There may be more detailed explanations available, and I’d be happy to dig deeper into those if needed.

Thank you for taking the time to read this. This is part of my personal study, so if you notice anything missing or requiring correction, feel free to reach out. I’d love to connect—feel free to reach out to me on LinkedIn or check out my Github-AI-ML-Repo

AI Blogathon

Tournaments

AI Madness (Completed)

Leaderboard

This Post's Rank: 7

Rank	Post	Score
1	How to Build an MCP Server for Kafka and Qdrant	841
2	Visualizing Chunking Impacts in Agentic RAG with Agno, Qdrant, RAGAS and LlamaIndex	559
3	Smarter Automation With Burr: The Future of Decision-Making	301
4	Run Gemma 3 Locally Using Open WebUI	144
5	How AI is Redefining the Fight Against Climate Change	137
6	Build Your First AI Agent in Minutes	109
7	Comparison of Major LLM Architectures (2017– 2025)	69
8	How to Bridge the AI Literacy Gap	29

2017–2019

2020–2022

2023–2025

References

Note

Related

Leave a ReplyCancel reply