How machines learn to speak our language one token at a time.
Imagine you’re trying to learn a new language say, Japanese. On your first day, you’re handed a paragraph in kanji. No spaces. No familiar letters. Just symbols.
How do you even begin?
That’s exactly how computers feel when we throw raw text at them.
Natural Language Processing (NLP) bridges human language and machine understanding. But before models like ChatGPT or Google Translate can make sense of words, they must first break down language into digestible chunks. This is where tokenization and embeddings come in.

Let’s dive into this world, and by the end, you’ll not only understand these concepts you’ll be able to explain them to your curious friend over coffee.
Why Language Needs Decoding
Computers are pretty good with numbers. Text? Not so much.
To them, “I love pizza” is just a sequence of characters like:
01001001 00100000 01101100 01101111 01110110 01100101 00100000 01110000 01101001 01111010 01111010 01100001
Not very useful, right?
To interpret, analyze, and respond to human language, machines first need to:
- Break down the text (tokenization)
- Represent words in a numerical form they understand (embeddings)
Let’s begin with the first step.
Step 1: Tokenization – Cutting Language into Pieces
So… what’s a token?
A token is a piece of text that acts as a building block for language models.
Depending on the method, a token could be:
- A word →
"I"
,"love"
,"pizza"
- A subword →
"piz"
,"za"
- A character →
"p"
,"i"
,"z"
,"z"
,"a"
- Even punctuation →
"."
,"!"
,"?"

Let’s tokenize a sentence
Sentence: “Time-travel is mind-blowing!”
Tokenizer Type | Tokens |
---|---|
Word-level | ["Time-travel", "is", "mind-blowing", "!"] |
Subword-level (BPE) | ["Time", "-", "travel", "is", "mind", "-", "blow", "ing", "!"] |
Character-level | ["T", "i", "m", "e", "-", "t", "r", "a", "v", "e", "l", ...] |
Different models tokenize differently, and each choice has trade-offs.
Types of Tokenization
1. Word Tokenization
- Splits text based on spaces or punctuation.
- Simple but fails with out-of-vocabulary words like
"YOLO"
or"GPT-4"
.
2. Subword Tokenization (e.g., Byte Pair Encoding – BPE)
- Breaks rare words into common chunks.
- Balances vocabulary size and efficiency.
Example: "unbelievable"
→ "un"
, "believ"
, "able"
3. SentencePiece / Unigram
- Used in models like BERT and T5.
- Doesn’t need whitespace – good for languages like Japanese.
4. Character Tokenization
- Each letter is a token.
- Simpler but longer sequences, more compute.
Step 2: Embeddings – Giving Meaning to Tokens
Once tokens are ready, we need to map them into number but not just any numbers. Numbers that carry meaning.
This is where word embeddings shine.
What Are Embeddings?
An embedding is a vector (a list of numbers) that captures the semantic meaning of a token.
It’s like giving every word its own unique “coordinate” in a multi-dimensional space.
📌 Think of a giant word map, where similar words live close together.

Why Do We Use Embeddings?
- Computers can’t understand words but they can compare numbers.
- Embeddings allow models to grasp that:
"king"
is close to"queen"
"run"
is similar to"jog"
"banana"
is far from"keyboard"
Example: Vector Magic
Let’s say we have a 3D embedding space:
Word | Embedding (3D) |
---|---|
"cat" | [0.2, 0.1, 0.9] |
"dog" | [0.3, 0.2, 0.85] |
"banana" | [0.7, 0.4, 0.1] |
Now we can calculate similarity using cosine similarity or distance metrics.
Popular Embedding Techniques
1. Word2Vec
- Trained to predict a word from its context (
CBOW
) or context from a word (Skip-gram
) - Captures relationships like: 🧠
king - man + woman ≈ queen
2. GloVe (Global Vectors)
- Learns from word co-occurrence in a corpus.
- Embeddings are based on how often words appear together.

3. FastText
- Like Word2Vec but includes subword information.
- Helps with rare and misspelled words.
💡 "walking"
, "walker"
, and "walk"
share components.
4. Transformer-based Embeddings (BERT, GPT)
- Generate contextual embeddings word meaning changes with context!
Example:
"bank"
in"river bank"
vs"bank loan"
→ different vectors.
Case Study: Embeddings in Action
🏥 Healthcare Chatbot
Imagine building a medical chatbot. It needs to understand the difference between:
"cold"
as an illness"cold"
as temperature
With contextual embeddings, models can adapt:
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("I have a cold", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Each token now has a vector representation considering the sentence meaning.
From Words to Meaning: Wrapping It All Up
Tokenization and embeddings are like the alphabet and grammar of machine language.
- Tokenization breaks the text down, creating the LEGO blocks.
- Embeddings give those blocks meaning shape, weight, and relationships.
Together, they unlock everything from:
✅ Sentiment analysis
✅ Chatbots
✅ Translation
✅ Search engines
✅ Voice assistants
Without them, NLP wouldn’t exist.
What’s Next?
Want to take it further?
- Explore https://huggingface.co/docs/tokenizers/en/index
- Play with Word2Vec visualizations
- Try embedding your own text using
spaCy
ortransformers
Let’s Keep Learning
Understanding how language is decoded by machines is just the beginning.
If you found this helpful:
💬 Drop a comment with your questions
🔁 Share with someone curious about NLP
📌 Follow for more bite-sized AI breakdowns
Until next time, keep exploring the hidden language of machines.