Text Mining
Interactive Study Guide
A hands-on visual guide covering all lectures of the Text Mining 2025β2026 course at NOVA IMS by Bruno Jardim. Explore concepts, run interactive demos and test your knowledge.
Course Roadmap
Click any card to dive into the lecture.
Introduction to NLP
What is NLP, its applications, challenges, the NLP pipeline, and the Bag-of-Words model.
N-grams & TF-IDF
Extending BoW with n-grams and weighting terms with TF-IDF for better representation.
Word Embeddings
Dense vector representations using Word2Vec (Skip-gram), PMI, and term-term matrices.
Sequential Models
RNNs, LSTMs and Seq2Seq architectures for processing text as sequences.
Attention & Transformers
Self-attention mechanism, Transformer architecture and BERT encoder models.
Large Language Models
Generative AI, model configuration, RAG, evaluation and agentic AI systems.
Key Progression of Ideas
Each technique builds on the limitations of the previous one, gradually solving the curse of dimensionality, semantic relationships, word order, and context.
6 Lectures
Complete course coverage from basics to state-of-the-art LLMs
Interactive Demos
Run BoW, TF-IDF, N-gram, attention and token generators live
Quizzes
Test your understanding with targeted questions per lecture
Formulas & Diagrams
All key formulas and architecture diagrams explained visually
Introduction to NLP
What is NLP?
Natural Language Processing (NLP) is an area of computer science and AI concerned with the interaction between computers and humans in natural language. The ultimate goal is to enable computers to understand language as well as we do.
Machine Translation
Translate text between languages automatically
Chatbots
Dialog systems that converse naturally with users
Search Engines
Understanding queries to retrieve relevant documents
Predictive Keyboards
Auto-correct and next-word prediction
Challenges of Natural Language
Variability (Paraphrasing)
Different sentences can have the same meaning β we can say the same thing in many ways.
Ambiguity
A single sentence can have different meanings. The only way to deal with ambiguity is through context.
"I saw the man with the telescope"
Generalization
NLP systems are often trained in one domain but used in another. They may encounter:
- OOD (Out-of-Domain) β inputs from a domain the model was not trained on
- OOV (Out-of-Vocabulary) β words the model has never seen before
The NLP Pipeline
Text Preprocessing Methods
Tokenization
Split text into individual tokens (words, subwords, characters)
Lowercasing
Convert all text to lowercase to reduce vocabulary size
Stop Word Removal
Remove common words ("the", "is", "in") that carry little meaning
Stemming
Reduce words to their root form (running β run)
Lemmatization
Reduce to dictionary form considering grammar context
POS Tagging
Assign grammatical roles: noun, verb, adjective, etc.
Bag of Words (BoW)
Each word in the vocabulary becomes a feature. Documents are represented as sparse vectors β a count of how many times each word appears.
π¬ Interactive Bag-of-Words Demo
Type two sentences below and see the BoW matrix.
π§ͺ Quick Quiz
N-grams & TF-IDF
N-grams β Adding Word Order Context
An n-gram is a contiguous sequence of n items from text. By using n-grams instead of single words, we partially recover word order information.
π¬ Interactive N-gram Generator
TF-IDF β Weighting Words by Importance
Not all words are equally important. TF-IDF (Term Frequency β Inverse Document Frequency) gives higher weight to words that are frequent in a document but rare across the corpus.
Term Frequency (TF)
How often does the term appear in the document?
Example: "dog" appears 3 times in doc1 β TF = 3
Inverse Document Frequency (IDF)
How rare is the term across all documents?
|D| = total docs, nβ = docs containing term t
π¬ Interactive TF-IDF Calculator
Enter two documents to compute their TF-IDF matrices.
π§ͺ Quick Quiz
Word Representations & Embeddings
Why Embeddings?
Famous word embedding property:
Embeddings capture semantic analogies mathematically!
Ways to Generate Word Representations
Term-Term Matrix
A matrix of size |V| Γ |V|. Each cell records how many times a target word (row) and a context word (column) co-occur in the corpus.
Problem: Dimensionality is still |V|Γ|V| β huge and sparse.
Point-wise Mutual Information (PMI)
Measures how often two words co-occur compared to what we'd expect by chance:
Since negative values can be unreliable, we use Positive PMI (PPMI) which replaces all negatives with 0:
Word2Vec (Skip-gram)
Word2Vec learns dense low-dimensional word vectors by training a shallow neural network. Words used in similar contexts end up with similar vectors.
For each word, consider a window of surrounding words as context. These are positive pairs.
Randomly sample words that don't appear in the same window β these are negative pairs (target=0).
Given (word, context), predict if they co-occur: P(context | word).
After training, the first weight matrix W (shape VΓn) IS the embedding lookup table.
π¬ Word Embedding Space (2D Projection)
Simulated 2D embedding space showing semantic relationships between words.
Note: Real embeddings have hundreds of dimensions. This is a 2D PCA projection for visualization.
Using Word2Vec for Classification
To classify a sentence using word embeddings, a simple approach is to average the word vectors:
This gives a fixed-size dense vector that can be fed into any classifier. However, averaging loses word order β Sequential Models (L3) solve this!
π§ͺ Quick Quiz
Sequential Models
Why Sequential Models?
Averaging word embeddings throws away word order. We need a model that processes text token by token, maintaining a memory of what came before.
Sequence Labeling
Assign a label to each token (e.g., NER)
Sequence Classification
Classify an entire sequence (e.g., sentiment)
Machine Translation
Map one sequence to another in a different language
Next Word Prediction
Predict the most likely next word given context
Recurrent Neural Networks (RNN)
An RNN contains a cycle in its connections β each hidden state depends on the previous one, enabling the network to "remember" past inputs.
RNN unrolled over time β each hidden state h depends on the previous h and the current word embedding.
Long Short-Term Memory (LSTM)
LSTMs are explicitly designed to avoid the long-term dependency problem. They split context management into two subproblems: forgetting old info and adding new info.
The Three LSTM Gates
Forget Gate
Decides what information to throw away from the cell state. Uses sigmoid: values close to 0 = forget, close to 1 = keep.
Input Gate
Decides what new information to store. Sigmoid selects what to update; tanh creates new candidate values.
Output Gate
Decides what the next hidden state should be β a filtered version of the cell state that becomes the output.
π¬ LSTM Gate Simulator
Adjust values to see how each gate affects the cell state.
Sequence-to-Sequence (Seq2Seq)
Seq2Seq models map one sequence to another β used for machine translation, summarization, and chatbots.
The encoder compresses the input into a fixed-size context vector (sentence embedding). The decoder generates the output sequence token by token, conditioned on this vector.
π§ͺ Quick Quiz
Attention & Transformers
The Attention Mechanism
Instead of compressing the entire input into one context vector, Attention allows the decoder to look at all encoder hidden states and decide which ones to focus on at each decoding step.
For each encoder hidden state eβ±Ό, compute the dot product with the current decoder state hα΅α΅’ββ: score(i,j) = hα΅α΅’ββ Β· eβ±Ό
Turn scores into a probability distribution: Ξ± = softmax(scores). Each Ξ±β±Ό tells how much to attend to encoder state j.
Weighted average of encoder states: cα΅’ = Ξ£ Ξ±β±Ό Β· eβ±Ό. This dynamic context is unique for each decoding step!
π¬ Attention Heatmap Visualizer
Interactive attention weights between source (English) and target (Portuguese) words.
The Transformer Architecture
π· Encoder Stack
Reads the input sequence simultaneously. Each token can attend to all others.
Components:
- Multi-Head Self-Attention
- Feed-Forward Neural Network
- Layer Normalization + Residual Connections
πΆ Decoder Stack
Generates output one token at a time.
Components:
- Masked Self-Attention (can't see future tokens)
- Encoder-Decoder Attention
- Feed-Forward Neural Network
Self-Attention vs Regular Attention
In self-attention, every token in the sequence attends to every other token in the same sequence. This lets the model understand relationships like co-reference ("the animal is tired because it" β "it" = animal).
Q=Queries, K=Keys, V=Values β three different projections of the same token embeddings. The βdβ prevents vanishing gradients with large dimensions.
BERT β Encoder-Only Transformer
Some Transformers use only the encoder because their goal is to understand text, not generate it.
BERT (Google, 2019)
- 24 encoder layers (BERT-Large)
- 16 self-attention heads
- 1024 hidden dimensions
- ~340M parameters
Pre-training Tasks
- MLM: Masked Language Modeling β predict masked tokens
- NSP: Next Sentence Prediction β does sentence B follow A?
[CLS] Token & Sentence Embeddings
BERT adds a special [CLS] token at the start of every input. During training, this token's final embedding becomes a summary of the entire sequence and is used for classification tasks.
π¬ Multi-Head Self-Attention Demo
See how different attention heads focus on different word relationships. Click a word to see its attention pattern.
π§ͺ Quick Quiz
Large Language Models
Decoder-Only Transformers (GPT-style)
While BERT uses only encoders for understanding, decoder-only models (like GPT, LLaMA) are designed to generate text autoregressively β one token at a time.
How Token Generation Works
Input tokens pass through all decoder layers with masked self-attention.
The output vector is projected to vocabulary size and normalized into probabilities.
The next token is chosen (greedily or by sampling). It's appended to the input and the process repeats.
π¬ Token-by-Token Generation Demo
See how LLMs generate text one token at a time.
Note: This is a demonstration simulation β not a real language model.
Generative AI Models
π Closed Source LLMs
Accessed via API. Architecture and weights are private.
- GPT-4 / GPT-4o (OpenAI)
- Claude 3 / Claude 3.5 (Anthropic)
- Gemini (Google)
π Open Source LLMs
Weights are public β can run locally.
- LLaMA / LLaMA 3 (Meta)
- Mistral / Mixtral
- Falcon, Gemma
Model Configuration Parameters
Temperature
Controls randomness. Low (0) = deterministic; High (1+) = creative/random.
Top-P (Nucleus)
Sample from the smallest set of tokens whose cumulative probability β₯ P.
Top-K
Only consider the top K most probable tokens at each step.
Max Tokens
Maximum number of tokens the model will generate in a response.
π¬ Temperature Effect Demo
See how temperature changes the probability distribution over next tokens.
Retrieval Augmented Generation (RAG)
LLMs can't know information after their training cutoff and may hallucinate. RAG grounds LLMs with specific, authoritative external data.
Split documents into chunks (sentence, paragraph, or overlapping windows). Embed each chunk into a vector database.
Embed the user query. Find the nearest chunks using cosine similarity (dense search), keyword matching (sparse), or both (hybrid).
Inject the retrieved context into the prompt. The LLM generates an answer grounded in the retrieved information.
Agentic AI
On its own, an LLM just predicts the next token. LLM Agents use the LLM as a reasoning brain augmented with memory, tools, and planning.
Memory
Short-term (context window) + Long-term (RAG / external database). Addresses the stateless nature of LLMs.
Tools
APIs, code execution, web search β external interfaces the LLM can call to interact with the world.
Planning (ReAct)
Reason + Act: the LLM interleaves reasoning steps with tool calls, iterating until the task is complete.