Text Mining
Interactive Study Guide

A hands-on visual guide covering all lectures of the Text Mining 2025–2026 course at NOVA IMS by Bruno Jardim. Explore concepts, run interactive demos and test your knowledge.

Course Roadmap

Click any card to dive into the lecture.

Lecture 1

Introduction to NLP

What is NLP, its applications, challenges, the NLP pipeline, and the Bag-of-Words model.

CorpusBoWPipelinePreprocessing
Lecture 2 (Extra)

N-grams & TF-IDF

Extending BoW with n-grams and weighting terms with TF-IDF for better representation.

N-gramsTF-IDFDistance
Lecture 2

Word Embeddings

Dense vector representations using Word2Vec (Skip-gram), PMI, and term-term matrices.

Word2VecSkip-gramPMIEmbeddings
Lecture 3

Sequential Models

RNNs, LSTMs and Seq2Seq architectures for processing text as sequences.

RNNLSTMSeq2SeqGates
Lecture 4

Attention & Transformers

Self-attention mechanism, Transformer architecture and BERT encoder models.

AttentionTransformerBERTSelf-attention
Lecture 6

Large Language Models

Generative AI, model configuration, RAG, evaluation and agentic AI systems.

LLMsRAGAgentsPrompting

Key Progression of Ideas

BoW
β†’
N-grams / TF-IDF
β†’
Word2Vec
β†’
RNN / LSTM
β†’
Attention
β†’
Transformers / LLMs

Each technique builds on the limitations of the previous one, gradually solving the curse of dimensionality, semantic relationships, word order, and context.

πŸ“š

6 Lectures

Complete course coverage from basics to state-of-the-art LLMs

⚑

Interactive Demos

Run BoW, TF-IDF, N-gram, attention and token generators live

🎯

Quizzes

Test your understanding with targeted questions per lecture

πŸ“

Formulas & Diagrams

All key formulas and architecture diagrams explained visually

Lecture 1

Introduction to NLP

What is NLP?

Natural Language Processing (NLP) is an area of computer science and AI concerned with the interaction between computers and humans in natural language. The ultimate goal is to enable computers to understand language as well as we do.

🌐

Machine Translation

Translate text between languages automatically

πŸ’¬

Chatbots

Dialog systems that converse naturally with users

πŸ”

Search Engines

Understanding queries to retrieve relevant documents

⌨️

Predictive Keyboards

Auto-correct and next-word prediction

Challenges of Natural Language

Variability (Paraphrasing)

Different sentences can have the same meaning β€” we can say the same thing in many ways.

Sentence A
"The president greets the press"
Same meaning (paraphrase)
"Trump speaks to the media"

Ambiguity

A single sentence can have different meanings. The only way to deal with ambiguity is through context.

"I saw the man with the telescope"

Did I use a telescope to see the man?
Or did I see a man who had a telescope?

Generalization

NLP systems are often trained in one domain but used in another. They may encounter:

  • OOD (Out-of-Domain) β€” inputs from a domain the model was not trained on
  • OOV (Out-of-Vocabulary) β€” words the model has never seen before

The NLP Pipeline

πŸ“„ Corpus
β†’
Text Preprocessing
β†’
Feature Engineering
β†’
🎯 Task / Model
πŸ’‘
Corpus: A collection of text organized into datasets. A corpus can include news articles, Wikipedia pages, novels, tweets, etc. A collection of corpora is called Corpora. Always split into Train/Validation/Test (typically 80/10/10 for small datasets) and keep the original!

Text Preprocessing Methods

πŸ”‘

Tokenization

Split text into individual tokens (words, subwords, characters)

πŸ”½

Lowercasing

Convert all text to lowercase to reduce vocabulary size

🚫

Stop Word Removal

Remove common words ("the", "is", "in") that carry little meaning

🌿

Stemming

Reduce words to their root form (running β†’ run)

πŸ“–

Lemmatization

Reduce to dictionary form considering grammar context

✏️

POS Tagging

Assign grammatical roles: noun, verb, adjective, etc.

Bag of Words (BoW)

Each word in the vocabulary becomes a feature. Documents are represented as sparse vectors β€” a count of how many times each word appears.

⚠️
Problems with BoW: (1) Curse of dimensionality β€” huge sparse vectors; (2) No semantic relationships β€” "king" and "queen" have no connection; (3) All words have equal importance; (4) Word order is discarded.

πŸ”¬ Interactive Bag-of-Words Demo

Type two sentences below and see the BoW matrix.

πŸ§ͺ Quick Quiz

1. What is the main problem with Bag-of-Words representations?
It cannot represent documents of different lengths
Word order is discarded and all words are treated equally
It requires labeled data to build the vocabulary
It can only handle English text
Word order and semantics are both lost in BoW β€” "great movie" and "movie great" produce the same vector!
2. What is a Corpus in NLP?
A type of neural network layer
A tokenization algorithm
A collection of text organized into datasets
A text preprocessing step
Lecture 2 Extra

N-grams & TF-IDF

N-grams β€” Adding Word Order Context

An n-gram is a contiguous sequence of n items from text. By using n-grams instead of single words, we partially recover word order information.

Unigrams (1-gram)
"The" | "dog" | "barks"
Bigrams (2-gram)
"The dog" | "dog barks"
Trigrams (3-gram)
"The dog barks"
⚠️
Data Sparseness: As N increases, n-grams become increasingly rare β€” most combinations never appear in training data. This is called the data sparseness problem. The phrase "Gollum loves his precious" may never appear exactly as-is!

πŸ”¬ Interactive N-gram Generator

2

TF-IDF β€” Weighting Words by Importance

Not all words are equally important. TF-IDF (Term Frequency – Inverse Document Frequency) gives higher weight to words that are frequent in a document but rare across the corpus.

Term Frequency (TF)

How often does the term appear in the document?

TF(t, d) = count(t in d)

Example: "dog" appears 3 times in doc1 β†’ TF = 3

Inverse Document Frequency (IDF)

How rare is the term across all documents?

IDF(t, D) = log(|D| / nβ‚œ)

|D| = total docs, nβ‚œ = docs containing term t

TF-IDF(t, d, D) = TF(t, d) Γ— IDF(t, D)
πŸ’‘
Intuition: The word "the" appears in every document β†’ high TF but low IDF β†’ low TF-IDF score. The word "transformer" appears in few docs but many times β†’ high TF-IDF β†’ very informative!

πŸ”¬ Interactive TF-IDF Calculator

Enter two documents to compute their TF-IDF matrices.

πŸ§ͺ Quick Quiz

A word that appears in every document will have a TF-IDF score of…
0, because IDF = log(N/N) = 0
1, because it's the most common word
Very high, because it appears everywhere
Equal to its term frequency
Lecture 2

Word Representations & Embeddings

Why Embeddings?

πŸ“Œ
BoW still has problems: (1) Curse of dimensionality (2) No semantic relationships (3) Equal word importance (4) Context changes meaning. Solution: move from document-level to word/token-level representations.

Famous word embedding property:

king βˆ’ man + woman β‰ˆ queen

Embeddings capture semantic analogies mathematically!

Ways to Generate Word Representations

Term-Term Matrix

A matrix of size |V| Γ— |V|. Each cell records how many times a target word (row) and a context word (column) co-occur in the corpus.

Problem: Dimensionality is still |V|Γ—|V| β€” huge and sparse.

Point-wise Mutual Information (PMI)

Measures how often two words co-occur compared to what we'd expect by chance:

PMI(w,c) = logβ‚‚( P(w,c) / P(w)Β·P(c) )

Since negative values can be unreliable, we use Positive PMI (PPMI) which replaces all negatives with 0:

PPMI(w,c) = max(0, PMI(w,c))
πŸ’‘
PPMI solves the "equal importance" problem from BoW β€” rare but co-occurring words get high PPMI scores.

Word2Vec (Skip-gram)

Word2Vec learns dense low-dimensional word vectors by training a shallow neural network. Words used in similar contexts end up with similar vectors.

1
Define word & context

For each word, consider a window of surrounding words as context. These are positive pairs.

2
Add negative samples

Randomly sample words that don't appear in the same window β€” these are negative pairs (target=0).

3
Train a binary classifier

Given (word, context), predict if they co-occur: P(context | word).

4
Extract the weight matrix

After training, the first weight matrix W (shape VΓ—n) IS the embedding lookup table.

πŸ”¬ Word Embedding Space (2D Projection)

Simulated 2D embedding space showing semantic relationships between words.

Note: Real embeddings have hundreds of dimensions. This is a 2D PCA projection for visualization.

Using Word2Vec for Classification

To classify a sentence using word embeddings, a simple approach is to average the word vectors:

sentence_vector = (v₁ + vβ‚‚ + ... + vβ‚™) / n

This gives a fixed-size dense vector that can be fed into any classifier. However, averaging loses word order β€” Sequential Models (L3) solve this!

πŸ§ͺ Quick Quiz

What does Word2Vec (Skip-gram) learn to predict?
The sentiment of a sentence
Whether a context word appears near a target word (binary classification)
The part-of-speech tag of each word
The TF-IDF score of each word
Lecture 3

Sequential Models

Why Sequential Models?

Averaging word embeddings throws away word order. We need a model that processes text token by token, maintaining a memory of what came before.

πŸ“

Sequence Labeling

Assign a label to each token (e.g., NER)

πŸ—‚οΈ

Sequence Classification

Classify an entire sequence (e.g., sentiment)

🌍

Machine Translation

Map one sequence to another in a different language

πŸ”€

Next Word Prediction

Predict the most likely next word given context

Recurrent Neural Networks (RNN)

An RNN contains a cycle in its connections β€” each hidden state depends on the previous one, enabling the network to "remember" past inputs.

h₁ hβ‚‚ h₃ ... "Great" "movie" "was" y₁ yβ‚‚ y₃ hidden state hidden state

RNN unrolled over time β€” each hidden state h depends on the previous h and the current word embedding.

⚠️
RNN Problems: (1) Long Dependencies β€” the final hidden state reflects more of the end of the sentence than the beginning; (2) Vanishing Gradients β€” gradients shrink to near-zero during backpropagation through many steps, making early weights hard to update.

Long Short-Term Memory (LSTM)

LSTMs are explicitly designed to avoid the long-term dependency problem. They split context management into two subproblems: forgetting old info and adding new info.

πŸ”‘
Key innovation: The Cell State (C) acts as a "memory highway" running through the sequence. Information can be written, read, and erased via special gates.

The Three LSTM Gates

πŸšͺ

Forget Gate

Decides what information to throw away from the cell state. Uses sigmoid: values close to 0 = forget, close to 1 = keep.

✏️

Input Gate

Decides what new information to store. Sigmoid selects what to update; tanh creates new candidate values.

πŸ“€

Output Gate

Decides what the next hidden state should be β€” a filtered version of the cell state that becomes the output.

πŸ”¬ LSTM Gate Simulator

Adjust values to see how each gate affects the cell state.

Sequence-to-Sequence (Seq2Seq)

Seq2Seq models map one sequence to another β€” used for machine translation, summarization, and chatbots.

"Great movie" (EN)
β†’
Encoder (LSTM)
β†’
Context Vector
β†’
Decoder (LSTM)
β†’
"Γ“timo filme" (PT)

The encoder compresses the input into a fixed-size context vector (sentence embedding). The decoder generates the output sequence token by token, conditioned on this vector.

⚠️
Bottleneck problem: Everything about the input must be crammed into one fixed-size vector. For long sentences (50+ words), this is lossy β€” the first words are hard to recover from the final hidden state. β†’ This motivates the Attention mechanism (L4)!

πŸ§ͺ Quick Quiz

What is the primary advantage of LSTMs over vanilla RNNs?
LSTMs are faster to train
LSTMs use attention to focus on relevant tokens
LSTMs solve the vanishing gradient and long-term dependency problems via gating
LSTMs process tokens in parallel
Lecture 4

Attention & Transformers

The Attention Mechanism

Instead of compressing the entire input into one context vector, Attention allows the decoder to look at all encoder hidden states and decide which ones to focus on at each decoding step.

1
Compute attention scores

For each encoder hidden state eβ±Ό, compute the dot product with the current decoder state hα΅ˆα΅’β‚‹β‚: score(i,j) = hα΅ˆα΅’β‚‹β‚ Β· eβ±Ό

2
Normalize with Softmax

Turn scores into a probability distribution: Ξ± = softmax(scores). Each Ξ±β±Ό tells how much to attend to encoder state j.

3
Compute context vector

Weighted average of encoder states: cα΅’ = Ξ£ Ξ±β±Ό Β· eβ±Ό. This dynamic context is unique for each decoding step!

πŸ”¬ Attention Heatmap Visualizer

Interactive attention weights between source (English) and target (Portuguese) words.

The Transformer Architecture

πŸ“„
"Attention Is All You Need" (Vaswani et al., 2017) introduced the Transformer: a model that replaces recurrent connections with self-attention, enabling full parallelization and much faster training on GPUs.

πŸ”· Encoder Stack

Reads the input sequence simultaneously. Each token can attend to all others.

Components:

  • Multi-Head Self-Attention
  • Feed-Forward Neural Network
  • Layer Normalization + Residual Connections

πŸ”Ά Decoder Stack

Generates output one token at a time.

Components:

  • Masked Self-Attention (can't see future tokens)
  • Encoder-Decoder Attention
  • Feed-Forward Neural Network

Self-Attention vs Regular Attention

In self-attention, every token in the sequence attends to every other token in the same sequence. This lets the model understand relationships like co-reference ("the animal is tired because it" β†’ "it" = animal).

Attention(Q, K, V) = softmax(QΒ·Kα΅€ / √dβ‚–) Β· V

Q=Queries, K=Keys, V=Values β€” three different projections of the same token embeddings. The √dβ‚– prevents vanishing gradients with large dimensions.

BERT β€” Encoder-Only Transformer

Some Transformers use only the encoder because their goal is to understand text, not generate it.

BERT (Google, 2019)

  • 24 encoder layers (BERT-Large)
  • 16 self-attention heads
  • 1024 hidden dimensions
  • ~340M parameters

Pre-training Tasks

  • MLM: Masked Language Modeling β€” predict masked tokens
  • NSP: Next Sentence Prediction β€” does sentence B follow A?
πŸ—οΈ
Transfer Learning: BERT is first pre-trained on massive unlabeled data (Wikipedia + BookCorpus), then fine-tuned on a specific task (e.g., sentiment analysis) with a small labeled dataset. This is called Transfer Learning.

[CLS] Token & Sentence Embeddings

BERT adds a special [CLS] token at the start of every input. During training, this token's final embedding becomes a summary of the entire sequence and is used for classification tasks.

πŸ”¬ Multi-Head Self-Attention Demo

See how different attention heads focus on different word relationships. Click a word to see its attention pattern.

πŸ§ͺ Quick Quiz

What key innovation did the Transformer paper ("Attention Is All You Need") introduce?
Using LSTMs with bidirectional encoding
Adding more layers to existing RNNs
Self-attention enabling full parallelization without recurrent connections
Pre-training on masked language modeling
Lecture 6

Large Language Models

Decoder-Only Transformers (GPT-style)

While BERT uses only encoders for understanding, decoder-only models (like GPT, LLaMA) are designed to generate text autoregressively β€” one token at a time.

How Token Generation Works

1
Forward pass

Input tokens pass through all decoder layers with masked self-attention.

2
Linear + Softmax

The output vector is projected to vocabulary size and normalized into probabilities.

3
Sample / Argmax

The next token is chosen (greedily or by sampling). It's appended to the input and the process repeats.

πŸ”¬ Token-by-Token Generation Demo

See how LLMs generate text one token at a time.

Note: This is a demonstration simulation β€” not a real language model.

Generative AI Models

πŸ”’ Closed Source LLMs

Accessed via API. Architecture and weights are private.

  • GPT-4 / GPT-4o (OpenAI)
  • Claude 3 / Claude 3.5 (Anthropic)
  • Gemini (Google)

πŸ”“ Open Source LLMs

Weights are public β€” can run locally.

  • LLaMA / LLaMA 3 (Meta)
  • Mistral / Mixtral
  • Falcon, Gemma

Model Configuration Parameters

🌑️

Temperature

Controls randomness. Low (0) = deterministic; High (1+) = creative/random.

🎯

Top-P (Nucleus)

Sample from the smallest set of tokens whose cumulative probability β‰₯ P.

πŸ”’

Top-K

Only consider the top K most probable tokens at each step.

πŸ“

Max Tokens

Maximum number of tokens the model will generate in a response.

πŸ”¬ Temperature Effect Demo

See how temperature changes the probability distribution over next tokens.

1.0

Retrieval Augmented Generation (RAG)

LLMs can't know information after their training cutoff and may hallucinate. RAG grounds LLMs with specific, authoritative external data.

1
Chunk & Embed documents

Split documents into chunks (sentence, paragraph, or overlapping windows). Embed each chunk into a vector database.

2
Semantic Search

Embed the user query. Find the nearest chunks using cosine similarity (dense search), keyword matching (sparse), or both (hybrid).

3
Augment & Generate

Inject the retrieved context into the prompt. The LLM generates an answer grounded in the retrieved information.

πŸ”
Search strategies in RAG: Dense Search (embeddings + cosine similarity), Keyword Search (TF-IDF BoW), Hybrid Search (combine both scores with a weighted formula).

Agentic AI

On its own, an LLM just predicts the next token. LLM Agents use the LLM as a reasoning brain augmented with memory, tools, and planning.

🧠

Memory

Short-term (context window) + Long-term (RAG / external database). Addresses the stateless nature of LLMs.

πŸ”§

Tools

APIs, code execution, web search β€” external interfaces the LLM can call to interact with the world.

πŸ—ΊοΈ

Planning (ReAct)

Reason + Act: the LLM interleaves reasoning steps with tool calls, iterating until the task is complete.

πŸ€–
Multi-Agent Systems: When a task is too complex for one agent, multiple specialized agents collaborate. A Supervisor agent coordinates subtasks across worker agents (frameworks: CrewAI, AutoGen).

πŸ§ͺ Quick Quiz

What problem does RAG (Retrieval Augmented Generation) solve?
Making LLMs faster to run
Grounding LLMs with up-to-date or domain-specific knowledge they weren't trained on
Reducing the number of parameters in a model
Training models without labeled data
What does a high temperature value do to LLM output?
Makes the output more deterministic and focused
Flattens the probability distribution, making output more random and diverse
Increases the model's maximum context length
Enables the model to use external tools