Các Mô Hình Học Của AI

By Cuong Tran 2026-04-07 17 min read

Từ "dạy máy bằng đáp án" đến "máy tự học từ internet" — sự tiến hóa của Learning Paradigms.

01. Bài Toán Nền Tảng

Machine learning về bản chất là tìm function f(x) → y mà không cần lập trình tường minh. Thay vì viết rules thủ công, ta cho máy học từ data.

Câu hỏi cốt lõi phân biệt các mô hình học: tín hiệu học (learning signal) đến từ đâu?

---
config:
  theme: neutral
  look: classic
---
flowchart TB
    Q(["❓ Tín hiệu học
đến từ đâu?"]):::orange

    Q --> S["👨‍🏫 Con người gán label"]:::blue
    Q --> U["🔍 Tự tìm trong data"]:::green
    Q --> SS["🧩 Tự tạo từ data"]:::purple
    Q --> R["🎮 Phần thưởng từ
môi trường"]:::red

    S --> SL["Supervised
Learning"]:::blue
    U --> UL["Unsupervised
Learning"]:::green
    SS --> SSL["Self-Supervised
Learning"]:::purple
    R --> RL["Reinforcement
Learning"]:::red

    classDef orange fill:#ffeeba,stroke:#856404,color:#1a1a1a
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a
    classDef red fill:#f8d7da,stroke:#721c24,color:#1a1a1a

02. Supervised Learning — Thầy Giáo Cho Đáp Án

Cơ Chế

Cho máy cặp (input, label) do con người gán. Máy học ánh xạ input → label bằng cách minimize loss function.

Training: {(x₁,y₁), (x₂,y₂), ..., (xₙ,yₙ)} → learn f such that f(x) ≈ y
Loss:     L = Σ distance(f(xᵢ), yᵢ) → minimize

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    A["🖼 Input
(ảnh mèo)"]:::blue --> B["🧠 Model
f(x)"]:::purple --> C["📤 Prediction
'mèo'"]:::green
    D["🏷 Label
'mèo'"]:::orange --> E["📉 Loss
= prediction - label"]:::red
    C --> E
    E -.->|"backprop"| B

    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a
    classDef orange fill:#ffeeba,stroke:#856404,color:#1a1a1a
    classDef red fill:#f8d7da,stroke:#721c24,color:#1a1a1a

Ưu Và Nhược

Ưu điểm	Nhược điểm
Chính xác — có ground truth để evaluate	Bottleneck labeling — ImageNet cần ~49,000 người gán label qua nhiều năm¹
Dễ hiểu — mục tiêu rõ ràng	Không generalize tốt ngoài distribution đã thấy
Proven — hàng thập kỷ nghiên cứu	Scale bị giới hạn bởi chi phí labeling

Tiến Hóa

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    A["Perceptron
(1958)"]:::dim --> B["Backprop
(1986)"]:::dim --> C["SVM
(1995)"]:::dim --> D["AlexNet
(2012)"]:::blue --> E["ResNet
(2015)"]:::blue --> F["Transfer
Learning"]:::green

    classDef dim fill:#f0f0f0,stroke:#999,color:#666
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a,stroke-width:2px

Bước ngoặt: AlexNet (Krizhevsky et al., 2012)² thắng ImageNet với deep CNN + GPU — khởi đầu kỷ nguyên deep learning. ResNet (He et al., 2016)³ giải quyết vanishing gradient với skip connections, cho phép train networks hàng trăm layers.

Transfer learning là giải pháp cho bottleneck labeling: pretrain trên dataset lớn (ImageNet), fine-tune trên task nhỏ với ít labels. Ý tưởng này dẫn trực tiếp đến foundation models.

03. Unsupervised Learning — Tự Tìm Cấu Trúc

Cơ Chế

Không có label. Mục tiêu: tìm cấu trúc ẩn (hidden structure) trong data.

Training: {x₁, x₂, ..., xₙ} → discover patterns, clusters, representations

Các Nhánh Chính

---
config:
  theme: neutral
  look: classic
---
flowchart TB
    UL["Unsupervised Learning"]:::green

    UL --> CL["Clustering
K-means, DBSCAN"]:::blue
    UL --> DR["Dimensionality Reduction
PCA, t-SNE, UMAP"]:::blue
    UL --> GEN["Generative Models
VAE, GAN"]:::purple

    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a,stroke-width:2px
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a

GAN — Generative Adversarial Network

Paper: "Generative Adversarial Nets" (Goodfellow et al., 2014)⁴

Hai networks đấu nhau: Generator tạo fake data, Discriminator phân biệt thật/giả. Cả hai cải thiện lẫn nhau qua quá trình đối kháng.

min_G max_D  V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]

GAN từng thống trị image generation (2014-2021) trước khi bị diffusion models thay thế. Nhưng GAN vẫn quan trọng về mặt lý thuyết — ý tưởng adversarial training ảnh hưởng đến nhiều lĩnh vực khác.

VAE — Variational Autoencoder

Paper: "Auto-Encoding Variational Bayes" (Kingma & Welling, 2014)⁵

VAE học latent representation bằng cách encode data xuống latent space rồi decode lại. Khác GAN ở chỗ: VAE optimize một objective rõ ràng (ELBO), trong khi GAN dùng adversarial game.

Ưu Và Nhược

Ưu điểm	Nhược điểm
Không cần label — chạy trên raw data	Khó evaluate — "đúng" là gì khi không có ground truth?
Khám phá structure ẩn mà con người chưa biết	Kết quả khó interpret
Useful cho EDA, anomaly detection	Không trực tiếp giải prediction tasks

04. Self-Supervised Learning — Cuộc Cách Mạng Thầm Lặng

Cơ Chế

Tự tạo label từ chính data — che một phần, dự đoán phần bị che.

Input:  "The cat sat on the ___"
Label:  "mat"                      ← tự tạo, không cần human annotation

Đây là bước đột phá lớn nhất trong machine learning hiện đại: unlock internet-scale training mà không cần labeling.

Các Pretext Tasks

Phương pháp	Cơ chế	Model tiêu biểu
Next-token prediction	Dự đoán token tiếp theo	GPT⁶⁷
Masked language modeling	Che random tokens, dự đoán	BERT⁸
Denoising	Thêm noise, học khử	Diffusion models
Contrastive learning	Cùng object = gần, khác = xa	SimCLR⁹, CLIP¹⁰

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    subgraph GPT["Next-Token (GPT)"]
        G1["The cat sat"]:::blue --> G2["→ on"]:::green
    end

    subgraph BERT["Masked (BERT)"]
        B1["The [MASK] sat on"]:::blue --> B2["→ cat"]:::green
    end

    subgraph CLIP["Contrastive (CLIP)"]
        C1["🖼 image"]:::orange
        C2["📝 text"]:::orange
        C1 --> C3["match?"]:::green
        C2 --> C3
    end

    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a
    classDef orange fill:#ffeeba,stroke:#856404,color:#1a1a1a

Tại Sao Đây Là Breakthrough?

Supervised	Self-Supervised
Data cần label (đắt, chậm)	Data raw từ internet (miễn phí, vô hạn)
ImageNet: 14M ảnh, 3 năm gán label	GPT-3 training data: hàng TB text, 0 human labels
Scale bị bottleneck bởi labeling	Scale gần như vô hạn

Đây là lý do tại sao LLM và diffusion models scale được lên hàng trăm tỷ parameters — vì tín hiệu học tự tạo từ data, không cần con người gán label.

BERT vs GPT — Hai Trường Phái

Paper BERT: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2019)⁸

Paper GPT-2: "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019)⁶

	BERT	GPT
Pretext task	Masked LM (bidirectional)	Next-token (autoregressive)
Nhìn context	Cả trái lẫn phải	Chỉ trái → phải
Mạnh ở	Understanding (NLU)	Generation (NLG)
Kết quả	Thống trị NLU 2019-2022	Thống trị mọi thứ từ GPT-3

GPT thắng cuối cùng không phải vì autoregressive tốt hơn bidirectional, mà vì autoregressive scale tốt hơn — next-token prediction tạo ra training signal dày đặc (mỗi token đều là 1 training example) và generation tự nhiên hơn.

Ưu Và Nhược

Ưu điểm	Nhược điểm
Scale vô hạn — internet = training data	Cần compute khổng lồ (GPT-4: ước tính ~$100M+ training)
General representations	Representations không phải lúc nào cũng useful cho downstream task
Foundation models — 1 pretrain, N fine-tunes	Hallucination — plausibility ≠ correctness

05. Reinforcement Learning — Học Từ Thử Và Sai

Cơ Chế

Agent tương tác với environment: thực hiện action, nhận reward, học policy tối ưu.

Agent → Action → Environment → (State', Reward) → Agent → ...

Mục tiêu: maximize Σ γᵗ · rₜ  (discounted cumulative reward)

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    A["🤖 Agent"]:::purple -->|"action aₜ"| E["🌍 Environment"]:::green
    E -->|"state sₜ₊₁"| A
    E -->|"reward rₜ"| A

    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a

Khác Biệt Cốt Lõi Với Supervised

	Supervised	Reinforcement
Tín hiệu	"Đáp án đúng là Y"	"Tốt hơn hay xấu hơn" (reward)
Feedback	Ngay lập tức	Delayed (thắng cờ cuối ván)
Data	Dataset cố định	Agent tự tạo qua exploration
Mục tiêu	Predict đúng	Maximize long-term reward

Tiến Hóa

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    A["Q-Learning
(1989)"]:::dim --> B["DQN
(2015)"]:::blue --> C["Policy
Gradient"]:::blue --> D["PPO
(2017)"]:::green --> E["RLHF
(2022)"]:::purple --> F["DPO
(2023)"]:::purple

    classDef dim fill:#f0f0f0,stroke:#999,color:#666
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a,stroke-width:2px

DQN (Mnih et al., 2015)¹¹: Deep RL chơi Atari games vượt trình người — chứng minh RL + deep learning khả thi.

PPO (Schulman et al., 2017)¹²: Stable policy gradient algorithm — trở thành "default RL algorithm" nhờ đơn giản và robust. Đây chính là thuật toán đằng sau RLHF.

RLHF (Ouyang et al., 2022)¹³: Dùng human preferences làm reward signal — biến GPT-3 (autocomplete) thành ChatGPT (helpful assistant). RL ở đây không train from scratch mà fine-tune LLM đã pretrain.

Ưu Và Nhược

Ưu điểm	Nhược điểm
Giải sequential decision-making	Sample inefficiency — cần triệu episodes
Không cần "đáp án đúng"	Training unstable, khó reproduce
Tối ưu long-term reward	Reward hacking — agent exploit reward function

06. So Sánh Tổng Hợp — Bốn Paradigm

	Supervised	Unsupervised	Self-Supervised	Reinforcement
Tín hiệu	Human labels	Không	Tự tạo từ data	Reward từ env
Data cần	(x, y) pairs	x only	x only	Environment
Scale	Giới hạn (labeling)	Trung bình	Vô hạn	Simulator-dependent
Chi phí data	Cao (human labor)	Thấp	Thấp	Trung bình (env design)
Evaluate	Dễ (so với label)	Khó	Trung bình	Trung bình (reward)
Ứng dụng chính	Classification, prediction	Clustering, anomaly	LLM, Diffusion	Games, robotics, alignment
Đại diện	ResNet, AlphaFold	K-means, GAN	GPT, BERT, CLIP	DQN, PPO, RLHF

07. Hybrid Paradigms — Kết Hợp Sức Mạnh

Trong thực tế, các paradigm ít khi đứng một mình:

Semi-Supervised Learning

Ít labels + nhiều unlabeled data. Ví dụ: label 1% data, dùng model predictions trên 99% còn lại làm pseudo-labels.

Few-Shot / Zero-Shot Learning

Không fine-tune — chỉ cần vài ví dụ trong prompt (few-shot) hoặc không cần ví dụ nào (zero-shot). Đây là khả năng emergent của LLM lớn.

Paper: "Language Models are Few-Shot Learners" (Brown et al., 2020)¹⁴

Meta-Learning — Học Cách Học

Paper: "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (Finn et al., 2017)¹⁵

MAML train model trên nhiều tasks khác nhau để model nhanh chóng adapt sang task mới chỉ với vài gradient steps. Khác với transfer learning (pretrain → fine-tune), meta-learning tối ưu trực tiếp cho khả năng adaptation.

Curriculum Learning

Dạy từ dễ đến khó — giống cách con người học. Cho model thấy easy examples trước, hard examples sau. Cải thiện convergence và final performance.

Self-Supervised + RL (Con Đường Của LLM)

---
config:
  theme: neutral
  look: classic
---
flowchart LR
    A["Raw text
(internet)"]:::dim --> B["Self-Supervised
Pretraining
(next-token)"]:::purple --> C["Supervised
Fine-tuning
(SFT)"]:::blue --> D["RL
(RLHF/DPO)"]:::red --> E["Aligned
LLM"]:::green

    classDef dim fill:#f0f0f0,stroke:#999,color:#666
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef red fill:#f8d7da,stroke:#721c24,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a,stroke-width:2px

ChatGPT = Self-supervised (pretrain) + Supervised (SFT) + RL (RLHF). Không paradigm nào đứng một mình. Mỗi bước đóng vai trò khác nhau: pretrain cho knowledge, SFT cho format, RLHF cho alignment.

08. Sự Tiến Hóa — 70 Năm Machine Learning

---
config:
  theme: neutral
  look: classic
---
flowchart TB
    subgraph ERA1["1950s-1980s: Foundations"]
        direction LR
        E1["Perceptron
(1958)"]:::dim
        E2["Backprop
(1986)"]:::dim
    end

    subgraph ERA2["1990s-2000s: Classical ML"]
        direction LR
        E3["SVM (1995)"]:::dim
        E4["Random Forest"]:::dim
        E5["Deep Belief
Nets (2006)"]:::blue
    end

    subgraph ERA3["2012-2017: Deep Learning"]
        direction LR
        E6["AlexNet
(2012)"]:::blue
        E7["GAN (2014)"]:::blue
        E8["ResNet (2015)"]:::blue
    end

    subgraph ERA4["2018-2022: Self-Supervised Era"]
        direction LR
        E9["BERT (2018)"]:::purple
        E10["GPT-3 (2020)"]:::purple
        E11["DALL-E (2021)"]:::purple
    end

    subgraph ERA5["2022-now: Alignment + Scale"]
        direction LR
        E12["ChatGPT
(RLHF)"]:::green
        E13["DPO (2023)"]:::green
        E14["o1 (test-time
compute)"]:::green
    end

    ERA1 --> ERA2 --> ERA3 --> ERA4 --> ERA5

    classDef dim fill:#f0f0f0,stroke:#999,color:#666
    classDef blue fill:#cce5ff,stroke:#004085,color:#1a1a1a
    classDef purple fill:#e8daef,stroke:#8e44ad,color:#1a1a1a
    classDef green fill:#d4edda,stroke:#28a745,color:#1a1a1a,stroke-width:2px

Xu Hướng Lớn

Supervised → Self-supervised: Từ "cần label" đến "tự tạo label". Đây là shift lớn nhất — mở khóa internet-scale training.
RL → Fine-tuning layer: RL không còn là paradigm độc lập. Nó trở thành lớp phủ alignment trên nền self-supervised pretrain.
Single paradigm → Pipeline: Model hiện đại dùng nhiều paradigm kết hợp (pretrain + SFT + RLHF).
Train-time scaling → Test-time scaling: Từ "model lớn hơn" sang "nghĩ lâu hơn" (o1/o3).

09. Kết Luận

Paradigm	Thời kỳ hoàng kim	Vai trò hiện tại
Supervised	2012-2018	Fine-tuning, specific tasks
Unsupervised	2014-2020 (GAN era)	Niche (clustering, anomaly)
Self-Supervised	2018-now	Dominant — nền tảng mọi foundation model
RL	Ongoing	Alignment layer (RLHF, DPO)

3 insights chính:

Self-supervised learning là paradigm dominant — LLM, diffusion, CLIP đều dùng nó. Lý do: scale vô hạn, không cần human labels.
RL không thay thế self-supervised, mà bổ sung — RLHF/DPO là "lớp sơn cuối" biến knowledge thành helpfulness.
Tương lai có thể cần paradigm mới — Causal learning (Pearl), World models (LeCun) đang thách thức giới hạn của mọi paradigm hiện tại: chúng ta có thể train model biết correlation, nhưng chưa biết causation.

Paradigm quyết định scale. Scale quyết định capability. Self-supervised learning thắng không phải vì nó "thông minh hơn" — mà vì nó scale được. Bài học: trong AI, khả năng tận dụng data nhiều hơn thường quan trọng hơn thuật toán phức tạp hơn.

References

Deng, J. et al. (2009). ImageNet: A Large-Scale Hierarchical Image Database. CVPR 2009. DOI:10.1109/CVPR.2009.5206848 ↩
Krizhevsky, A. et al. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. papers.nips.cc ↩
He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385 ↩
Goodfellow, I. et al. (2014). Generative Adversarial Nets. NeurIPS 2014. arXiv:1406.2661 ↩
Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. arXiv:1312.6114 ↩
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. openai.com ↩ ↩²
Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report. openai.com ↩
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. arXiv:1810.04805 ↩ ↩²
Chen, T. et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020. arXiv:2002.05709 ↩
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020 ↩
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533. DOI:10.1038/nature14236 ↩
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 ↩
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155 ↩
Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. arXiv:2005.14165 ↩
Finn, C. et al. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017. arXiv:1703.03400 ↩

AI Blog — Cập nhật 04/2026

Bài viết liên quan

AI Models Beyond LLM LLM thống trị, nhưng không phải mô hình AI duy nhất. Landscape AI 2025 rộng hơn nhiều. OpenClaw — Giải Mã Hiện Tượng AI Agent Có Stars Kỷ Lục GitHub 346,000 stars trong 60 ngày. Tại sao một dự án của 1 developer lại phá kỷ lục 10 năm của React? AI Agent Harness Engineering LLM là CPU, Harness là OS — engineering layer biến "text predictor" thành "autonomous agent". LLM — Bộ Não Của AI Agent Tại sao một mô hình "đoán từ tiếp theo" lại có thể lập luận, viết code, và điều phối hệ thống phức tạp? Claude Code Architecture AI Agent from IPO Model to Extension Mechanisms