生成式人工智慧課導論學習筆記 - 2

2024-09-05

筆記關於生成式人工智慧課導論的學習筆記，整理自李宏毅老師的課程，系列文章第二篇。

課程說明

本課程的目標是讓學生理解生成式AI的全貌，而不是僅僅學會使用某一個工具如ChatGPT。課程會深入探討生成式AI的技術原理，讓學生了解這些技術是如何誕生、發展及應用。同時，課程不需要任何人工智慧的預備知識，適合初學者作為入門學習。

生成式AI的技術日新月異，因此單靠某一時期的技術學習可能很快過時。李宏毅老師強調，課程會教授的內容是希望能讓學生在未來數年甚至一生中受用。李老師提到「了解AI技術的背後原理，能讓人對技術有更清晰的理解，避免過度依賴或誤解其能力。」

大型語言模型修練史 — 第一階段: 自我學習，累積實力

介紹了語言模型的背景，並強調其主要功能是文字接龍「語言模型的輸入是一個未完成的句子，輸出則是後續可以接的符號 (Token)」。

語言模型經由訓練來產生，「訓練是找出參數的過程，測試是驗證這些參數是否有效」，使用大量訓練資料進行模型的最佳化。

模型的訓練包含超參數的設定，設定直接影響模型的訓練成效，並且需要反覆調整。

訓練可能成功但測試失敗的過擬合問題 (Overfitting)，並以貓狗分類器為例說明，「訓練資料可能讓機器過度依賴某些特徵，如顏色，導致測試失敗。」

解釋如何從零開始訓練模型，及如何使用初始參數來提高訓練效率，「找到好的初始參數可以讓模型更快達到合理的結果。」

語言模型訓練的第一階段，強調訓練讓模型需要足夠的文字和語言知識，「語言模型必須具備對語法的正確理解，才能預測出合理的Token。」

Summary

The first stage of LLM training involves a process called self-supervised learning. This means that the model learns to predict the next word in a sequence based on the preceding words, using massive amounts of text data scraped from the internet. Key points include:

Data collection: LLMs are trained on enormous datasets obtained from the web.
Tokenization: Text is broken down into smaller units called tokens (often words or subwords) for the model to process.
Parameter optimization: The model's parameters are adjusted through an optimization process to minimize the difference between its predicted tokens and the actual tokens in the training data.
Challenges: Overfitting, where the model becomes too specialized to the training data and performs poorly on new data, is a common issue.
Limitations of first-stage models: While these models can generate human-quality text, they often lack a deep understanding of the world and can produce nonsensical or irrelevant output.

Key Takeaways

Self-supervised learning is a powerful technique for training LLMs, but it has limitations.
The quality and quantity of training data significantly impact the model's performance.
Overcoming overfitting is a crucial challenge in LLM training.
While these models can generate impressive text, they may not always understand the meaning behind the words.

大型語言模型修練史 — 第二階段: 名師指點，發揮潛力

第一階段：語言模型的自我學習
精選句：語言模型在第一階段透過自我學習累積實力，但缺乏有效的使用方法。摘要：第一階段中，語言模型通過自我學習積累了很多實力，但缺乏具體的應用方法，需要進一步學習。

第二階段：人類老師的指導與微調
精選句：人類老師的指導稱為「instruction fine-tuning」，語言模型透過這些指令進行微調。摘要：在第二階段，人類老師提供問題和答案資料，進行「指令微調」（instruction fine-tuning），使語言模型學會依照人類指令給出正確回應。

標記資料的重要性
精選句：標記哪個部分是使用者的話，哪個部分是AI的話，能讓AI更準確地作出回應。摘要：人類老師標記資料的部分，有助於語言模型分辨輸入和回應，避免自問自答的錯誤。

資料標註與有限性
精選句：只依靠人類標註資料訓練語言模型，會導致模型無法舉一反三，容易答非所問。摘要：人類標註的資料雖然精確，但數量有限，導致語言模型難以有效應對多樣的問題，需要大規模的數據進行預訓練。

預訓練與微調的結合
精選句：預訓練的參數提供了初始的複雜規則，這些參數能讓語言模型更有舉一反三的能力。摘要：第一階段的預訓練提供了良好的初始參數，在第二階段的微調過程中，這些參數讓語言模型能夠舉一反三，解決類似問題。

Adapter與參數優化
精選句： Adapter技術允許我們在不改變初始參數的前提下，進行少量未知數的優化。摘要： Adapter技術是一種減少計算負擔的方式，允許只優化少量參數，從而保持模型與初始參數的相似性，常見例子為LoRA。

舉一反三的能力
精選句：預訓練模型可以通過學習一個語言的任務，推廣到其他語言的同樣任務。摘要：預訓練模型的舉一反三能力顯著，例如只需學會英文的閱讀能力測驗，就能無需訓練下完成中文的閱讀測驗。

Fine-tuning的應用與專才培養
精選句：每一個專才模型可以專注於一個特定任務，例如翻譯或編修。摘要： Fine-tuning可應用於不同任務，模型可以成為專門解決特定問題的專才，透過專用數據進行訓練。

Summary of the Article

Key Stages in LLM Training

Pre-training:

The model is trained on massive amounts of text data from the internet.
This stage equips the model with a strong foundation in language understanding and generation.

Instruction Fine-tuning:

Human-created instructions and examples are used to guide the model's behavior.
The model learns to follow specific instructions and generate appropriate responses.
This stage requires a significant amount of high-quality data.

Importance of Pre-training

Provides a strong foundation: The pre-trained model serves as a solid starting point for fine-tuning.
Enables transfer learning: The model can learn new tasks more efficiently by leveraging its pre-trained knowledge.

Challenges and Solutions

Data quality: High-quality instruction data is crucial for effective fine-tuning.
Computational resources: Training large language models requires significant computational power.
Alignment: Ensuring that the model's responses align with human values and avoid harmful biases is a challenging task.

Recent Developments and Trends

Open-source models: The release of open-source models like Llama has democratized LLM development.
Instruction fine-tuning as a standard practice: Many researchers and companies are adopting instruction fine-tuning as a standard approach for improving LLM performance.
Focus on quality over quantity: There is a growing consensus that the quality of training data is more important than the quantity.

Key Concepts

Instruction fine-tuning: A method of training a language model to follow specific instructions.
Pre-trained model: A model that has been trained on a massive amount of data and can be used as a starting point for other tasks.
Transfer learning: The ability to apply knowledge gained from one task to a new task.
Alignment: Ensuring that a model's outputs are safe and aligned with human values.

大型語言模型修練史 — 第三階段: 參與實戰，打磨技巧 (Reinforcement Learning from Human Feedback, RLHF)

Google Gemeni Summary

大模型訓練第三階段：RLHF

RLHF（Reinforcement Learning from Human Feedback）的概念，即通過人類反饋來強化模型學習，通過人類對模型生成答案的評價，來調整模型參數

RLHF與前兩個階段（Pretrain、Instruction Fine Tuning）在訓練數據和學習目標上的不同，相較於其他階段，更能讓模型考慮生成結果的整體質量，而非僅關注局部細節。

RLHF與Instruction Fine Tuning的學習目標差異，前者關注過程，後者關注結果。

相比於Instruction Fine Tuning，人類在 RLHF 中工作量更少。

回饋模型（reward model）

即通過訓練一個模型來模擬人類的喜好，討論 RLHF 中回饋機制的設計，以及為什麼讓模型生成多個答案，再由人類進行排序是一種常見的做法。

過度依賴回饋模型可能導致模型產生一些奇怪或不符合預期的行為。

RLHF 的挑戰 🤔

回饋標註的困難，如何定義「好」答案這個問題的複雜性。
人類提供的回饋可能包含個人偏見，這可能會影響模型的學習方向。
當模型面對人類自己都無法判斷好壞的問題時，如何繼續學習？
未來是否可以讓AI自己來評價其他AI生成的答案，從而減少對人類的依賴

以大型語言模型打造的 AI Agent

傳統 AI 與 AI Agent 的區別，AI Agent 能執行多步驟、複雜任務的能力。

AI Agent 的運作機制，包括目標設定、環境感知、計畫制定、行動執行、記憶更新等。

大型語言模型在 AI Agent 中的角色，強調其在自然語言處理、計畫制定、學習能力等方面的優勢。

本文深入淺出地介紹了AI Agent的概念、運作原理以及未來發展趨勢。通過結合大型語言模型與AI Agent，未來AI將能夠執行更加複雜的任務，並在各個領域發揮更大的作用。

今日的語言模型是如何做文字接龍的 — 淺談Transformer

「Transformer的誕生，標誌著語言模型進入了一個全新的時代」

回顧了語言模型從傳統的 N-Gram 模型到深度學習模型，如RNN、Transformer的演進過程。重點強調了Transformer在現代語言模型中的重要性，並簡要介紹了Transformer的誕生背景。

介紹 Transformer 的運作過程，從Tokenization、Embedding 到 Attention 和 Feed Forward Network，逐層剖析了Transformer如何處理文本並生成輸出。

Attention機制

「Transformer 通過 Attention機制，讓模型能夠理解上下文，從而生成更準確、更有意義的文本。」

Attention 如何計算 Token 之間的相關性，並結合上下文資訊生成 Contextualized Embedding。同時，也介紹了 Multi-Head Attention 的概念，以及為何只考慮左半邊 Token 的 Attention。

如何加速Attention計算、無限長度Attention以及訓練短文本處理長文本，都是未來 Transformer 發展的重要方向。

其他潛在的語言模型架構，如Memba和JAMBAR。

概念	定義
Tokenization	將文本分割成最小單位（Token）的過程。
Embedding	將Token映射為稠密向量表示，以便模型處理。
Attention	一種機制，用於計算Token之間的相關性，並根據相關性加權求和生成新的表示。
Transformer Block	Transformer的基本組成單位，由Multi-Head Attention和Feed Forward Network組成。
Contextualized Embedding	考慮了上下文資訊的Embedding表示。

Transformer

進行順序

Tokenization
Input Layer
Attention Layer
Feed Forward Layer
Output Layer

Embdding 進行順序

Words to tokens
Positional tokens
Attention (Context)

輸入一排向量，得到一排一樣長的向量
計算與其他向量的相關性，取得後再加權計算出向量值
實務上只會考慮前文的 Token 稱為 Causal Attention

詞語之間的關聯性可能有多種面向，稱為 Multi-head Attention
現在的 LLM 都是使用 Multi-head (16組實務)