教你理解和构建GPT Tokenizer

2024-04-03

课程

1.7k 词

教你理解和构建GPT Tokenizer

Open AI传奇研究员Andrej Karpathy的新课，教你理解和构建GPT Tokenizer。

他可以把相当复杂的LLM概念用非常好理解的方式讲出来。希望了解LLM的强烈建议听一下他的课，包括一些历史课程。

用GPT-4翻译了一下这节课，感兴趣可以听一下。字幕文件下载和历史课程会放最后。

补充一下视频介绍：

分词器是大语言模型（LLM）处理流程中一个独立且关键的环节。它们有专属的训练数据集、采用特定的训练算法——字节对编码（Byte Pair Encoding），训练完成后，分词器能够执行两个核心功能：encode() 函数将普通文本字符串转换为词元，而 decode() 函数则能将词元还原为原始文本字符串。在这场讲座中，我们将一步步揭开 OpenAI GPT 系列分词器的构建过程。

我们将发现，许多大语言模型(LLM)表现出的异常行为和问题，其实都源于标记化(tokenization)这一环节。我们会针对这些问题进行详细讨论，探究标记化为何成为问题的关键所在，以及为什么最理想的情况是有人能够找到办法，完全去除这一处理阶段。

00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations

https://pan.quark.cn/s/60d51adb8ecc#/list/share

最新文章

归档

教你理解和构建GPT Tokenizer

教你理解和构建GPT Tokenizer