如何在 Mac Studio 上微调 Llama-2

这是一个使用 llama.cpp 在 Mac Studio 上微调 Llama-2 模型的端到端教程。只需三个步骤：

1. 构建 llama.cpp 并将 Llama-2 模型转换为 gguf 格式

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j

假设你的 Llama-2 7B 模型位于路径 ./models/llama-2-7b/ 下，然后运行：

python3 convert.py ./models/llama-2-7b/
./quantize ./models/llama-2-7b/ggml-model-f16.gguf ./models/llama-2-7b/ggml-model-q4_0.gguf q4_0

如果你没有 Llama-2 模型，可以参考其他文章下载。

2. 准备微调数据

在本教程中，我们将使用 TinyStories 数据集来微调 Llama-2 7B 模型。

下载数据集到文件夹 tinystories：

mkdir tinystories && cd tinystories
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStories_all_data.tar.gz
tar xf TinyStories_all_data.tar.gz

你会得到一个包含 50 个 json 文件（data00.json — data49.json）的列表，每个文件都包含大量短篇儿童故事。

将一部分数据（data49.json）提取到 data49.txt 中：

./extract.sh data49.json

extract.sh 脚本如下：

#!/bin/bash

# Check for input argument
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <input_file.json>"
    exit 1
fi

input_file="$1"
output_file="${input_file%.json}.txt"

# Check if jq is installed
if ! command -v jq &> /dev/null; then
    echo "Please install 'jq' first."
    exit 1
fi

# Check if the input file exists
if [ ! -f "$input_file" ]; then
    echo "File '$input_file' not found!"
    exit 1
fi

# Extract stories and format them
jq -r '.[] | "<s>" + .story + "\n\n"' "$input_file" > "$output_file"
echo "Stories have been saved to $output_file"

文件 data49.txt 包含约 68000 个由 <s> 分隔的短篇故事，我们将使用它作为微调数据集。

3. 开始训练（需要运行 24 小时或更长时间）

./finetune --model-base ./models/llama-2-7b/ggml-model-q4_0.gguf --train-data tinystories/data49.txt --threads 26 --sample-start "<s>" --ctx 512

你将在终端中看到一长串日志，包括以下内容：

main: init model
print_params: n_vocab:   32000
print_params: n_ctx:     512
print_params: n_embd:    4096
print_params: n_ff:      11008
print_params: n_head:    32
print_params: n_head_kv: 32
print_params: n_layer:   32
...
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
...
tokenize_file: warning: found 942 samples (max length 1164) that exceed context length of 512. samples will be cut off.
tokenize_file: warning: found 66924 samples (min length 4) that are shorter than context length of 512.
tokenize_file: total number of samples: 67871
...
train_opt_callback: iter=     0 sample=1/67871 sched=0.000000 loss=0.000000 |->
train_opt_callback: iter=     1 sample=9/67871 sched=0.010000 loss=4.608573 dt=00:11:24 eta=2d 00:28:18 |->
train_opt_callback: iter=     2 sample=17/67871 sched=0.020000 loss=4.691439 dt=00:23:23 eta=4d 03:00:01 |>
...

一些关键信息解释

模型初始化参数：

n_vocab: 词汇表大小设置为 32,000。
n_ctx: 上下文长度（最大序列长度）设置为 512。
n_embd: 嵌入大小（通常与模型的大小/复杂性有关）为 4,096。
n_ff: Transformer 中前馈网络的大小为 11,008。
n_head: 注意力头的数量为 32。更多的头允许模型同时关注输入的不同部分。
n_layer: Transformer 层的数量为 32。更多的层通常意味着模型更复杂，可以捕捉更复杂的模式。

LoRA（低秩适应）模型：

lora_size, opt_size, input_size, 和 compute_size 表示训练过程中各部分的内存占用，提供了关于所用计算资源的信息。
total number of samples: 训练数据中的总样本数为 67,871。
iter: 表示“迭代”。每次迭代代表通过神经网络模型的一批数据的一次前向和后向传播。
sample: 告诉你当前处理的是总批次中的哪个样本。
sched: 学习率调度器的当前速率。学习率调度是用于在训练期间调整学习率的策略，可以导致更高效和更稳定的训练。
loss: 表示当前的损失值。损失值衡量模型预测与实际数据的匹配程度。较低的损失值通常更好，表示模型做出了准确的预测。损失从 0 开始，然后稳定在大约 4 的值附近。监控此值及其趋势对于确保模型在学习非常重要。
dt（该迭代所花费的时间）随着模型处理更多批次逐渐稳定在约28分钟左右。
eta（估计剩余训练时间）最初增加，然后减少，因为模型可能变得更高效，或者估计过程变得更准确。

检查点：每10次迭代会保存一个微调模型（LoRA 模型）的检查点。模型以 .gguf 为扩展名格式保存。

如何使用微调后的模型？

要运行微调后的模型，可以使用以下命令：

./main --model ./models/llama-2-7b/ggml-model-q4_0.gguf --lora ggml-lora-LATEST-f32.gguf --prompt "Can you please write a children's story with 200 words about father and son and friendship and bravery?"

注意，使用带有量化模型的 lora adapter 可能会导致质量下降。建议使用 f16 或 f32 基础模型与 --lora-base 一起使用。

以下是带有 lora-base 的完整命令：

./main --model ./models/llama-2-7b/ggml-model-q4_0.gguf --lora ggml-lora-LATEST-f32.gguf --lora-base ./models/llama-2-7b/ggml-model-f16.gguf --prompt "Can you please write a children's story with 200 words about father and son and friendship and bravery?"

以下是一个使用 lora-base llama-2–7b/ggml-model-f16.gguf 的输出示例：

Can you please write a children's story with 200 words about father and son and friendship and bravery?
...My Dad and I...

微调时间多长？

每次迭代大约需要30分钟，如果让它运行100次迭代，大约需要50小时才能完成。但你可以配置它运行任意 N 次迭代，使用 --adam-iter N。

在我们的设置中，它默认运行256次迭代，因此日志显示在第13次迭代时，损失为1.580555，预计剩余训练时间为4天23小时43分05秒（约五天）。

train_opt_callback: iter=    12 sample=97/67871 sched=0.120000 loss=1.580555 dt=00:29:26 eta=4d 23:43:05 |------------------------------->

注意，微调时间也取决于训练数据中的样本数量和上下文窗口大小。如果你使用较少数量的样本（例如，1000 而不是约68000个故事），则可能需要更短的时间，大约48小时：

iter=     1 sample=9/951 sched=0.010000 loss=4.199916 dt=00:10:55 eta=1d 22:26:21 |->