LoRA 对 Phi Silica 的微调

Name: Visual Studio Code
Author: Microsoft

您可以使用低秩适应（LoRA）微调Phi Silica模型，以提升其性能以满足您的具体需求。通过使用 LoRA 优化 Microsoft Windows 本地语言模型 Phi Silica，你可以获得更准确的结果。该过程包括训练LoRA适配器，并在推理过程中应用该适配器以提高模型的准确性。

注释

Phi Silica的功能在中国没有。

前提条件

你已经确定了一个提升Phi Silica响应的应用场景。
你已经选择了一个评估标准来决定什么是“良好回应”。
你已经尝试过Phi Silica API，但它们不符合你的评估标准。

训练你的适配器

要训练用于在Windows 11中微调Phi Silica模型的LoRA适配器，首先必须生成一个训练过程将使用的数据集。

生成用于LoRA适配器的数据集

要生成数据集，你需要将数据拆分为两个文件：

train.json：用于训练适配器。
test.json：用于评估适配器在培训期间及训练后的表现。

这两个文件都必须使用 JSON 格式，每行都是代表单个样本的独立 JSON 对象。每个样本应包含用户与助手之间交换的消息列表。

每个消息对象都需要两个字段：

内容：信息文本。
角色：要么“用户”或“助手”，表示发送者。

请参见以下示例：

{"messages": [{"content": "Hello, how do I reset my password?", "role": "user"}, {"content": "To reset your password, go to the settings page and click 'Reset Password'.", "role": "assistant"}]}

{"messages": [{"content": "Can you help me find nearby restaurants?", "role": "user"}, {"content": "Sure! Here are some restaurants near your location: ...", "role": "assistant"}]}

{"messages": [{"content": "What is the weather like today?", "role": "user"}, {"content": "Today's forecast is sunny with a high of 25°C.", "role": "assistant"}]}

训练技巧：

每行采样行末尾无需逗号。
尽可能多地包含高质量且多样化的例子。为了获得最佳效果，至少收集几千个训练样本train.json档案。
该test.json文件可以更小，但应该涵盖你期望模型处理的各种交互类型。
创作train.json以及test.json每行包含一个 JSON 对象的文件，每个对象包含用户与助手之间的简短对话。数据的质量和数量会极大影响LoRA适配器的有效性。

培训LoRA适配器

要训练LoRA适配器，你需要满足以下必要的前提条件：

Azure subscription with available quota in Azure Container Apps.
- 我们建议使用A100或更先进的GPU来高效地进行微调工作。
- 检查你在Azure门户中有可用的配额。如果你想帮助查找配额，请参见“查看配额”。

请按照以下步骤创建工作区并开始微调工作：

进入模型工具>微调，选择新项目。
从模型目录中选择“microsoft/phi-silica”，然后选择“下一步”。
在对话框中，选择一个项目文件夹并输入项目名称。项目将会开启一个新的 VS Code 窗口。
从方法列表中选择“LoRA”。
在“训练数据集名称>测试数据集名称”下，选择你的train.json以及test.json文件。
选择“云端运行”。
在对话框中，选择访问您的Azure订阅的Microsoft账户。
账户选定后，从订阅下拉菜单中选择资源组。
注意你的微调工作成功启动并显示一个职位状态。

使用刷新按钮手动更新状态。一个微调工作通常平均需要45到60分钟完成。
工作完成后，您可以选择“下载”并选择“显示指标”来下载新训练的LoRA适配器，以检查微调指标。

LoRA微调建议

超参数选择

LoRA微调的默认超参数设置应该能提供一个合理的基线微调可供比较。我们尽力寻找适用于大多数用例和数据集的默认值。

不过，我们也保留了灵活性，允许您根据需要进行筛选。

训练超参数

我们的标准参数搜索空间为：

参数名称	敏	马克斯	分布
learning_rate	1e-4	1e-2	对数均匀
weight_decay	1e-5	1e-1	对数均匀
adam_beta1	0.9	0.99	校服
adam_beta2	0.9	0.999	校服
adam_epsilon	1e-9	1e-6	对数均匀
num_warmup_steps	0	10000	校服
lora_dropout	0	0.5	校服

我们也会搜索学习率调度器，选择其中一个linear_with_warmup或cosine_with_warmup. 如果num_warmup_steps参数设置为0那么你可以等价地使用线性或余弦选项。

学习率、学习率调度器和热身步数之间相互作用。保持两个固定并调整第三个步数，能让你更好地了解它们如何影响数据集上的训练输出。

权重衰减和LoRA的退出参数是为了帮助控制过拟合。如果你发现适配器无法很好地从训练集泛化到评估集，试着提高这些参数的值。

该adam_参数影响Adam优化器在训练步骤中的行为。有关该优化器的更多信息，可以参考例如PyTorch文档。

许多其他暴露的参数与PEFT库中同名的参数类似。有关这些的更多信息，请参见变换器文档。

数据超参数

数据超参数train_nsamples以及test_nsamples控制训练和测试分别取的样本数量。使用训练集中的更多样本通常是个好主意。使用更多测试样本能让你的测试指标噪声更低，但每次评估运行都会花更长时间。

该train_batch_size以及test_batch_size参数控制每批训练和测试中使用的样本数量。通常你可以用比训练更多的批次来测试，因为运行测试示例比训练样本占用的GPU内存少。

该train_seqlen以及test_seqlen参数控制列车和测试序列的长度。一般来说，越长越好，直到达到GPU内存限制。默认设置应该能提供良好的平衡。

选择系统提示词

我们发现，在选择系统提示词进行训练时，效果良好的是保持其相对简单（1到2句），同时鼓励模型以你想要的格式输出。我们还发现，使用稍有不同的系统提示词进行训练和推理可以提升效果。

你期望的输出与基础模型的差异越大，系统提示词对你的帮助就越大。

例如，如果你只训练基础模型的风格做些微变化，比如用简化语言吸引年轻读者，可能根本不需要系统提示。

不过，如果你想要的输出更有结构，那么你需要用系统提示词来让模型完成部分目标。所以，如果你需要一个带有特定键的JSON表，系统提示词的第一句话可以描述模型响应如果用通俗语言应是什么样子。第二句话可以更具体地说明JSON表的格式。在训练中使用第一句，推理时用两句话，可以给你想要的结果。

参数

所有可微调参数的列表附于此处。如果某参数未出现在工作流程页面界面，请手动添加到<your_project_path>/<model_name>/lora/lora.yaml.

[

################## Basic config settings ##################
  {
    "groupId": "data",
    "fields": [
      {
        "name": "system_prompt",
        "type": "Optional",
        "defaultValue": null,
        "info": "Optional system prompt. If specified, the system prompt given here will be prepended to each example in the dataset as the system prompt when training the LoRA adapter. When running inference the same (or a very similar) system prompt should be used. Note: if a system prompt is specified in the training data, giving a system prompt here will overwrite the system prompt in the dataset.",
        "label": "System prompt"
      },
      {
        "name": "varied_seqlen",
        "type": "bool",
        "defaultValue": false,
        "info": "Varied sequence lengths in the calibration data. If False (default), training examples will be concatenated together until they are finetune_[train/test]_seqlen tokens long. This makes memory usage more consistent and predictable. If True, each individual example will be truncated to finetune_[train/test]_seqlen tokens. This can sometimes give better training performance, but also gives unpredictable memory usage. It can cause `out of memory` errors mid training, if there are long training examples in your dataset.",
        "label": "Allow varied sequence length in data"
      },
      {
        "name": "finetune_dataset",
        "type": "str",
        "defaultValue": "wikitext2",
        "info": "Dataset to finetune on.",
        "label": "Dataset name or path"
      },
      {
        "name": "finetune_train_nsamples",
        "type": "int",
        "defaultValue": 4096,
        "info": "Number of samples to load from the train set for finetuning.",
        "label": "Number of finetuning samples"
      },
      {
        "name": "finetune_test_nsamples",
        "type": "int",
        "defaultValue": 128,
        "info": "Number of samples to load from the test set for finetuning.",
        "label": "Number of test samples"
      },
      {
        "name": "finetune_train_batch_size",
        "type": "int",
        "defaultValue": 4,
        "info": "Batch size for finetuning training.",
        "label": "Training batch size"
      },
      {
        "name": "finetune_test_batch_size",
        "type": "int",
        "defaultValue": 8,
        "info": "Batch size for finetuning testing.",
        "label": "Test batch size"
      },
      {
        "name": "finetune_train_seqlen",
        "type": "int",
        "defaultValue": 2048,
        "info": "Maximum sequence length for finetuning training. Longer sequences will be truncated.",
        "label": "Max training sequence length"
      },
      {
        "name": "finetune_test_seqlen",
        "type": "int",
        "defaultValue": 2048,
        "info": "Maximum sequence length for finetuning testing. Longer sequences will be truncated.",
        "label": "Max test sequence length"
      }
    ]
  },
  {
    "groupId": "finetuning",
    "fields": [
      {
        "name": "early_stopping_patience",
        "type": "int",
        "defaultValue": 5,
        "info": "Number of evaluations with no improvement after which training will be stopped.",
        "label": "Early stopping patience"
      },
      {
        "name": "epochs",
        "type": "float",
        "defaultValue": 1,
        "info": "Number of total epochs to run.",
        "label": "Epochs"
      },
      {
        "name": "eval_steps",
        "type": "int",
        "defaultValue": 64,
        "info": "Number of training steps to perform before each evaluation.",
        "label": "Steps between evaluations"
      },
      {
        "name": "save_steps",
        "type": "int",
        "defaultValue": 64,
        "info": "Number of steps after which to save model checkpoint. This _must_ be a multiple of the number of steps between evaluations.",
        "label": "Steps between checkpoints"
      },
      {
        "name": "learning_rate",
        "type": "float",
        "defaultValue": 0.0002,
        "info": "Learning rate for training.",
        "label": "Learning rate"
      },
      {
        "name": "lr_scheduler_type",
        "type": "str",
        "defaultValue": "linear",
        "info": "Type of learning rate scheduler.",
        "label": "Learning rate scheduler",
        "optionValues": [
          "linear",
          "linear_with_warmup",
          "cosine",
          "cosine_with_warmup"
        ]
      },
      {
        "name": "num_warmup_steps",
        "type": "int",
        "defaultValue": 400,
        "info": "Number of warmup steps for learning rate scheduler. Only relevant for a _with_warmup scheduler.",
        "label": "Scheduler warmup steps (if supported)"
      }
    ]
  }



################## Advanced config settings ##################



  {
    "groupId": "advanced",
    "fields": [
      {
        "name": "seed",
        "type": "int",
        "defaultValue": 42,
        "info": "Seed for sampling the data.",
        "label": "Random seed"
      },
      {
        "name": "evaluation_strategy",
        "type": "str",
        "defaultValue": "steps",
        "info": "Evaluation strategy to use.",
        "label": "Evaluation strategy",
        "optionValues": [
          "steps",
          "epoch",
          "no"
        ]
      },
      {
        "name": "lora_dropout",
        "type": "float",
        "defaultValue": 0.1,
        "info": "Dropout rate for LoRA.",
        "label": "LoRA dropout"
      },
      {
        "name": "adam_beta1",
        "type": "float",
        "defaultValue": 0.9,
        "info": "Beta1 hyperparameter for Adam optimizer.",
        "label": "Adam beta 1"
      },
      {
        "name": "adam_beta2",
        "type": "float",
        "defaultValue": 0.95,
        "info": "Beta2 hyperparameter for Adam optimizer.",
        "label": "Adam beta 2"
      },
      {
        "name": "adam_epsilon",
        "type": "float",
        "defaultValue": 1e-08,
        "info": "Epsilon hyperparameter for Adam optimizer.",
        "label": "Adam epsilon"
      },
      {
        "name": "num_training_steps",
        "type": "Optional",
        "defaultValue": null,
        "info": "The number of training steps there will be. If not set (recommended), this will be calculated internally.",
        "label": "Number of training steps"
      },
      {
        "name": "gradient_accumulation_steps",
        "type": "int",
        "defaultValue": 1,
        "info": "Number of updates steps to accumulate before performing a backward/update pass.",
        "label": "gradient accumulation steps"
      },
      {
        "name": "eval_accumulation_steps",
        "type": "Optional",
        "defaultValue": null,
        "info": "Number of predictions steps to accumulate before moving the tensors to the CPU.",
        "label": "eval accumulation steps"
      },
      {
        "name": "eval_delay",
        "type": "Optional",
        "defaultValue": 0,
        "info": "Number of epochs or steps to wait for before the first evaluation can be performed, depending on the eval_strategy.",
        "label": "eval delay"
      },
      {
        "name": "weight_decay",
        "type": "float",
        "defaultValue": 0.0,
        "info": "Weight decay for AdamW if we apply some.",
        "label": "weight decay"
      },
      {
        "name": "max_grad_norm",
        "type": "float",
        "defaultValue": 1.0,
        "info": "Max gradient norm.",
        "label": "max grad norm"
      },
      {
        "name": "gradient_checkpointing",
        "type": "bool",
        "defaultValue": false,
        "info": "If True, use gradient checkpointing to save memory at the expense of slower backward pass.",
        "label": "gradient checkpointing"
      }
    ]
  }
]

修改 Azure 订阅和资源组

如果你想修改之前设置的 Azure 订阅和资源组，可以在<your_project_path>/model_lab.workspace.provision.config档案。

使用 Phi Silica LoRA 适配器进行推断

重要

Phi Silica API 属于有限访问功能（参见 LimitedAccessFeatures 类）。欲了解更多信息或申请解锁令牌，请使用LAF访问令牌请求表。

注释

目前仅支持 Phi Silica LoRA 适配器的推理，适用于搭载 ARM 处理器的 Copilot+ 个人电脑。

使用 Windows AI API 推断：Phi Silica 搭配 LoRA 适配器