原文：Fasih Ahmad A Comprehensive Guide on Fine-Tuning LLMs : Mimicking Researcher Writing Style

img

微调LLMs的介绍

微调语言模型（LLMs）已成为自然语言处理（NLP）任务中的一项关键技术，使模型能够适应特定领域或任务，并提高性能。在本文中，我们将使用Python探索微调LLMs的过程，重点介绍高效预处理文本数据的技术。

预处理文本数据

清理文本数据

在微调LLMs之前，对文本数据进行预处理以确保一致性和去除噪声是至关重要的。提供的代码片段演示了一种全面清理文本数据的方法：

import re
from cleantext import clean
def clean_text(text):
    # 去除特殊字符和额外的空格
    cleaned_text = re.sub(r'\s+', ' ', text)  # 将多个空格替换为一个空格
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)  # 去除特殊字符
    cleaned_text.strip()
    cleaned_text = clean(cleaned_text, no_line_breaks=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True, no_punct=True, replace_with_punct='', replace_with_url='', replace_with_email='', replace_with_phone_number='', replace_with_currency_symbol='', lang='en')
    return cleaned_text  # 去除前导和尾随空格

clean_text()函数利用正则表达式和cleantext库来去除文本中的特殊字符、URL、电子邮件和其他噪声元素，确保LLM的输入干净。

从PDF中提取文本

解析PDF文档

PDF文档通常包含有价值的文本数据，但提取这些数据可能具有挑战性。以下函数extract_text_from_pdf()使用PyPDF2从PDF文件中高效提取文本：

import PyPDF2
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    return clean_text(text)

该函数遍历PDF文档的每个页面，提取文本，并使用先前定义的clean_text()函数进行清理。

微调过程

提取相关部分

微调LLMs通常需要关注特定的文本部分，例如摘要和参考文献。extract_abstract_and_references()函数高效地隔离这些部分：

import re
def extract_abstract_and_references(text):
    abstract_match = re.search(r'Abstract', text, re.IGNORECASE)
    references_match = re.search(r'References', text,re.IGNORECASE)
    if abstract_match and references_match:
        abstract_start = abstract_match.start()
        references_end = text.rfind('REFERENCES')
        abstract_text = (text[abstract_start:references_end]).lower()
        return abstract_text
    elif abstract_match:
        abstract_start = abstract_match.start()
        abstract_text = (text[abstract_start:]).lower()
        return abstract_text
    else:
        return (text).lower()

使用LangChain简化文本

在本节中，我们将探索一个利用LangChain（一种强大的语言模型LLM）简化文本的Python脚本。让我们深入代码并了解其功能。

导入必要的模块

脚本首先导入所需的模块，包括langchain_together中的Together，langchain.prompts中的PromptTemplate，json和asyncio。这些模块使得使用LangChain进行文本简化成为可能。

定义示例文本

脚本定义了示例文本，每个示例包含原始文本及其对应的中性版本。这些示例作为输入供LangChain模型学习和生成简化版本。

设置提示

prompt变量定义了一个提示模板，指示LangChain简化给定的文本。它指定了输入变量text，在处理过程中将被实际文本替换。

初始化LangChain

使用Together类初始化LangChain，指定要使用的模型（mistralai/Mistral-7B-Instruct-v0.2）、温度、top-k值和LangChain的API密钥。

处理文本

脚本从JSON文件中读取输入文本块，为每个块创建提示，并将批量提示发送给LangChain进行简化。LangChain生成输入文本的简化版本，然后将其存储在一个列表中。

from langchain_together import Together
from langchain.prompts import PromptTemplate
import json
import asyncio
from langchain_core.prompts.few_shot import FewShotPromptTemplate
import pprint
examples = [
    {
        "Raw_text": """The purpose of this paper is to highlight the important advances in unsupervised learning, and after providing a tutorial introduction to these techniques, to review how such techniques have been, or could be, used for various tasks in modern next-generation networks comprising both computer networks as well as mobile telecom networks.""",
        "Neutral_text": """The objective of this document is to outline significant progress in unsupervised learning. Following a tutorial-style introduction to these methods, it aims to examine the potential applications of such techniques in contemporary next-generation networks, encompassing computer networks and mobile telecommunications networks.""",
    },
    {
        "Raw_text": """The rapid advances in deep neural networks, the democratization of enormous computing capabilities through cloud computing and distributed computing, and the ability to store and process large swathes of data have motivated a surging interest in applying unsupervised ML techniques in the networking field.""",
        "Neutral_text": """The swift progress in deep neural networks, the widespread access to vast computing power via cloud and distributed computing, and the capacity to handle extensive data sets have spurred a growing enthusiasm for the application of unsupervised machine learning techniques in the networking domai""",
    }
]
example_prompt = PromptTemplate(
    input_variables=["Raw_text", "Neutral_text"], template="Raw_text: {Raw_text}\nNeutral_text:{Neutral_text}"
)
prompt=PromptTemplate(
    template=
    """Simplify the given text. Your output should only be the simplified text nothing else.\n\n
Given text:\n{text}""",
    input_variables=["text"]
)
llm = Together(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.5,
    top_k=1,
    together_api_key="API_Key",
    max_tokens=300
)
with open('./output_chunks.json', 'r') as f:
    data = json.load(f)
all_prompts = []
for chunk in data:
    formatted_prompt = prompt.format(text=chunk)
    all_prompts.append(formatted_prompt)
chain=prompt|llm
print(len(data))
count=0
neutral_texts=[]
while len(all_prompts)>count:
    ans=asyncio.run(chain.abatch(all_prompts[count:count+100]))
    for a in ans:
        neutral_texts.append({"Raw_text":data[count],"Neutral_text":a})
        print("===========================================================================")
        print(data[count])
        print("---------------------------------------------------------------------------")
        print(a)
        print("===========================================================================")
        count+=1
    print("Processed ",count," texts")
    with open('neutral_texts.json', 'w') as f:
        json.dump(neutral_texts, f, indent=4)

使用 LangChain 进行高效文本分块

在本节中，我们将探讨一个使用 LangChain 高效将文本分割成较小块的 Python 脚本。让我们逐步分解代码并理解其功能。

导入必要模块

import os
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter

脚本首先导入了一些必要的模块，包括 os、json 和 langchain.text_splitter 中的 RecursiveCharacterTextSplitter。这些模块有助于使用 LangChain 进行文本分块。

初始化文本分块器

text_splitter = RecursiveCharacterTextSplitter(
    # 设置一个非常小的块大小，仅用于展示。
    chunk_size=1050,
    chunk_overlap=70,
    length_function=len,
    is_separator_regex=False,
)

初始化了一个 RecursiveCharacterTextSplitter 对象，用于将文本分割成块。通过配置参数如 chunk_size 和 chunk_overlap 控制分块过程。

加载文本数据

# 从 JSON 文件中加载文本
with open('extracted_text.json', 'r') as f:
    data = json.load(f)

脚本从名为 'extracted_text.json' 的 JSON 文件中读取文本数据。这些数据很可能包含需要分割成较小、可管理块的大段文本。

将文本分割成块

# 初始化字典以存储每个文本项的块
chunks_dict = []
# 将文本分割成块并存储在字典中
for index, text_item in enumerate(data):
    chunks = text_splitter.split_text(text_item)
    chunks_dict.extend(chunks)

对加载的数据中的文本项进行迭代，将每个项使用初始化的文本分块器分割成较小块。生成的块存储在名为 chunks_dict 的字典中。

将块保存到 JSON 文件

# 将块保存到新的 JSON 文件
with open('output_chunks.json', 'w') as f:
    json.dump(chunks_dict, f, indent=4)

最后，脚本将生成的块保存到名为 'output_chunks.json' 的新 JSON 文件中。该文件包含了分割后的文本块，可供进一步处理或分析使用。

运行 Tiny Llama：逐步指南

在本节中，我们将通过从 Google Colab 笔记本中提取的代码片段，逐步介绍设置和运行 Llama Factory 的过程。让我们逐步分解每个步骤，了解如何有效地执行 Tiny Llama 进行微调。

步骤 1：使用 LangChain 进行文本分块

import os
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # 设置一个非常小的块大小，仅用于展示。
    chunk_size=1050,
    chunk_overlap=70,
    length_function=len,
    is_separator_regex=False,
)
# 从 JSON 文件中加载文本
with open('extracted_text.json', 'r') as f:
    data = json.load(f)
# 初始化字典以存储每个文本项的块
chunks_dict = []
# 将文本分割成块并存储在字典中
for index, text_item in enumerate(data):
    chunks = text_splitter.split_text(text_item)
    chunks_dict.extend(chunks)
# 将块保存到新的 JSON 文件
with open('output_chunks.json', 'w') as f:
    json.dump(chunks_dict, f, indent=4)
print("Chunks have been saved to 'output_chunks.json'.")

这段代码演示了如何使用 LangChain 将大段文本分割成较小、可管理的块，这是在将数据输入到我们的 Tiny Llama 模型之前的关键预处理步骤。

步骤 2：安装所需软件包

!pip install bitsandbytes

这个命令安装了必要的 Python 软件包 bitsandbytes，这是与 Llama Factory 交互所需的软件包。

步骤 3：验证 GPU 可用性

import torch
assert torch.cuda.is_available() is True

确保 GPU 可用以有效运行 Llama Factory 是至关重要的。这个断言验证了 GPU 的可用性。

步骤 4：登录 Hugging Face

!huggingface-cli login --token Hugging_Face_Token

这个命令使用提供的令牌登录 Hugging Face，从而可以访问 Llama Factory 所需的模型和资源。

步骤 5：安装 TensorRT

!pip install tensorrt

TensorRT 是一个高性能的深度学习推理库。安装它对于优化模型推理性能是必要的。

步骤 6：运行 Llama Factory Web UI

!CUDA_VISIBLE_DEVICES=0 llamafactory-cli webui

这个命令启动 Llama Factory 的 Web 用户界面（UI），允许用户与工厂进行交互并执行各种与文本相关的任务。

步骤 7：导出环境变量

!export GRADIO_SHARE=False

导出环境变量确保为运行 Llama Factory 的 Web 界面进行正确配置。

步骤 8：训练 Web 界面

!CUDA_VISIBLE_DEVICES=0 python src/train_web.py

这个命令启动了Llama Factory的Web界面的训练过程，可以根据特定的用例进行定制和优化。

运行界面前的一些先决条件

在启动Llama Factory Web界面之前，有一些关键的文件调整是必要的。这些调整确保功能的无缝性和模型与数据集的正确集成。让我们来看看必要的预配置：

1. 修改interface.py

进入 interface.py 并找到 launch() 函数。右键单击以访问其定义，这将打开 blocks.py。在 blocks.py 中，设置 share: bool = True。这一步可以在Web界面内实现共享功能，增强协作和可访问性。

def launch():
    share: bool = True
    # 其他配置和功能

2. 更新constants.py

访问位于 data 文件夹中的 constants.py，具体位于 extras 子文件夹下。在这里，使用提供的代码片段注册您的模型。这一步确保Llama Factory Web界面能够无缝地识别和集成指定的模型。

"LLaMA-Tiny": {
    DownloadSource.DEFAULT: "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
},

3. 配置dataset_info.json

打开 dataset_info.json，将您的数据集包含在Web界面中以供显示。定义数据集的规格，如文件名和列映射。这样可以确保用户可以在界面内轻松访问和使用数据集。

"PDF_Data": {
    "file_name": "data.json",
    "columns": {
        "prompt": "prompt",
        "response": "output"
    }
}

使用Llama Factory界面

第一步：模型选择和数据集配置

第一步是选择Tiny Llama模型作为基础，并选择要训练的数据集。借助Tiny Llama的高效性和有效性，再加上Llama Factory的灵活性，我们可以根据不同的文本生成任务来定制我们的模型，而在本例中是像研究人员一样写作。

img

第二步：配置训练设置

一旦选择了模型和数据集，我们就开始配置训练设置。在这里，我们设置关键参数，如学习率为2e-5，训练轮数为1.0，每批最大样本数为10,000。这些设置在确定训练模型的性能和收敛性方面起着至关重要的作用，确保获得最佳结果。

img

第三步：优化的高级配置

除了基本设置之外，我们还可以探索高级配置来进一步优化训练过程。其中一个配置是量化，我们将精度设置为4。通过降低模型权重和激活的精度，我们在模型大小和性能之间取得平衡，提高效率而不损失质量。

img

第四步：使用Llama Factory启动训练

配置好模型、数据集和设置后，我们可以使用Llama Factory启动训练过程。

img

结论

在本讨论中，我们介绍了在Google Colab上设置和执行Llama Factory的过程，利用LangChain的能力进行高效的文本处理。

通过按照这些步骤，用户可以有效地利用LangChain和Google Colab上的Llama Factory，为各个领域的文本处理和生成开启无限可能。

详解微调语言模型（LLMs）的全面指南：模仿研究者的写作风格