laitimes

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

author:ChatGPT sweeper

原文:Fasih Ahmad A Comprehensive Guide on Fine-Tuning LLMs : Mimicking Researcher Writing Style

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

img

Introduction to fine-tuning LLMs

Fine-tuned language models (LLMs) have emerged as a key technology in natural language processing (NLP) tasks, enabling models to adapt to specific domains or tasks and improve performance. In this article, we will explore the process of fine-tuning LLMs using Python, focusing on techniques for efficient preprocessing of text data.

Preprocess text data

Clean up text data

Before fine-tuning LLMs, it is critical to pre-process the text data to ensure consistency and remove noise. The provided code snippet demonstrates a way to comprehensively clean up text data:

import re
from cleantext import clean
def clean_text(text):
    # 去除特殊字符和额外的空格
    cleaned_text = re.sub(r'\s+', ' ', text)  # 将多个空格替换为一个空格
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)  # 去除特殊字符
    cleaned_text.strip()
    cleaned_text = clean(cleaned_text, no_line_breaks=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True, no_punct=True, replace_with_punct='', replace_with_url='', replace_with_email='', replace_with_phone_number='', replace_with_currency_symbol='', lang='en')
    return cleaned_text  # 去除前导和尾随空格
           

The clean_text() function uses regular expressions and the cleantext library to remove special characters, URLs, emails, and other noisy elements from the text, ensuring that the LLM input is clean.

Extract text from PDFs

Parse PDF documents

PDF documents often contain valuable textual data, but extracting this data can be challenging. The following function extract_text_from_pdf() uses PyPDF2 to efficiently extract text from PDF files:

import PyPDF2
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    return clean_text(text)
           

The function traverses each page of the PDF document, extracts the text, and cleans it up using the previously defined clean_text() function.

Fine-tune the process

Extract the relevant sections

Fine-tuning LLMs often requires focusing on specific parts of text, such as abstracts and references. extract_abstract_and_references() function to isolate these parts efficiently:

import re
def extract_abstract_and_references(text):
    abstract_match = re.search(r'Abstract', text, re.IGNORECASE)
    references_match = re.search(r'References', text,re.IGNORECASE)
    if abstract_match and references_match:
        abstract_start = abstract_match.start()
        references_end = text.rfind('REFERENCES')
        abstract_text = (text[abstract_start:references_end]).lower()
        return abstract_text
    elif abstract_match:
        abstract_start = abstract_match.start()
        abstract_text = (text[abstract_start:]).lower()
        return abstract_text
    else:
        return (text).lower()
           

Simplify text with LangChain

In this section, we'll explore a Python script that simplifies text using LangChain, a powerful language model LLM. Let's dive into the code and understand its features.

Import the necessary modules

The script first imports the required modules, including Together in the langchain_together, PromptTemplate in langchain.prompts, json, and asyncio. These modules make it possible to use LangChain for text simplification.

Define sample text

The script defines sample text, each containing the original text and its corresponding neutral version. These examples are used as input for the LangChain model to learn and generate simplified versions.

Set up prompts

The prompt variable defines a prompt template that instructs LangChain to simplify the given text. It specifies the input variable text, which will be replaced by the actual text during processing.

初始化LangChain

使用Together类初始化LangChain,指定要使用的模型(mistralai/Mistral-7B-Instruct-v0.2)、温度、top-k值和LangChain的API密钥。

Work with text

The script reads input text blocks from the JSON file, creates prompts for each block, and sends bulk prompts to LangChain for simplification. LangChain generates a simplified version of the input text and then stores it in a list.

from langchain_together import Together
from langchain.prompts import PromptTemplate
import json
import asyncio
from langchain_core.prompts.few_shot import FewShotPromptTemplate
import pprint
examples = [
    {
        "Raw_text": """The purpose of this paper is to highlight the important advances in unsupervised learning, and after providing a tutorial introduction to these techniques, to review how such techniques have been, or could be, used for various tasks in modern next-generation networks comprising both computer networks as well as mobile telecom networks.""",
        "Neutral_text": """The objective of this document is to outline significant progress in unsupervised learning. Following a tutorial-style introduction to these methods, it aims to examine the potential applications of such techniques in contemporary next-generation networks, encompassing computer networks and mobile telecommunications networks.""",
    },
    {
        "Raw_text": """The rapid advances in deep neural networks, the democratization of enormous computing capabilities through cloud computing and distributed computing, and the ability to store and process large swathes of data have motivated a surging interest in applying unsupervised ML techniques in the networking field.""",
        "Neutral_text": """The swift progress in deep neural networks, the widespread access to vast computing power via cloud and distributed computing, and the capacity to handle extensive data sets have spurred a growing enthusiasm for the application of unsupervised machine learning techniques in the networking domai""",
    }
]
example_prompt = PromptTemplate(
    input_variables=["Raw_text", "Neutral_text"], template="Raw_text: {Raw_text}\nNeutral_text:{Neutral_text}"
)
prompt=PromptTemplate(
    template=
    """Simplify the given text. Your output should only be the simplified text nothing else.\n\n
Given text:\n{text}""",
    input_variables=["text"]
)
llm = Together(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.5,
    top_k=1,
    together_api_key="API_Key",
    max_tokens=300
)
with open('./output_chunks.json', 'r') as f:
    data = json.load(f)
all_prompts = []
for chunk in data:
    formatted_prompt = prompt.format(text=chunk)
    all_prompts.append(formatted_prompt)
chain=prompt|llm
print(len(data))
count=0
neutral_texts=[]
while len(all_prompts)>count:
    ans=asyncio.run(chain.abatch(all_prompts[count:count+100]))
    for a in ans:
        neutral_texts.append({"Raw_text":data[count],"Neutral_text":a})
        print("===========================================================================")
        print(data[count])
        print("---------------------------------------------------------------------------")
        print(a)
        print("===========================================================================")
        count+=1
    print("Processed ",count," texts")
    with open('neutral_texts.json', 'w') as f:
        json.dump(neutral_texts, f, indent=4)
           

Use LangChain for efficient text chunking

In this section, we'll explore a Python script that uses LangChain to efficiently split text into smaller chunks. Let's break down the code step by step and understand its functionality.

Import the necessary modules

import os
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
           

The script first imports some of the necessary modules, including os, json, and RecursiveCharacterTextSplitter in langchain.text_splitter. These modules help with text chunking with LangChain.

Initialize the text chunker

text_splitter = RecursiveCharacterTextSplitter(
    # 设置一个非常小的块大小,仅用于展示。
    chunk_size=1050,
    chunk_overlap=70,
    length_function=len,
    is_separator_regex=False,
)
           

Initializes a RecursiveCharacterTextSplitter object that splits the text into chunks. The chunking process is controlled by configuring parameters such as chunk_size and chunk_overlap.

Load text data

# 从 JSON 文件中加载文本
with open('extracted_text.json', 'r') as f:
    data = json.load(f)
           

The script reads text data from a JSON file named 'extracted_text.json'. This data is likely to contain large pieces of text that need to be broken down into smaller, manageable chunks.

Split the text into chunks

# 初始化字典以存储每个文本项的块
chunks_dict = []
# 将文本分割成块并存储在字典中
for index, text_item in enumerate(data):
    chunks = text_splitter.split_text(text_item)
    chunks_dict.extend(chunks)
           

Iterate over the text items in the loaded data, splitting each item into smaller chunks using an initialized text chunker. The resulting blocks are stored in a dictionary called chunks_dict.

Save the block to a JSON file

# 将块保存到新的 JSON 文件
with open('output_chunks.json', 'w') as f:
    json.dump(chunks_dict, f, indent=4)
           

Finally, the script saves the generated block to a new JSON file named 'output_chunks.json'. The file contains a split block of text that can be used for further processing or analysis.

It's a Little Girl:逐步指南

In this section, we'll walk through the process of setting up and running Llama Factory with code snippets extracted from Google Colab notebooks. Let's break down each step step by step to see how effectively Tiny Llama is fine-tuned.

Step 1: Use LangChain for text chunking

import os
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # 设置一个非常小的块大小,仅用于展示。
    chunk_size=1050,
    chunk_overlap=70,
    length_function=len,
    is_separator_regex=False,
)
# 从 JSON 文件中加载文本
with open('extracted_text.json', 'r') as f:
    data = json.load(f)
# 初始化字典以存储每个文本项的块
chunks_dict = []
# 将文本分割成块并存储在字典中
for index, text_item in enumerate(data):
    chunks = text_splitter.split_text(text_item)
    chunks_dict.extend(chunks)
# 将块保存到新的 JSON 文件
with open('output_chunks.json', 'w') as f:
    json.dump(chunks_dict, f, indent=4)
print("Chunks have been saved to 'output_chunks.json'.")
           

This code demonstrates how to use LangChain to split large chunks of text into smaller, manageable chunks, which is a key preprocessing step before data is fed into our Tiny Llama model.

Step 2: Install the required packages

!pip install bitsandbytes
           

This command installs the necessary Python package, bitsandbytes, which is required to interact with Llama Factory.

Step 3: Verify GPU availability

import torch
assert torch.cuda.is_available() is True
           

It is critical to ensure that the GPU is available to run Llama Factory effectively. This assertion validates the availability of the GPU.

步骤 4:登录 Hugging Face

!huggingface-cli login --token Hugging_Face_Token
           

This command uses the provided token to log in to Hugging Face, giving you access to the models and resources needed by Llama Factory.

Step 5: Install TensorRT

!pip install tensorrt
           

TensorRT is a high-performance deep learning inference library. Installing it is necessary to optimize model inference performance.

6:30 a.m. - The Llama Factory Web UI

!CUDA_VISIBLE_DEVICES=0 llamafactory-cli webui
           

This command launches Llama Factory's web user interface (UI), allowing users to interact with the factory and perform a variety of text-related tasks.

Step 7: Export environment variables

!export GRADIO_SHARE=False
           

Exporting environment variables ensures that the web interface running Llama Factory is properly configured.

Step 8: Train the web interface

!CUDA_VISIBLE_DEVICES=0 python src/train_web.py
           

This command initiates the training process of Llama Factory's web interface, which can be customized and optimized for specific use cases.

Some prerequisites before running the interface

Before launching the Llama Factory web interface, there are a few key file tweaks that are necessary. These adjustments ensure seamless functionality and proper integration of the model with the dataset. Let's take a look at the necessary pre-configurations:

1. Modify interface.py

Go to interface.py and find the launch() function. Right-click to access its definition, which will open the blocks.py. In blocks.py, set share: bool = True. This step enables sharing capabilities within the web interface, enhancing collaboration and accessibility.

def launch():
    share: bool = True
    # 其他配置和功能
           

2. Update constants.py

Access the constants.py located in the data folder, under the extras subfolder. Here, register your model with the code snippet provided. This step ensures that the Llama Factory web interface can seamlessly identify and integrate the specified model.

"LLaMA-Tiny": {
    DownloadSource.DEFAULT: "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
},
           

3. Configure dataset_info.json

Open the dataset_info.json and include your dataset in the web interface for display. Define the specifications of the dataset, such as file names and column mappings. This ensures that users can easily access and use datasets within the interface.

"PDF_Data": {
    "file_name": "data.json",
    "columns": {
        "prompt": "prompt",
        "response": "output"
    }
}
           

Use the Llama Factory interface

Step 1: Model selection and dataset configuration

The first step is to select the Tiny Llama model as a base and select the dataset you want to train. With the efficiency and effectiveness of Tiny Llama, coupled with the flexibility of Llama Factory, we can tailor our model to different text generation tasks, in this case writing like a researcher.

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

img

Step 2: Configure training settings

Once we've selected the model and dataset, we start configuring the training settings. Here, we set key parameters such as a learning rate of 2e-5, a training round of 1.0, and a maximum sample size of 10,000 per batch. These settings play a crucial role in determining the performance and convergence of the trained model, ensuring the best possible results.

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

img

Step 3: Optimized advanced configuration

In addition to the basic settings, we can also explore advanced configurations to further optimize the training process. One of the configurations is quantization, and we set the precision to 4. By reducing model weights and the precision of activation, we strike a balance between model size and performance, increasing efficiency without sacrificing quality.

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

img

Step 4: Start the training with Llama Factory

Once the model, dataset, and settings are configured, we can start the training process using Llama Factory.

A comprehensive guide to fine-tuning language models (LLMs): mimicking a researcher's writing style

img

conclusion

In this discussion, we describe the process of setting up and executing Llama Factory on Google Colab, leveraging LangChain's capabilities for efficient text processing.

By following these steps, users can effectively leverage Llama Factory on LangChain and Google Colab, opening up endless possibilities for text processing and generation in various fields.

Read on