QLoRA Fine-Tuning on Intel Gaudi¶

This guide provides the steps required to enable QLoRA fine-tuning on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.

QLoRA is a novel approach that reduces memory usage for fine-tuning LLMs while maintaining the performance of full 16-bit fine-tuning. It backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low-Rank Adapters (LoRA). QLoRA introduces several innovations to minimize memory usage without compromising performance:

4-bit NormalFloat (NF4) - A new data type that is theoretically optimal for representing normally distributed weights.
Double quantization - Reduces average memory usage by quantizing the quantization constants themselves.
Paged optimizers - Helps manage memory spikes. For further details, refer to QLoRA.

Intel Gaudi utilizes Bitsandbytes (BNB) and Hugging Face Transformers Python libraries to run QLoRA fine-tuning.

The following is an example to help you get started with QLoRA fine-tuning on Gaudi.

Install Python packages:

pip install -q -U git+https://github.com/bitsandbytes-foundation/bitsandbytes.git
pip install transformers
pip install accelerate
pip install peft
pip install datasets

Load and quantize the model to NF4, then run QLoRA fine-tuning:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model configuration
model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype="bfloat16"
)

# Prepare model for k-bit training
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# Configure LoRA
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

# Load and prepare dataset
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

# Setup training
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=False,
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_torch"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

It is recommended to use Hugging Face Optimum-Habana Python library to further reduce QLoRA fine-tuning time for selected models (e.g., Llama) by using Gaudi-specific optimizations available in the library. For reference, see the QLoRA Test Cases section provided in Hugging Face Optimum-Habana repository.

Note

Paged optimizers are not supported on Gaudi.
Fine-tuning with QLoRA using Float16 data type is not supported on Gaudi.

Gaudi Documentation 1.22.2 documentation

QLoRA Fine-Tuning on Intel Gaudi

QLoRA Fine-Tuning on Intel Gaudi¶