QLoRA Fine-Tuning on Intel Gaudi
QLoRA Fine-Tuning on Intel Gaudi¶
This guide provides the steps required to enable QLoRA fine-tuning on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.
QLoRA is a novel approach that reduces memory usage for fine-tuning LLMs while maintaining the performance of full 16-bit fine-tuning. It backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low-Rank Adapters (LoRA). QLoRA introduces several innovations to minimize memory usage without compromising performance:
4-bit NormalFloat (NF4) - A new data type that is theoretically optimal for representing normally distributed weights.
Double quantization - Reduces average memory usage by quantizing the quantization constants themselves.
Paged optimizers - Helps manage memory spikes. For further details, refer to QLoRA.
Intel Gaudi utilizes Bitsandbytes (BNB) and Hugging Face Transformers Python libraries to run QLoRA fine-tuning.
The following is an example to help you get started with QLoRA fine-tuning on Gaudi.
Install Python packages:
pip install -q -U git+https://github.com/bitsandbytes-foundation/bitsandbytes.git pip install transformers pip install accelerate pip install peft pip install datasets
Load and quantize the model to NF4, then run QLoRA fine-tuning:
import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig # Model configuration model_id = "EleutherAI/gpt-neox-20b" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", torch_dtype="bfloat16" ) # Prepare model for k-bit training from peft import prepare_model_for_kbit_training model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model) # Configure LoRA from peft import LoraConfig, get_peft_model config = LoraConfig( r=8, lora_alpha=32, target_modules=["query_key_value"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) # Load and prepare dataset from datasets import load_dataset data = load_dataset("Abirate/english_quotes") data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True) # Setup training import transformers # needed for gpt-neo-x tokenizer tokenizer.pad_token = tokenizer.eos_token trainer = transformers.Trainer( model=model, train_dataset=data["train"], args=transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_steps=2, max_steps=10, learning_rate=2e-4, fp16=False, logging_steps=1, output_dir="outputs", optim="adamw_torch" ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) model.config.use_cache = False # silence the warnings. Please re-enable for inference! trainer.train()
It is recommended to use Hugging Face Optimum-Habana Python library to further reduce QLoRA fine-tuning time for selected models (e.g., Llama) by using Gaudi-specific optimizations available in the library. For reference, see the QLoRA Test Cases section provided in Hugging Face Optimum-Habana repository.
Note
Paged optimizers are not supported on Gaudi.
Fine-tuning with QLoRA using Float16 data type is not supported on Gaudi.