QLoRA Fine-Tuning on Intel Gaudi

This guide provides the steps required to enable QLoRA fine-tuning on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.

QLoRA is a novel approach that reduces memory usage for fine-tuning LLMs while maintaining the performance of full 16-bit fine-tuning. It backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low-Rank Adapters (LoRA). QLoRA introduces several innovations to minimize memory usage without compromising performance:

  • 4-bit NormalFloat (NF4) - A new data type that is theoretically optimal for representing normally distributed weights.

  • Double quantization - Reduces average memory usage by quantizing the quantization constants themselves.

  • Paged optimizers - Helps manage memory spikes. For further details, refer to QLoRA.

Intel Gaudi utilizes Bitsandbytes (BNB) and Hugging Face Transformers Python libraries to run QLoRA fine-tuning.

The following is an example to help you get started with QLoRA fine-tuning on Gaudi.

  1. Install Python packages:

    pip install -q -U git+https://github.com/bitsandbytes-foundation/bitsandbytes.git
    pip install transformers
    pip install accelerate
    pip install peft
    pip install datasets
    
  2. Load and quantize the model to NF4, then run QLoRA fine-tuning:

    import torch
    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
    
    # Model configuration
    model_id = "EleutherAI/gpt-neox-20b"
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype="bfloat16"
    )
    
    # Prepare model for k-bit training
    from peft import prepare_model_for_kbit_training
    
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)
    
    # Configure LoRA
    from peft import LoraConfig, get_peft_model
    
    config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["query_key_value"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = get_peft_model(model, config)
    
    # Load and prepare dataset
    from datasets import load_dataset
    
    data = load_dataset("Abirate/english_quotes")
    data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
    
    # Setup training
    import transformers
    
    # needed for gpt-neo-x tokenizer
    tokenizer.pad_token = tokenizer.eos_token
    
    trainer = transformers.Trainer(
        model=model,
        train_dataset=data["train"],
        args=transformers.TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=10,
            learning_rate=2e-4,
            fp16=False,
            logging_steps=1,
            output_dir="outputs",
            optim="adamw_torch"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
    trainer.train()
    

It is recommended to use Hugging Face Optimum-Habana Python library to further reduce QLoRA fine-tuning time for selected models (e.g., Llama) by using Gaudi-specific optimizations available in the library. For reference, see the QLoRA Test Cases section provided in Hugging Face Optimum-Habana repository.

Note

  • Paged optimizers are not supported on Gaudi.

  • Fine-tuning with QLoRA using Float16 data type is not supported on Gaudi.