FSDP-QLoRA: Enabling Efficient LLM Training on Consumer GPUs

FSDP-QLoRA is a powerful technique that combines the strengths of Fully Sharded Data Parallelism (FSDP), 4-bit quantization, and Low-Rank Adaptation (LoRA) to enable the training of Large Language Models (LLMs) with up to 70 billion parameters on relatively modest hardware setups, such as systems equipped with dual 24GB GPUs. Developed through a collaboration between Answer.AI and bitsandbytes, this technique significantly lowers the barrier to entry for LLM training, making it more accessible and efficient for researchers and developers.

This article provides a concise overview of how bitsandbytes facilitates FSDP-QLoRA training, focusing on its unique approach to quantized weight storage and its integration with the Hugging Face ecosystem for streamlined training workflows.

Quantized Data Storage with bitsandbytes

A key challenge in enabling FSDP with quantized LLMs arises from the data types used. While FSDP primarily supports sharding model parameters, optimizer states, and gradients stored as float data types, quantized weights are typically represented as integers (uint8) for memory efficiency.

bitsandbytes overcomes this limitation through its innovative use of the StoreChar functionality. This allows bitsandbytes to read and write quantized weights seamlessly, regardless of the underlying storage data type. By introducing a quant_storage parameter to the Linear4bit and Params4bit classes, bitsandbytes can store quantized weights as any FSDP-supported data type (bfloat16, float16, float32), ensuring compatibility with FSDP's sharding mechanism.

In practical terms, this is achieved by setting the bnb_4bit_quant_storage parameter within the transformers.BitsAndBytesConfig class. It is crucial to ensure that the quant_storage data type aligns with the data types employed throughout the model to guarantee correct sharding by FSDP.

from transformers import BitsAndBytesConfig, AutoModelForCausalLM

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

In addition to quant_storage, bitsandbytes utilizes the compute_dtype parameter to specify the data type used for computations within CUDA kernels. During computation, the 4-bit quantized weights are dequantized from the quant_storage type to the compute_dtype, with torch.bfloat16 being the recommended choice for enhanced numerical stability if supported by the hardware.

Streamlined Training with Hugging Face Integration

The integration of bitsandbytes with the Hugging Face ecosystem simplifies the process of setting up and running FSDP-QLoRA training. Libraries like Transformers, PEFT (Parameter-Efficient Fine-Tuning), and TRL (Transformer Reinforcement Learning) work seamlessly with bitsandbytes.

PEFT, in particular, offers a convenient configuration file (fsdp_config_qlora.yaml), a launch script (run_peft_qlora_fsdp.sh), and a training script (train.py) specifically designed for FSDP-QLoRA training. Detailed instructions and examples can be found in the Hugging Face documentation on "Use PEFT QLoRA and FSDP for finetuning large models on multiple GPUs".

Before initiating training, ensure that the necessary libraries are installed:

pip install -U bitsandbytes accelerate transformers peft trl

The key to enabling FSDP-QLoRA training lies in setting the bnb_4bit_quant_storage parameter within the BitsAndBytesConfig class, as highlighted earlier. This configures the storage data type for quantized weights to a float data type compatible with FSDP.

The provided code snippet demonstrates the setup process:

from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import LoraConfig
from trl import SFTTrainer

# ... (bnb_config definition as shown previously)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16, # Match with bnb_4bit_quant_storage
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

trainer.train()

Conclusion

FSDP-QLoRA, powered by bitsandbytes and its seamless integration with the Hugging Face ecosystem, significantly enhances the accessibility and efficiency of LLM training. By enabling the training of large models on consumer-grade hardware, this technique empowers a wider range of researchers and developers to engage in cutting-edge LLM research and development.

FSDP-QLoRA: Enabling Efficient LLM Training on Consumer GPUs