Fine-Tuning BLIP-2 with LoRA on the Flickr8k Dataset for Image Captioning

May 1, 2024

Introduction

In my journey to dive deeper into multimodal AI systems, I decided to fine-tune BLIP-2 — a powerful vision-language model — on the Flickr8k dataset to generate image captions. What made this more exciting was integrating LoRA (Low-Rank Adaptation) to fine-tune efficiently on limited compute. This blog walks through the entire process — from understanding BLIP-2, preparing the data, applying LoRA, training, and analyzing results.

Why BLIP-2?

BLIP-2 stands for Bootstrapping Language-Image Pretraining 2, and it bridges the gap between vision and language using a modular structure:

  • A frozen image encoder (like a Vision Transformer).
  • A Q-Former that translates visual features into language tokens.
  • A frozen language model (like T5 or GPT-style models).
  • This modularity makes BLIP-2 efficient and powerful without needing massive compute resources.

Why Flickr8k?

Flickr8k is a small but well-annotated dataset containing:

  • 8,000 images
  • 5 captions per image
  • This makes it perfect for quick fine-tuning experiments.

What is LoRA?

LoRA (Low-Rank Adaptation) is a method that:

  • Injects small trainable rank-decomposition matrices into existing model weights.
  • Allows large models to be fine-tuned with fewer parameters and compute.

In my case, instead of updating the whole BLIP-2 model, I used LoRA to fine-tune only small parts — making training faster and memory-efficient.

Setup and Environment

Here’s what I used:

  • Python 3.10
  • PyTorch and Hugging Face Transformers
  • BLIP-2 (Salesforce repo) or transformers implementation
  • PEFT library for LoRA
  • Flickr8k dataset (loaded using datasets or a custom script)
pip install torch transformers datasets accelerate peft bitsandbytes

Data Preprocessing

I processed the Flickr8k dataset into a format suitable for BLIP-2. Each example looks like:

{
  "image": <PIL.Image>,
  "caption": "A man is riding a bicycle down a hill."
}

Then, I used the BLIP-2 image processor and tokenizer:

from transformers import Blip2Processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
def preprocess(example):
    inputs = processor(images=example["image"], text=example["caption"], return_tensors="pt")
    return inputs

Model & LoRA Integration

Using the PEFT library:

from transformers import Blip2ForConditionalGeneration
from peft import LoraConfig, get_peft_model
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xl",
    device_map="auto",
    load_in_8bit=True,  # Optional for low memory
)
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["qformer.query_tokens", "language_model"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

I only fine-tuned the Q-Former and the interface to the language model — not the frozen encoder or language model itself.

🏋️ Training the Model I used Trainer from Hugging Face:

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="./blip2-flickr8k-lora",
    per_device_train_batch_size=4,
    num_train_epochs=5,
    learning_rate=2e-4,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    report_to="none"
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["test"],
    tokenizer=processor.tokenizer
)
trainer.train()

Results

After just 5 epochs, the model began generating accurate and human-like captions. Sample output:

Input Image:

Generated Caption: “A man in a red shirt is playing guitar on a street.”

Evaluation

I used BLEU, CIDEr, and ROUGE for evaluation. The model performed better than training from scratch or without LoRA.

Learnings and Challenges

  • What worked: LoRA made fine-tuning on consumer-grade GPU possible. Modular structure of BLIP-2 helped isolate learning in the Q-Former.
  • Challenges: Memory management with large models (used load_in_8bit). Tuning learning rates was critical to avoid overfitting on such a small dataset.
  • What’s Next?
    • Try BLIP-2 with larger datasets like COCO.
    • Test zero-shot VQA (Visual Question Answering).
    • Explore instruction-tuning with multimodal prompts.

Conclusion

Fine-tuning BLIP-2 with LoRA on Flickr8k was an insightful project into multimodal AI. It showcased how we can efficiently adapt large vision-language models for downstream tasks without needing massive compute. If you’re a researcher or developer curious about VLMs (Vision-Language Models), BLIP-2 is a great place to start — and LoRA makes it accessible.