Introduction
In my journey to dive deeper into multimodal AI systems, I decided to fine-tune BLIP-2 — a powerful vision-language model — on the Flickr8k dataset to generate image captions. What made this more exciting was integrating LoRA (Low-Rank Adaptation) to fine-tune efficiently on limited compute. This blog walks through the entire process — from understanding BLIP-2, preparing the data, applying LoRA, training, and analyzing results.
Why BLIP-2?
BLIP-2 stands for Bootstrapping Language-Image Pretraining 2, and it bridges the gap between vision and language using a modular structure:
- A frozen image encoder (like a Vision Transformer).
- A Q-Former that translates visual features into language tokens.
- A frozen language model (like T5 or GPT-style models).
- This modularity makes BLIP-2 efficient and powerful without needing massive compute resources.
Why Flickr8k?
Flickr8k is a small but well-annotated dataset containing:
- 8,000 images
- 5 captions per image
- This makes it perfect for quick fine-tuning experiments.
What is LoRA?
LoRA (Low-Rank Adaptation) is a method that:
- Injects small trainable rank-decomposition matrices into existing model weights.
- Allows large models to be fine-tuned with fewer parameters and compute.
In my case, instead of updating the whole BLIP-2 model, I used LoRA to fine-tune only small parts — making training faster and memory-efficient.
Setup and Environment
Here’s what I used:
- Python 3.10
- PyTorch and Hugging Face Transformers
- BLIP-2 (Salesforce repo) or transformers implementation
- PEFT library for LoRA
- Flickr8k dataset (loaded using datasets or a custom script)
pip install torch transformers datasets accelerate peft bitsandbytes
Data Preprocessing
I processed the Flickr8k dataset into a format suitable for BLIP-2. Each example looks like:
{
"image": <PIL.Image>,
"caption": "A man is riding a bicycle down a hill."
}
Then, I used the BLIP-2 image processor and tokenizer:
from transformers import Blip2Processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
def preprocess(example):
inputs = processor(images=example["image"], text=example["caption"], return_tensors="pt")
return inputs
Model & LoRA Integration
Using the PEFT library:
from transformers import Blip2ForConditionalGeneration
from peft import LoraConfig, get_peft_model
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xl",
device_map="auto",
load_in_8bit=True, # Optional for low memory
)
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["qformer.query_tokens", "language_model"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
I only fine-tuned the Q-Former and the interface to the language model — not the frozen encoder or language model itself.
🏋️ Training the Model I used Trainer from Hugging Face:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./blip2-flickr8k-lora",
per_device_train_batch_size=4,
num_train_epochs=5,
learning_rate=2e-4,
save_steps=500,
save_total_limit=2,
fp16=True,
report_to="none"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_dataset["train"],
eval_dataset=processed_dataset["test"],
tokenizer=processor.tokenizer
)
trainer.train()
Results
After just 5 epochs, the model began generating accurate and human-like captions. Sample output:
Input Image:
Generated Caption: “A man in a red shirt is playing guitar on a street.”
Evaluation
I used BLEU, CIDEr, and ROUGE for evaluation. The model performed better than training from scratch or without LoRA.
Learnings and Challenges
- What worked: LoRA made fine-tuning on consumer-grade GPU possible. Modular structure of BLIP-2 helped isolate learning in the Q-Former.
- Challenges: Memory management with large models (used load_in_8bit). Tuning learning rates was critical to avoid overfitting on such a small dataset.
- What’s Next?
- Try BLIP-2 with larger datasets like COCO.
- Test zero-shot VQA (Visual Question Answering).
- Explore instruction-tuning with multimodal prompts.
Conclusion
Fine-tuning BLIP-2 with LoRA on Flickr8k was an insightful project into multimodal AI. It showcased how we can efficiently adapt large vision-language models for downstream tasks without needing massive compute. If you’re a researcher or developer curious about VLMs (Vision-Language Models), BLIP-2 is a great place to start — and LoRA makes it accessible.