DIPSY: Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance

Qualitative comparison showing DIPSY's superior performance

Qualitative comparison of synthetic image generation for visually similar class pairs across datasets: British Shorthair vs Russian Blue (Pets), Risotto vs Paella (Food101), and Boeing 747-400 vs 777-300 (FGVC Aircraft). DIPSY generates semantically faithful and visually distinct images, preserving class-specific cues such as eye color in pets, food-specific textures and toppings, and structural aircraft details. Competing methods (DISEF and DataDream) often produce ambiguous results. Real images included for reference.

Abstract

Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples.

DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering.

Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.

Method Overview

DIPSY generates synthetic training data for few-shot classification using a novel dual IP-Adapter approach. The method leverages IP-Adapter and Stable Diffusion to create highly discriminative images without requiring model fine-tuning or external tools.

DIPSY architecture diagram showing the dual IP-Adapter generation pipeline

Dual IP-Adapter generation pipeline. An image from the target class (CLS i) provides positive conditioning (IPA+, weight w_im+), while an image from a similar class (CLS j) provides negative conditioning (IPA-, weight w_im-). These guide the Denoising U-Net to produce images of the target class.

Key Innovations

Extended Classifier-Free Guidance: We extend CFG to independently control text, positive image, and negative image conditioning. This provides fine-grained control over the generation process, allowing us to simultaneously enhance class-specific features while suppressing features from related classes.

Class Similarity-Based Sampling: Our strategy selects effective negative prompts from related classes, enhancing the discriminative power of generated images. By identifying semantically similar classes, we create stronger contrastive examples that improve classifier training.

Training-Free Pipeline: DIPSY requires no model fine-tuning, external captioning, or filtering, making it practical for real-world applications. The entire pipeline operates using pre-trained models without any additional training steps.

Dual IP-Adapter Guidance Framework

DIPSY leverages the power of IP-Adapter for image-to-image translation without requiring any model fine-tuning. Our key innovation lies in extending the classical Classifier-Free Guidance (CFG) to handle dual image conditioning through positive and negative prompts.

The framework processes few-shot examples through IP-Adapter conditioning, where positive guidance enhances class-specific feature preservation while negative guidance increases inter-class feature discrimination. This dual guidance system optimizes the synthetic data distribution for superior classifier performance.

DIPSY's Guidance Scheme: Given one text prompt c_text, one positive image prompt c_im+, and one negative image prompt c_im-, with corresponding guidance scales w_text, w_im+, and -w_im-, our proposed extended CFG scheme yields the following noise prediction ε̂_θ:

BibTeX