TryOffDiff:
Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Machine Learning Group, CITEC, Bielefeld University

...Stay tuned...

Predictions of TryOffDiff on the samples from the VITON-HD-test set.

Abstract

This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction.

Model Architecture

AI model architecture proposed in the paper

Overview of TryOffDiff. Given a reference image $I$ of size $(1024\times768\times3)$, the Image Encoder (SigLIP-B/16) extracts features, resulting in a $(1024\times768)$ map. These features are refined by the Adapter for the cross-attention layers, similar to IP-Adapter. The adapted features, now of shape $(77\times768)$, are integrated into a Denoising U-Net (Stable Diffusion-v1.4) where they replace the text features in the cross-attention layers. The output is then processed by the VAE Decoder of SD-v1.4, which decodes the latents into pixel space, producing the Predicted Garment $\hat{G}$. The model is trained end-to-end using the Mean Squared Error (MSE) loss, following standard diffusion model training practices.

BibTeX

@article{velioglu2024tryoffdiff,
    title     = {TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models},
    author    = {Velioglu, Riza and Bevandic, Petra and Chan, Robin and Hammer, Barbara},
    journal   = {arXiv},
    year      = {2024},
    note      = {\url{https://doi.org/nt3n}}
}