[CVPR'25-Demo]

TryOffDiff:
Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Machine Learning Group, CITEC, Bielefeld University

Problem Statement

Illustration of the relationship between Virtual Try-On and Virtual Try-Off.

Illustration of the relationship between Virtual Try-On (VTON) and Virtual Try-Off (VTOFF).
Left: VTON focuses on transferring garments onto a person. Right: VTOFF aims to reconstruct the garment itself from a worn instance. These two tasks are inherently cyclic: the output of one can serve as the input to the other.
This cyclical connection offers: (1) improved training through consistency constraints in the loss function [54], (2) the generation of synthetic data for both tasks [9, 40]. Such synergy not only enhances model robustness but also opens new possibilities in fashion image generation. Despite the growing interest in VTON, its inverse task, VTOFF, has received limited attention and remains largely undefined in the literature. Furthermore, current VTON methods often assume the availability of clean product images. Eliminating this dependency reduces the need for expensive photography and manual editing, thereby enabling smaller vendors to produce professional-quality visuals more affordably.

Method

AI model architecture proposed in the paper

Overview of TryOffDiff. Given a reference image $I$ of size $(1024\times768\times3)$, the Image Encoder (SigLIP-B/16) extracts features, resulting in a $(1024\times768)$ map. These features are refined by the Adapter, following IP-Adapter, to align with the cross-attention mechanism. The adapted features, now of shape $(77\times768)$, are injected into the Denoising U-Net of Stable Diffusion-v1.4 by replacing the default text features in the cross-attention layers. A class label $c \in \{\text{"upper-body"},\ \text{"lower-body"},\ \text{"dresses"}\}$ is mapped to a learnable embedding and added to the timestep embeddings of the diffusion model, conditioning the generation process on garment type. The final latent output is decoded into pixel space by the VAE Decoder of SD-v1.4, yielding the reconstructed garment image $\hat{G}$.
We train four separate models: one each for upper-body, lower-body, and dresses, along with a multi-garment model that handles all classes. All models are trained end-to-end using Mean Squared Error (MSE) loss, adhering to standard diffusion model training procedures.

Results

Predictions of TryOffDiff on the samples from the VITON-HD-test set.

BibTeX

@article{velioglu2024tryoffdiff,
    title     = {TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models},
    author    = {Velioglu, Riza and Bevandic, Petra and Chan, Robin and Hammer, Barbara},
    journal   = {arXiv preprint arXiv:2411.18350},
    year      = {2024},
    note      = {\url{https://doi.org/nt3n}}
}
@article{velioglu2025enhancing,
    title     = {Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off},
    author    = {Velioglu, Riza and Bevandic, Petra and Chan, Robin and Hammer, Barbara},
    journal   = {arXiv preprint arXiv:2504.13078},
    year      = {2025},
    note      = {\url{https://doi.org/pn67}}
}