Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a DUal-Stream Distillation strategy that unifies distribution matching and adversarial supervision for One-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real–Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
While Distribution Matching Distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR reveals three key limitations: (1) Training instability due to distribution mismatch at initialization; (2) Degraded supervision from the frozen real score model producing spatially shifted or artifact-contaminated guidance; and (3) Insufficient supervision as the teacher distribution still falls short of real HR videos.
To address these challenges, we propose DUO-VSR, a three-stage distillation framework:
Stage I – Progressive Guided Distillation Initialization. We first perform CFG Distillation to train a single model matching the combined output of the conditional and unconditional branches. We then treat this model as the teacher and progressively halve the denoising steps until reaching a single-step model, providing a stable initialization for subsequent training.
Stage II – Dual-Stream Distillation. The core of our method jointly optimizes two complementary streams through alternating updates: (1) The DMD Stream aligns the student distribution with the teacher via distribution matching distillation. (2) The RFS-GAN Stream employs both the frozen real score and fake score models as discriminator backbones, extracting intermediate features that are fed into convolutional discriminator heads with a hinge GAN objective. This provides complementary adversarial supervision from real HR videos, mitigating the adverse effects of degraded supervision and breaking the quality ceiling imposed by the teacher model.
Stage III – Preference-Guided Refinement. We construct a synthetic preference dataset by generating multiple HR candidates per LR video and ranking them using video quality assessment models. The student is then fine-tuned with Direct Preference Optimization (DPO) to further align with perceptual quality preferences.
Visual comparison on synthetic (YouHQ40), real-world (VideoLQ) and AIGC (AIGC60) datasets. DUO-VSR demonstrates strong capability in reconstructing realistic textures and structures under diverse and challenging degradations — from brick-wall patterns and human faces under severe degradation to fine-grained natural fur.
DUO-VSR consistently achieves the highest or near-highest scores on no-reference perceptual metrics such as NIQE and MUSIQ across all datasets, demonstrating superior perceptual quality. In terms of fidelity metrics, our method attains performance comparable to competing approaches. Moreover, DUO-VSR exhibits highly stable results in temporal coherence (Ewarp).
@inproceedings{lv2026duovsr,
title = {DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution},
author = {Lv, Zhengyao and Xia, Menghan and Wang, Xintao and Wong, Kwan-Yee K.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}