DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

The University of Hong Kong¹ Huazhong University of Science and Technology² Kling Team, Kuaishou Technology³
CVPR 2026
^*Work done during an internship at Kling Team, Kuaishou Tech. ^†Corresponding Author.

Abstract

Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a DUal-Stream Distillation strategy that unifies distribution matching and adversarial supervision for One-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real–Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.

Motivation

While Distribution Matching Distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR reveals three key limitations: (1) Training instability due to distribution mismatch at initialization; (2) Degraded supervision from the frozen real score model producing spatially shifted or artifact-contaminated guidance; and (3) Insufficient supervision as the teacher distribution still falls short of real HR videos.

(a) Progressive guided distillation initialization leads to more stable training dynamics. (b) The real score model occasionally produces spatially shifted or artifact-contaminated outputs, leading to degraded supervision.

Methodology

To address these challenges, we propose DUO-VSR, a three-stage distillation framework:

Stage I – Progressive Guided Distillation Initialization. We first perform CFG Distillation to train a single model matching the combined output of the conditional and unconditional branches. We then treat this model as the teacher and progressively halve the denoising steps until reaching a single-step model, providing a stable initialization for subsequent training.

Stage II – Dual-Stream Distillation. The core of our method jointly optimizes two complementary streams through alternating updates: (1) The DMD Stream aligns the student distribution with the teacher via distribution matching distillation. (2) The RFS-GAN Stream employs both the frozen real score and fake score models as discriminator backbones, extracting intermediate features that are fed into convolutional discriminator heads with a hinge GAN objective. This provides complementary adversarial supervision from real HR videos, mitigating the adverse effects of degraded supervision and breaking the quality ceiling imposed by the teacher model.

Stage III – Preference-Guided Refinement. We construct a synthetic preference dataset by generating multiple HR candidates per LR video and ranking them using video quality assessment models. The student is then fine-tuned with Direct Preference Optimization (DPO) to further align with perceptual quality preferences.

Method overview of the three-stage distillation framework

Overview of our three-stage distillation framework. (a) Progressive Guided Distillation Initialization via CFG Distillation and Progressive Distillation. (b) Dual-Stream Distillation jointly optimizing DMD and RFS-GAN streams through alternating Student Update and Auxiliary Update. (c) Preference-Guided Refinement with DPO-based fine-tuning.

Results

Qualitative Comparison

Visual comparison on synthetic (YouHQ40), real-world (VideoLQ) and AIGC (AIGC60) datasets. DUO-VSR demonstrates strong capability in reconstructing realistic textures and structures under diverse and challenging degradations — from brick-wall patterns and human faces under severe degradation to fine-grained natural fur.

BibTeX

@inproceedings{lv2026duovsr, title = {DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution}, author = {Lv, Zhengyao and Xia, Menghan and Wang, Xintao and Wong, Kwan-Yee K.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, }