Skip to content

Text Image Inpainting via Global Structure-Guided Diffusion Models

ZHU S P, FANG P F, ZHU C J, et al.Text image inpainting via global structure-guided diffusion models[C]// Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2024: 7775-7783. DOI:10.1609/aaai.v38i7.28612

https://github.com/blackprotoss/GSDM

AAAI2024

基于全局结构引导扩散模型的文本图像修复

Abstract

Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing.

现实世界中的文本常因环境或人为因素遭受腐蚀问题影响,导致文本完整风格(如纹理和结构)的保存受阻。这些腐蚀问题(例如涂鸦标识和不完整签名)给文本理解带来困难,进而对下游应用(如场景文本识别和签名认证)构成重大挑战。值得注意的是,现有修复技术往往难以充分解决这一问题,且在恢复准确文本图像与合理一致风格方面存在困难。本文将这一问题定义为文本图像修复的开放性问题,旨在建立基准以推动其研究。为此,我们构建了两个特定文本修复数据集,分别包含场景文本图像和手写文本图像。每个数据集均包含由真实场景和合成数据集改造的图像,并配有原始图像、损坏图像及辅助信息对。基于这些数据集,我们进一步开发了新型神经网络框架——全局结构引导的扩散模型(GSDM)作为潜在解决方案。通过将文本的全局结构作为先验知识,所提出的 GSDM 构建了高效的扩散模型以实现清晰文本的恢复。大量实证研究证明了我们方法的有效性,包括在识别准确率和图像质量上的显著提升。这些发现不仅凸显了我们方法的优势,更强调了其在推动文本图像理解与处理领域的潜在价值。


InstructIR: High-Quality Image Restoration Following Human Instructions

Conde M V, Geigle G, Timofte R.InstructIR: High-Quality Image Restoration Following Human Instructions[C]//European Conference on Computer Vision.Springer, Cham, 2025.DOI:10.1007/978-3-031-72764-1_1.

https://github.com/mv-lab/InstructIR

ECCV2024

InstructIR: 遵循人类指令的高质量图像修复

Abstract

Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement.

图像修复是一个基础性问题,涉及从退化的图像中恢复高质量干净的图像。一体化图像修复模型能够通过将退化特定信息作为提示来指导修复模型,从而有效恢复多种类型和程度的退化图像。在本工作中,我们提出了首个利用人类书面指令指导图像修复模型的方法。给定自然语言提示,我们的模型能够从退化图像中恢复高质量结果,同时考虑多种退化类型。我们的方法 InstructIR 在去噪、去雨、去模糊、去雾和(低光)图像增强等多个修复任务上取得了最先进的成果。InstructIR 较之前的一体化修复方法提升了+1dB。此外,我们的数据集和结果为文本引导的图像修复与增强研究领域建立了新的基准。


LIR: A Lightweight Baseline for Image Restoration

Fan D, Yue T, Zhao X,et al.LIR: A Lightweight Baseline for Image Restoration[J].2024.

https://github.com/Dongqi-Fan/LIR

LIR: 轻量级图像修复基线

Abstract

Recently, there have been significant advancements in Image Restoration based on CNN and transformer. However, the inherent characteristics of the Image Restoration task are often overlooked in many works. They, instead, tend to focus on the basic block design and stack numerous such blocks to the model, leading to parameters redundant and computations unnecessary. Thus, the efficiency of the image restoration is hindered. In this paper, we propose a Lightweight Baseline network for Image Restoration called LIR to efficiently restore the image and remove degradations. First of all, through an ingenious structural design, LIR removes the degradations existing in the local and global residual connections that are ignored by modern networks. Then, a Lightweight Adaptive Attention (LAA) Block is introduced which is mainly composed of proposed Adaptive Filters and Attention Blocks. The proposed Adaptive Filter is used to adaptively extract high-frequency information and enhance object contours in various IR tasks, and Attention Block involves a novel Patch Attention module to approximate the self-attention part of the transformer. On the deraining task, our LIR achieves the state-of-the-art Structure Similarity Index Measure (SSIM) and comparable performance to state-of-the-art models on Peak Signal-to-Noise Ratio (PSNR). For denoising, dehazing, and deblurring tasks, LIR also achieves a comparable performance to state-of-the-art models with a parameter size of about 30%. In addition, it is worth noting that our LIR produces better visual results that are more in line with the human aesthetic.

近期,基于 CNN 和 Transformer 的图像修复研究取得了显著进展。然而,许多工作往往忽视了图像修复任务的内在特性,转而过度关注基础模块的设计并通过堆叠大量此类模块构建模型,导致参数冗余和不必要的计算,最终阻碍了图像修复的效率。本文提出一种轻量级基线网络 LIR,专注于高效恢复图像并消除退化现象。首先,通过巧妙的结构设计,LIR 消除了现代网络架构中被忽视的局部和全局残差连接中存在的退化问题。其次,引入由自适应滤波器和注意力模块构成的轻量级自适应注意力(LAA)模块:所提出的自适应滤波器能够自适应提取高频信息并增强各类图像修复任务中的物体轮廓,而注意力模块采用新型块注意力机制来近似 Transformer 中的自注意力部分。在去雨任务中,LIR 在结构相似性指数(SSIM)上达到当前最优水平,在峰值信噪比(PSNR)指标上与最先进模型性能相当。针对去噪、去雾和去模糊任务,LIR 仅需约 30%的参数规模即可达到与最先进模型相媲美的性能。此外需要特别指出,LIR 生成的视觉结果更符合人类美学标准。


Key-Graph Transformer for Image Restoration

Ren B, Li Y, Liang J,et al.Key-Graph Transformer for Image Restoration[J]. 2024.

用于图像恢复的关键图转换器

Abstract

While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In response to these challenges, we introduce the Key-Graph Transformer (KGT) in this paper. Specifically, KGT views patch features as graph nodes. The proposed Key-Graph Constructor efficiently forms a sparse yet representative Key-Graph by selectively connecting essential nodes instead of all the nodes. Then the proposed Key-Graph Attention is conducted under the guidance of the Key-Graph only among selected nodes with linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed KGT's state-of-the-art performance, showcasing advancements both quantitatively and qualitatively.

虽然捕获全局信息对有效的图像修复(IR)至关重要,但将这些线索整合到基于 Transformer 的方法中会带来高昂的计算成本,尤其是在高输入分辨率情况下。此外,Transformer 中的自注意力机制容易考虑来自无关物体或区域的不必要全局线索,导致计算效率低下。针对这些挑战,本文提出了关键图变换器(KGT)。具体而言,KGT 将图像块特征视为图节点。所提出的关键图构造器通过选择性地连接关键节点而非所有节点,高效地形成一个稀疏但具有代表性的关键图。随后,提出的关键图注意力在关键图指导下仅于每个窗口内的选定节点间执行,具有线性计算复杂度。在 6 个 IR 任务上的广泛实验验证了所提 KGT 的先进性能,在定量和定性方面均展现出显著优势。


Plug-and-Play image restoration with Stochastic deNOising REgularization

Renaud M, Prost J, Leclaire A,et al.Plug-and-Play image restoration with Stochastic deNOising REgularization[J]. 2024.

https://github.com/Marien-RENAUD/SNORE

ICML2024

基于随机去噪正则化的即插即用图像修复

Abstract

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

即插即用(PnP)算法是一类通过结合物理模型和深度神经网络进行正则化来解决图像逆问题的迭代算法。尽管这些算法能产生令人印象深刻的图像修复结果,但其依赖于在迭代过程中对噪声逐渐减少的图像进行非标准化的去噪器使用,这与基于扩散模型(DM)的最新算法形成对比——后者仅对重新加噪的图像应用去噪器。我们提出了一种新的 PnP 框架,称为随机去噪正则化(SNORE),该框架仅在具有适当噪声水平的图像上应用去噪器。该方法基于显式的随机正则化,导出了求解不适定逆问题的随机梯度下降算法。本文提供了该算法及其退火扩展的收敛性分析。实验证明,SNORE 在去模糊和图像修复任务中,无论定量还是定性评估,均能与最先进方法保持竞争力。


Residual Denoising Diffusion Models

Liu J, Wang Q, Fan H,et al.Residual Denoising Diffusion Models[J].IEEE, 2023.DOI:10.1109/CVPR52733.2024.00268.

https://github.com/nachifur/RDDM

CVPR2024

残差去噪扩散模型

Abstract

We propose residual denoising diffusion models (RDDM), a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models, initially uninterpretable for image restoration, into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically, our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration, while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty, while the noise emphasizes diversity, enabling RDDM to effectively unify tasks with varying certainty or diversity requirements, such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation, and propose a partially path-independent generation process to better understand the reverse process. Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a batch size of 1, to compete with state-of-the-art image restoration methods.

我们提出了残差去噪扩散模型(RDDM),这是一种新型的双扩散过程,将传统的单去噪扩散过程解耦为残差扩散和噪声扩散。这种双扩散框架通过引入残差,将最初难以解释的图像修复去噪扩散模型扩展为同时适用于图像生成和修复的统一可解释模型。具体而言,我们的残差扩散表示从目标图像到退化输入图像的方向性扩散,并明确指导逆向生成过程进行图像修复,而噪声扩散表示扩散过程中的随机扰动。残差优先考虑确定性,噪声强调多样性,使得 RDDM 能有效统一具有不同确定性或多样性需求的任务,如图像生成和修复。我们通过系数变换证明了采样过程与 DDPM 和 DDIM 的一致性,并提出部分路径无关的生成过程以更好理解逆向过程。值得注意的是,我们的 RDDM 仅使用 L1 损失和批大小为 1 训练的通用 UNet,即可与最先进的图像修复方法竞争。


Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

Yu F, Gu J, Li Z,et al.Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).0[2025-03-19].DOI:10.1109/CVPR52733.2024.02425.

CVPR2024

迈向卓越:通过模型缩放实现野外环境下的照片级真实图像修复

Abstract

We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution, high-quality images for model training, each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts, broadening its application scope and potential. Moreover, we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.

我们提出了 SUPIR(规模化图像修复),这是一种突破性的图像修复方法,通过利用生成先验和模型规模化技术的力量,标志着智能化和真实化图像修复的重大进展。作为 SUPIR 的关键催化剂,模型规模化显著增强了其能力,并展示了图像修复的新潜力。我们收集了一个包含 2000 万张高分辨率、高质量图像的数据集用于模型训练,每张图像均附有描述性文本标注。SUPIR 能够通过文本提示引导图像修复,从而扩展其应用范围和潜力。此外,我们引入了负质量提示以进一步提升感知质量,并开发了一种修复引导的采样方法,以抑制基于生成的修复中遇到的保真度问题。实验证明,SUPIR 具有卓越的修复效果,并通过文本提示操控修复过程的新颖能力。


Boosting Image Restoration via Priors from Pre-trained Models

Xu X, Kong S, Hu T,et al.Boosting Image Restoration via Priors from Pre-trained Models[J].IEEE, 2024.DOI:10.1109/CVPR52733.2024.00280.

CVPR2024

通过预训练模型的先验增强图像复原

Abstract

Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size (<1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

基于大规模训练数据(如 CLIP 和 Stable Diffusion)的预训练模型在图像理解和文本到图像生成等高层计算机视觉任务中展现了卓越性能,但在低层任务(如图像修复)中的应用潜力仍待探索。本文研究如何利用此类模型增强图像修复效果。由于预训练模型的现成特征(OSF)无法直接用于图像修复,我们提出通过训练轻量级模块——预训练指导细化模块(PTG-RM)来借助 OSF 优化目标修复网络的输出结果。PTG-RM 包含两个组件:预训练指导的空域变增强(PTG-SVE)和预训练指导的通道-空域注意力(PTG-CSA)。PTG-SVE 可实现最优的短长程神经操作,PTG-CSA 则增强通道-空域注意力机制以促进修复相关学习。大量实验证明,PTG-RM 凭借其紧凑结构(参数量<1M)有效提升了多种模型在低光增强、去雨、去模糊和去噪等任务中的修复性能。


From Posterior Sampling to Meaningful Diversity in Image Restoration

Cohen N, Manor H, Bahat Y,et al.From Posterior Sampling to Meaningful Diversity in Image Restoration[J]. 2023.

https://github.com/noa-cohen/MeaningfulDiversityInIR

ICLR2024

从后验采样到图像修复中有意义的多样性

Abstract

Image restoration problems are typically ill-posed in the sense that each degraded image can be restored in infinitely many valid ways. To accommodate this, many works generate a diverse set of outputs by attempting to randomly sample from the posterior distribution of natural images given the degraded input. Here we argue that this strategy is commonly of limited practical value because of the heavy tail of the posterior distribution. Consider for example inpainting a missing region of the sky in an image. Since there is a high probability that the missing region contains no object but clouds, any set of samples from the posterior would be entirely dominated by (practically identical) completions of sky. However, arguably, presenting users with only one clear sky completion, along with several alternative solutions such as airships, birds, and balloons, would better outline the set of possibilities. In this paper, we initiate the study of meaningfully diverse image restoration. We explore several post-processing approaches that can be combined with any diverse image restoration method to yield semantically meaningful diversity. Moreover, we propose a practical approach for allowing diffusion based image restoration methods to generate meaningfully diverse outputs, while incurring only negligent computational overhead. We conduct extensive user studies to analyze the proposed techniques, and find the strategy of reducing similarity between outputs to be significantly favorable over posterior sampling.

图像修复问题通常是不适定的,即每个退化图像可以有无限多种有效还原方式。为了适应这一点,许多工作试图通过从给定退化输入的自然图像后验分布中随机采样来生成多样化的输出集合。本文认为,由于后验分布的厚尾特性,这种策略通常具有有限的实用价值。例如考虑修复图像中天空区域的缺失部分:由于缺失区域大概率仅包含云朵而不含物体,任何后验采样集合都将完全被(实质上相同的)天空补全结果所主导。然而可以说,向用户展示一个清晰的天空补全结果,配合若干替代方案(如飞艇、鸟类和气球),能更好地勾勒可能性空间。本文开创了有意义多样性图像修复的研究,探索了多种可与任意多样性图像修复方法结合的后处理方案,以产生语义层面有意义的多样性。此外,我们提出一种实用方法,使基于扩散的图像修复方法在仅产生可忽略计算开销的同时生成具有语义意义的多样化输出。我们通过广泛的用户研究分析所提技术,发现降低输出间相似性的策略相较于后验采样显著更受青睐。


Restoration by Generation with Constrained Priors

Ding Z, Zhang X, Tu Z, et al. Restoration by generation with constrained priors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2567-2577.

https://github.com/adobe-research/gen2res

CVPR2024

基于约束先验的生成式修复

Abstract

The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our method is based on the observation that the space of a generative model needs to be constrained. We impose this constraint by finetuning the generative model with a set of anchor images that capture the characteristics of the input image. With the constrained space, we can then leverage the sampling strategy used for generation to do image restoration. We evaluate against previous methods and show superior performances on multiple real-world restoration datasets in preserving identity and image quality. We also demonstrate an important and practical application on personalized restoration, where we use a personal album as the anchor images to constrain the generative space. This approach allows us to produce results that accurately preserve high-frequency details, which previous works are unable to do.

去噪扩散模型固有的生成能力使其非常适合图像修复任务,其目标是在生成空间中寻找与输入图像高度相似的最优高质量图像。我们提出一种方法,通过简单地向待修复的输入图像添加噪声再进行去噪,即可将预训练的扩散模型适配于图像修复任务。该方法的提出基于以下观察:生成模型的生成空间需要被约束。我们通过使用一组捕捉输入图像特征的锚定图像对生成模型进行微调,从而施加这种空间约束。在约束空间下,可直接利用生成任务中的采样策略完成图像修复。我们与先前方法进行评估对比,在多个真实场景修复数据集上均展现出更优的身份特征保持能力与图像质量。此外,我们展示了一项重要且实用的个性化修复应用:通过将个人相册作为锚定图像约束生成空间,该方法能够精确保留高频细节,而现有方法无法实现这一效果。


VmambaIR: Visual State Space Model for Image Restoration

Shi Y, Xia B, Jin X,et al.VmambaIR: Visual State Space Model for Image Restoration[J].IEEE Transactions on Circuits and Systems for Video Technology, PP[2025-03-19].DOI:10.1109/TCSVT.2025.3530090.

https://github.com/AlphacatPlus/VmambaIR

VmambaIR:用于图像修复的视觉状态空间模型

Abstract

Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.

图像修复是底层计算机视觉中的关键任务,旨在从退化输入中恢复高质量图像。卷积神经网络(CNN)、生成对抗网络(GAN)、Transformer 和扩散模型(DM)等多种模型已被用于解决该问题并取得显著效果。然而,CNN 在捕获长程依赖方面存在局限性,DM 需要大先验模型和计算密集的去噪步骤,Transformer 虽具备强大建模能力但其计算复杂度与输入图像的大小成平方关系。为应对这些挑战,我们提出了 VmambaIR,将具有线性复杂度的状态空间模型(SSMs)引入到综合图像修复任务中。我们采用 Unet 架构堆叠提出的 Omni Selective Scan(OSS)模块,该模块由 OSS 机制和高效前馈网络(EFFN)组成。我们提出的全方位选择性扫描机制通过高效建模图像六个方向的信息流,克服了 SSMs 单向建模的局限。此外,我们在图像去雨、单图像超分辨率和真实世界图像超分辨率等多个图像修复任务上对 VmambaIR 进行了全面评估。大量实验结果表明,我们提出的 VmambaIR 以更少的计算资源和参数量实现了最先进的性能。本研究揭示了状态空间模型作为替代 Transformer 和 CNN 架构的潜力,有望成为下一代底层视觉任务的基础框架。


AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

Cui Y, Zamir S W, Khan S, et al. Adair: Adaptive all-in-one image restoration via frequency mining and modulation[J]. arXiv preprint arXiv:2403.14614, 2024.

https://github.com/c-yn/AdaIR

ICLR2025

AdaIR: 通过频率挖掘与调制实现自适应一体化图像修复

Abstract

In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement.

在图像采集过程中,噪声、雾霾、雨痕等各种形式的退化常被引入,这些退化通常源于相机的固有局限或不利的环境条件。为了从退化图像中恢复干净图像,人们开发了许多针对特定退化类型的专用修复方法。最近,一体化算法通过在不要求输入退化类型先验信息的情况下,在单一模型中处理不同类型的退化而受到广泛关注。然而,这些方法仅在空间域操作,未深入探究不同退化类型固有的频率差异。为填补这一空白,我们提出了一种基于频率挖掘与调制的自适应一体化图像修复网络。该方法的动机源于我们观察到不同退化类型会在不同频率子带上对图像内容产生差异化影响,因此每个修复任务需要不同的处理方式。具体而言,我们首先通过自适应解耦的退化图像频谱引导,从输入特征中挖掘低频和高频信息。随后通过双向算子对提取的特征进行调制,以促进不同频率成分的交互。最后,调制后的特征被合并回原始输入,实现渐进式引导的修复。通过这种方法,模型能够根据输入退化的不同类型,通过强化信息丰富的频率子带实现自适应重建。大量实验表明,所提方法在去噪、去雾、去雨、运动去模糊和低光图像增强等不同图像修复任务中均达到了最先进的性能。


Priors in Deep Image Restoration and Enhancement: A Survey

Lu Y, Lin Y T, Wu H,et al.Priors in Deep Image Restoration and Enhancement: A Survey[J]. 2022.

https://github.com/VLIS2022/Awesome-Image-Prior

深度图像恢复与增强中的先验:综述

Abstract

Image restoration and enhancement is a process of improving the image quality by removing degradations, such as noise, blur, and resolution degradation. Deep learning (DL) has recently been applied to image restoration and enhancement. Due to its ill-posed property, plenty of works have been explored priors to facilitate training deep neural networks (DNNs). However, the importance of priors has not been systematically studied and analyzed by far in the research community. Therefore, this paper serves as the first study that provides a comprehensive overview of recent advancements in priors for deep image restoration and enhancement. Our work covers five primary contents: (1) A theoretical analysis of priors for deep image restoration and enhancement; (2) A hierarchical and structural taxonomy of priors commonly used in the DL-based methods; (3) An insightful discussion on each prior regarding its principle, potential, and applications; (4) A summary of crucial problems by highlighting the potential future directions, especially adopting the large-scale foundation models as prior, to spark more research in the community; (5) An open-source repository that provides a taxonomy of all mentioned works and code links.

图像恢复和增强是通过去除退化因素(如噪声、模糊、分辨率下降等)来提升图像质量的过程。近年来,深度学习(DL)被广泛应用于图像恢复与增强领域。由于其不适定性质,大量研究探索了利用先验知识辅助训练深度神经网络(DNNs)的方法。然而,目前学术界尚未系统性地研究和分析先验的重要性。因此,本文首次对深度图像恢复与增强中的先验技术进行了全面综述,涵盖以下五部分内容:(1)对深度图像恢复与增强中先验的理论分析;(2)基于深度学习方法中常用先验的层次化与结构化分类体系;(3)针对各先验的原理、潜力及应用的深入探讨;(4)通过总结关键问题(尤其是利用大规模基础模型作为先验的潜力)展望未来研究方向;(5)提供涵盖所有提及工作的分类体系及代码链接的开源仓库。


Distilling Semantic Priors from SAM to Efficient Image Restoration Models

Zhang Q, Liu X, Li W,et al.Distilling Semantic Priors from SAM to Efficient Image Restoration Models[J].IEEE, 2024.DOI:10.1109/CVPR52733.2024.02401.

CVPR2024

从 SAM 中提取语义先验以构建高效图像修复模型

Abstract

In image restoration (IR), leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However, the computational cost of SAM is prohibitive for IR, compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue, we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically, our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionally, we design a semantic-guided relation (SGR) module for SPD, which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks, including deraining, deblurring, and denoising.

在图像修复(IR)领域,利用来自分割模型的语义先验提升性能已成为常见策略。近期提出的 Segment Anything Model (SAM) 作为提取高级语义先验的工具,展现出增强 IR 任务性能的潜力。然而,与现有小型 IR 模型相比,SAM 的计算成本过高,其引入的语义先验提取过程严重阻碍了模型推理效率。为解决此问题,我们提出了一个通用框架,通过蒸馏 SAM 的语义知识来增强现有 IR 模型性能,同时不影响其推理效率。具体而言,框架包含两个核心方案:语义先验融合(SPF)与语义先验蒸馏(SPD)。SPF 通过融合原始 IR 模型预测的修复图像与 SAM 针对精炼修复图像生成的语义掩码两类信息,提炼语义先验;SPD 采用自蒸馏方式将融合的语义先验知识迁移至原始 IR 模型以提升其性能。此外,我们设计了语义引导关系(SGR)模块嵌入 SPD 流程,确保语义特征表示空间的一致性以实现先验知识的充分蒸馏。研究验证了该框架在去雨、去模糊、去噪等多种 IR 模型与任务中的有效性。


Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration

Dudhane A, Thawakar O, Zamir S W, et al. Dynamic pre-training: Towards efficient and scalable all-in-one image restoration[J]. arXiv preprint arXiv:2404.02154, 2024.

https://github.com/akshaydudhane16/DyNet

动态预训练:实现高效、可扩展的一体化图像修复

Abstract

All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. The requirement to tackle multiple degradations using the same model can lead to high-complexity designs with fixed configuration that lack the adaptability to more efficient alternatives. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment with a single round of training. This seamless switching is enabled by our weights-sharing mechanism, forming the core of our architecture and facilitating the reuse of initialized module weights. Further, to establish robust weights initialization, we introduce a dynamic pre-training strategy that trains variants of the proposed DyNet concurrently, thereby achieving a 50% reduction in GPU hours. Our dynamic pre-training strategy eliminates the need for maintaining separate checkpoints for each variant, as all models share a common set of checkpoints, varying only in model depth. This efficient strategy significantly reduces storage overhead and enhances adaptability. To tackle the unavailability of large-scale dataset required in pre-training, we curate a high-quality, high-resolution image dataset named Million-IRD, having 2M image samples. We validate our DyNet for image denoising, deraining, and dehazing in all-in-one setting, achieving state-of-the-art results with 31.34% reduction in GFlops and a 56.75% reduction in parameters compared to baseline models.

一体化图像修复通过统一模型处理多种类型的图像退化问题,而非针对每种退化类型使用特定任务模型。使用相同模型处理多种退化的需求,可能导致采用高复杂度且配置固定的设计方案,这些方案缺乏向更高效替代方案转换的适应性。我们提出了 DyNet——采用编码器-解码器架构设计的动态网络系列,适用于一体化图像修复任务。我们的 DyNet 能够在完整版和轻量版之间无缝切换,通过单次训练即可实现灵活高效的模型部署。这种无缝切换能力由我们的权重共享机制支撑,该机制构成了架构的核心并促进了初始化模块权重的重用。此外,为实现鲁棒的权重初始化,我们提出动态预训练策略,通过同步训练 DyNet 的不同变体,实现了 GPU 训练时间数减少 50%。我们的动态预训练策略无需为每个变体维护独立检查点,因为所有模型共享通用检查点集,仅通过模型深度进行区分。这种高效策略显著降低了存储开销并增强了适应性。为解决预训练所需大规模数据集的缺失问题,我们构建了包含 200 万张图像样本的高质量高分辨率数据集 Million-IRD。我们在一体化设置下验证了 DyNet 在图像去噪、去雨和去雾任务中的性能,相比基线模型实现了 31.34%的计算量(GFlops)降低和 56.75%的参数减少,同时取得了最先进的实验结果。


SPIRE: Semantic Prompt-Driven Image Restoration

Qi C, Tu Z, Ye K,et al.SPIRE: Semantic Prompt-Driven Image Restoration[C]//European Conference on Computer Vision.Springer, Cham, 2025.DOI:10.1007/978-3-031-73661-2_25.

ECCV2024

SPIRE:语义提示驱动的图像修复

Abstract

Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework that leverages natural language as a user-friendly interface to control the image restoration process. We consider the capacity of prompt information in two dimensions. First, we use content-related prompts to enhance the semantic alignment, effectively alleviating identity ambiguity in the restoration outcomes. Second, our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength, without the need for explicit task-specific design. In addition, we introduce a novel fusion mechanism that augments the existing ControlNet architecture by learning to rescale the generative prior, thereby achieving better restoration fidelity. Our extensive experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts, alongside offering the flexibility of text-based control over the restoration effects.

基于文本驱动的扩散模型在各类图像编辑任务中日益普及,包括图像修复、风格化处理和物体替换。然而,如何将这种语言-视觉范式应用于更精细的图像处理任务(如去噪、超分辨率、去模糊和压缩伪影消除)仍是一个开放的研究课题。本文提出 SPIRE,一个基于语义与修复提示驱动的图像修复框架,利用自然语言作为用户友好界面来控制图像修复过程。我们从两个维度探索提示信息的潜力:首先,采用内容相关提示增强语义对齐,有效缓解修复结果中的身份模糊性;其次,本方法首次支持通过语言对修复强度进行量化指定的细粒度调控,无需显式的任务特定设计。此外,我们提出了一种创新的融合机制,通过学习对生成先验进行动态缩放,扩展了现有 ControlNet 架构的功能,从而提升修复保真度。大量实验表明,SPIRE 在实现文本可控修复效果灵活性的同时,其修复性能显著优于现有技术。


Referring Flexible Image Restoration

Guan R, Hu R, Zhou Z, et al. Referring flexible image restoration[J]. Expert Systems with Applications, 2025: 126857.

指向性灵活的图像修复

Abstract

In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image restoration, where a model must perceive and remove specific degradation types specified by human commands in images with multiple degradations. We term this task Referring Flexible Image Restoration (RFIR). To address this, we first construct a large-scale synthetic dataset called RFIR, comprising 153,423 samples with the degraded image, text prompt for specific degradation removal and restored image. RFIR consists of five basic degradation types: blur, rain, haze, low light and snow while six main sub-categories are included for varying degrees of degradation removal. To tackle the challenge, we propose a novel transformer-based multi-task model named TransRFIR, which simultaneously perceives degradation types in the degraded image and removes specific degradation upon text prompt. TransRFIR is based on two devised attention modules, Multi-Head Agent Self-Attention (MHASA) and Multi-Head Agent Cross Attention (MHACA), where MHASA and MHACA introduce the agent token and reach the linear complexity, achieving lower computation cost than vanilla self-attention and cross-attention and obtaining competitive performances. Our TransRFIR achieves state-of-the-art performances compared with other counterparts and is proven as an effective architecture for image restoration.

在现实场景中,图像常同时存在多种退化现象,例如夜间雨雾场景(三重退化)。然而很多时候人们并不希望去除所有退化,例如当模糊镜头中呈现美丽雪景时(双重退化),用户可能仅希望执行去模糊操作。这些场景和需求揭示了一个新的图像修复挑战:模型需要根据人类指令感知并去除多退化图像中的指定退化类型。我们将其定义为指向性灵活图像修复任务(RFIR)。为此,我们首先构建了大规模合成数据集 RFIR,包含 153,423 个样本(退化图像、指定退化去除的文本提示和修复图像)。该数据集涵盖模糊、雨雾、薄雾、低光照和雪痕五种基础退化类型,包含六个主要子类别以体现不同程度的退化去除需求。针对该任务,我们提出新型基于 Transformer 的多任务模型 TransRFIR,可同步感知退化图像中的退化类型并根据文本提示去除指定退化。TransRFIR 的核心是两个创新注意力模块:多头代理自注意力(MHASA)和多头代理交叉注意力(MHACA)。通过引入代理令牌,MHASA 和 MHACA 实现了线性计算复杂度,其计算成本低于传统自注意力和交叉注意力,同时保持了优异的性能表现。实验证明 TransRFIR 在图像修复任务中取得了最先进的性能,验证了该架构的有效性。


Improving Image Restoration through Removing Degradations in Textual Representations

Lin J, Zhang Z, Wei Y,et al.Improving Image Restoration Through Removing Degradations in Textual Representations[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).0[2025-03-19].DOI:10.1109/CVPR52733.2024.00277.

https://github.com/mrluin/TextualDegRemoval

CVPR2024

通过去除文本表示中的退化来改进图像修复

Abstract

In this paper, we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively, restoration is much easier on text modality than image one. For example, it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence, we combine the advantages of images in detail description and ones of text in degradation removal to perform restoration. To address the cross-modal assistance, we propose to map the degraded images into textual representations for removing the degradations, and then convert the restored textual representations into a guidance image for assisting image restoration. In particular, We ingeniously embed an image-to-text mapper and text restoration module into CLIP-equipped text-to-image models to generate the guidance. Then, we adopt a simple coarse-to-fine approach to dynamically inject multi-scale information from guidance to image restoration networks. Extensive experiments are conducted on various image restoration tasks, including deblurring, dehazing, deraining, and denoising, and all-in-one image restoration. The results showcase that our method outperforms state-of-the-art ones across all these tasks.

本文提出了一种通过去除退化图像文本表示中的退化信息来改进图像修复的新视角。直观上,文本模态的修复比图像模态容易得多。例如,只需简单地去除退化相关词汇同时保留内容感知词汇即可完成修复。因此,我们结合图像在细节描述方面的优势和文本在退化去除方面的优势来进行图像修复。针对跨模态协作问题,我们提出将退化图像映射到文本表示进行退化去除,然后将修复后的文本表示转换为引导图像以辅助图像修复。具体而言,我们巧妙地在配备 CLIP 的文本到图像模型中嵌入了图像到文本映射器和文本修复模块来生成引导图像。随后采用简单的由粗到细方法,将来自引导图像的多尺度信息动态注入图像修复网络。我们在多种图像修复任务上进行了大量实验,包括去模糊、去雾、去雨和去噪,以及一体化图像修复。结果表明,我们的方法在所有这些任务中都优于现有最优方法。


Exposure Bracketing Is All You Need For A High-Quality Image

Zhang Z, Zhang S, Wu R,et al.Exposure Bracketing Is All You Need For A High-Quality Image[J]. 2024.

https://github.com/cszhilu1998/BracketIRE

ICLR2025

包围曝光是获取高质量图像的关键

Abstract

It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple images. Motivated by the fact that multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution, we propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this work. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data and then adapts it to real-world unlabeled images. In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed. Moreover, we construct a data simulation pipeline to synthesize pairs and collect real-world images from 200 nighttime scenarios. Experiments on both datasets show that our method performs favorably against the state-of-the-art multi-image processing ones.

在低光环境下获取内容清晰的高质量照片是迫切但极具挑战性的任务。尽管多图像处理方法(使用连拍、双曝光或多曝光图像)在解决该问题上取得了显著进展,但它们通常专注于特定的修复或增强问题,并未充分挖掘多图像的潜力。受多曝光图像在去噪、去模糊、高动态范围成像和超分辨率任务中具有互补性的启发,本文提出通过结合这些任务,利用包围曝光摄影技术获取高质量图像。由于真实场景配对数据难以采集,我们提出一种解决方案:首先使用合成配对数据预训练模型,再将其适配到真实无标签图像。具体而言,本文提出了时间调制循环网络(TMRNet)和自监督适应方法。此外,我们构建了数据模拟流程用于合成配对数据,并从 200 个夜间场景中采集真实图像。在两类数据集上的实验表明,本文方法相较最先进的多图像处理方法具有显著优势。


UGPNet: Universal Generative Prior for Image Restoration

Lee H, Kang K, Lee H, et al. Ugpnet: Universal generative prior for image restoration[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024: 1598-1608.

WACV2025

UGPNet:用于图像修复的通用生成先验

Abstract

Recent image restoration methods can be broadly categorized into two classes: (1) regression methods that recover the rough structure of the original image without synthesizing high-frequency details and (2) generative methods that synthesize perceptually-realistic high-frequency details even though the resulting image deviates from the original structure of the input. While both directions have been extensively studied in isolation, merging their benefits with a single framework has been rarely studied. In this paper, we propose UGPNet, a universal image restoration framework that can effectively achieve the benefits of both approaches by simply adopting a pair of an existing regression model and a generative model. UGPNet first restores the image structure of a degraded input using a regression model and synthesizes a perceptually-realistic image with a generative model on top of the regressed output. UGPNet then combines the regressed output and the synthesized output, resulting in a final result that faithfully reconstructs the structure of the original image in addition to perceptually-realistic textures. Our extensive experiments on deblurring, denoising, and super-resolution demonstrate that UGPNet can successfully exploit both regression and generative methods for high-fidelity image restoration.

最近的图像修复方法可大致分为两类:(1) 回归方法,这类方法能恢复原始图像的粗略结构但不会合成高频细节;(2) 生成方法,这类方法即使会导致结果图像偏离输入的原始结构,也能合成感知逼真的高频细节。虽然这两个方向已各自得到广泛研究,但将它们的优势结合到单一框架中的研究却鲜有涉及。本文提出 UGPNet——一种通用图像修复框架,通过简单采用现有回归模型与生成模型的组合,即可有效实现两种方法的优势。UGPNet 首先使用回归模型恢复退化输入图像的结构,并基于回归输出通过生成模型合成感知逼真的图像。随后,UGPNet 将回归输出与合成输出结合,生成既忠实重建原始图像结构又具备感知逼真纹理的最终结果。我们在去模糊、去噪和超分辨率任务上的大量实验表明,UGPNet 能成功结合回归与生成方法实现高保真图像修复。


BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Ju X, Liu X, Wang X,et al.BrushNet: A Plug-and-Play Image Inpainting Model withDecomposed Dual-Branch Diffusion[C]//European Conference on Computer Vision.Springer, Cham, 2025.DOI:10.1007/978-3-031-72661-3_9.

https://github.com/TencentARC/BrushNet

WACV2025

BrushNet:一种具有分解双分支扩散机制的即插即用图像修复模型

Abstract

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these challenges, our work introduces a novel paradigm: the division of masked image features and noisy latent into separate branches. This division dramatically diminishes the model's learning load, facilitating a nuanced incorporation of essential masked image information in a hierarchical fashion. Herein, we present BrushNet, a novel plug-and-play dual-branch model engineered to embed pixel-level masked image features into any pre-trained DM, guaranteeing coherent and enhanced image inpainting outcomes. Additionally, we introduce BrushData and BrushBench to facilitate segmentation-based inpainting training and performance assessment. Our extensive experimental analysis demonstrates BrushNet's superior performance over existing models across seven key metrics, including image quality, mask region preservation, and textual coherence.

图像修复作为修复受损图像的过程,随着扩散模型(DMs)的出现已取得显著进展。尽管存在这些进步,当前应用于修复的扩散模型改进方案(包括采样策略调整或开发专用修复扩散模型)仍普遍面临语义不一致和图像质量下降的问题。针对这些挑战,本文提出了一种新范式:将掩模图像特征和噪声隐变量分解至独立分支。这种分解显著降低了模型的学习负担,实现了掩模图像关键信息的分层精细融合。在此,我们提出 BrushNet——一种新型即插即用双分支模型,其设计目标是将像素级掩模图像特征嵌入任何预训练扩散模型,从而保证连贯且优质的图像修复效果。此外,我们开发了 BrushData 数据集和 BrushBench 评估基准,以支持基于分割的修复训练与性能评估。通过大量实验分析,我们验证了 BrushNet 在七项关键指标上的优越性能,包括图像质量、掩模区域保真度及文本连贯性等维度。


MxT: Mamba x Transformer for Image Inpainting

Chen S, Atapour-Abarghouei A, Zhang H, et al. MxT: Mamba x Transformer for Image Inpainting[J]. arXiv preprint arXiv:2407.16126, 2024.

https://github.com/ChrisChen1023/MxT

BMVC2024

MxT: Mamba x Transformer 用于图像修复

Abstract

Image inpainting, or image completion, is a crucial task in computer vision that aims to restore missing or damaged regions of images with semantically coherent content. This technique requires a precise balance of local texture replication and global contextual understanding to ensure the restored image integrates seamlessly with its surroundings. Traditional methods using Convolutional Neural Networks (CNNs) are effective at capturing local patterns but often struggle with broader contextual relationships due to the limited receptive fields. Recent advancements have incorporated transformers, leveraging their ability to understand global interactions. However, these methods face computational inefficiencies and struggle to maintain fine-grained details. To overcome these challenges, we introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner. Mamba is adept at efficiently processing long sequences with linear computational costs, making it an ideal complement to the transformer for handling long-scale data interactions. Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy. We evaluate MxT on the widely-used CelebA-HQ and Places2-standard datasets, where it consistently outperformed existing state-of-the-art methods.

图像修复(或图像补全)是计算机视觉中一项关键任务,旨在通过语义连贯的内容恢复图像中缺失或损坏的区域。该技术需要在局部纹理复制与全局上下文理解之间实现精确平衡,以确保修复后的图像与周围环境无缝融合。传统方法使用卷积神经网络(CNN)能有效捕捉局部模式,但由于感受野有限,常难以处理更广泛的上下文关系。最新进展引入了 transformer,利用其理解全局交互的能力。然而这些方法面临计算效率低下的问题,且难以保持细粒度细节。为克服这些挑战,我们提出了由新型混合模块(HM)构成的 MxT 模型,以协同方式将 Mamba 与 transformer 相结合。Mamba 擅长以线性计算成本高效处理长序列,使其成为处理长尺度数据交互的 transformer 的理想补充。我们的 HM 模块实现了像素级和块级的双级交互学习,显著增强了模型重建高质且上下文准确的图像能力。我们在广泛使用的 CelebA-HQ 和 Places2-standard 数据集上评估 MxT,其性能始终优于现有最先进方法。


Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

Ai Y, Huang H, Zhou X, et al. Multimodal prompt perceiver: Empower adaptiveness generalizability and fidelity for all-in-one image restoration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 25432-25444.

https://github.com/hhb072/MPerceiver-Code

CVPR2024

多模态提示感知器:增强一体化图像复原的适应性、泛化性与保真度

Abstract

Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.

尽管取得了重大进展,但一体化图像复原(IR)在处理复杂真实世界退化问题时仍面临持续挑战。本文提出了 MPerceiver:一种新颖的多模态提示学习方法,利用 Stable Diffusion(SD) 先验知识来增强一体化图像复原的适应性、泛化性和保真度。具体而言,我们开发了双分支模块来掌握两种 SD 提示:用于整体表征的文本提示和用于多尺度细节表征的视觉提示。两种提示均通过 CLIP 图像编码器的退化预测进行动态调整,从而实现对各种未知退化的自适应响应。此外,我们设计了即插即用的细节精炼模块,通过直接编解码器信息传递来提升复原保真度。为评估该方法,MPerceiver 在 9 个任务的训练集上进行一体化图像复原训练,并在多数任务中超越了最先进的专项方法。经过多任务预训练后,MPerceiver 在底层视觉领域获得了通用表征能力,在未见任务中展现出卓越的零样本和小样本学习能力。在 16 个图像复原任务上的大量实验证明了 MPerceiver 在适应性、泛化性和保真度方面的优越性。


Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model

Zheng D, Wu X M, Yang S, et al. Selective hourglass mapping for universal image restoration based on diffusion model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 25445-25455.

https://github.com/iSEE-Laboratory/DiffUIR

CVPR2024

基于扩散模型的选择性沙漏映射通用图像修复

Abstract

Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g., prompt) to guide the model to learn different distributions separately, named multi-partite mapping. However, it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work, we propose an advanced selective hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly, we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly, DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally, which gradually maps different distributions into a shared one. In the reverse process, combined with SDT and strong condition guidance, DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles, by only modifying the mapping strategy, we achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly, by only using a lightweight model (only 0.89M), we could achieve outstanding performance.

通用图像修复是针对实际应用的一项实用且具有潜力的计算机视觉任务。该任务的主要挑战在于同时处理不同的退化分布。现有方法主要利用任务特定条件(如提示)引导模型分别学习不同分布,称为多支映射。然而,这种方式不适用于通用模型学习,因为它忽略了不同任务间的共享信息。本工作提出了一种基于扩散模型的高级选择性沙漏映射策略,称为 DiffUIR。两个新颖的考量使我们的 DiffUIR 具有创新性:首先,我们赋予模型强条件引导以获得扩散模型精确的生成方向(选择性)。更重要的是,DiffUIR 将灵活的共享分布项(SDT)优雅自然地融入扩散算法,逐步将不同分布映射至共享分布。在逆向过程中,结合 SDT 与强条件引导,DiffUIR 以迭代方式将共享分布引导至具有高图像质量的任务特定分布(沙漏形态)。无需复杂修饰,仅通过修改映射策略,我们在通用设置和零样本泛化设置的 5 个图像修复任务、22 个基准测试中达到了最先进性能。令人惊讶的是,仅使用轻量级模型(仅 0.89M 参数量)即可获得卓越性能。


Learning Diffusion Texture Priors for Image Restoration

Ye T, Chen S, Chai W, et al. Learning diffusion texture priors for image restoration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2524-2534.

CVPR2024

基于扩散纹理先验的图像修复模型学习

Abstract

Diffusion Models have shown remarkable performance in image generation tasks which are capable of generating diverse and realistic image content. When adopting diffusion models for image restoration the crucial challenge lies in how to preserve high-level image fidelity in the randomness diffusion process and generate accurate background structures and realistic texture details. In this paper we propose a general framework and develop a Diffusion Texture Prior Model (DTPM) for image restoration tasks. DTPM explicitly models high-quality texture details through the diffusion process rather than global contextual content. In phase one of the training stage we pre-train DTPM on approximately 55K high-quality image samples after which we freeze most of its parameters. In phase two we insert conditional guidance adapters into DTPM and equip it with an initial predictor thereby facilitating its rapid adaptation to downstream image restoration tasks. Our DTPM could mitigate the randomness of traditional diffusion models by utilizing encapsulated rich and diverse texture knowledge and background structural information provided by the initial predictor during the sampling process. Our comprehensive evaluations of five image restoration tasks demonstrate DTPM's superiority over existing regression and diffusion-based image restoration methods in perceptual quality and its exceptional generalization capabilities.

扩散模型在图像生成任务中展现出卓越性能,能够生成多样且逼真的图像内容。当将扩散模型应用于图像修复时,核心挑战在于如何在随机扩散过程中保持高层图像保真度,并生成准确的背景结构和逼真的纹理细节。本文提出通用框架并开发了扩散纹理先验模型(DTPM),用于图像修复任务。DTPM 通过扩散过程显式建模高质量纹理细节而非全局上下文内容。在训练阶段的第一阶段,我们在约 55K 高质量图像样本上预训练 DTPM 后,冻结其大部分参数。在第二阶段,我们向 DTPM 中插入条件引导适配器,并为其配备初始预测器,从而促进其快速适应下游图像修复任务。我们的 DTPM 可通过采样过程中利用封装的丰富多样化纹理知识和初始预测器提供的背景结构信息,降低传统扩散模型的随机性。对五个图像修复任务的全面评估表明,DTPM 在感知质量上优于现有基于回归和扩散的图像修复方法,并展现出出色的泛化能力。


A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Zhuang J, Zeng Y, Liu W, et al. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 195-211.

https://github.com/open-mmlab/PowerPaint

ECCV2024

一个任务对应一个词:通过任务提示学习实现高质量多功能图像修复

Abstract

Advancing image inpainting is challenging as it requires filling user-specified regions for various intents, such as background filling and object synthesis. Existing approaches focus on either context-aware filling or object synthesis using text descriptions. However, achieving both tasks simultaneously is challenging due to differing training strategies. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in multiple inpainting tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Moreover, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting, enhancing the model's applicability in shape-guided applications. Finally, we conduct extensive experiments and applications to verify the effectiveness of PowerPaint.

推进图像修复技术面临挑战,因为它需要根据用户指定区域完成多种修复目标,例如背景填充和物体合成。现有方法主要集中于上下文感知填充或基于文本描述的物体合成,但由于训练策略差异,同时实现这两类任务存在困难。为解决这一挑战,我们提出了 PowerPaint——首个在多项修复任务中均表现优异的高质量多功能修复模型。首先,我们引入可学习的任务提示及其定制化微调策略,显式引导模型关注不同的修复目标。这使得 PowerPaint 能通过调用不同任务提示完成多种修复任务,取得最先进的性能。其次,我们通过将任务提示作为负向提示实现物体擦除功能,展示了 PowerPaint 任务提示的多功能性。此外,我们利用提示插值技术实现可控的形状引导物体修复,增强了模型在形状引导应用中的适用性。最终,我们通过大量实验和应用验证了 PowerPaint 的有效性。


Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Zhao L, Yang T, Shao W, et al. Diffree: Text-guided shape free object inpainting with diffusion model[J]. arXiv preprint arXiv:2407.16982, 2024.

https://github.com/OpenGVLab/Diffree

Diffree:基于扩散模型的文本引导自由形状物体修复

Abstract

This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

本文解决了仅通过文本引导在图像中添加物体这一重要问题。该任务具有挑战性,因为新物体必须与图像中的视觉语境(如光照、纹理和空间位置)实现无缝融合。现有文本引导图像修复方法虽然可以添加物体,但要么无法保持背景一致性,要么需要繁琐的人工干预来指定边界框或用户涂鸦掩码。为解决这一挑战,我们提出了 Diffree——一种仅通过文本控制即可实现文本引导物体添加的文本到图像(T2I)模型。为此,我们使用先进图像修复技术移除物体,精心构建了 OABench 合成数据集。该数据集包含 74K 个真实世界元组,每个元组包含原始图像、移除物体后的修复图像、物体掩码和物体描述。通过在 OABench 上训练带有额外掩码预测模块的 Stable Diffusion 模型,Diffree 能够独特地预测新物体的位置,并仅通过文本引导实现物体添加。大量实验表明,Diffree 在添加新物体方面表现出色,具有高成功率,同时保持了背景一致性、空间合理性以及物体相关性和质量。


ALMRR: Anomaly Localization Mamba on Industrial Textured Surface with Feature Reconstruction and Refinement

Qu S, Tao X, Qu Z, et al. ALMRR: Anomaly Localization Mamba on Industrial Textured Surface with Feature Reconstruction and Refinement[C]//Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Singapore: Springer Nature Singapore, 2024: 378-391.

https://github.com/qsc1103/ALMRR

虚假开源

ALMRR:基于特征重建与优化的工业纹理表面异常定位 Mamba 方法

Abstract

Unsupervised anomaly localization on industrial textured images has achieved remarkable results through reconstruction-based methods, yet existing approaches based on image reconstruction and feature reconstruc-tion each have their own shortcomings. Firstly, image-based methods tend to reconstruct both normal and anomalous regions well, which lead to over-generalization. Feature-based methods contain a large amount of distin-guishable semantic information, however, its feature structure is redundant and lacks anomalous information, which leads to significant reconstruction errors. In this paper, we propose an Anomaly Localization method based on Mamba with Feature Reconstruction and Refinement(ALMRR) which re-constructs semantic features based on Mamba and then refines them through a feature refinement module. To equip the model with prior knowledge of anomalies, we enhance it by adding artificially simulated anomalies to the original images. Unlike image reconstruction or repair, the features of synthesized defects are repaired along with those of normal areas. Finally, the aligned features containing rich semantic information are fed in-to the refinement module to obtain the anomaly map. Extensive experiments have been conducted on the MVTec-AD-Textured dataset and other real-world industrial dataset, which has demonstrated superior performance com-pared to state-of-the-art (SOTA) methods.

在工业纹理图像的无监督异常定位领域,基于重建的方法已取得显著成果,但现有的图像重建和特征重建方法各自存在缺陷。首先,基于图像的方法倾向于对正常区域和异常区域都进行良好重建,导致过度泛化。基于特征的方法虽然包含大量可区分语义信息,但其特征结构冗余且缺乏异常信息,导致显著的重建误差。本文提出一种基于 Mamba 的特征重建与优化异常定位方法(ALMRR),通过 Mamba 重建语义特征后,再经由特征优化模块进行细化。为使模型具备异常先验知识,我们在原始图像中添加人工模拟异常进行增强。与图像重建或修复不同,合成缺陷的特征会与正常区域特征共同参与修复。最终将包含丰富语义信息的对齐特征输入优化模块,获得异常定位图。在 MVTec-AD-Textured 数据集及其他真实工业数据集上的大量实验表明,该方法相比现有最优方法(SOTA)展现出更优越的性能。


Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

Zhang Y, Zhang H, Chai X, et al. Diff-restorer: Unleashing visual prompts for diffusion-based universal image restoration[J]. arXiv preprint arXiv:2407.03636, 2024.

Diff-Restorer:释放视觉提示的基于扩散的通用图像修复

Abstract

Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existing methods often suffer from over-smoothing and lack of realism in the restored results. To address these issues, we propose Diff-Restorer, a universal image restoration method based on the diffusion model, aiming to leverage the prior knowledge of Stable Diffusion to remove degradation while generating high perceptual quality restoration results. Specifically, we utilize the pre-trained visual language model to extract visual prompts from degraded images, including semantic and degradation embeddings. The semantic embeddings serve as content prompts to guide the diffusion model for generation. In contrast, the degradation embeddings modulate the Image-guided Control Module to generate spatial priors for controlling the spatial structure of the diffusion process, ensuring faithfulness to the original image. Additionally, we design a Degradation-aware Decoder to perform structural correction and convert the latent code to the pixel domain. We conducted comprehensive qualitative and quantitative analysis on restoration tasks with different degradations, demonstrating the effectiveness and superiority of our approach.

图像修复是一个经典的低层问题,旨在从具有模糊、噪声、雨雾等多种退化的低质量图像中恢复高质量图像。然而由于真实图像退化固有的复杂性和非唯一性,为单一任务训练的模型难以有效处理真实世界的修复问题。现有方法常存在过度平滑和修复结果缺乏真实感的问题。为解决这些问题,我们提出了 Diff-Restorer——一种基于扩散模型的通用图像修复方法,旨在利用 Stable Diffusion 的先验知识去除退化同时生成高感知质量的修复结果。具体而言,我们利用预训练的视觉语言模型从退化图像中提取视觉提示,包括语义嵌入和退化嵌入。语义嵌入作为内容提示引导扩散模型进行生成,而退化嵌入则调制图像引导控制模块以生成空间先验,用于控制扩散过程的空间结构,确保对原始图像的忠实性。此外,我们设计了退化感知解码器进行结构校正,将潜在代码转换到像素域。我们在不同退化的修复任务上进行了全面的定性和定量分析,证明了方法的有效性和优越性。


A Comparative Study of Image Restoration Networks for General Backbone Network Design

https://github.com/Andrew0613/X-Restormer

ECCV2024

Abstract

Despite the significant progress made by deep models in various image restoration tasks, existing image restoration networks still face challenges in terms of task generality. An intuitive manifestation is that networks which excel in certain tasks often fail to deliver satisfactory results in others. To illustrate this point, we select five representative networks and conduct a comparative study on five classic image restoration tasks. First, we provide a detailed explanation of the characteristics of different image restoration tasks and backbone networks. Following this, we present the benchmark results and analyze the reasons behind the performance disparity of different models across various tasks. Drawing from this comparative study, we propose that a general image restoration backbone network needs to meet the functional requirements of diverse tasks. Based on this principle, we design a new general image restoration backbone network, X-Restormer. Extensive experiments demonstrate that X-Restormer possesses good task generality and achieves state-of-the-art performance across a variety of tasks.


AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

https://github.com/jiangyitong/AutoDIR

ECCV2024

Abstract

We present AutoDIR, an innovative all-in-one image restoration system incorporating latent diffusion. AutoDIR excels in its ability to automatically identify and restore images suffering from a range of unknown degradations. AutoDIR offers intuitive open-vocabulary image editing, empowering users to customize and enhance images according to their preferences. Specifically, AutoDIR consists of two key stages: a Blind Image Quality Assessment (BIQA) stage based on a semantic-agnostic vision-language model which automatically detects unknown image degradations for input images, an All-in-One Image Restoration (AIR) stage utilizes structural-corrected latent diffusion which handles multiple types of image degradations. Extensive experimental evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches for a wider range of image restoration tasks. The design of AutoDIR also enables flexible user control (via text prompt) and generalization to new tasks as a foundation model of image restoration.


Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems

https://github.com/mlvlab/DAVI

ECCV2024

Abstract

Recent studies on inverse problems have proposed posterior samplers that leverage the pre-trained diffusion models as powerful priors. These attempts have paved the way for using diffusion models in a wide range of inverse problems. However, the existing methods entail computationally demanding iterative sampling procedures and optimize a separate solution for each measurement, which leads to limited scalability and lack of generalization capability across unseen samples. To address these limitations, we propose a novel approach, Diffusion prior-based Amortized Variational Inference (DAVI) that solves inverse problems with a diffusion prior from an amortized variational inference perspective. Specifically, instead of separate measurement-wise optimization, our amortized inference learns a function that directly maps measurements to the implicit posterior distributions of corresponding clean data, enabling a single-step posterior sampling even for unseen measurements. Extensive experiments on image restoration tasks, e.g., Gaussian deblur, 4× super-resolution, and box inpainting with two benchmark datasets, demonstrate our approach's superior performance over strong baselines.


MambaIR: A Simple Baseline for Image Restoration with State-Space Model

https://github.com/csguoh/MambaIR

ECCV2024

Abstract

Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field.


MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

https://github.com/renyulin-f/MoE-DiffIR

ECCV2024

Abstract

We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR.


When Fast Fourier Transform Meets Transformer for Image Restoration

https://github.com/deng-ai-lab/SFHformer

ECCV2024

Abstract

Natural images can suffer from various degradation phenomena caused by adverse atmospheric conditions or unique degradation mechanism. Such diversity makes it challenging to design a universal framework for kinds of restoration tasks. Instead of exploring the commonality across different degradation phenomena, existing image restoration methods focus on the modification of network architecture under limited restoration priors. In this work, we first review various degradation phenomena from a frequency perspective as prior. Based on this, we propose an efficient image restoration framework, dubbed SFHformer, which incorporates the Fast Fourier Transform mechanism into Transformer architecture. Specifically, we design a dual domain hybrid structure for multi-scale receptive fields modeling, in which the spatial domain and the frequency domain focuses on local modeling and global modeling, respectively. Moreover, we design unique positional coding and frequency dynamic convolution for each frequency component to extract rich frequency-domain features. Extensive experiments on thirty-one restoration datasets for a range of ten restoration tasks such as deraining, dehazing, deblurring, desnowing, denoising, super-resolution and underwater/low-light enhancement, demonstrate that our SFHformer surpasses the state-of-the-art approaches and achieves a favorable trade-off between performance, parameter size and computational cost.


Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration

https://github.com/joshyZhou/AST

CVPR2024

Abstract

Transformer-based approaches have achieved promising performance in image restoration tasks given their ability to model long-range dependencies which is crucial for recovering clear images. Though diverse efficient attention mechanism designs have addressed the intensive computations associated with using transformers they often involve redundant information and noisy interactions from irrelevant regions by considering all available tokens. In this work we propose an Adaptive Sparse Transformer (AST) to mitigate the noisy interactions of irrelevant areas and remove feature redundancy in both spatial and channel domains. AST comprises two core designs i.e. an Adaptive Sparse Self-Attention (ASSA) block and a Feature Refinement Feed-forward Network (FRFN). Specifically ASSA is adaptively computed using a two-branch paradigm where the sparse branch is introduced to filter out the negative impacts of low query-key matching scores for aggregating features while the dense one ensures sufficient information flow through the network for learning discriminative representations. Meanwhile FRFN employs an enhance-and-ease scheme to eliminate feature redundancy in channels enhancing the restoration of clear latent images. Experimental results on commonly used benchmarks have demonstrated the versatility and competitive performance of our method in several tasks including rain streak removal real haze removal and raindrop removal.