Arxiv Papers in cs.CV on 2023-04-20

# Arxiv Papers in cs.CV on 2023-04-20 ### Jedi: Entropy-based Localization and Removal of Adversarial Patches - **Arxiv ID**: http://arxiv.org/abs/2304.10029v1 - **DOI**: None - **Categories**: **cs.CR**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10029v1) - **Published**: 2023-04-20 00:56:59+00:00 - **Updated**: 2023-04-20 00:56:59+00:00 - **Authors**: Bilel Tarchoun, Anouar Ben Khalifa, Mohamed Ali Mahjoub, Nael Abu-Ghazaleh, Ihsen Alouani - **Comment**: 9 pages, 11 figures. To appear in CVPR 2023 - **Journal**: None - **Summary**: Real-world adversarial physical patches were shown to be successful in compromising state-of-the-art models in a variety of computer vision applications. Existing defenses that are based on either input gradient or features analysis have been compromised by recent GAN-based attacks that generate naturalistic patches. In this paper, we propose Jedi, a new defense against adversarial patches that is resilient to realistic patch attacks. Jedi tackles the patch localization problem from an information theory perspective; leverages two new ideas: (1) it improves the identification of potential patch regions using entropy analysis: we show that the entropy of adversarial patches is high, even in naturalistic patches; and (2) it improves the localization of adversarial patches, using an autoencoder that is able to complete patch regions from high entropy kernels. Jedi achieves high-precision adversarial patch localization, which we show is critical to successfully repair the images. Since Jedi relies on an input entropy analysis, it is model-agnostic, and can be applied on pre-trained off-the-shelf models without changes to the training or inference of the protected models. Jedi detects on average 90% of adversarial patches across different benchmarks and recovers up to 94% of successful patch attacks (Compared to 75% and 65% for LGS and Jujutsu, respectively). ### Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations - **Arxiv ID**: http://arxiv.org/abs/2304.10036v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10036v1) - **Published**: 2023-04-20 01:23:36+00:00 - **Updated**: 2023-04-20 01:23:36+00:00 - **Authors**: Benjamin Ramtoula, Matthew Gadd, Paul Newman, Daniele De Martini - **Comment**: Published at CVPR 2023. Project page with code: https://bramtoula.github.io/vdna/ - **Journal**: None - **Summary**: Selecting appropriate datasets is critical in modern computer vision. However, no general-purpose tools exist to evaluate the extent to which two datasets differ. For this, we propose representing images - and by extension datasets - using Distributions of Neuron Activations (DNAs). DNAs fit distributions, such as histograms or Gaussians, to activations of neurons in a pre-trained feature extractor through which we pass the image(s) to represent. This extractor is frozen for all datasets, and we rely on its generally expressive power in feature space. By comparing two DNAs, we can evaluate the extent to which two datasets differ with granular control over the comparison attributes of interest, providing the ability to customise the way distances are measured to suit the requirements of the task at hand. Furthermore, DNAs are compact, representing datasets of any size with less than 15 megabytes. We demonstrate the value of DNAs by evaluating their applicability on several tasks, including conditional dataset comparison, synthetic image evaluation, and transfer learning, and across diverse datasets, ranging from synthetic cat images to celebrity faces and urban driving scenes. ### Open-World Continual Learning: Unifying Novelty Detection and Continual Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10038v1 - **DOI**: None - **Categories**: **cs.LG**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10038v1) - **Published**: 2023-04-20 01:32:32+00:00 - **Updated**: 2023-04-20 01:32:32+00:00 - **Authors**: Gyuhak Kim, Changnan Xiao, Tatsuya Konishi, Zixuan Ke, Bing Liu - **Comment**: arXiv admin note: substantial text overlap with arXiv:2211.02633, arXiv:2208.09734 - **Journal**: None - **Summary**: As AI agents are increasingly used in the real open world with unknowns or novelties, they need the ability to (1) recognize objects that (i) they have learned and (ii) detect items that they have not seen or learned before, and (2) learn the new items incrementally to become more and more knowledgeable and powerful. (1) is called novelty detection or out-of-distribution (OOD) detection and (2) is called class incremental learning (CIL), which is a setting of continual learning (CL). In existing research, OOD detection and CIL are regarded as two completely different problems. This paper theoretically proves that OOD detection actually is necessary for CIL. We first show that CIL can be decomposed into two sub-problems: within-task prediction (WP) and task-id prediction (TP). We then prove that TP is correlated with OOD detection. The key theoretical result is that regardless of whether WP and OOD detection (or TP) are defined explicitly or implicitly by a CIL algorithm, good WP and good OOD detection are necessary and sufficient conditions for good CIL, which unifies novelty or OOD detection and continual learning (CIL, in particular). A good CIL algorithm based on our theory can naturally be used in open world learning, which is able to perform both novelty/OOD detection and continual learning. Based on the theoretical result, new CIL methods are also designed, which outperform strong baselines in terms of CIL accuracy and its continual OOD detection by a large margin. ### Brain tumor multi classification and segmentation in MRI images using deep learning - **Arxiv ID**: http://arxiv.org/abs/2304.10039v2 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10039v2) - **Published**: 2023-04-20 01:32:55+00:00 - **Updated**: 2023-06-23 21:29:42+00:00 - **Authors**: Belal Amin, Romario Sameh Samir, Youssef Tarek, Mohammed Ahmed, Rana Ibrahim, Manar Ahmed, Mohamed Hassan - **Comment**: Accepted by artificial intelligence and applications Journal - **Journal**: None - **Summary**: This study proposes a deep learning model for the classification and segmentation of brain tumors from magnetic resonance imaging (MRI) scans. The classification model is based on the EfficientNetB1 architecture and is trained to classify images into four classes: meningioma, glioma, pituitary adenoma, and no tumor. The segmentation model is based on the U-Net architecture and is trained to accurately segment the tumor from the MRI images. The models are evaluated on a publicly available dataset and achieve high accuracy and segmentation metrics, indicating their potential for clinical use in the diagnosis and treatment of brain tumors. ### Neural Radiance Fields: Past, Present, and Future - **Arxiv ID**: http://arxiv.org/abs/2304.10050v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10050v1) - **Published**: 2023-04-20 02:17:08+00:00 - **Updated**: 2023-04-20 02:17:08+00:00 - **Authors**: Ansh Mittal - **Comment**: 122 pages, 9 figures, 272 citations - **Journal**: None - **Summary**: The various aspects like modeling and interpreting 3D environments and surroundings have enticed humans to progress their research in 3D Computer Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in Computer Graphics, Robotics, Computer Vision, and the possible scope of High-Resolution Low Storage Augmented Reality and Virtual Reality-based 3D models have gained traction from res with more than 500 preprints related to NeRFs published. This paper serves as a bridge for people starting to study these fields by building on the basics of Mathematics, Geometry, Computer Vision, and Computer Graphics to the difficulties encountered in Implicit Representations at the intersection of all these disciplines. This survey provides the history of rendering, Implicit Learning, and NeRFs, the progression of research on NeRFs, and the potential applications and implications of NeRFs in today's world. In doing so, this survey categorizes all the NeRF-related research in terms of the datasets used, objective functions, applications solved, and evaluation criteria for these applications. ### Complex Mixer for MedMNIST Classification Decathlon - **Arxiv ID**: http://arxiv.org/abs/2304.10054v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10054v1) - **Published**: 2023-04-20 02:34:36+00:00 - **Updated**: 2023-04-20 02:34:36+00:00 - **Authors**: Zhuoran Zheng, Xiuyi Jia - **Comment**: None - **Journal**: None - **Summary**: With the development of the medical image field, researchers seek to develop a class of datasets to block the need for medical knowledge, such as \text{MedMNIST} (v2). MedMNIST (v2) includes a large number of small-sized (28 $\times$ 28 or 28 $\times$ 28 $\times$ 28) medical samples and the corresponding expert annotations (class label). The existing baseline model (Google AutoML Vision, ResNet-50+3D) can reach an average accuracy of over 70\% on MedMNIST (v2) datasets, which is comparable to the performance of expert decision-making. Nevertheless, we note that there are two insurmountable obstacles to modeling on MedMNIST (v2): 1) the raw images are cropped to low scales may cause effective recognition information to be dropped and the classifier to have difficulty in tracing accurate decision boundaries; 2) the labelers' subjective insight may cause many uncertainties in the label space. To address these issues, we develop a Complex Mixer (C-Mixer) with a pre-training framework to alleviate the problem of insufficient information and uncertainty in the label space by introducing an incentive imaginary matrix and a self-supervised scheme with random masking. Our method (incentive learning and self-supervised learning with masking) shows surprising potential on both the standard MedMNIST (v2) dataset, the customized weakly supervised datasets, and other image enhancement tasks. ### Is Cross-modal Information Retrieval Possible without Training? - **Arxiv ID**: http://arxiv.org/abs/2304.11095v1 - **DOI**: None - **Categories**: **cs.LG**, cs.AI, cs.CL, cs.CV, cs.IR - **Links**: [PDF](http://arxiv.org/pdf/2304.11095v1) - **Published**: 2023-04-20 02:36:18+00:00 - **Updated**: 2023-04-20 02:36:18+00:00 - **Authors**: Hyunjin Choi, Hyunjae Lee, Seongho Joe, Youngjune L. Gwon - **Comment**: None - **Journal**: Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, Proceedings, Part II - **Summary**: Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture. ### Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation - **Arxiv ID**: http://arxiv.org/abs/2304.10066v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10066v1) - **Published**: 2023-04-20 03:18:03+00:00 - **Updated**: 2023-04-20 03:18:03+00:00 - **Authors**: Jacky Chen Long Chai, Tiong-Sik Ng, Cheng-Yaw Low, Jaewoo Park, Andrew Beng Jin Teoh - **Comment**: Accepted to CVPR23 - **Journal**: None - **Summary**: Very low-resolution face recognition (VLRFR) poses unique challenges, such as tiny regions of interest and poor resolution due to extreme standoff distance or wide viewing angle of the acquisition devices. In this paper, we study principled approaches to elevate the recognizability of a face in the embedding space instead of the visual quality. We first formulate a robust learning-based face recognizability measure, namely recognizability index (RI), based on two criteria: (i) proximity of each face embedding against the unrecognizable faces cluster center and (ii) closeness of each face embedding against its positive and negative class prototypes. We then devise an index diversion loss to push the hard-to-recognize face embedding with low RI away from unrecognizable faces cluster to boost the RI, which reflects better recognizability. Additionally, a perceptibility attention mechanism is introduced to attend to the most recognizable face regions, which offers better explanatory and discriminative traits for embedding learning. Our proposed model is trained end-to-end and simultaneously serves recognizability-aware embedding learning and face quality estimation. To address VLRFR, our extensive evaluations on three challenging low-resolution datasets and face quality assessment demonstrate the superiority of the proposed model over the state-of-the-art methods. ### Video-based Contrastive Learning on Decision Trees: from Action Recognition to Autism Diagnosis - **Arxiv ID**: http://arxiv.org/abs/2304.10073v2 - **DOI**: 10.1145/3587819.3590988 - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10073v2) - **Published**: 2023-04-20 04:02:04+00:00 - **Updated**: 2023-04-21 06:17:01+00:00 - **Authors**: Mindi Ruan, Xiangxu Yu, Na Zhang, Chuanbo Hu, Shuo Wang, Xin Li - **Comment**: Accepted by ACM Multimedia Systems Conference 2023 - **Journal**: None - **Summary**: How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database. ### Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering - **Arxiv ID**: http://arxiv.org/abs/2304.10075v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10075v1) - **Published**: 2023-04-20 04:05:22+00:00 - **Updated**: 2023-04-20 04:05:22+00:00 - **Authors**: Dongting Hu, Zhenkai Zhang, Tingbo Hou, Tongliang Liu, Huan Fu, Mingming Gong - **Comment**: None - **Journal**: None - **Summary**: The rendering scheme in neural radiance field (NeRF) is effective in rendering a pixel by casting a ray into the scene. However, NeRF yields blurred rendering results when the training images are captured at non-uniform scales, and produces aliasing artifacts if the test images are taken in distant views. To address this issue, Mip-NeRF proposes a multiscale representation as a conical frustum to encode scale information. Nevertheless, this approach is only suitable for offline rendering since it relies on integrated positional encoding (IPE) to query a multilayer perceptron (MLP). To overcome this limitation, we propose mip voxel grids (Mip-VoG), an explicit multiscale representation with a deferred architecture for real-time anti-aliasing rendering. Our approach includes a density Mip-VoG for scene geometry and a feature Mip-VoG with a small MLP for view-dependent color. Mip-VoG encodes scene scale using the level of detail (LOD) derived from ray differentials and uses quadrilinear interpolation to map a queried 3D location to its features and density from two neighboring downsampled voxel grids. To our knowledge, our approach is the first to offer multiscale training and real-time anti-aliasing rendering simultaneously. We conducted experiments on multiscale datasets, and the results show that our approach outperforms state-of-the-art real-time rendering baselines. ### NeUDF: Leaning Neural Unsigned Distance Fields with Volume Rendering - **Arxiv ID**: http://arxiv.org/abs/2304.10080v1 - **DOI**: None - **Categories**: **cs.CV**, cs.GR - **Links**: [PDF](http://arxiv.org/pdf/2304.10080v1) - **Published**: 2023-04-20 04:14:42+00:00 - **Updated**: 2023-04-20 04:14:42+00:00 - **Authors**: Yu-Tao Liu, Li Wang, Jie yang, Weikai Chen, Xiaoxu Meng, Bo Yang, Lin Gao - **Comment**: Accepted by CVPR 2023, 24 pages, project: http://geometrylearning.com/neudf/ - **Journal**: None - **Summary**: Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in neural implicit surface rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of an SDF-based neural renderer cannot scale to UDF, we propose two new formulations of weight function specially tailored for UDF-based volume rendering. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including DTU}, MGN, and Deep Fashion 3D. Experimental results demonstrate that nEudf can significantly outperform the state-of-the-art method in the task of multi-view surface reconstruction, especially for complex shapes with open boundaries. ### MIPI 2023 Challenge on RGBW Fusion: Methods and Results - **Arxiv ID**: http://arxiv.org/abs/2304.10089v2 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10089v2) - **Published**: 2023-04-20 05:02:03+00:00 - **Updated**: 2023-04-24 21:13:04+00:00 - **Authors**: Qianhui Sun, Qingyu Yang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Yuekun Dai, Wenxiu Sun, Qingpeng Zhu, Chen Change Loy, Jinwei Gu - **Comment**: CVPR 2023 Mobile Intelligent Photography and Imaging (MIPI) Workshop--RGBW Sensor Fusion Challenge Report. Website: https://mipi-challenge.org/MIPI2023/. arXiv admin note: substantial text overlap with arXiv:2209.07530, arXiv:2209.08471, arXiv:2209.07060 - **Journal**: None - **Summary**: Developing and integrating advanced image sensors with novel algorithms in camera systems are prevalent with the increasing demand for computational photography and imaging on mobile platforms. However, the lack of high-quality data for research and the rare opportunity for an in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). With the success of the 1st MIPI Workshop@ECCV 2022, we introduce the second MIPI challenge, including four tracks focusing on novel image sensors and imaging algorithms. This paper summarizes and reviews the RGBW Joint Fusion and Denoise track on MIPI 2023. In total, 69 participants were successfully registered, and 4 teams submitted results in the final testing phase. The final results are evaluated using objective metrics, including PSNR, SSIM, LPIPS, and KLD. A detailed description of the top three models developed in this challenge is provided in this paper. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2023/. ### MIPI 2023 Challenge on RGBW Remosaic: Methods and Results - **Arxiv ID**: http://arxiv.org/abs/2304.10551v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10551v1) - **Published**: 2023-04-20 05:12:42+00:00 - **Updated**: 2023-04-20 05:12:42+00:00 - **Authors**: Qianhui Sun, Qingyu Yang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Yuekun Dai, Wenxiu Sun, Qingpeng Zhu, Chen Change Loy, Jinwei Gu - **Comment**: CVPR 2023 Mobile Intelligent Photography and Imaging (MIPI) Workshop--RGBW Sensor Remosaic Challenge Report. Website: https://mipi-challenge.org/MIPI2023/. arXiv admin note: substantial text overlap with arXiv:2209.08471, arXiv:2209.07060, arXiv:2209.07530, arXiv:2304.10089 - **Journal**: None - **Summary**: Developing and integrating advanced image sensors with novel algorithms in camera systems are prevalent with the increasing demand for computational photography and imaging on mobile platforms. However, the lack of high-quality data for research and the rare opportunity for an in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). With the success of the 1st MIPI Workshop@ECCV 2022, we introduce the second MIPI challenge, including four tracks focusing on novel image sensors and imaging algorithms. This paper summarizes and reviews the RGBW Joint Remosaic and Denoise track on MIPI 2023. In total, 81 participants were successfully registered, and 4 teams submitted results in the final testing phase. The final results are evaluated using objective metrics, including PSNR, SSIM, LPIPS, and KLD. A detailed description of the top three models developed in this challenge is provided in this paper. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2023/. ### Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition - **Arxiv ID**: http://arxiv.org/abs/2304.10091v1 - **DOI**: None - **Categories**: **cs.CV**, cs.MM - **Links**: [PDF](http://arxiv.org/pdf/2304.10091v1) - **Published**: 2023-04-20 05:18:28+00:00 - **Updated**: 2023-04-20 05:18:28+00:00 - **Authors**: Jun Zhu, Jiandong Jin, Zihan Yang, Xiaohao Wu, Xiao Wang - **Comment**: Accepted by CVPR-2023 Workshop@NFVLR (New Frontiers in Visual Language Reasoning: Compositionality, Prompts and Causality) - **Journal**: None - **Summary**: Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image. However, the performance is not reliable for images with challenging factors, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can make full use of temporal information. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames. To better utilize the semantic information, we take the attribute list as another input and transform the attribute words/phrase into the corresponding sentence via split, expand, and prompt. Then, the text encoder of CLIP is utilized for language embedding. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on a large-scale video-based PAR dataset fully validated the effectiveness of our proposed framework. ### Clustered-patch Element Connection for Few-shot Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10093v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10093v2) - **Published**: 2023-04-20 05:22:27+00:00 - **Updated**: 2023-05-10 05:47:19+00:00 - **Authors**: Jinxiang Lai, Siqian Yang, Junhong Zhou, Wenlong Wu, Xiaochen Chen, Jun Liu, Bin-Bin Gao, Chengjie Wang - **Comment**: None - **Journal**: IJCAI 2023 - **Summary**: Weak feature representation problem has influenced the performance of few-shot classification task for a long time. To alleviate this problem, recent researchers build connections between support and query instances through embedding patch features to generate discriminative representations. However, we observe that there exists semantic mismatches (foreground/ background) among these local patches, because the location and size of the target object are not fixed. What is worse, these mismatches result in unreliable similarity confidences, and complex dense connection exacerbates the problem. According to this, we propose a novel Clustered-patch Element Connection (CEC) layer to correct the mismatch problem. The CEC layer leverages Patch Cluster and Element Connection operations to collect and establish reliable connections with high similarity patch features, respectively. Moreover, we propose a CECNet, including CEC layer based attention module and distance metric. The former is utilized to generate a more discriminative representation benefiting from the global clustered-patch features, and the latter is introduced to reliably measure the similarity between pair-features. Extensive experiments demonstrate that our CECNet outperforms the state-of-the-art methods on classification benchmark. Furthermore, our CEC approach can be extended into few-shot segmentation and detection tasks, which achieves competitive performances. ### Scene Style Text Editing - **Arxiv ID**: http://arxiv.org/abs/2304.10097v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10097v1) - **Published**: 2023-04-20 05:36:49+00:00 - **Updated**: 2023-04-20 05:36:49+00:00 - **Authors**: Tonghua Su, Fuxiang Yang, Xiang Zhou, Donglin Di, Zhongjie Wang, Songze Li - **Comment**: None - **Journal**: None - **Summary**: In this work, we propose a task called "Scene Style Text Editing (SSTE)", changing the text content as well as the text style of the source image while keeping the original text scene. Existing methods neglect to fine-grained adjust the style of the foreground text, such as its rotation angle, color, and font type. To tackle this task, we propose a quadruple framework named "QuadNet" to embed and adjust foreground text styles in the latent feature space. Specifically, QuadNet consists of four parts, namely background inpainting, style encoder, content encoder, and fusion generator. The background inpainting erases the source text content and recovers the appropriate background with a highly authentic texture. The style encoder extracts the style embedding of the foreground text. The content encoder provides target text representations in the latent feature space to implement the content edits. The fusion generator combines the information yielded from the mentioned parts and generates the rendered text images. Practically, our method is capable of performing promisingly on real-world datasets with merely string-level annotation. To the best of our knowledge, our work is the first to finely manipulate the foreground text content and style by deeply semantic editing in the latent feature space. Extensive experiments demonstrate that QuadNet has the ability to generate photo-realistic foreground text and avoid source text shadows in real-world scenes when editing text content. ### eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation - **Arxiv ID**: http://arxiv.org/abs/2304.10103v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10103v1) - **Published**: 2023-04-20 06:20:22+00:00 - **Updated**: 2023-04-20 06:20:22+00:00 - **Authors**: Libo Huang, Yan Zeng, Chuanguang Yang, Zhulin An, Boyu Diao, Yongjun Xu - **Comment**: 12 pages, 12 figures - **Journal**: None - **Summary**: Class-Incremental Learning (CIL) aims to solve the neural networks' catastrophic forgetting problem, which refers to the fact that once the network updates on a new task, its performance on previously-learned tasks drops dramatically. Most successful CIL methods incrementally train a feature extractor with the aid of stored exemplars, or estimate the feature distribution with the stored prototypes. However, the stored exemplars would violate the data privacy concerns, while the stored prototypes might not reasonably be consistent with a proper feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a method of \textit{e}mbedding distillation and \textit{Ta}sk-oriented \textit{g}eneration (\textit{eTag}) for CIL, which requires neither the exemplar nor the prototype. Instead, eTag achieves a data-free manner to train the neural networks incrementally. To prevent the feature extractor from forgetting, eTag distills the embeddings of the network's intermediate blocks. Additionally, eTag enables a generative network to produce suitable features, fitting the needs of the top incremental classifier. Experimental results confirmed that our proposed eTag considerably outperforms the state-of-the-art methods on CIFAR-100 and ImageNet-sub\footnote{Our code is available in the Supplementary Materials. ### Reinforcement Learning for Picking Cluttered General Objects with Dense Object Descriptors - **Arxiv ID**: http://arxiv.org/abs/2304.10108v1 - **DOI**: None - **Categories**: **cs.RO**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10108v1) - **Published**: 2023-04-20 06:24:33+00:00 - **Updated**: 2023-04-20 06:24:33+00:00 - **Authors**: Hoang-Giang Cao, Weihao Zeng, I-Chen Wu - **Comment**: Accepted to International Conference on Robotics and Automation (ICRA) 2022 - **Journal**: None - **Summary**: Picking cluttered general objects is a challenging task due to the complex geometries and various stacking configurations. Many prior works utilize pose estimation for picking, but pose estimation is difficult on cluttered objects. In this paper, we propose Cluttered Objects Descriptors (CODs), a dense cluttered objects descriptor that can represent rich object structures, and use the pre-trained CODs network along with its intermediate outputs to train a picking policy. Additionally, we train the policy with reinforcement learning, which enable the policy to learn picking without supervision. We conduct experiments to demonstrate that our CODs is able to consistently represent seen and unseen cluttered objects, which allowed for the picking policy to robustly pick cluttered general objects. The resulting policy can pick 96.69% of unseen objects in our experimental environment which is twice as cluttered as the training scenarios. ### SATA: Source Anchoring and Target Alignment Network for Continual Test Time Adaptation - **Arxiv ID**: http://arxiv.org/abs/2304.10113v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10113v1) - **Published**: 2023-04-20 06:38:33+00:00 - **Updated**: 2023-04-20 06:38:33+00:00 - **Authors**: Goirik Chakrabarty, Manogna Sreenivas, Soma Biswas - **Comment**: None - **Journal**: None - **Summary**: Adapting a trained model to perform satisfactorily on continually changing testing domains/environments is an important and challenging task. In this work, we propose a novel framework, SATA, which aims to satisfy the following characteristics required for online adaptation: 1) can work seamlessly with different (preferably small) batch sizes to reduce latency; 2) should continue to work well for the source domain; 3) should have minimal tunable hyper-parameters and storage requirements. Given a pre-trained network trained on source domain data, the proposed SATA framework modifies the batch-norm affine parameters using source anchoring based self-distillation. This ensures that the model incorporates the knowledge of the newly encountered domains, without catastrophically forgetting about the previously seen ones. We also propose a source-prototype driven contrastive alignment to ensure natural grouping of the target samples, while maintaining the already learnt semantic information. Extensive evaluation on three benchmark datasets under challenging settings justify the effectiveness of SATA for real-world applications. ### Learning Bottleneck Concepts in Image Classification - **Arxiv ID**: http://arxiv.org/abs/2304.10131v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10131v1) - **Published**: 2023-04-20 07:32:05+00:00 - **Updated**: 2023-04-20 07:32:05+00:00 - **Authors**: Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara - **Comment**: Accepted in CVPR 2023 - **Journal**: None - **Summary**: Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL's potential to rebuild neural networks for better interpretability. Code is available at https://github.com/wbw520/BotCL and a simple demo is available at https://botcl.liangzhili.com/. ### Diversifying the High-level Features for better Adversarial Transferability - **Arxiv ID**: http://arxiv.org/abs/2304.10136v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10136v1) - **Published**: 2023-04-20 07:44:59+00:00 - **Updated**: 2023-04-20 07:44:59+00:00 - **Authors**: Zhiyuan Wang, Zeliang Zhang, Siyuan Liang, Xiaosen Wang - **Comment**: 15 pages - **Journal**: None - **Summary**: Given the great threat of adversarial attacks against Deep Neural Networks (DNNs), numerous works have been proposed to boost transferability to attack real-world applications. However, existing attacks often utilize advanced gradient calculation or input transformation but ignore the white-box model. Inspired by the fact that DNNs are over-parameterized for superior performance, we propose diversifying the high-level features (DHF) for more transferable adversarial examples. In particular, DHF perturbs the high-level features by randomly transforming the high-level features and mixing them with the feature of benign samples when calculating the gradient at each iteration. Due to the redundancy of parameters, such transformation does not affect the classification performance but helps identify the invariant features across different models, leading to much better transferability. Empirical evaluations on ImageNet dataset show that DHF could effectively improve the transferability of existing momentum-based attacks. Incorporated into the input transformation-based attacks, DHF generates more transferable adversarial examples and outperforms the baselines with a clear margin when attacking several defense models, showing its generalization to various attacks and high effectiveness for boosting transferability. ### Motion Artifacts Detection in Short-scan Dental CBCT Reconstructions - **Arxiv ID**: http://arxiv.org/abs/2304.10154v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10154v1) - **Published**: 2023-04-20 08:28:44+00:00 - **Updated**: 2023-04-20 08:28:44+00:00 - **Authors**: Abdul Salam Rasmi Asraf Ali, Andrea Fusiello, Claudio Landi, Cristina Sarti, Anneke Annassia Putri Siswadi - **Comment**: None - **Journal**: None - **Summary**: Cone Beam Computed Tomography (CBCT) is widely used in dentistry for diagnostics and treatment planning. CBCT Imaging has a long acquisition time and consequently, the patient is likely to move. This motion causes significant artifacts in the reconstructed data which may lead to misdiagnosis. Existing motion correction algorithms only address this issue partially, struggling with inconsistencies due to truncation, accuracy, and execution speed. On the other hand, a short-scan reconstruction using a subset of motion-free projections with appropriate weighting methods can have a sufficient clinical image quality for most diagnostic purposes. Therefore, a framework is used in this study to extract the motion-free part of the scanned projections with which a clean short-scan volume can be reconstructed without using correction algorithms. Motion artifacts are detected using deep learning with a slice-based prediction scheme followed by volume averaging to get the final result. A realistic motion simulation strategy and data augmentation has been implemented to address data scarcity. The framework has been validated by testing it with real motion-affected data while the model was trained only with simulated motion data. This shows the feasibility to apply the proposed framework to a broad variety of motion cases for further research. ### High-Fidelity and Freely Controllable Talking Head Video Generation - **Arxiv ID**: http://arxiv.org/abs/2304.10168v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10168v1) - **Published**: 2023-04-20 09:02:41+00:00 - **Updated**: 2023-04-20 09:02:41+00:00 - **Authors**: Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu - **Comment**: CVPR 2023 - **Journal**: None - **Summary**: Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance. More information is available at https://yuegao.me/PECHead. ### Regularizing Second-Order Influences for Continual Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10177v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10177v1) - **Published**: 2023-04-20 09:30:35+00:00 - **Updated**: 2023-04-20 09:30:35+00:00 - **Authors**: Zhicheng Sun, Yadong Mu, Gang Hua - **Comment**: CVPR 2023 - **Journal**: None - **Summary**: Continual learning aims to learn on non-stationary data streams without catastrophically forgetting previous knowledge. Prevalent replay-based methods address this challenge by rehearsing on a small buffer holding the seen data, for which a delicate sample selection strategy is required. However, existing selection schemes typically seek only to maximize the utility of the ongoing selection, overlooking the interference between successive rounds of selection. Motivated by this, we dissect the interaction of sequential selection steps within a framework built on influence functions. We manage to identify a new class of second-order influences that will gradually amplify incidental bias in the replay buffer and compromise the selection process. To regularize the second-order effects, a novel selection objective is proposed, which also has clear connections to two widely adopted criteria. Furthermore, we present an efficient implementation for optimizing the proposed criterion. Experiments on multiple continual learning benchmarks demonstrate the advantage of our approach over state-of-the-art methods. Code is available at https://github.com/feifeiobama/InfluenceCL. ### SCoDA: Domain Adaptive Shape Completion for Real Scans - **Arxiv ID**: http://arxiv.org/abs/2304.10179v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10179v2) - **Published**: 2023-04-20 09:38:26+00:00 - **Updated**: 2023-04-24 06:31:59+00:00 - **Authors**: Yushuang Wu, Zizheng Yan, Ce Chen, Lai Wei, Xiao Li, Guanbin Li, Yihao Li, Shuguang Cui, Xiaoguang Han - **Comment**: None - **Journal**: CVPR 2023 - **Summary**: 3D shape completion from point clouds is a challenging task, especially from scans of real-world objects. Considering the paucity of 3D shape ground truths for real scans, existing works mainly focus on benchmarking this task on synthetic data, e.g. 3D computer-aided design models. However, the domain gap between synthetic and real data limits the generalizability of these methods. Thus, we propose a new task, SCoDA, for the domain adaptation of real scan shape completion from synthetic data. A new dataset, ScanSalon, is contributed with a bunch of elaborate 3D models created by skillful artists according to scans. To address this new task, we propose a novel cross-domain feature fusion method for knowledge transfer and a novel volume-consistent self-training framework for robust learning from real data. Extensive experiments prove our method is effective to bring an improvement of 6%~7% mIoU. ### Using Text-to-Image Generation for Architectural Design Ideation - **Arxiv ID**: http://arxiv.org/abs/2304.10182v1 - **DOI**: None - **Categories**: **cs.HC**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10182v1) - **Published**: 2023-04-20 09:46:27+00:00 - **Updated**: 2023-04-20 09:46:27+00:00 - **Authors**: Ville Paananen, Jonas Oppenlaender, Aku Visuri - **Comment**: None - **Journal**: None - **Summary**: The recent progress of text-to-image generation has been recognized in architectural design. Our study is the first to investigate the potential of text-to-image generators in supporting creativity during the early stages of the architectural design process. We conducted a laboratory study with 17 architecture students, who developed a concept for a culture center using three popular text-to-image generators: Midjourney, Stable Diffusion, and DALL-E. Through standardized questionnaires and group interviews, we found that image generation could be a meaningful part of the design process when design constraints are carefully considered. Generative tools support serendipitous discovery of ideas and an imaginative mindset, enriching the design process. We identified several challenges of image generators and provided considerations for software development and educators to support creativity and emphasize designers' imaginative mindset. By understanding the limitations and potential of text-to-image generators, architects and designers can leverage this technology in their design process and education, facilitating innovation and effective communication of concepts. ### Efficient Uncertainty Estimation in Spiking Neural Networks via MC-dropout - **Arxiv ID**: http://arxiv.org/abs/2304.10191v1 - **DOI**: None - **Categories**: **cs.NE**, cs.AI, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10191v1) - **Published**: 2023-04-20 10:05:57+00:00 - **Updated**: 2023-04-20 10:05:57+00:00 - **Authors**: Tao Sun, Bojian Yin, Sander Bohte - **Comment**: None - **Journal**: None - **Summary**: Spiking neural networks (SNNs) have gained attention as models of sparse and event-driven communication of biological neurons, and as such have shown increasing promise for energy-efficient applications in neuromorphic hardware. As with classical artificial neural networks (ANNs), predictive uncertainties are important for decision making in high-stakes applications, such as autonomous vehicles, medical diagnosis, and high frequency trading. Yet, discussion of uncertainty estimation in SNNs is limited, and approaches for uncertainty estimation in artificial neural networks (ANNs) are not directly applicable to SNNs. Here, we propose an efficient Monte Carlo(MC)-dropout based approach for uncertainty estimation in SNNs. Our approach exploits the time-step mechanism of SNNs to enable MC-dropout in a computationally efficient manner, without introducing significant overheads during training and inference while demonstrating high accuracy and uncertainty quality. ### Spiking-Fer: Spiking Neural Network for Facial Expression Recognition With Event Cameras - **Arxiv ID**: http://arxiv.org/abs/2304.10211v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10211v1) - **Published**: 2023-04-20 10:59:56+00:00 - **Updated**: 2023-04-20 10:59:56+00:00 - **Authors**: Sami Barchid, Benjamin Allaert, Amel Aissaoui, José Mennesson, Chaabane Djéraba - **Comment**: None - **Journal**: None - **Summary**: Facial Expression Recognition (FER) is an active research domain that has shown great progress recently, notably thanks to the use of large deep learning models. However, such approaches are particularly energy intensive, which makes their deployment difficult for edge devices. To address this issue, Spiking Neural Networks (SNNs) coupled with event cameras are a promising alternative, capable of processing sparse and asynchronous events with lower energy consumption. In this paper, we establish the first use of event cameras for FER, named "Event-based FER", and propose the first related benchmarks by converting popular video FER datasets to event streams. To deal with this new task, we propose "Spiking-FER", a deep convolutional SNN model, and compare it against a similar Artificial Neural Network (ANN). Experiments show that the proposed approach achieves comparable performance to the ANN architecture, while consuming less energy by orders of magnitude (up to 65.39x). In addition, an experimental study of various event-based data augmentation techniques is performed to provide insights into the efficient transformations specific to event-based FER. ### Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning? - **Arxiv ID**: http://arxiv.org/abs/2304.10224v2 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10224v2) - **Published**: 2023-04-20 11:39:41+00:00 - **Updated**: 2023-08-04 09:19:43+00:00 - **Authors**: Haoyang Peng, Baopu Li, Bo Zhang, Xin Chen, Tao Chen, Hongyuan Zhu - **Comment**: 10 pages,5 figures - **Journal**: None - **Summary**: Point cloud based 3D deep model has wide applications in many applications such as autonomous driving, house robot, and so on. Inspired by the recent prompt learning in natural language processing, this work proposes a novel Multi-view Vision-Prompt Fusion Network (MvNet) for few-shot 3D point cloud classification. MvNet investigates the possibility of leveraging the off-the-shelf 2D pre-trained models to achieve the few-shot classification, which can alleviate the over-dependence issue of the existing baseline models towards the large-scale annotated 3D point cloud data. Specifically, MvNet first encodes a 3D point cloud into multi-view image features for a number of different views. Then, a novel multi-view prompt fusion module is developed to effectively fuse information from different views to bridge the gap between 3D point cloud data and 2D pre-trained models. A set of 2D image prompts can then be derived to better describe the suitable prior knowledge for a large-scale pre-trained image model for few-shot 3D point cloud classification. Extensive experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification. The source code of this work will be available soon. ### Domain Generalization for Mammographic Image Analysis with Contrastive Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10226v4 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10226v4) - **Published**: 2023-04-20 11:40:21+00:00 - **Updated**: 2023-06-29 06:06:36+00:00 - **Authors**: Zheren Li, Zhiming Cui, Lichi Zhang, Sheng Wang, Chenjin Lei, Xi Ouyang, Dongdong Chen, Xiangyu Zhao, Yajia Gu, Zaiyi Liu, Chunling Liu, Dinggang Shen, Jie-Zhi Cheng - **Comment**: arXiv admin note: text overlap with arXiv:2111.10827 - **Journal**: None - **Summary**: The deep learning technique has been shown to be effectively addressed several image analysis tasks in the computer-aided diagnosis scheme for mammography. The training of an efficacious deep learning model requires large data with diverse styles and qualities. The diversity of data often comes from the use of various scanners of vendors. But, in practice, it is impractical to collect a sufficient amount of diverse data for training. To this end, a novel contrastive learning is developed to equip the deep learning models with better style generalization capability. Specifically, the multi-style and multi-view unsupervised self-learning scheme is carried out to seek robust feature embedding against style diversity as a pretrained model. Afterward, the pretrained network is further fine-tuned to the downstream tasks, e.g., mass detection, matching, BI-RADS rating, and breast density classification. The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets. The experimental results suggest that the proposed domain generalization method can effectively improve performance of four mammographic image tasks on the data from both seen and unseen domains, and outperform many state-of-the-art (SOTA) generalization methods. ### A geometry-aware deep network for depth estimation in monocular endoscopy - **Arxiv ID**: http://arxiv.org/abs/2304.10241v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10241v1) - **Published**: 2023-04-20 11:59:32+00:00 - **Updated**: 2023-04-20 11:59:32+00:00 - **Authors**: Yongming Yang, Shuwei Shao, Tao Yang, Peng Wang, Zhuo Yang, Chengdong Wu, Hao Liu - **Comment**: None - **Journal**: None - **Summary**: Monocular depth estimation is critical for endoscopists to perform spatial perception and 3D navigation of surgical sites. However, most of the existing methods ignore the important geometric structural consistency, which inevitably leads to performance degradation and distortion of 3D reconstruction. To address this issue, we introduce a gradient loss to penalize edge fluctuations ambiguous around stepped edge structures and a normal loss to explicitly express the sensitivity to frequently small structures, and propose a geometric consistency loss to spreads the spatial information across the sample grids to constrain the global geometric anatomy structures. In addition, we develop a synthetic RGB-Depth dataset that captures the anatomical structures under reflections and illumination variations. The proposed method is extensively validated across different datasets and clinical images and achieves mean RMSE values of 0.066 (stomach), 0.029 (small intestine), and 0.139 (colon) on the EndoSLAM dataset. The generalizability of the proposed method achieves mean RMSE values of 12.604 (T1-L1), 9.930 (T2-L2), and 13.893 (T3-L3) on the ColonDepth dataset. The experimental results show that our method exceeds previous state-of-the-art competitors and generates more consistent depth maps and reasonable anatomical structures. The quality of intraoperative 3D structure perception from endoscopic videos of the proposed method meets the accuracy requirements of video-CT registration algorithms for endoscopic navigation. The dataset and the source code will be available at https://github.com/YYM-SIA/LINGMI-MR. ### Omni Aggregation Networks for Lightweight Image Super-Resolution - **Arxiv ID**: http://arxiv.org/abs/2304.10244v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10244v2) - **Published**: 2023-04-20 12:05:14+00:00 - **Updated**: 2023-04-24 09:03:56+00:00 - **Authors**: Hang Wang, Xuanhong Chen, Bingbing Ni, Yutian Liu, Jinfan Liu - **Comment**: Accepted by CVPR2023. Code is available at \url{https://github.com/Francis0625/Omni-SR} - **Journal**: None - **Summary**: While lightweight ViT framework has made tremendous progress in image super-resolution, its uni-dimensional self-attention modeling, as well as homogeneous aggregation scheme, limit its effective receptive field (ERF) to include more comprehensive interactions from both spatial and channel dimensions. To tackle these drawbacks, this work proposes two enhanced components under a new Omni-SR architecture. First, an Omni Self-Attention (OSA) block is proposed based on dense interaction principle, which can simultaneously model pixel-interaction from both spatial and channel dimensions, mining the potential correlations across omni-axis (i.e., spatial and channel). Coupling with mainstream window partitioning strategies, OSA can achieve superior performance with compelling computational budgets. Second, a multi-scale interaction scheme is proposed to mitigate sub-optimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso-/global-scale interactions, rendering an omni-scale aggregation building block. Extensive experiments demonstrate that Omni-SR achieves record-high performance on lightweight super-resolution benchmarks (e.g., 26.95 dB@Urban100 $\times 4$ with only 792K parameters). Our code is available at \url{https://github.com/Francis0625/Omni-SR}. ### Revisiting Implicit Neural Representations in Low-Level Vision - **Arxiv ID**: http://arxiv.org/abs/2304.10250v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10250v1) - **Published**: 2023-04-20 12:19:27+00:00 - **Updated**: 2023-04-20 12:19:27+00:00 - **Authors**: Wentian Xu, Jianbo Jiao - **Comment**: Published at the ICLR 2023 Neural Fields workshop. Project Webpage: https://wentxul.github.io/LINR-projectpage - **Journal**: None - **Summary**: Implicit Neural Representation (INR) has been emerging in computer vision in recent years. It has been shown to be effective in parameterising continuous signals such as dense 3D models from discrete image data, e.g. the neural radius field (NeRF). However, INR is under-explored in 2D image processing tasks. Considering the basic definition and the structure of INR, we are interested in its effectiveness in low-level vision problems such as image restoration. In this work, we revisit INR and investigate its application in low-level image restoration tasks including image denoising, super-resolution, inpainting, and deblurring. Extensive experimental evaluations suggest the superior performance of INR in several low-level vision tasks with limited resources, outperforming its counterparts by over 2dB. Code and models are available at https://github.com/WenTXuL/LINR ### Towards replacing precipitation ensemble predictions systems using machine learning - **Arxiv ID**: http://arxiv.org/abs/2304.10251v1 - **DOI**: None - **Categories**: **physics.ao-ph**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10251v1) - **Published**: 2023-04-20 12:20:35+00:00 - **Updated**: 2023-04-20 12:20:35+00:00 - **Authors**: Rüdiger Brecht, Alex Bihlo - **Comment**: 12 pages, 7 figures, 2 tables - **Journal**: None - **Summary**: Precipitation forecasts are less accurate compared to other meteorological fields because several key processes affecting precipitation distribution and intensity occur below the resolved scale of global weather prediction models. This requires to use higher resolution simulations. To generate an uncertainty prediction associated with the forecast, ensembles of simulations are run simultaneously. However, the computational cost is a limiting factor here. Thus, instead of generating an ensemble system from simulations there is a trend of using neural networks. Unfortunately the data for high resolution ensemble runs is not available. We propose a new approach to generating ensemble weather predictions for high-resolution precipitation without requiring high-resolution training data. The method uses generative adversarial networks to learn the complex patterns of precipitation and produce diverse and realistic precipitation fields, allowing to generate realistic precipitation ensemble members using only the available control forecast. We demonstrate the feasibility of generating realistic precipitation ensemble members on unseen higher resolutions. We use evaluation metrics such as RMSE, CRPS, rank histogram and ROC curves to demonstrate that our generated ensemble is almost identical to the ECMWF IFS ensemble. ### A data augmentation perspective on diffusion models and retrieval - **Arxiv ID**: http://arxiv.org/abs/2304.10253v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10253v1) - **Published**: 2023-04-20 12:21:30+00:00 - **Updated**: 2023-04-20 12:21:30+00:00 - **Authors**: Max F. Burg, Florian Wenzel, Dominik Zietlow, Max Horn, Osama Makansi, Francesco Locatello, Chris Russell - **Comment**: None - **Journal**: None - **Summary**: Diffusion models excel at generating photorealistic images from text-queries. Naturally, many approaches have been proposed to use these generative abilities to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large noisily supervised, but nonetheless, annotated datasets. It is an open question whether the generalization capabilities of diffusion models beyond using the additional data of the pre-training process for augmentation lead to improved downstream performance. We perform a systematic evaluation of existing methods to generate images from diffusion models and study new extensions to assess their benefit for data augmentation. While we find that personalizing diffusion models towards the target data outperforms simpler prompting strategies, we also show that using the training data of the diffusion model alone, via a simple nearest neighbor retrieval procedure, leads to even stronger downstream performance. Overall, our study probes the limitations of diffusion models for data augmentation but also highlights its potential in generating new training data to improve performance on simple downstream vision tasks. ### Image-text Retrieval via Preserving Main Semantics of Vision - **Arxiv ID**: http://arxiv.org/abs/2304.10254v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10254v2) - **Published**: 2023-04-20 12:23:29+00:00 - **Updated**: 2023-04-28 08:09:54+00:00 - **Authors**: Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, Xudong Dai - **Comment**: 6 pages, 3 figures, accepted by ICME2023 - **Journal**: None - **Summary**: Image-text retrieval is one of the major tasks of cross-modal retrieval. Several approaches for this task map images and texts into a common space to create correspondences between the two modalities. However, due to the content (semantics) richness of an image, redundant secondary information in an image may cause false matches. To address this issue, this paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL), to assist the model in focusing on an image's main content. This approach is inspired by how people typically annotate the content of an image by describing its main content. Thus, we leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image, reducing the negative impact of secondary content. Extensive experiments on two benchmark datasets (MSCOCO and Flickr30K) demonstrate the superior performance of our method. The code is available at: https://github.com/ZhangXu0963/VSL. ### Indian Sign Language Recognition Using Mediapipe Holistic - **Arxiv ID**: http://arxiv.org/abs/2304.10256v1 - **DOI**: None - **Categories**: **cs.CV**, cs.CL, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10256v1) - **Published**: 2023-04-20 12:25:47+00:00 - **Updated**: 2023-04-20 12:25:47+00:00 - **Authors**: Dr. Velmathi G, Kaushal Goyal - **Comment**: 16 pages, 22 figures - **Journal**: None - **Summary**: Deaf individuals confront significant communication obstacles on a daily basis. Their inability to hear makes it difficult for them to communicate with those who do not understand sign language. Moreover, it presents difficulties in educational, occupational, and social contexts. By providing alternative communication channels, technology can play a crucial role in overcoming these obstacles. One such technology that can facilitate communication between deaf and hearing individuals is sign language recognition. We will create a robust system for sign language recognition in order to convert Indian Sign Language to text or speech. We will evaluate the proposed system and compare CNN and LSTM models. Since there are both static and gesture sign languages, a robust model is required to distinguish between them. In this study, we discovered that a CNN model captures letters and characters for recognition of static sign language better than an LSTM model, but it outperforms CNN by monitoring hands, faces, and pose in gesture sign language phrases and sentences. The creation of a text-to-sign language paradigm is essential since it will enhance the sign language-dependent deaf and hard-of-hearing population's communication skills. Even though the sign-to-text translation is just one side of communication, not all deaf or hard-of-hearing people are proficient in reading or writing text. Some may have difficulty comprehending written language due to educational or literacy issues. Therefore, a text-to-sign language paradigm would allow them to comprehend text-based information and participate in a variety of social, educational, and professional settings. Keywords: deaf and hard-of-hearing, DHH, Indian sign language, CNN, LSTM, static and gesture sign languages, text-to-sign language model, MediaPipe Holistic, sign language recognition, SLR, SLT ### Social Distance Detection Using Deep Learning And Risk Management System - **Arxiv ID**: http://arxiv.org/abs/2304.10259v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10259v1) - **Published**: 2023-04-20 12:27:39+00:00 - **Updated**: 2023-04-20 12:27:39+00:00 - **Authors**: Dr. Sangeetha R. G, Jaya Aravindh V. V - **Comment**: None - **Journal**: None - **Summary**: An outbreak of the coronavirus disease which occurred three years later and it has hit the world again with many evolutions. The effects on the human race have already been profound. We can only safeguard ourselves against this pandemic by mandating a "Face Mask" also maintaining the "Social Distancing." The necessity of protective face masks in all gatherings is required by many civil institutions in India. As a result of the substantial human resource utilization, personally examining the whole country with a huge population like India, to determine whether the execution of mask wearing and social distance maintained is unfeasible. The COVID-19 Social Distancing Detector System is a single-stage detector that employs deep learning to integrate high-end semantic data to a CNN module in order to maintain social distances and simultaneously monitor violations within a specified region. By deploying current Security footages, CCTV cameras, and computer vision (CV), it will also be able to identify those who are experiencing the calamity of social separation. Providing tools for safety and security, this technology disposes the need for a labor-force based surveillance system, yet a manual governing body is still required to monitor, track, and inform on the violations that are committed. Any sort of infrastructure, including universities, hospitals, offices of the government, schools, and building sites, can employ the technology. Therefore, the risk management system created to report and analyze video streams along with the social distance detector system might help to ensure our protection and security as well as the security of our loved ones. Furthermore, we will discuss about deployment and improvement of the project overall. ### PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image - **Arxiv ID**: http://arxiv.org/abs/2304.10263v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10263v1) - **Published**: 2023-04-20 12:33:56+00:00 - **Updated**: 2023-04-20 12:33:56+00:00 - **Authors**: Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, Jun Zhu - **Comment**: 20 pages, 21 figures - **Journal**: None - **Summary**: We study the 3D-aware image attribute editing problem in this paper, which has wide applications in practice. Recent methods solved the problem by training a shared encoder to map images into a 3D generator's latent space or by per-image latent code optimization and then edited images in the latent space. Despite their promising results near the input view, they still suffer from the 3D inconsistency of produced images at large camera poses and imprecise image attribute editing, like affecting unspecified attributes during editing. For more efficient image inversion, we train a shared encoder for all images. To alleviate 3D inconsistency at large camera poses, we propose two novel methods, an alternating training scheme and a multi-view identity loss, to maintain 3D consistency and subject identity. As for imprecise image editing, we attribute the problem to the gap between the latent space of real images and that of generated images. We compare the latent space and inversion manifold of GAN models and demonstrate that editing in the inversion manifold can achieve better results in both quantitative and qualitative evaluations. Extensive experiments show that our method produces more 3D consistent images and achieves more precise image editing than previous work. Source code and pretrained models can be found on our project page: https://mybabyyh.github.io/Preim3D/ ### Learning Self-Supervised Representations for Label Efficient Cross-Domain Knowledge Transfer on Diabetic Retinopathy Fundus Images - **Arxiv ID**: http://arxiv.org/abs/2304.11168v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.11168v1) - **Published**: 2023-04-20 12:46:34+00:00 - **Updated**: 2023-04-20 12:46:34+00:00 - **Authors**: Ekta Gupta, Varun Gupta, Muskaan Chopra, Prakash Chandra Chhipa, Marcus Liwicki - **Comment**: Accepted to International Joint Conference on Neural Networks (IJCNN) 2023 - **Journal**: None - **Summary**: This work presents a novel label-efficient selfsupervised representation learning-based approach for classifying diabetic retinopathy (DR) images in cross-domain settings. Most of the existing DR image classification methods are based on supervised learning which requires a lot of time-consuming and expensive medical domain experts-annotated data for training. The proposed approach uses the prior learning from the source DR image dataset to classify images drawn from the target datasets. The image representations learned from the unlabeled source domain dataset through contrastive learning are used to classify DR images from the target domain dataset. Moreover, the proposed approach requires a few labeled images to perform successfully on DR image classification tasks in cross-domain settings. The proposed work experiments with four publicly available datasets: EyePACS, APTOS 2019, MESSIDOR-I, and Fundus Images for self-supervised representation learning-based DR image classification in cross-domain settings. The proposed method achieves state-of-the-art results on binary and multiclassification of DR images, even in cross-domain settings. The proposed method outperforms the existing DR image binary and multi-class classification methods proposed in the literature. The proposed method is also validated qualitatively using class activation maps, revealing that the method can learn explainable image representations. The source code and trained models are published on GitHub. ### Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis - **Arxiv ID**: http://arxiv.org/abs/2304.10278v1 - **DOI**: 10.1145/3591106.3592262 - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10278v1) - **Published**: 2023-04-20 13:00:46+00:00 - **Updated**: 2023-04-20 13:00:46+00:00 - **Authors**: Yankun Wu, Yuta Nakashima, Noa Garcia - **Comment**: None - **Journal**: None - **Summary**: The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author's emotions, social trends, artistic movement, etc., and their deep comprehension undoubtfully requires to handle both. A promising step towards a general paradigm for art analysis is to disentangle content and style, whereas relying on human annotations to cull a single aspect of artworks has limitations in learning semantic concepts and the visual appearance of paintings. We thus present GOYA, a method that distills the artistic knowledge captured in a recent generative model to disentangle content and style. Experiments show that synthetically generated images sufficiently serve as a proxy of the real distribution of artworks, allowing GOYA to separately represent the two elements of art while keeping more information than existing methods. ### Feature-compatible Progressive Learning for Video Copy Detection - **Arxiv ID**: http://arxiv.org/abs/2304.10305v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10305v2) - **Published**: 2023-04-20 13:39:47+00:00 - **Updated**: 2023-05-12 17:26:39+00:00 - **Authors**: Wenhao Wang, Yifan Sun, Yi Yang - **Comment**: The second place solutions for both tracks of Meta AI Video Similarity Challenge (VSC22), CVPR 2023 - **Journal**: None - **Summary**: Video Copy Detection (VCD) has been developed to identify instances of unauthorized or duplicated video content. This paper presents our second place solutions to the Meta AI Video Similarity Challenge (VSC22), CVPR 2023. In order to compete in this challenge, we propose Feature-Compatible Progressive Learning (FCPL) for VCD. FCPL trains various models that produce mutually-compatible features, meaning that the features derived from multiple distinct models can be directly compared with one another. We find this mutual compatibility enables feature ensemble. By implementing progressive learning and utilizing labeled ground truth pairs, we effectively gradually enhance performance. Experimental results demonstrate the superiority of the proposed FCPL over other competitors. Our code is available at https://github.com/WangWenhao0716/VSC-DescriptorTrack-Submission and https://github.com/WangWenhao0716/VSC-MatchingTrack-Submission. ### FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits - **Arxiv ID**: http://arxiv.org/abs/2304.10306v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10306v1) - **Published**: 2023-04-20 13:40:49+00:00 - **Updated**: 2023-04-20 13:40:49+00:00 - **Authors**: Polina Karpikova, Radionova Ekaterina, Anastasia Yaschenko, Andrei Spiridonov, Leonid Kostyushko, Riccardo Fabbricatore, Aleksei Ivakhnenko - **Comment**: 12 pages, 22 figures - **Journal**: None - **Summary**: Generative DNNs are a powerful tool for image synthesis, but they are limited by their computational load. On the other hand, given a trained model and a task, e.g. faces generation within a range of characteristics, the output image quality will be unevenly distributed among images with different characteristics. It follows, that we might restrain the models complexity on some instances, maintaining a high quality. We propose a method for diminishing computations by adding so-called early exit branches to the original architecture, and dynamically switching the computational path depending on how difficult it will be to render the output. We apply our method on two different SOTA models performing generative tasks: generation from a semantic map, and cross-reenactment of face expressions; showing it is able to output images with custom lower-quality thresholds. For a threshold of LPIPS <=0.1, we diminish their computations by up to a half. This is especially relevant for real-time applications such as synthesis of faces, when quality loss needs to be contained, but most of the inputs need fewer computations than the complex instances. ### Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10309v1 - **DOI**: 10.1109/TASLP.2023.3244521 - **Categories**: **cs.CL**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10309v1) - **Published**: 2023-04-20 13:41:56+00:00 - **Updated**: 2023-04-20 13:41:56+00:00 - **Authors**: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Wei-Qiang Zhang - **Comment**: None - **Journal**: IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023 - **Summary**: The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharingmechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information. ### LA3: Efficient Label-Aware AutoAugment - **Arxiv ID**: http://arxiv.org/abs/2304.10310v1 - **DOI**: 10.1007/978-3-031-19803-8_16 - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10310v1) - **Published**: 2023-04-20 13:42:18+00:00 - **Updated**: 2023-04-20 13:42:18+00:00 - **Authors**: Mingjun Zhao, Shan Lu, Zixuan Wang, Xiaoli Wang, Di Niu - **Comment**: ECCV 2022 - **Journal**: None - **Summary**: Automated augmentation is an emerging and effective technique to search for data augmentation policies to improve generalizability of deep neural network training. Most existing work focuses on constructing a unified policy applicable to all data samples in a given dataset, without considering sample or class variations. In this paper, we propose a novel two-stage data augmentation algorithm, named Label-Aware AutoAugment (LA3), which takes advantage of the label information, and learns augmentation policies separately for samples of different labels. LA3 consists of two learning stages, where in the first stage, individual augmentation methods are evaluated and ranked for each label via Bayesian Optimization aided by a neural predictor, which allows us to identify effective augmentation techniques for each label under a low search cost. And in the second stage, a composite augmentation policy is constructed out of a selection of effective as well as complementary augmentations, which produces significant performance boost and can be easily deployed in typical model training. Extensive experiments demonstrate that LA3 achieves excellent performance matching or surpassing existing methods on CIFAR-10 and CIFAR-100, and achieves a new state-of-the-art ImageNet accuracy of 79.97% on ResNet-50 among auto-augmentation methods, while maintaining a low computational cost. ### Search-Map-Search: A Frame Selection Paradigm for Action Recognition - **Arxiv ID**: http://arxiv.org/abs/2304.10316v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10316v1) - **Published**: 2023-04-20 13:49:53+00:00 - **Updated**: 2023-04-20 13:49:53+00:00 - **Authors**: Mingjun Zhao, Yakun Yu, Xiaoli Wang, Lei Yang, Di Niu - **Comment**: CVPR 2023 - **Journal**: None - **Summary**: Despite the success of deep learning in video understanding tasks, processing every frame in a video is computationally expensive and often unnecessary in real-time applications. Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, without considering interaction among frames, or adopt reinforcement learning agents to find representative frames in succession, which are costly to train and may lead to potential stability issues. To overcome the limitations of existing methods, we propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead. Specifically, we first propose a hierarchical search method conducted on each training video to search for the optimal combination of frames with the lowest error on the downstream task. A feature mapping function is then learned to map the frames of a video to the representation of its target optimal frame combination. During inference, another search is performed on an unseen video to select a combination of frames whose feature representation is close to the projected feature representation. Extensive experiments based on several action recognition benchmarks demonstrate that our frame selection method effectively improves performance of action recognition models, and significantly outperforms a number of competitive baselines. ### Adaptive Consensus Optimization Method for GANs - **Arxiv ID**: http://arxiv.org/abs/2304.10317v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV, cs.NA, math.NA - **Links**: [PDF](http://arxiv.org/pdf/2304.10317v1) - **Published**: 2023-04-20 13:50:42+00:00 - **Updated**: 2023-04-20 13:50:42+00:00 - **Authors**: Sachin Kumar Danisetty, Santhosh Reddy Mylaram, Pawan Kumar - **Comment**: None - **Journal**: None - **Summary**: We propose a second order gradient based method with ADAM and RMSprop for the training of generative adversarial networks. The proposed method is fastest to obtain similar accuracy when compared to prominent second order methods. Unlike state-of-the-art recent methods, it does not require solving a linear system, or it does not require additional mixed second derivative terms. We derive the fixed point iteration corresponding to proposed method, and show that the proposed method is convergent. The proposed method produces better or comparable inception scores, and comparable quality of images compared to other recently proposed state-of-the-art second order methods. Compared to first order methods such as ADAM, it produces significantly better inception scores. The proposed method is compared and validated on popular datasets such as FFHQ, LSUN, CIFAR10, MNIST, and Fashion MNIST for image generation tasks\footnote{Accepted in IJCNN 2023}. Codes: \url{https://github.com/misterpawan/acom} ### DropDim: A Regularization Method for Transformer Networks - **Arxiv ID**: http://arxiv.org/abs/2304.10321v1 - **DOI**: 10.1109/LSP.2022.3140693 - **Categories**: **cs.CL**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10321v1) - **Published**: 2023-04-20 13:54:18+00:00 - **Updated**: 2023-04-20 13:54:18+00:00 - **Authors**: Hao Zhang, Dan Qu, Keji Shao, Xukui Yang - **Comment**: None - **Journal**: IEEE SIGNAL PROCESSING LETTERS, VOL. 29, 2022 - **Summary**: We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline. ### Ensembling Instance and Semantic Segmentation for Panoptic Segmentation - **Arxiv ID**: http://arxiv.org/abs/2304.10326v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10326v1) - **Published**: 2023-04-20 14:02:01+00:00 - **Updated**: 2023-04-20 14:02:01+00:00 - **Authors**: Mehmet Yildirim, Yogesh Langhe - **Comment**: None - **Journal**: None - **Summary**: We demonstrate our solution for the 2019 COCO panoptic segmentation task. Our method first performs instance segmentation and semantic segmentation separately, then combines the two to generate panoptic segmentation results. To enhance the performance, we add several expert models of Mask R-CNN in instance segmentation to tackle the data imbalance problem in the training data; also HTC model is adopted yielding our best instance segmentation results. In semantic segmentation, we trained several models with various backbones and use an ensemble strategy which further boosts the segmentation results. In the end, we analyze various combinations of instance and semantic segmentation, and report on their performance for the final panoptic segmentation results. Our best model achieves $PQ$ 47.1 on 2019 COCO panoptic test-dev data. ### Noisy Universal Domain Adaptation via Divergence Optimization for Visual Recognition - **Arxiv ID**: http://arxiv.org/abs/2304.10333v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10333v1) - **Published**: 2023-04-20 14:18:38+00:00 - **Updated**: 2023-04-20 14:18:38+00:00 - **Authors**: Qing Yu, Atsushi Hashimoto, Yoshitaka Ushiku - **Comment**: None - **Journal**: None - **Summary**: To transfer the knowledge learned from a labeled source domain to an unlabeled target domain, many studies have worked on universal domain adaptation (UniDA), where there is no constraint on the label sets of the source domain and target domain. However, the existing UniDA methods rely on source samples with correct annotations. Due to the limited resources in the real world, it is difficult to obtain a large amount of perfectly clean labeled data in a source domain in some applications. As a result, we propose a novel realistic scenario named Noisy UniDA, in which classifiers are trained using noisy labeled data from the source domain as well as unlabeled domain data from the target domain that has an uncertain class distribution. A multi-head convolutional neural network framework is proposed in this paper to address all of the challenges faced in the Noisy UniDA at once. Our network comprises a single common feature generator and multiple classifiers with various decision bounds. We can detect noisy samples in the source domain, identify unknown classes in the target domain, and align the distribution of the source and target domains by optimizing the divergence between the outputs of the various classifiers. The proposed method outperformed the existing methods in most of the settings after a thorough analysis of the various domain adaption scenarios. The source code is available at \url{https://github.com/YU1ut/Divergence-Optimization}. ### A baseline on continual learning methods for video action recognition - **Arxiv ID**: http://arxiv.org/abs/2304.10335v2 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10335v2) - **Published**: 2023-04-20 14:20:43+00:00 - **Updated**: 2023-04-26 09:49:18+00:00 - **Authors**: Giulia Castagnolo, Concetto Spampinato, Francesco Rundo, Daniela Giordano, Simone Palazzo - **Comment**: None - **Journal**: None - **Summary**: Continual learning has recently attracted attention from the research community, as it aims to solve long-standing limitations of classic supervisedly-trained models. However, most research on this subject has tackled continual learning in simple image classification scenarios. In this paper, we present a benchmark of state-of-the-art continual learning methods on video action recognition. Besides the increased complexity due to the temporal dimension, the video setting imposes stronger requirements on computing resources for top-performing rehearsal methods. To counteract the increased memory requirements, we present two method-agnostic variants for rehearsal methods, exploiting measures of either model confidence or data information to select memorable samples. Our experiments show that, as expected from the literature, rehearsal methods outperform other approaches; moreover, the proposed memory-efficient variants are shown to be effective at retaining a certain level of performance with a smaller buffer size. ### DocMAE: Document Image Rectification via Self-supervised Representation Learning - **Arxiv ID**: http://arxiv.org/abs/2304.10341v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10341v1) - **Published**: 2023-04-20 14:27:15+00:00 - **Updated**: 2023-04-20 14:27:15+00:00 - **Authors**: Shaokai Liu, Hao Feng, Wengang Zhou, Houqiang Li, Cong Liu, Feng Wu - **Comment**: Accepted to ICME 2023 - **Journal**: None - **Summary**: Tremendous efforts have been made on document image rectification, but how to learn effective representation of such distorted images is still under-explored. In this paper, we present DocMAE, a novel self-supervised framework for document image rectification. Our motivation is to encode the structural cues in document images by leveraging masked autoencoder to benefit the rectification, i.e., the document boundaries, and text lines. Specifically, we first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents by restoring document boundaries and missing text lines. Transfer performance in the downstream rectification task validates the effectiveness of our method. Extensive experiments are conducted to demonstrate the effectiveness of our method. ### Breast cancer detection using deep learning - **Arxiv ID**: http://arxiv.org/abs/2304.10386v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10386v1) - **Published**: 2023-04-20 15:22:12+00:00 - **Updated**: 2023-04-20 15:22:12+00:00 - **Authors**: Gayathri Girish, Ponnathota Spandana, Badrish Vasu - **Comment**: None - **Journal**: None - **Summary**: Objective: This paper proposes a deep learning model for breast cancer detection from reconstructed images of microwave imaging scan data and aims to improve the accuracy and efficiency of breast tumor detection, which could have a significant impact on breast cancer diagnosis and treatment. Methods: Our framework consists of different convolutional neural network (CNN) architectures for feature extraction and a region-based CNN for tumor detection. We use 7 different architectures: DenseNet201, ResNet50, InceptionV3, InceptionResNetV3, MobileNetV2, NASNetMobile and NASNetLarge and compare its performance to find the best architecture out of the seven. An experimental dataset of MRI-derived breast phantoms was used. Results: NASNetLarge is the best architecture which can be used for the CNN model with accuracy of 88.41% and loss of 27.82%. Given that the model's AUC is 0.786, it can be concluded that it is suitable for use in its present form, while it could be improved upon and trained on other datasets that are comparable. Impact: One of the main causes of death in women is breast cancer, and early identification is essential for enhancing the results for patients. Due to its non-invasiveness and capacity to produce high-resolution images, microwave imaging is a potential tool for breast cancer screening. The complexity of tumors makes it difficult to adequately detect them in microwave images. The results of this research show that deep learning has a lot of potential for breast cancer detection in microwave images ### LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields - **Arxiv ID**: http://arxiv.org/abs/2304.10406v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10406v2) - **Published**: 2023-04-20 15:44:37+00:00 - **Updated**: 2023-07-14 12:44:47+00:00 - **Authors**: Tang Tao, Longfei Gao, Guangrun Wang, Yixing Lao, Peng Chen, Hengshuang Zhao, Dayang Hao, Xiaodan Liang, Mathieu Salzmann, Kaicheng Yu - **Comment**: This paper introduces a new task of novel LiDAR view synthesis, and proposes a differentiable framework called LiDAR-NeRF with a structural regularization, as well as an object-centric multi-view LiDAR dataset called NeRF-MVL - **Journal**: None - **Summary**: We introduce a new task, novel view synthesis for LiDAR sensors. While traditional model-based LiDAR simulators with style-transfer neural networks can be applied to render novel views, they fall short of producing accurate and realistic LiDAR patterns because the renderers rely on explicit 3D reconstruction and exploit game engines, that ignore important attributes of LiDAR points. We address this challenge by formulating, to the best of our knowledge, the first differentiable end-to-end LiDAR rendering framework, LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint learning of geometry and the attributes of 3D points. However, simply employing NeRF cannot achieve satisfactory results, as it only focuses on learning individual pixels while ignoring local information, especially at low texture areas, resulting in poor geometry. To this end, we have taken steps to address this issue by introducing a structural regularization method to preserve local structural details. To evaluate the effectiveness of our approach, we establish an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains observations of objects from 9 categories seen from 360-degree viewpoints captured with multiple LiDAR sensors. Our extensive experiments on the scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our LiDAR-NeRF surpasses the model-based algorithms significantly. ### Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review - **Arxiv ID**: http://arxiv.org/abs/2304.10410v2 - **DOI**: 10.1109/TIV.2023.3307157 - **Categories**: **cs.CV**, cs.AI, cs.RO - **Links**: [PDF](http://arxiv.org/pdf/2304.10410v2) - **Published**: 2023-04-20 15:48:50+00:00 - **Updated**: 2023-08-23 15:15:59+00:00 - **Authors**: Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, Yutao Yue - **Comment**: Accepted by IEEE Transactions on Intelligent Vehicles (T-IV) - **Journal**: None - **Summary**: Driven by deep learning techniques, perception technology in autonomous driving has developed rapidly in recent years, enabling vehicles to accurately detect and interpret surrounding environment for safe and efficient navigation. To achieve accurate and robust perception capabilities, autonomous vehicles are often equipped with multiple sensors, making sensor fusion a crucial part of the perception system. Among these fused sensors, radars and cameras enable a complementary and cost-effective perception of the surrounding environment regardless of lighting and weather conditions. This review aims to provide a comprehensive guideline for radar-camera fusion, particularly concentrating on perception tasks related to object detection and semantic segmentation.Based on the principles of the radar and camera sensors, we delve into the data processing process and representations, followed by an in-depth analysis and summary of radar-camera fusion datasets. In the review of methodologies in radar-camera fusion, we address interrogative questions, including "why to fuse", "what to fuse", "where to fuse", "when to fuse", and "how to fuse", subsequently discussing various challenges and potential research directions within this domain. To ease the retrieval and comparison of datasets and fusion methods, we also provide an interactive website: https://radar-camera-fusion.github.io. ### NTIRE 2023 Challenge on Light Field Image Super-Resolution: Dataset, Methods and Results - **Arxiv ID**: http://arxiv.org/abs/2304.10415v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10415v1) - **Published**: 2023-04-20 15:59:31+00:00 - **Updated**: 2023-04-20 15:59:31+00:00 - **Authors**: Yingqian Wang, Longguang Wang, Zhengyu Liang, Jungang Yang, Radu Timofte, Yulan Guo - **Comment**: None - **Journal**: None - **Summary**: In this report, we summarize the first NTIRE challenge on light field (LF) image super-resolution (SR), which aims at super-resolving LF images under the standard bicubic degradation with a magnification factor of 4. This challenge develops a new LF dataset called NTIRE-2023 for validation and test, and provides a toolbox called BasicLFSR to facilitate model development. Compared with single image SR, the major challenge of LF image SR lies in how to exploit complementary angular information from plenty of views with varying disparities. In total, 148 participants have registered the challenge, and 11 teams have successfully submitted results with PSNR scores higher than the baseline method LF-InterNet \cite{LF-InterNet}. These newly developed methods have set new state-of-the-art in LF image SR, e.g., the winning method achieves around 1 dB PSNR improvement over the existing state-of-the-art method DistgSSR \cite{DistgLF}. We report the solutions proposed by the participants, and summarize their common trends and useful tricks. We hope this challenge can stimulate future research and inspire new ideas in LF image SR. ### SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation - **Arxiv ID**: http://arxiv.org/abs/2304.10417v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10417v2) - **Published**: 2023-04-20 16:01:55+00:00 - **Updated**: 2023-08-19 20:34:13+00:00 - **Authors**: Nikos Athanasiou, Mathis Petrovich, Michael J. Black, Gül Varol - **Comment**: ICCV 2023 Camera Ready - **Journal**: None - **Summary**: Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/. ### A Study on Reproducibility and Replicability of Table Structure Recognition Methods - **Arxiv ID**: http://arxiv.org/abs/2304.10439v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10439v1) - **Published**: 2023-04-20 16:30:58+00:00 - **Updated**: 2023-04-20 16:30:58+00:00 - **Authors**: Kehinde Ajayi, Muntabir Hasan Choudhury, Sarah Rajtmajer, Jian Wu - **Comment**: 10 pages, 5 figures - **Journal**: None - **Summary**: Concerns about reproducibility in artificial intelligence (AI) have emerged, as researchers have reported unsuccessful attempts to directly reproduce published findings in the field. Replicability, the ability to affirm a finding using the same procedures on new data, has not been well studied. In this paper, we examine both reproducibility and replicability of a corpus of 16 papers on table structure recognition (TSR), an AI task aimed at identifying cell locations of tables in digital documents. We attempt to reproduce published results using codes and datasets provided by the original authors. We then examine replicability using a dataset similar to the original as well as a new dataset, GenTSR, consisting of 386 annotated tables extracted from scientific papers. Out of 16 papers studied, we reproduce results consistent with the original in only four. Two of the four papers are identified as replicable using the similar dataset under certain IoU values. No paper is identified as replicable using the new dataset. We offer observations on the causes of irreproducibility and irreplicability. All code and data are available on Codeocean at https://codeocean.com/capsule/6680116/tree. ### OpenLane-V2: A Topology Reasoning Benchmark for Scene Understanding in Autonomous Driving - **Arxiv ID**: http://arxiv.org/abs/2304.10440v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10440v2) - **Published**: 2023-04-20 16:31:22+00:00 - **Updated**: 2023-06-10 17:22:09+00:00 - **Authors**: Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Yuting Wang, Shengyin Jiang, Peijin Jia, Bangjun Wang, Feng Wen, Hang Xu, Ping Luo, Junchi Yan, Wei Zhang, Hongyang Li - **Comment**: OpenLane-V2 dataset: https://github.com/OpenDriveLab/OpenLane-V2 - **Journal**: None - **Summary**: Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute accurate judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model's performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes. ### DPAF: Image Synthesis via Differentially Private Aggregation in Forward Phase - **Arxiv ID**: http://arxiv.org/abs/2304.12185v1 - **DOI**: None - **Categories**: **cs.CV**, cs.CR, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.12185v1) - **Published**: 2023-04-20 16:32:02+00:00 - **Updated**: 2023-04-20 16:32:02+00:00 - **Authors**: Chih-Hsun Lin, Chia-Yi Hsu, Chia-Mu Yu, Yang Cao, Chun-Ying Huang - **Comment**: None - **Journal**: None - **Summary**: Differentially private synthetic data is a promising alternative for sensitive data release. Many differentially private generative models have been proposed in the literature. Unfortunately, they all suffer from the low utility of the synthetic data, particularly for images of high resolutions. Here, we propose DPAF, an effective differentially private generative model for high-dimensional image synthesis. Different from the prior private stochastic gradient descent-based methods that add Gaussian noises in the backward phase during the model training, DPAF adds a differentially private feature aggregation in the forward phase, bringing advantages, including the reduction of information loss in gradient clipping and low sensitivity for the aggregation. Moreover, as an improper batch size has an adverse impact on the utility of synthetic data, DPAF also tackles the problem of setting a proper batch size by proposing a novel training strategy that asymmetrically trains different parts of the discriminator. We extensively evaluate different methods on multiple image datasets (up to images of 128x128 resolution) to demonstrate the performance of DPAF. ### Securing Neural Networks with Knapsack Optimization - **Arxiv ID**: http://arxiv.org/abs/2304.10442v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10442v1) - **Published**: 2023-04-20 16:40:10+00:00 - **Updated**: 2023-04-20 16:40:10+00:00 - **Authors**: Yakir Gorski, Shai Avidan - **Comment**: None - **Journal**: None - **Summary**: Deep learning inference brings together the data and the Convolutional Neural Network (CNN). This is problematic in case the user wants to preserve the privacy of the data and the service provider does not want to reveal the weights of his CNN. Secure Inference allows the two parties to engage in a protocol that preserves their respective privacy concerns, while revealing only the inference result to the user. This is known as Multi-Party Computation (MPC). A major bottleneck of MPC algorithms is communication, as the parties must send data back and forth. The linear component of a CNN (i.e. convolutions) can be done efficiently with minimal communication, but the non-linear part (i.e., ReLU) requires the bulk of communication bandwidth. We propose two ways to accelerate Secure Inference. The first is based on the observation that the ReLU outcome of many convolutions is highly correlated. Therefore, we replace the per pixel ReLU operation by a ReLU operation per patch. Each layer in the network will benefit from a patch of a different size and we devise an algorithm to choose the optimal set of patch sizes through a novel reduction of the problem to a knapsack problem. The second way to accelerate Secure Inference is based on cutting the number of bit comparisons required for a secure ReLU operation. We demonstrate the cumulative effect of these tools in the semi-honest secure 3-party setting for four problems: Classifying ImageNet using ResNet50 backbone, classifying CIFAR100 using ResNet18 backbone, semantic segmentation of ADE20K using MobileNetV2 backbone and semantic segmentation of Pascal VOC 2012 using ResNet50 backbone. Our source code is publicly available: $\href{https://github.com/yg320/secure_inference}{\text{https://github.com/yg320/secure_inference}}$ ### Certified Adversarial Robustness Within Multiple Perturbation Bounds - **Arxiv ID**: http://arxiv.org/abs/2304.10446v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10446v1) - **Published**: 2023-04-20 16:42:44+00:00 - **Updated**: 2023-04-20 16:42:44+00:00 - **Authors**: Soumalya Nandi, Sravanti Addepalli, Harsh Rangwani, R. Venkatesh Babu - **Comment**: None - **Journal**: None - **Summary**: Randomized smoothing (RS) is a well known certified defense against adversarial attacks, which creates a smoothed classifier by predicting the most likely class under random noise perturbations of inputs during inference. While initial work focused on robustness to $\ell_2$ norm perturbations using noise sampled from a Gaussian distribution, subsequent works have shown that different noise distributions can result in robustness to other $\ell_p$ norm bounds as well. In general, a specific noise distribution is optimal for defending against a given $\ell_p$ norm based attack. In this work, we aim to improve the certified adversarial robustness against multiple perturbation bounds simultaneously. Towards this, we firstly present a novel \textit{certification scheme}, that effectively combines the certificates obtained using different noise distributions to obtain optimal results against multiple perturbation bounds. We further propose a novel \textit{training noise distribution} along with a \textit{regularized training scheme} to improve the certification within both $\ell_1$ and $\ell_2$ perturbation norms simultaneously. Contrary to prior works, we compare the certified robustness of different training algorithms across the same natural (clean) accuracy, rather than across fixed noise levels used for training and certification. We also empirically invalidate the argument that training and certifying the classifier with the same amount of noise gives the best results. The proposed approach achieves improvements on the ACR (Average Certified Radius) metric across both $\ell_1$ and $\ell_2$ perturbation bounds. ### ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects - **Arxiv ID**: http://arxiv.org/abs/2304.10448v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10448v1) - **Published**: 2023-04-20 16:43:58+00:00 - **Updated**: 2023-04-20 16:43:58+00:00 - **Authors**: Marco Toschi, Riccardo De Matteo, Riccardo Spezialetti, Daniele De Gregorio, Luigi Di Stefano, Samuele Salti - **Comment**: Accepted at CVPR 2023 as a highlight - **Journal**: None - **Summary**: In this paper, we focus on the problem of rendering novel views from a Neural Radiance Field (NeRF) under unobserved light conditions. To this end, we introduce a novel dataset, dubbed ReNe (Relighting NeRF), framing real world objects under one-light-at-time (OLAT) conditions, annotated with accurate ground-truth camera and light poses. Our acquisition pipeline leverages two robotic arms holding, respectively, a camera and an omni-directional point-wise light source. We release a total of 20 scenes depicting a variety of objects with complex geometry and challenging materials. Each scene includes 2000 images, acquired from 50 different points of views under 40 different OLAT conditions. By leveraging the dataset, we perform an ablation study on the relighting capability of variants of the vanilla NeRF architecture and identify a lightweight architecture that can render novel views of an object under novel light conditions, which we use to establish a non-trivial baseline for the dataset. Dataset and benchmark are available at https://eyecan-ai.github.io/rene. ### Angle based dynamic learning rate for gradient descent - **Arxiv ID**: http://arxiv.org/abs/2304.10457v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10457v1) - **Published**: 2023-04-20 16:55:56+00:00 - **Updated**: 2023-04-20 16:55:56+00:00 - **Authors**: Neel Mishra, Pawan Kumar - **Comment**: 8 pages, 7 figures - **Journal**: None - **Summary**: In our work, we propose a novel yet simple approach to obtain an adaptive learning rate for gradient-based descent methods on classification tasks. Instead of the traditional approach of selecting adaptive learning rates via the decayed expectation of gradient-based terms, we use the angle between the current gradient and the new gradient: this new gradient is computed from the direction orthogonal to the current gradient, which further helps us in determining a better adaptive learning rate based on angle history, thereby, leading to relatively better accuracy compared to the existing state-of-the-art optimizers. On a wide variety of benchmark datasets with prominent image classification architectures such as ResNet, DenseNet, EfficientNet, and VGG, we find that our method leads to the highest accuracy in most of the datasets. Moreover, we prove that our method is convergent. ### Implicit Temporal Modeling with Learnable Alignment for Video Recognition - **Arxiv ID**: http://arxiv.org/abs/2304.10465v2 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10465v2) - **Published**: 2023-04-20 17:11:01+00:00 - **Updated**: 2023-08-15 08:04:00+00:00 - **Authors**: Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang - **Comment**: ICCV 2023 oral. 14 pages, 7 figures. Code released at https://github.com/Francis-Rings/ILA - **Journal**: None - **Summary**: Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint spatial-temporal modeling trades off between the efficiency and performance. While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention. To this end, in this paper, we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. Specifically, for a frame pair, an interactive point is predicted in each frame, serving as a mutual information rich region. By enhancing the features around the interactive point, two frames are implicitly aligned. The aligned features are then pooled into a single token, which is leveraged in the subsequent spatial self-attention. Our method allows eliminating the costly or insufficient temporal self-attention in video. Extensive experiments on benchmarks demonstrate the superiority and generality of our module. Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is released at https://github.com/Francis-Rings/ILA . ### Reconstructing Signing Avatars From Video Using Linguistic Priors - **Arxiv ID**: http://arxiv.org/abs/2304.10482v1 - **DOI**: None - **Categories**: **cs.CV**, cs.GR - **Links**: [PDF](http://arxiv.org/pdf/2304.10482v1) - **Published**: 2023-04-20 17:29:50+00:00 - **Updated**: 2023-04-20 17:29:50+00:00 - **Authors**: Maria-Paola Forte, Peter Kulits, Chun-Hao Huang, Vasileios Choutas, Dimitrios Tzionas, Katherine J. Kuchenbecker, Michael J. Black - **Comment**: None - **Journal**: None - **Summary**: Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at $\href{http://sgnify.is.tue.mpg.de}{\text{sgnify.is.tue.mpg.de}}$. ### Segment Anything Model for Medical Image Analysis: an Experimental Study - **Arxiv ID**: http://arxiv.org/abs/2304.10517v3 - **DOI**: 10.1016/j.media.2023.102918 - **Categories**: **cs.CV**, cs.AI, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10517v3) - **Published**: 2023-04-20 17:50:18+00:00 - **Updated**: 2023-05-17 17:20:46+00:00 - **Authors**: Maciej A. Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, Yixin Zhang - **Comment**: Link to our code: https://github.com/mazurowski-lab/segment-anything-medical-evaluation - **Journal**: Medical Image Analysis, 102918 (2023) - **Summary**: Training segmentation models for medical images continues to be challenging due to the limited availability of data annotations. Segment Anything Model (SAM) is a foundation model that is intended to segment user-defined objects of interest in an interactive manner. While the performance on natural images is impressive, medical image domains pose their own set of challenges. Here, we perform an extensive evaluation of SAM's ability to segment medical images on a collection of 19 medical imaging datasets from various modalities and anatomies. We report the following findings: (1) SAM's performance based on single prompts highly varies depending on the dataset and the task, from IoU=0.1135 for spine MRI to IoU=0.8650 for hip X-ray. (2) Segmentation performance appears to be better for well-circumscribed objects with prompts with less ambiguity and poorer in various other scenarios such as the segmentation of brain tumors. (3) SAM performs notably better with box prompts than with point prompts. (4) SAM outperforms similar methods RITM, SimpleClick, and FocalClick in almost all single-point prompt settings. (5) When multiple-point prompts are provided iteratively, SAM's performance generally improves only slightly while other methods' performance improves to the level that surpasses SAM's point-based performance. We also provide several illustrations for SAM's performance on all tested datasets, iterative segmentation, and SAM's behavior given prompt ambiguity. We conclude that SAM shows impressive zero-shot segmentation performance for certain medical imaging datasets, but moderate to poor performance for others. SAM has the potential to make a significant impact in automated medical image segmentation in medical imaging, but appropriate care needs to be applied when using it. ### Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget - **Arxiv ID**: http://arxiv.org/abs/2304.10520v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10520v1) - **Published**: 2023-04-20 17:51:09+00:00 - **Updated**: 2023-04-20 17:51:09+00:00 - **Authors**: Johannes Lehner, Benedikt Alkin, Andreas Fürst, Elisabeth Rumetshofer, Lukas Miklautz, Sepp Hochreiter - **Comment**: under review - **Journal**: None - **Summary**: Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features capture not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAE-CT), a sequential approach that applies Nearest Neighbor Contrastive Learning (NNCLR) to a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Applied to large and huge Vision Transformer (ViT) models, MAE-CT matches or excels previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. Notably, similar results can be achieved without additional image augmentations. While ID methods generally rely on hand-crafted augmentations to avoid shortcut learning, we find that nearest neighbor lookup is sufficient and that this data-driven augmentation effect improves with model size. MAE-CT is compute efficient. For instance, starting from a MAE pre-trained ViT-L/16, MAE-CT increases the ImageNet 1% low-shot accuracy from 67.7% to 72.6%, linear probing accuracy from 76.0% to 80.2% and k-NN accuracy from 60.6% to 79.1% in just five hours using eight A100 GPUs. ### GenCorres: Consistent Shape Matching via Coupled Implicit-Explicit Shape Generative Models - **Arxiv ID**: http://arxiv.org/abs/2304.10523v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10523v1) - **Published**: 2023-04-20 17:52:58+00:00 - **Updated**: 2023-04-20 17:52:58+00:00 - **Authors**: Haitao Yang, Xiangru Huang, Bo Sun, Chandrajit Bajaj, Qixing Huang - **Comment**: None - **Journal**: None - **Summary**: This paper introduces GenCorres, a novel unsupervised joint shape matching (JSM) approach. The basic idea of GenCorres is to learn a parametric mesh generator to fit an unorganized deformable shape collection while constraining deformations between adjacent synthetic shapes to preserve geometric structures such as local rigidity and local conformality. GenCorres presents three appealing advantages over existing JSM techniques. First, GenCorres performs JSM among a synthetic shape collection whose size is much bigger than the input shapes and fully leverages the data-driven power of JSM. Second, GenCorres unifies consistent shape matching and pairwise matching (i.e., by enforcing deformation priors between adjacent synthetic shapes). Third, the generator provides a concise encoding of consistent shape correspondences. However, learning a mesh generator from an unorganized shape collection is challenging. It requires a good initial fitting to each shape and can easily get trapped by local minimums. GenCorres addresses this issue by learning an implicit generator from the input shapes, which provides intermediate shapes between two arbitrary shapes. We introduce a novel approach for computing correspondences between adjacent implicit surfaces and force the correspondences to preserve geometric structures and be cycle-consistent. Synthetic shapes of the implicit generator then guide initial fittings (i.e., via template-based deformation) for learning the mesh generator. Experimental results show that GenCorres considerably outperforms state-of-the-art JSM techniques on benchmark datasets. The synthetic shapes of GenCorres preserve local geometric features and yield competitive performance gains against state-of-the-art deformable shape generators. ### Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance - **Arxiv ID**: http://arxiv.org/abs/2304.10528v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10528v1) - **Published**: 2023-04-20 17:58:26+00:00 - **Updated**: 2023-04-20 17:58:26+00:00 - **Authors**: Haiwen Feng, Peter Kulits, Shichen Liu, Michael J. Black, Victoria Abrevaya - **Comment**: Project page: https://arteq.is.tue.mpg.de - **Journal**: None - **Summary**: We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization-based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by leveraging SE(3)-equivariant networks, but these methods do not work on articulated objects. In this work we extend this idea to human bodies and propose ArtEq, a novel part-based SE(3)-equivariant neural architecture for SMPL model estimation from point clouds. Specifically, we learn a part detection network by leveraging local SO(3) invariance, and regress shape and pose using articulated SE(3) shape-invariant and pose-equivariant networks, all trained end-to-end. Our novel equivariant pose regression module leverages the permutation-equivariant property of self-attention layers to preserve rotational equivariance. Experimental results show that ArtEq can generalize to poses not seen during training, outperforming state-of-the-art methods by 74.5%, without requiring an optimization refinement step. Further, compared with competing works, our method is more than three orders of magnitude faster during inference and has 97.3% fewer parameters. The code and model will be available for research purposes at https://arteq.is.tue.mpg.de. ### Collaborative Diffusion for Multi-Modal Face Generation and Editing - **Arxiv ID**: http://arxiv.org/abs/2304.10530v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10530v1) - **Published**: 2023-04-20 17:59:02+00:00 - **Updated**: 2023-04-20 17:59:02+00:00 - **Authors**: Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, Ziwei Liu - **Comment**: CVPR 2023. Project page: https://ziqihuangg.github.io/projects/collaborative-diffusion.html Code: https://github.com/ziqihuangg/Collaborative-Diffusion - **Journal**: None - **Summary**: Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency. ### Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs - **Arxiv ID**: http://arxiv.org/abs/2304.10532v2 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.GR - **Links**: [PDF](http://arxiv.org/pdf/2304.10532v2) - **Published**: 2023-04-20 17:59:05+00:00 - **Updated**: 2023-04-21 22:41:20+00:00 - **Authors**: Frederik Warburg, Ethan Weber, Matthew Tancik, Aleksander Holynski, Angjoo Kanazawa - **Comment**: https://ethanweber.me/nerfbusters - **Journal**: None - **Summary**: Casually captured Neural Radiance Fields (NeRFs) suffer from artifacts such as floaters or flawed geometry when rendered outside the camera trajectory. Existing evaluation protocols often do not capture these effects, since they usually only assess image quality at every 8th frame of the training capture. To push forward progress in novel-view synthesis, we propose a new dataset and evaluation procedure, where two camera trajectories are recorded of the scene: one used for training, and the other for evaluation. In this more challenging in-the-wild setting, we find that existing hand-crafted regularizers do not remove floaters nor improve scene geometry. Thus, we propose a 3D diffusion-based method that leverages local 3D priors and a novel density-based score distillation sampling loss to discourage artifacts during NeRF optimization. We show that this data-driven prior removes floaters and improves scene geometry for casual captures. ### Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion - **Arxiv ID**: http://arxiv.org/abs/2304.10535v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10535v1) - **Published**: 2023-04-20 17:59:34+00:00 - **Updated**: 2023-04-20 17:59:34+00:00 - **Authors**: Tomas Jakab, Ruining Li, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi - **Comment**: Project page: http://farm3d.github.io - **Journal**: None - **Summary**: We present Farm3D, a method to learn category-specific 3D reconstructors for articulated objects entirely from "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn, given a collection of single-view images of an object category, a monocular network to predict the 3D shape, albedo, illumination and viewpoint of any object occurrence. We propose a framework using an image generator like Stable Diffusion to generate virtual training data for learning such a reconstruction network from scratch. Furthermore, we include the diffusion model as a score to further improve learning. The idea is to randomise some aspects of the reconstruction, such as viewpoint and illumination, generating synthetic views of the reconstructed 3D object, and have the 2D network assess the quality of the resulting image, providing feedback to the reconstructor. Different from work based on distillation which produces a single 3D asset for each textual prompt in hours, our approach produces a monocular reconstruction network that can output a controllable 3D asset from a given image, real or generated, in only seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games. ### Learning Sparse and Low-Rank Priors for Image Recovery via Iterative Reweighted Least Squares Minimization - **Arxiv ID**: http://arxiv.org/abs/2304.10536v1 - **DOI**: None - **Categories**: **cs.CV**, eess.IV, I.4.5; I.4.10 - **Links**: [PDF](http://arxiv.org/pdf/2304.10536v1) - **Published**: 2023-04-20 17:59:45+00:00 - **Updated**: 2023-04-20 17:59:45+00:00 - **Authors**: Stamatios Lefkimmiatis, Iaroslav Koshelev - **Comment**: Paper has been accepted and will appear at ICLR 2023 proceedings. Link to OpenReview: https://openreview.net/forum?id=TXPN6MtdSE4 - **Journal**: None - **Summary**: We introduce a novel optimization algorithm for image recovery under learned sparse and low-rank constraints, which we parameterize as weighted extensions of the $\ell_p^p$-vector and $\mathcal S_p^p$ Schatten-matrix quasi-norms for $0\!<p\!\le1$, respectively. Our proposed algorithm generalizes the Iteratively Reweighted Least Squares (IRLS) method, used for signal recovery under $\ell_1$ and nuclear-norm constrained minimization. Further, we interpret our overall minimization approach as a recurrent network that we then employ to deal with inverse low-level computer vision problems. Thanks to the convergence guarantees that our IRLS strategy offers, we are able to train the derived reconstruction networks using a memory-efficient implicit back-propagation scheme, which does not pose any restrictions on their effective depth. To assess our networks' performance, we compare them against other existing reconstruction methods on several inverse problems, namely image deblurring, super-resolution, demosaicking and sparse recovery. Our reconstruction results are shown to be very competitive and in many cases outperform those of existing unrolled networks, whose number of parameters is orders of magnitude higher than that of our learned models. ### Learning Neural Duplex Radiance Fields for Real-Time View Synthesis - **Arxiv ID**: http://arxiv.org/abs/2304.10537v1 - **DOI**: None - **Categories**: **cs.CV**, cs.GR - **Links**: [PDF](http://arxiv.org/pdf/2304.10537v1) - **Published**: 2023-04-20 17:59:52+00:00 - **Updated**: 2023-04-20 17:59:52+00:00 - **Authors**: Ziyu Wan, Christian Richardt, Aljaž Božič, Chao Li, Vijay Rengarajan, Seonghyeon Nam, Xiaoyu Xiang, Tuotuo Li, Bo Zhu, Rakesh Ranjan, Jing Liao - **Comment**: CVPR 2023. Project page: http://raywzy.com/NDRF - **Journal**: None - **Summary**: Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations - for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs into highly efficient mesh-based neural representations that are fully compatible with the massively parallel graphics rendering pipeline. We represent scenes as neural radiance features encoded on a two-layer duplex mesh, which effectively overcomes the inherent inaccuracies in 3D surface reconstruction by learning the aggregated radiance information from a reliable interval of ray-surface intersections. To exploit local geometric relationships of nearby pixels, we leverage screen-space convolutions instead of the MLPs used in NeRFs to achieve high-quality appearance. Finally, the performance of the whole framework is further boosted by a novel multi-view distillation optimization strategy. We demonstrate the effectiveness and superiority of our approach via extensive experiments on a range of standard datasets. ### Invariant Scattering Transform for Medical Imaging - **Arxiv ID**: http://arxiv.org/abs/2304.10582v2 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10582v2) - **Published**: 2023-04-20 18:12:50+00:00 - **Updated**: 2023-05-31 17:02:48+00:00 - **Authors**: Md Manjurul Ahsan, Shivakumar Raman, Zahed Siddique - **Comment**: Accepted for Springer book chapter for a book "Data-driven approaches to Medical Imaging" - **Journal**: None - **Summary**: Over the years, the Invariant Scattering Transform (IST) technique has become popular for medical image analysis, including using wavelet transform computation using Convolutional Neural Networks (CNN) to capture patterns' scale and orientation in the input signal. IST aims to be invariant to transformations that are common in medical images, such as translation, rotation, scaling, and deformation, used to improve the performance in medical imaging applications such as segmentation, classification, and registration, which can be integrated into machine learning algorithms for disease detection, diagnosis, and treatment planning. Additionally, combining IST with deep learning approaches has the potential to leverage their strengths and enhance medical image analysis outcomes. This study provides an overview of IST in medical imaging by considering the types of IST, their application, limitations, and potential scopes for future researchers and practitioners. ### MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - **Arxiv ID**: http://arxiv.org/abs/2304.10592v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10592v1) - **Published**: 2023-04-20 18:25:35+00:00 - **Updated**: 2023-04-20 18:25:35+00:00 - **Authors**: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny - **Comment**: Project Website: https://minigpt-4.github.io/; Code, Pretrained Model, and Dataset: https://github.com/Vision-CAIR/MiniGPT-4; Deyao Zhu and Jun Chen contributed equally to this work - **Journal**: None - **Summary**: The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/. ### Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models - **Arxiv ID**: http://arxiv.org/abs/2304.10597v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2304.10597v1) - **Published**: 2023-04-20 18:39:41+00:00 - **Updated**: 2023-04-20 18:39:41+00:00 - **Authors**: Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, Sheng Li - **Comment**: 10 pages, 6 figures - **Journal**: None - **Summary**: Recent advancements in foundation models (FMs), such as GPT-4 and LLaMA, have attracted significant attention due to their exceptional performance in zero-shot learning scenarios. Similarly, in the field of visual learning, models like Grounding DINO and the Segment Anything Model (SAM) have exhibited remarkable progress in open-set detection and instance segmentation tasks. It is undeniable that these FMs will profoundly impact a wide range of real-world visual learning tasks, ushering in a new paradigm shift for developing such models. In this study, we concentrate on the remote sensing domain, where the images are notably dissimilar from those in conventional scenarios. We developed a pipeline that leverages multiple FMs to facilitate remote sensing image semantic segmentation tasks guided by text prompt, which we denote as Text2Seg. The pipeline is benchmarked on several widely-used remote sensing datasets, and we present preliminary results to demonstrate its effectiveness. Through this work, we aim to provide insights into maximizing the applicability of visual FMs in specific contexts with minimal model tuning. The code is available at https://github.com/Douglas2Code/Text2Seg. ### Multi-domain learning CNN model for microscopy image classification - **Arxiv ID**: http://arxiv.org/abs/2304.10616v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10616v1) - **Published**: 2023-04-20 19:32:23+00:00 - **Updated**: 2023-04-20 19:32:23+00:00 - **Authors**: Duc Hoa Tran, Michel Meunier, Farida Cheriet - **Comment**: None - **Journal**: None - **Summary**: For any type of microscopy image, getting a deep learning model to work well requires considerable effort to select a suitable architecture and time to train it. As there is a wide range of microscopes and experimental setups, designing a single model that can apply to multiple imaging domains, instead of having multiple per-domain models, becomes more essential. This task is challenging and somehow overlooked in the literature. In this paper, we present a multi-domain learning architecture for the classification of microscopy images that differ significantly in types and contents. Unlike previous methods that are computationally intensive, we have developed a compact model, called Mobincep, by combining the simple but effective techniques of depth-wise separable convolution and the inception module. We also introduce a new optimization technique to regulate the latent feature space during training to improve the network's performance. We evaluated our model on three different public datasets and compared its performance in single-domain and multiple-domain learning modes. The proposed classifier surpasses state-of-the-art results and is robust for limited labeled data. Moreover, it helps to eliminate the burden of designing a new network when switching to new experiments. ### Enhancing object detection robustness: A synthetic and natural perturbation approach - **Arxiv ID**: http://arxiv.org/abs/2304.10622v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10622v1) - **Published**: 2023-04-20 19:55:51+00:00 - **Updated**: 2023-04-20 19:55:51+00:00 - **Authors**: Nilantha Premakumara, Brian Jalaian, Niranjan Suri, Hooman Samani - **Comment**: 09 pages, 4 figures - **Journal**: None - **Summary**: Robustness against real-world distribution shifts is crucial for the successful deployment of object detection models in practical applications. In this paper, we address the problem of assessing and enhancing the robustness of object detection models against natural perturbations, such as varying lighting conditions, blur, and brightness. We analyze four state-of-the-art deep neural network models, Detr-ResNet-101, Detr-ResNet-50, YOLOv4, and YOLOv4-tiny, using the COCO 2017 dataset and ExDark dataset. By simulating synthetic perturbations with the AugLy package, we systematically explore the optimal level of synthetic perturbation required to improve the models robustness through data augmentation techniques. Our comprehensive ablation study meticulously evaluates the impact of synthetic perturbations on object detection models performance against real-world distribution shifts, establishing a tangible connection between synthetic augmentation and real-world robustness. Our findings not only substantiate the effectiveness of synthetic perturbations in improving model robustness, but also provide valuable insights for researchers and practitioners in developing more robust and reliable object detection models tailored for real-world applications. ### Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels - **Arxiv ID**: http://arxiv.org/abs/2304.10539v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10539v1) - **Published**: 2023-04-20 20:05:08+00:00 - **Updated**: 2023-04-20 20:05:08+00:00 - **Authors**: Wenqiao Zhang, Changshuo Liu, Lingze Zeng, Beng Chin Ooi, Siliang Tang, Yueting Zhuang - **Comment**: None - **Journal**: None - **Summary**: Conventional multi-label classification (MLC) methods assume that all samples are fully labeled and identically distributed. Unfortunately, this assumption is unrealistic in large-scale MLC data that has long-tailed (LT) distribution and partial labels (PL). To address the problem, we introduce a novel task, Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC), to jointly consider the above two imperfect learning environments. Not surprisingly, we find that most LT-MLC and PL-MLC approaches fail to solve the PLT-MLC, resulting in significant performance degradation on the two proposed PLT-MLC benchmarks. Therefore, we propose an end-to-end learning framework: \textbf{CO}rrection $\rightarrow$ \textbf{M}odificat\textbf{I}on $\rightarrow$ balan\textbf{C}e, abbreviated as \textbf{\method{}}. Our bootstrapping philosophy is to simultaneously correct the missing labels (Correction) with convinced prediction confidence over a class-aware threshold and to learn from these recall labels during training. We next propose a novel multi-focal modifier loss that simultaneously addresses head-tail imbalance and positive-negative imbalance to adaptively modify the attention to different samples (Modification) under the LT class distribution. In addition, we develop a balanced training strategy by distilling the model's learning effect from head and tail samples, and thus design a balanced classifier (Balance) conditioned on the head and tail learning effect to maintain stable performance for all samples. Our experimental study shows that the proposed \method{} significantly outperforms general MLC, LT-MLC and PL-MLC methods in terms of effectiveness and robustness on our newly created PLT-MLC datasets. ### HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer - **Arxiv ID**: http://arxiv.org/abs/2304.10628v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10628v1) - **Published**: 2023-04-20 20:09:59+00:00 - **Updated**: 2023-04-20 20:09:59+00:00 - **Authors**: Hao Xiang, Runsheng Xu, Jiaqi Ma - **Comment**: None - **Journal**: None - **Summary**: Vehicle-to-Vehicle technologies have enabled autonomous vehicles to share information to see through occlusions, greatly enhancing perception performance. Nevertheless, existing works all focused on homogeneous traffic where vehicles are equipped with the same type of sensors, which significantly hampers the scale of collaboration and benefit of cross-modality interactions. In this paper, we investigate the multi-agent hetero-modal cooperative perception problem where agents may have distinct sensor modalities. We present HM-ViT, the first unified multi-agent hetero-modal cooperative perception framework that can collaboratively predict 3D objects for highly dynamic vehicle-to-vehicle (V2V) collaborations with varying numbers and types of agents. To effectively fuse features from multi-view images and LiDAR point clouds, we design a novel heterogeneous 3D graph transformer to jointly reason inter-agent and intra-agent interactions. The extensive experiments on the V2V perception dataset OPV2V demonstrate that the HM-ViT outperforms SOTA cooperative perception methods for V2V hetero-modal cooperative perception. We will release codes to facilitate future research. ### Medical Image Deidentification, Cleaning and Compression Using Pylogik - **Arxiv ID**: http://arxiv.org/abs/2304.12322v5 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.12322v5) - **Published**: 2023-04-20 21:23:14+00:00 - **Updated**: 2023-05-10 13:55:49+00:00 - **Authors**: Adrienne Kline, Vinesh Appadurai, Yuan Luo, Sanjiv Shah - **Comment**: updates needed to manuscript - **Journal**: None - **Summary**: Leveraging medical record information in the era of big data and machine learning comes with the caveat that data must be cleaned and de-identified. Facilitating data sharing and harmonization for multi-center collaborations are particularly difficult when protected health information (PHI) is contained or embedded in image meta-data. We propose a novel library in the Python framework, called PyLogik, to help alleviate this issue for ultrasound images, which are particularly challenging because of the frequent inclusion of PHI directly on the images. PyLogik processes the image volumes through a series of text detection/extraction, filtering, thresholding, morphological and contour comparisons. This methodology de-identifies the images, reduces file sizes, and prepares image volumes for applications in deep learning and data sharing. To evaluate its effectiveness in processing ultrasound data, a random sample of 50 cardiac ultrasounds (echocardiograms) were processed through PyLogik, and the outputs were compared with the manual segmentations by an expert user. The Dice coefficient of the two approaches achieved an average value of 0.976. Next, an investigation was conducted to ascertain the degree of information compression achieved using the algorithm. Resultant data was found to be on average ~72% smaller after processing by PyLogik. Our results suggest that PyLogik is a viable methodology for data cleaning and de-identification, determining ROI, and file compression which will facilitate efficient storage, use, and dissemination of ultrasound data. Variants of the pipeline have also been created for use with other medical imaging data types. ### A Comparative Neural Radiance Field (NeRF) 3D Analysis of Camera Poses from HoloLens Trajectories and Structure from Motion - **Arxiv ID**: http://arxiv.org/abs/2304.10664v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2304.10664v1) - **Published**: 2023-04-20 22:17:28+00:00 - **Updated**: 2023-04-20 22:17:28+00:00 - **Authors**: Miriam Jäger, Patrick Hübner, Dennis Haitz, Boris Jutzi - **Comment**: 7 pages, 5 figures. Will be published in the ISPRS The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences - **Journal**: None - **Summary**: Neural Radiance Fields (NeRFs) are trained using a set of camera poses and associated images as input to estimate density and color values for each position. The position-dependent density learning is of particular interest for photogrammetry, enabling 3D reconstruction by querying and filtering the NeRF coordinate system based on the object density. While traditional methods like Structure from Motion are commonly used for camera pose calculation in pre-processing for NeRFs, the HoloLens offers an interesting interface for extracting the required input data directly. We present a workflow for high-resolution 3D reconstructions almost directly from HoloLens data using NeRFs. Thereby, different investigations are considered: Internal camera poses from the HoloLens trajectory via a server application, and external camera poses from Structure from Motion, both with an enhanced variant applied through pose refinement. Results show that the internal camera poses lead to NeRF convergence with a PSNR of 25\,dB with a simple rotation around the x-axis and enable a 3D reconstruction. Pose refinement enables comparable quality compared to external camera poses, resulting in improved training process with a PSNR of 27\,dB and a better 3D reconstruction. Overall, NeRF reconstructions outperform the conventional photogrammetric dense reconstruction using Multi-View Stereo in terms of completeness and level of detail. ### Feature point detection in HDR images based on coefficient of variation - **Arxiv ID**: http://arxiv.org/abs/2304.10666v1 - **DOI**: 10.1007/s11042-023-16055-9 - **Categories**: **cs.CV**, I.4.0 - **Links**: [PDF](http://arxiv.org/pdf/2304.10666v1) - **Published**: 2023-04-20 22:23:10+00:00 - **Updated**: 2023-04-20 22:23:10+00:00 - **Authors**: Artur Santos Nascimento, Welerson Augusto Lino de Jesus Melo, Daniel Oliveira Dantas, Beatriz Trinchão Andrade - **Comment**: None - **Journal**: None - **Summary**: Feature point (FP) detection is a fundamental step of many computer vision tasks. However, FP detectors are usually designed for low dynamic range (LDR) images. In scenes with extreme light conditions, LDR images present saturated pixels, which degrade FP detection. On the other hand, high dynamic range (HDR) images usually present no saturated pixels but FP detection algorithms do not take advantage of all the information present in such images. FP detection frequently relies on differential methods, which work well in LDR images. However, in HDR images, the differential operation response in bright areas overshadows the response in dark areas. As an alternative to standard FP detection methods, this study proposes an FP detector based on a coefficient of variation (CV) designed for HDR images. The CV operation adapts its response based on the standard deviation of pixels inside a window, working well in both dark and bright areas of HDR images. The proposed and standard detectors are evaluated by measuring their repeatability rate (RR) and uniformity. Our proposed detector shows better performance when compared to other standard state-of-the-art detectors. In uniformity metric, our proposed detector surpasses all the other algorithms. In other hand, when using the repeatability rate metric, the proposed detector is worse than Harris for HDR and SURF detectors. ### Edge-Aware Image Color Appearance and Difference Modeling - **Arxiv ID**: http://arxiv.org/abs/2304.10669v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2304.10669v1) - **Published**: 2023-04-20 22:55:16+00:00 - **Updated**: 2023-04-20 22:55:16+00:00 - **Authors**: Abhinau K. Venkataramanan - **Comment**: None - **Journal**: None - **Summary**: The perception of color is one of the most important aspects of human vision. From an evolutionary perspective, the accurate perception of color is crucial to distinguishing friend from foe, and food from fatal poison. As a result, humans have developed a keen sense of color and are able to detect subtle differences in appearance, while also robustly identifying colors across illumination and viewing conditions. In this paper, we shall briefly review methods for adapting traditional color appearance and difference models to complex image stimuli, and propose mechanisms to improve their performance. In particular, we find that applying contrast sensitivity functions and local adaptation rules in an edge-aware manner improves image difference predictions. ### Point-supervised Single-cell Segmentation via Collaborative Knowledge Sharing - **Arxiv ID**: http://arxiv.org/abs/2304.10671v2 - **DOI**: None - **Categories**: **cs.CV**, q-bio.QM - **Links**: [PDF](http://arxiv.org/pdf/2304.10671v2) - **Published**: 2023-04-20 23:22:41+00:00 - **Updated**: 2023-07-10 21:59:38+00:00 - **Authors**: Ji Yu - **Comment**: None - **Journal**: None - **Summary**: Despite their superior performance, deep-learning methods often suffer from the disadvantage of needing large-scale well-annotated training data. In response, recent literature has seen a proliferation of efforts aimed at reducing the annotation burden. This paper focuses on a weakly-supervised training setting for single-cell segmentation models, where the only available training label is the rough locations of individual cells. The specific problem is of practical interest due to the widely available nuclei counter-stain data in biomedical literature, from which the cell locations can be derived programmatically. Of more general interest is a proposed self-learning method called collaborative knowledge sharing, which is related to but distinct from the more well-known consistency learning methods. This strategy achieves self-learning by sharing knowledge between a principal model and a very light-weight collaborator model. Importantly, the two models are entirely different in their architectures, capacities, and model outputs: In our case, the principal model approaches the segmentation problem from an object-detection perspective, whereas the collaborator model a sematic segmentation perspective. We assessed the effectiveness of this strategy by conducting experiments on LIVECell, a large single-cell segmentation dataset of bright-field images, and on A431 dataset, a fluorescence image dataset in which the location labels are generated automatically from nuclei counter-stain data. Implementing code is available at https://github.com/jiyuuchc/lacss ### Deep reproductive feature generation framework for the diagnosis of COVID-19 and viral pneumonia using chest X-ray images - **Arxiv ID**: http://arxiv.org/abs/2304.10677v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2304.10677v1) - **Published**: 2023-04-20 23:52:21+00:00 - **Updated**: 2023-04-20 23:52:21+00:00 - **Authors**: Ceyhun Efe Kayan, Talha Enes Koksal, Arda Sevinc, Abdurrahman Gumus - **Comment**: None - **Journal**: None - **Summary**: The rapid and accurate detection of COVID-19 cases is critical for timely treatment and preventing the spread of the disease. In this study, a two-stage feature extraction framework using eight state-of-the-art pre-trained deep Convolutional Neural Networks (CNNs) and an autoencoder is proposed to determine the health conditions of patients (COVID-19, Normal, Viral Pneumonia) based on chest X-rays. The X-ray scans are divided into four equally sized sections and analyzed by deep pre-trained CNNs. Subsequently, an autoencoder with three hidden layers is trained to extract reproductive features from the concatenated ouput of CNNs. To evaluate the performance of the proposed framework, three different classifiers, which are single-layer perceptron (SLP), multi-layer perceptron (MLP), and support vector machine (SVM) are used. Furthermore, the deep CNN architectures are used to create benchmark models and trained on the same dataset for comparision. The proposed framework outperforms other frameworks wih pre-trained feature extractors in binary classification and shows competitive results in three-class classification. The proposed methodology is task-independent and suitable for addressing various problems. The results show that the discriminative features are a subset of the reproductive features, suggesting that extracting task-independent features is superior to the extraction only task-based features. The flexibility and task-independence of the reproductive features make the conceptive information approach more favorable. The proposed methodology is novel and shows promising results for analyzing medical image data. ### A Plug-and-Play Defensive Perturbation for Copyright Protection of DNN-based Applications - **Arxiv ID**: http://arxiv.org/abs/2304.10679v2 - **DOI**: None - **Categories**: **cs.CV**, cs.MM - **Links**: [PDF](http://arxiv.org/pdf/2304.10679v2) - **Published**: 2023-04-20 23:57:39+00:00 - **Updated**: 2023-05-19 00:28:22+00:00 - **Authors**: Donghua Wang, Wen Yao, Tingsong Jiang, Weien Zhou, Lang Lin, Xiaoqian Chen - **Comment**: 9 pages, 7 figures - **Journal**: None - **Summary**: Wide deployment of deep neural networks (DNNs) based applications (e.g., style transfer, cartoonish), stimulating the requirement of copyright protection of such application's production. Although some traditional visible copyright techniques are available, they would introduce undesired traces and result in a poor user experience. In this paper, we propose a novel plug-and-play invisible copyright protection method based on defensive perturbation for DNN-based applications (i.e., style transfer). Rather than apply the perturbation to attack the DNNs model, we explore the potential utilization of perturbation in copyright protection. Specifically, we project the copyright information to the defensive perturbation with the designed copyright encoder, which is added to the image to be protected. Then, we extract the copyright information from the encoded copyrighted image with the devised copyright decoder. Furthermore, we use a robustness module to strengthen the decoding capability of the decoder toward images with various distortions (e.g., JPEG compression), which may be occurred when the user posts the image on social media. To ensure the image quality of encoded images and decoded copyright images, a loss function was elaborately devised. Objective and subjective experiment results demonstrate the effectiveness of the proposed method. We have also conducted physical world tests on social media (i.e., Wechat and Twitter) by posting encoded copyright images. The results show that the copyright information in the encoded image saved from social media can still be correctly extracted.

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets