Arxiv Papers in cs.CV on 2024-07-15

# Arxiv Papers in cs.CV on 2024-07-15 ### Accessing Vision Foundation Models at ImageNet-level Costs - **Arxiv ID**: http://arxiv.org/abs/2407.10366v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10366v1) - **Published**: 2024-07-15 00:13:53+00:00 - **Updated**: 2024-07-15 00:13:53+00:00 - **Authors**: Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu - **Comment**: None - **Journal**: None - **Summary**: Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M). ### Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion - **Arxiv ID**: http://arxiv.org/abs/2407.10373v1 - **DOI**: None - **Categories**: **cs.SD**, cs.AI, cs.CV, eess.AS - **Links**: [PDF](http://arxiv.org/pdf/2407.10373v1) - **Published**: 2024-07-15 00:47:56+00:00 - **Updated**: 2024-07-15 00:47:56+00:00 - **Authors**: Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng - **Comment**: ECCV 2024; Project page: https://hechang25.github.io/MVSD - **Journal**: None - **Summary**: Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios. ### An Empirical Study of Mamba-based Pedestrian Attribute Recognition - **Arxiv ID**: http://arxiv.org/abs/2407.10374v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10374v1) - **Published**: 2024-07-15 00:48:06+00:00 - **Updated**: 2024-07-15 00:48:06+00:00 - **Authors**: Xiao Wang, Weizhe Kong, Jiandong Jin, Shiao Wang, Ruichong Gao, Qingchuan Ma, Chenglong Li, Jin Tang - **Comment**: In Peer Review - **Journal**: None - **Summary**: Current strong pedestrian attribute recognition models are developed based on Transformer networks, which are computationally heavy. Recently proposed models with linear complexity (e.g., Mamba) have garnered significant attention and have achieved a good balance between accuracy and computational cost across a variety of visual tasks. Relevant review articles also suggest that while these models can perform well on some pedestrian attribute recognition datasets, they are generally weaker than the corresponding Transformer models. To further tap into the potential of the novel Mamba architecture for PAR tasks, this paper designs and adapts Mamba into two typical PAR frameworks, i.e., the text-image fusion approach and pure vision Mamba multi-label recognition framework. It is found that interacting with attribute tags as additional input does not always lead to an improvement, specifically, Vim can be enhanced, but VMamba cannot. This paper further designs various hybrid Mamba-Transformer variants and conducts thorough experimental validations. These experimental results indicate that simply enhancing Mamba with a Transformer does not always lead to performance improvements but yields better results under certain settings. We hope this empirical study can further inspire research in Mamba for PAR, and even extend into the domain of multi-label recognition, through the design of these network structures and comprehensive experimentation. The source code of this work will be released at \url{https://github.com/Event-AHU/OpenPAR} ### Enhanced Self-supervised Learning for Multi-modality MRI Segmentation and Classification: A Novel Approach Avoiding Model Collapse - **Arxiv ID**: http://arxiv.org/abs/2407.10377v2 - **DOI**: None - **Categories**: **eess.IV**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10377v2) - **Published**: 2024-07-15 01:11:30+00:00 - **Updated**: 2024-07-17 07:05:57+00:00 - **Authors**: Linxuan Han, Sa Xiao, Zimeng Li, Haidong Li, Xiuchao Zhao, Fumin Guo, Yeqing Han, Xin Zhou - **Comment**: None - **Journal**: None - **Summary**: Multi-modality magnetic resonance imaging (MRI) can provide complementary information for computer-aided diagnosis. Traditional deep learning algorithms are suitable for identifying specific anatomical structures segmenting lesions and classifying diseases with magnetic resonance images. However, manual labels are limited due to high expense, which hinders further improvement of model accuracy. Self-supervised learning (SSL) can effectively learn feature representations from unlabeled data by pre-training and is demonstrated to be effective in natural image analysis. Most SSL methods ignore the similarity of multi-modality MRI, leading to model collapse. This limits the efficiency of pre-training, causing low accuracy in downstream segmentation and classification tasks. To solve this challenge, we establish and validate a multi-modality MRI masked autoencoder consisting of hybrid mask pattern (HMP) and pyramid barlow twin (PBT) module for SSL on multi-modality MRI analysis. The HMP concatenates three masking steps forcing the SSL to learn the semantic connections of multi-modality images by reconstructing the masking patches. We have proved that the proposed HMP can avoid model collapse. The PBT module exploits the pyramidal hierarchy of the network to construct barlow twin loss between masked and original views, aligning the semantic representations of image patches at different vision scales in latent space. Experiments on BraTS2023, PI-CAI, and lung gas MRI datasets further demonstrate the superiority of our framework over the state-of-the-art. The performance of the segmentation and classification is substantially enhanced, supporting the accurate detection of small lesion areas. The code is available at https://github.com/LinxuanHan/M2-MAE. ### NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models - **Arxiv ID**: http://arxiv.org/abs/2407.10380v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.CL, cs.IR - **Links**: [PDF](http://arxiv.org/pdf/2407.10380v1) - **Published**: 2024-07-15 01:21:56+00:00 - **Updated**: 2024-07-15 01:21:56+00:00 - **Authors**: Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth - **Comment**: 15 pages, 2 figures, 5 tables - **Journal**: None - **Summary**: Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances. ### By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting - **Arxiv ID**: http://arxiv.org/abs/2407.10385v1 - **DOI**: None - **Categories**: **cs.CL**, cs.AI, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10385v1) - **Published**: 2024-07-15 01:33:54+00:00 - **Updated**: 2024-07-15 01:33:54+00:00 - **Authors**: Hyungjun Yoon, Biniyam Aschalew Tolera, Taesik Gong, Kimin Lee, Sung-Ju Lee - **Comment**: 21 pages, 16 figures - **Journal**: None - **Summary**: Large language models (LLMs) have demonstrated exceptional abilities across various domains. However, utilizing LLMs for ubiquitous sensing applications remains challenging as existing text-prompt methods show significant performance degradation when handling long sensor data sequences. We propose a visual prompting approach for sensor data using multimodal LLMs (MLLMs). We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge. We evaluated our approach on nine sensory tasks involving four sensing modalities, achieving an average of 10% higher accuracy than text-based prompts and reducing token costs by 15.8x. Our findings highlight the effectiveness and cost-efficiency of visual prompts with MLLMs for various sensory tasks. ### Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity - **Arxiv ID**: http://arxiv.org/abs/2407.10387v1 - **DOI**: None - **Categories**: **cs.SD**, cs.AI, cs.CV, eess.AS - **Links**: [PDF](http://arxiv.org/pdf/2407.10387v1) - **Published**: 2024-07-15 01:49:59+00:00 - **Updated**: 2024-07-15 01:49:59+00:00 - **Authors**: Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà - **Comment**: Accepted to ECCV 2024 - **Journal**: None - **Summary**: Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io . ### Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering - **Arxiv ID**: http://arxiv.org/abs/2407.10389v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10389v1) - **Published**: 2024-07-15 01:58:54+00:00 - **Updated**: 2024-07-15 01:58:54+00:00 - **Authors**: Francesco Di Sario, Riccardo Renzulli, Enzo Tartaglione, Marco Grangetto - **Comment**: None - **Journal**: None - **Summary**: Since the introduction of NeRFs, considerable attention has been focused on improving their training and inference times, leading to the development of Fast-NeRFs models. Despite demonstrating impressive rendering speed and quality, the rapid convergence of such models poses challenges for further improving reconstruction quality. Common strategies to improve rendering quality involves augmenting model parameters or increasing the number of sampled points. However, these computationally intensive approaches encounter limitations in achieving significant quality enhancements. This study introduces a model-agnostic framework inspired by Sparsely-Gated Mixture of Experts to enhance rendering quality without escalating computational complexity. Our approach enables specialization in rendering different scene components by employing a mixture of experts with varying resolutions. We present a novel gate formulation designed to maximize expert capabilities and propose a resolution-based routing technique to effectively induce sparsity and decompose scenes. Our work significantly improves reconstruction quality while maintaining competitive performance. ### Exploring the Impact of Moire Pattern on Deepfake Detectors - **Arxiv ID**: http://arxiv.org/abs/2407.10399v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10399v1) - **Published**: 2024-07-15 02:39:24+00:00 - **Updated**: 2024-07-15 02:39:24+00:00 - **Authors**: Razaib Tariq, Shahroz Tariq, Simon S. Woo - **Comment**: 7 page, 4 figures, 1 table, Accepted for publication in IEEE International Conference on Image Processing (ICIP 2024) - **Journal**: None - **Summary**: Deepfake detection is critical in mitigating the societal threats posed by manipulated videos. While various algorithms have been developed for this purpose, challenges arise when detectors operate externally, such as on smartphones, when users take a photo of deepfake images and upload on the Internet. One significant challenge in such scenarios is the presence of Moir\'e patterns, which degrade image quality and confound conventional classification algorithms, including deep neural networks (DNNs). The impact of Moir\'e patterns remains largely unexplored for deepfake detectors. In this study, we investigate how camera-captured deepfake videos from digital screens affect detector performance. We conducted experiments using two prominent datasets, CelebDF and FF++, comparing the performance of four state-of-the-art detectors on camera-captured deepfake videos with introduced Moir\'e patterns. Our findings reveal a significant decline in detector accuracy, with none achieving above 68% on average. This underscores the critical need to address Moir\'e pattern challenges in real-world deepfake detection scenarios. ### Towards Scale-Aware Full Surround Monodepth with Transformers - **Arxiv ID**: http://arxiv.org/abs/2407.10406v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10406v1) - **Published**: 2024-07-15 02:54:46+00:00 - **Updated**: 2024-07-15 02:54:46+00:00 - **Authors**: Yuchen Yang, Xinyi Wang, Dong Li, Lu Tian, Ashish Sirasao, Xun Yang - **Comment**: None - **Journal**: None - **Summary**: Full surround monodepth (FSM) methods can learn from multiple camera views simultaneously in a self-supervised manner to predict the scale-aware depth, which is more practical for real-world applications in contrast to scale-ambiguous depth from a standalone monocular camera. In this work, we focus on enhancing the scale-awareness of FSM methods for depth estimation. To this end, we propose to improve FSM from two perspectives: depth network structure optimization and training pipeline optimization. First, we construct a transformer-based depth network with neighbor-enhanced cross-view attention (NCA). The cross-attention modules can better aggregate the cross-view context in both global and neighboring views. Second, we formulate a transformer-based feature matching scheme with progressive training to improve the structure-from-motion (SfM) pipeline. That allows us to learn scale-awareness with sufficient matches and further facilitate network convergence by removing mismatches based on SfM loss. Experiments demonstrate that the resulting Scale-aware full surround monodepth (SA-FSM) method largely improves the scale-aware depth predictions without median-scaling at the test time, and performs favorably against the state-of-the-art FSM methods, e.g., surpassing SurroundDepth by 3.8% in terms of accuracy at delta<1.25 on the DDAD benchmark. ### Melon Fruit Detection and Quality Assessment Using Generative AI-Based Image Data Augmentation - **Arxiv ID**: http://arxiv.org/abs/2407.10413v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10413v1) - **Published**: 2024-07-15 03:26:13+00:00 - **Updated**: 2024-07-15 03:26:13+00:00 - **Authors**: Seungri Yoon, Yunseong Cho, Tae In Ahn - **Comment**: 8 pages, 7 figures - **Journal**: None - **Summary**: Monitoring and managing the growth and quality of fruits are very important tasks. To effectively train deep learning models like YOLO for real-time fruit detection, high-quality image datasets are essential. However, such datasets are often lacking in agriculture. Generative AI models can help create high-quality images. In this study, we used MidJourney and Firefly tools to generate images of melon greenhouses and post-harvest fruits through text-to-image, pre-harvest image-to-image, and post-harvest image-to-image methods. We evaluated these AIgenerated images using PSNR and SSIM metrics and tested the detection performance of the YOLOv9 model. We also assessed the net quality of real and generated fruits. Our results showed that generative AI could produce images very similar to real ones, especially for post-harvest fruits. The YOLOv9 model detected the generated images well, and the net quality was also measurable. This shows that generative AI can create realistic images useful for fruit detection and quality assessment, indicating its great potential in agriculture. This study highlights the potential of AI-generated images for data augmentation in melon fruit detection and quality assessment and envisions a positive future for generative AI applications in agriculture. ### Teaching CORnet Human fMRI Representations for Enhanced Model-Brain Alignment - **Arxiv ID**: http://arxiv.org/abs/2407.10414v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG, q-bio.NC - **Links**: [PDF](http://arxiv.org/pdf/2407.10414v1) - **Published**: 2024-07-15 03:31:42+00:00 - **Updated**: 2024-07-15 03:31:42+00:00 - **Authors**: Zitong Lu, Yile Wang - **Comment**: arXiv admin note: text overlap with arXiv:2401.17231 - **Journal**: None - **Summary**: Deep convolutional neural networks (DCNNs) have demonstrated excellent performance in object recognition and have been found to share some similarities with brain visual processing. However, the substantial gap between DCNNs and human visual perception still exists. Functional magnetic resonance imaging (fMRI) as a widely used technique in cognitive neuroscience can record neural activation in the human visual cortex during the process of visual perception. Can we teach DCNNs human fMRI signals to achieve a more brain-like model? To answer this question, this study proposed ReAlnet-fMRI, a model based on the SOTA vision model CORnet but optimized using human fMRI data through a multi-layer encoding-based alignment framework. This framework has been shown to effectively enable the model to learn human brain representations. The fMRI-optimized ReAlnet-fMRI exhibited higher similarity to the human brain than both CORnet and the control model in within-and across-subject as well as within- and across-modality model-brain (fMRI and EEG) alignment evaluations. Additionally, we conducted an in-depth analyses to investigate how the internal representations of ReAlnet-fMRI differ from CORnet in encoding various object dimensions. These findings provide the possibility of enhancing the brain-likeness of visual models by integrating human neural data, helping to bridge the gap between computer vision and visual neuroscience. ### Omni-Dimensional Frequency Learner for General Time Series Analysis - **Arxiv ID**: http://arxiv.org/abs/2407.10419v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10419v1) - **Published**: 2024-07-15 03:48:16+00:00 - **Updated**: 2024-07-15 03:48:16+00:00 - **Authors**: Xianing Chen. Hanting Chen, Hailin Hu - **Comment**: None - **Journal**: None - **Summary**: Frequency domain representation of time series feature offers a concise representation for handling real-world time series data with inherent complexity and dynamic nature. However, current frequency-based methods with complex operations still fall short of state-of-the-art time domain methods for general time series analysis. In this work, we present Omni-Dimensional Frequency Learner (ODFL) model based on a in depth analysis among all the three aspects of the spectrum feature: channel redundancy property among the frequency dimension, the sparse and un-salient frequency energy distribution among the frequency dimension, and the semantic diversity among the variable dimension. Technically, our method is composed of a semantic-adaptive global filter with attention to the un-salient frequency bands and partial operation among the channel dimension. Empirical results show that ODFL achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection, offering a promising foundation for time series analysis. ### Transformer for Multitemporal Hyperspectral Image Unmixing - **Arxiv ID**: http://arxiv.org/abs/2407.10427v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10427v1) - **Published**: 2024-07-15 04:02:01+00:00 - **Updated**: 2024-07-15 04:02:01+00:00 - **Authors**: Hang Li, Qiankun Dong, Xueshuo Xie, Xia Xu, Tao Li, Zhenwei Shi - **Comment**: None - **Journal**: None - **Summary**: Multitemporal hyperspectral image unmixing (MTHU) holds significant importance in monitoring and analyzing the dynamic changes of surface. However, compared to single-temporal unmixing, the multitemporal approach demands comprehensive consideration of information across different phases, rendering it a greater challenge. To address this challenge, we propose the Multitemporal Hyperspectral Image Unmixing Transformer (MUFormer), an end-to-end unsupervised deep learning model. To effectively perform multitemporal hyperspectral image unmixing, we introduce two key modules: the Global Awareness Module (GAM) and the Change Enhancement Module (CEM). The Global Awareness Module computes self-attention across all phases, facilitating global weight allocation. On the other hand, the Change Enhancement Module dynamically learns local temporal changes by comparing endmember changes between adjacent phases. The synergy between these modules allows for capturing semantic information regarding endmember and abundance changes, thereby enhancing the effectiveness of multitemporal hyperspectral image unmixing. We conducted experiments on one real dataset and two synthetic datasets, demonstrating that our model significantly enhances the effect of multitemporal hyperspectral image unmixing. ### A Multi-Stage Framework for 3D Individual Tooth Segmentation in Dental CBCT - **Arxiv ID**: http://arxiv.org/abs/2407.10433v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10433v1) - **Published**: 2024-07-15 04:23:28+00:00 - **Updated**: 2024-07-15 04:23:28+00:00 - **Authors**: Chunshi Wang, Bin Zhao, Shuxue Ding - **Comment**: Semi-supervised Tooth Segmentation MICCAI 2023 Challenge - **Journal**: None - **Summary**: Cone beam computed tomography (CBCT) is a common way of diagnosing dental related diseases. Accurate segmentation of 3D tooth is of importance for the treatment. Although deep learning based methods have achieved convincing results in medical image processing, they need a large of annotated data for network training, making it very time-consuming in data collection and annotation. Besides, domain shift widely existing in the distribution of data acquired by different devices impacts severely the model generalization. To resolve the problem, we propose a multi-stage framework for 3D tooth segmentation in dental CBCT, which achieves the third place in the "Semi-supervised Teeth Segmentation" 3D (STS-3D) challenge. The experiments on validation set compared with other semi-supervised segmentation methods further indicate the validity of our approach. ### PolyRoom: Room-aware Transformer for Floorplan Reconstruction - **Arxiv ID**: http://arxiv.org/abs/2407.10439v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10439v1) - **Published**: 2024-07-15 04:53:10+00:00 - **Updated**: 2024-07-15 04:53:10+00:00 - **Authors**: Yuzhou Liu, Lingjie Zhu, Xiaodong Ma, Hanqiao Ye, Xiang Gao, Xianwei Zheng, Shuhan Shen - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point clouds. Despite significant advancements achieved in recent years, current methods still encounter several challenges, such as missing corners or edges, inaccuracies in corner positions or angles, self-intersecting or overlapping polygons, and potentially implausible topology. To tackle these challenges, we present PolyRoom, a room-aware Transformer that leverages uniform sampling representation, room-aware query initialization, and room-aware self-attention for floorplan reconstruction. Specifically, we adopt a uniform sampling floorplan representation to enable dense supervision during training and effective utilization of angle information. Additionally, we propose a room-aware query initialization scheme to prevent non-polygonal sequences and introduce room-aware self-attention to enhance memory efficiency and model performance. Experimental results on two widely used datasets demonstrate that PolyRoom surpasses current state-of-the-art methods both quantitatively and qualitatively. Our code is available at: https://github.com/3dv-casia/PolyRoom/. ### Backdoor Attacks against Image-to-Image Networks - **Arxiv ID**: http://arxiv.org/abs/2407.10445v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10445v1) - **Published**: 2024-07-15 05:14:17+00:00 - **Updated**: 2024-07-15 05:14:17+00:00 - **Authors**: Wenbo Jiang, Hongwei Li, Jiaming He, Rui Zhang, Guowen Xu, Tianwei Zhang, Rongxing Lu - **Comment**: None - **Journal**: None - **Summary**: Recently, deep learning-based Image-to-Image (I2I) networks have become the predominant choice for I2I tasks such as image super-resolution and denoising. Despite their remarkable performance, the backdoor vulnerability of I2I networks has not been explored. To fill this research gap, we conduct a comprehensive investigation on the susceptibility of I2I networks to backdoor attacks. Specifically, we propose a novel backdoor attack technique, where the compromised I2I network behaves normally on clean input images, yet outputs a predefined image of the adversary for malicious input images containing the trigger. To achieve this I2I backdoor attack, we propose a targeted universal adversarial perturbation (UAP) generation algorithm for I2I networks, where the generated UAP is used as the backdoor trigger. Additionally, in the backdoor training process that contains the main task and the backdoor task, multi-task learning (MTL) with dynamic weighting methods is employed to accelerate convergence rates. In addition to attacking I2I tasks, we extend our I2I backdoor to attack downstream tasks, including image classification and object detection. Extensive experiments demonstrate the effectiveness of the I2I backdoor on state-of-the-art I2I network architectures, as well as the robustness against different mainstream backdoor defenses. ### DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models - **Arxiv ID**: http://arxiv.org/abs/2407.10459v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10459v1) - **Published**: 2024-07-15 06:15:49+00:00 - **Updated**: 2024-07-15 06:15:49+00:00 - **Authors**: Yiwei Yang, Zheyuan Liu, Jun Jia, Zhongpai Gao, Yunhao Li, Wei Sun, Xiaohong Liu, Guangtao Zhai - **Comment**: 9 pages, 7 figures; reference added; accepted at IJCAI2024 main track - **Journal**: None - **Summary**: Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, crafting public prompts for semantic diversity, and the risk of prompt leakage during frequent transmission. To address these issues, we propose DiffStega, an innovative training-free diffusion-based CIS strategy for universal application. DiffStega uses a password-dependent reference image as an image prompt alongside the text, ensuring that only authorized parties can retrieve the hidden information. Furthermore, we develop Noise Flip technique to further secure the steganography against unauthorized decryption. To comprehensively assess our method across general CIS tasks, we create a dataset comprising various image steganography instances. Experiments indicate substantial improvements in our method over existing ones, particularly in aspects of versatility, password sensitivity, and recovery quality. Codes are available at \url{https://github.com/evtricks/DiffStega}. ### Kinetic Typography Diffusion Model - **Arxiv ID**: http://arxiv.org/abs/2407.10476v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10476v1) - **Published**: 2024-07-15 07:04:45+00:00 - **Updated**: 2024-07-15 07:04:45+00:00 - **Authors**: Seonmi Park, Inhwan Bae, Seunghyun Shin, Hae-Gon Jeon - **Comment**: Accepted at ECCV 2024, Project page: https://seonmip.github.io/kinety - **Journal**: None - **Summary**: This paper introduces a method for realistic kinetic typography that generates user-preferred animatable 'text content'. We draw on recent advances in guided video diffusion models to achieve visually-pleasing text appearances. To do this, we first construct a kinetic typography dataset, comprising about 600K videos. Our dataset is made from a variety of combinations in 584 templates designed by professional motion graphics designers and involves changing each letter's position, glyph, and size (i.e., flying, glitches, chromatic aberration, reflecting effects, etc.). Next, we propose a video diffusion model for kinetic typography. For this, there are three requirements: aesthetic appearances, motion effects, and readable letters. This paper identifies the requirements. For this, we present static and dynamic captions used as spatial and temporal guidance of a video diffusion model, respectively. The static caption describes the overall appearance of the video, such as colors, texture and glyph which represent a shape of each letter. The dynamic caption accounts for the movements of letters and backgrounds. We add one more guidance with zero convolution to determine which text content should be visible in the video. We apply the zero convolution to the text content, and impose it on the diffusion model. Lastly, our glyph loss, only minimizing a difference between the predicted word and its ground-truth, is proposed to make the prediction letters readable. Experiments show that our model generates kinetic typography videos with legible and artistic letter motions based on text prompts. ### NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis - **Arxiv ID**: http://arxiv.org/abs/2407.10482v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10482v1) - **Published**: 2024-07-15 07:08:03+00:00 - **Updated**: 2024-07-15 07:08:03+00:00 - **Authors**: Yubin Hu, Xiaoyang Guo, Yang Xiao, Jingwei Huang, Yong-Jin Liu - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: This paper presents NGP-RT, a novel approach for enhancing the rendering speed of Instant-NGP to achieve real-time novel view synthesis. As a classic NeRF-based method, Instant-NGP stores implicit features in multi-level grids or hash tables and applies a shallow MLP to convert the implicit features into explicit colors and densities. Although it achieves fast training speed, there is still a lot of room for improvement in its rendering speed due to the per-point MLP executions for implicit multi-level feature aggregation, especially for real-time applications. To address this challenge, our proposed NGP-RT explicitly stores colors and densities as hash features, and leverages a lightweight attention mechanism to disambiguate the hash collisions instead of using computationally intensive MLP. At the rendering stage, NGP-RT incorporates a pre-computed occupancy distance grid into the ray marching strategy to inform the distance to the nearest occupied voxel, thereby reducing the number of marching points and global memory access. Experimental results show that on the challenging Mip-NeRF360 dataset, NGP-RT achieves better rendering quality than previous NeRF-based methods, achieving 108 fps at 1080p resolution on a single Nvidia RTX 3090 GPU. Our approach is promising for NeRF-based real-time applications that require efficient and high-quality rendering. ### Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry - **Arxiv ID**: http://arxiv.org/abs/2407.10484v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10484v1) - **Published**: 2024-07-15 07:11:44+00:00 - **Updated**: 2024-07-15 07:11:44+00:00 - **Authors**: Ziheng Chen, Yue Song, Xiao-Jun Wu, Gaowen Liu, Nicu Sebe - **Comment**: 24 pages, 3 figures - **Journal**: None - **Summary**: Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations. GCP typically performs classification of the covariance matrices by applying matrix function normalization, such as matrix logarithm or power, followed by a Euclidean classifier. However, covariance matrices inherently lie in a Riemannian manifold, known as the Symmetric Positive Definite (SPD) manifold. The current literature does not provide a satisfactory explanation of why Euclidean classifiers can be applied directly to Riemannian features after the normalization of the matrix power. To mitigate this gap, this paper provides a comprehensive and unified understanding of the matrix logarithm and power from a Riemannian geometry perspective. The underlying mechanism of matrix functions in GCP is interpreted from two perspectives: one based on tangent classifiers (Euclidean classifiers on the tangent space) and the other based on Riemannian classifiers. Via theoretical analysis and empirical validation through extensive experiments on fine-grained and large-scale visual classification datasets, we conclude that the working mechanism of the matrix functions should be attributed to the Riemannian classifiers they implicitly respect. ### Effective Motion Modeling for UAV-platform Multiple Object Tracking with Re-Margin Loss - **Arxiv ID**: http://arxiv.org/abs/2407.10485v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10485v1) - **Published**: 2024-07-15 07:13:27+00:00 - **Updated**: 2024-07-15 07:13:27+00:00 - **Authors**: Mufeng Yao, Jinlong Peng, Qingdong He, Bo Peng, Hao Chen, Mingmin Chi, Chao Liu, Jon Atli Benediktsson - **Comment**: arXiv admin note: text overlap with arXiv:2308.07207 - **Journal**: None - **Summary**: Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces tracking difficulties caused by large and irregular motion, and insufficient training due to the motion long-tailed distribution of current UAV-MOT datasets. Previous UAV-MOT methods either extract motion and detection features redundantly or supervise motion model in a sparse scheme, which limited their tracking performance and speed. To this end, we propose a flowing-by-detection module to realize accurate motion modeling with a minimum cost. Focusing on the motion long-tailed problem that were ignored by previous works, the flow-guided margin loss is designed to enable more complete training of large moving objects. Experiments on two widely open-source datasets show that our proposed model can successfully track objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks. ### Lite2Relight: 3D-aware Single Image Portrait Relighting - **Arxiv ID**: http://arxiv.org/abs/2407.10487v1 - **DOI**: 10.1145/3641519.3657470 - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10487v1) - **Published**: 2024-07-15 07:16:11+00:00 - **Updated**: 2024-07-15 07:16:11+00:00 - **Authors**: Pramod Rao, Gereon Fox, Abhimitra Meka, Mallikarjun B R, Fangneng Zhan, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Mohamed Elgharib, Christian Theobalt - **Comment**: Accepted at SIGGRAPH '24: ACM SIGGRAPH 2024 Conference Papers - **Journal**: None - **Summary**: Achieving photorealistic 3D view synthesis and relighting of human portraits is pivotal for advancing AR/VR applications. Existing methodologies in portrait relighting demonstrate substantial limitations in terms of generalization and 3D consistency, coupled with inaccuracies in physically realistic lighting and identity preservation. Furthermore, personalization from a single view is difficult to achieve and often requires multiview images during the testing phase or involves slow optimization processes. This paper introduces Lite2Relight, a novel technique that can predict 3D consistent head poses of portraits while performing physically plausible light editing at interactive speed. Our method uniquely extends the generative capabilities and efficient volumetric representation of EG3D, leveraging a lightstage dataset to implicitly disentangle face reflectance and perform relighting under target HDRI environment maps. By utilizing a pre-trained geometry-aware encoder and a feature alignment module, we map input images into a relightable 3D space, enhancing them with a strong face geometry and reflectance prior. Through extensive quantitative and qualitative evaluations, we show that our method outperforms the state-of-the-art methods in terms of efficacy, photorealism, and practical application. This includes producing 3D-consistent results of the full head, including hair, eyes, and expressions. Lite2Relight paves the way for large-scale adoption of photorealistic portrait editing in various domains, offering a robust, interactive solution to a previously constrained problem. Project page: https://vcai.mpi-inf.mpg.de/projects/Lite2Relight/ ### Learning to Unlearn for Robust Machine Unlearning - **Arxiv ID**: http://arxiv.org/abs/2407.10494v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV, I.2.6 - **Links**: [PDF](http://arxiv.org/pdf/2407.10494v1) - **Published**: 2024-07-15 07:36:00+00:00 - **Updated**: 2024-07-15 07:36:00+00:00 - **Authors**: Mark He Huang, Lin Geng Foo, Jun Liu - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Machine unlearning (MU) seeks to remove knowledge of specific data samples from trained models without the necessity for complete retraining, a task made challenging by the dual objectives of effective erasure of data and maintaining the overall performance of the model. Despite recent advances in this field, balancing between the dual objectives of unlearning remains challenging. From a fresh perspective of generalization, we introduce a novel Learning-to-Unlearn (LTU) framework, which adopts a meta-learning approach to optimize the unlearning process to improve forgetting and remembering in a unified manner. LTU includes a meta-optimization scheme that facilitates models to effectively preserve generalizable knowledge with only a small subset of the remaining set, while thoroughly forgetting the specific data samples. We also introduce a Gradient Harmonization strategy to align the optimization trajectories for remembering and forgetting via mitigating gradient conflicts, thus ensuring efficient and effective model updates. Our approach demonstrates improved efficiency and efficacy for MU, offering a promising solution to the challenges of data rights and model reusability. ### Improving Hyperbolic Representations via Gromov-Wasserstein Regularization - **Arxiv ID**: http://arxiv.org/abs/2407.10495v1 - **DOI**: None - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10495v1) - **Published**: 2024-07-15 07:37:31+00:00 - **Updated**: 2024-07-15 07:37:31+00:00 - **Authors**: Yifei Yang, Wonjun Lee, Dongmian Zou, Gilad Lerman - **Comment**: Accepted for ECCV 2024 - **Journal**: None - **Summary**: Hyperbolic representations have shown remarkable efficacy in modeling inherent hierarchies and complexities within data structures. Hyperbolic neural networks have been commonly applied for learning such representations from data, but they often fall short in preserving the geometric structures of the original feature spaces. In response to this challenge, our work applies the Gromov-Wasserstein (GW) distance as a novel regularization mechanism within hyperbolic neural networks. The GW distance quantifies how well the original data structure is maintained after embedding the data in a hyperbolic space. Specifically, we explicitly treat the layers of the hyperbolic neural networks as a transport map and calculate the GW distance accordingly. We validate that the GW distance computed based on a training set well approximates the GW distance of the underlying data distribution. Our approach demonstrates consistent enhancements over current state-of-the-art methods across various tasks, including few-shot image classification, as well as semi-supervised graph link prediction and node classification. ### Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10528v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10528v1) - **Published**: 2024-07-15 08:35:00+00:00 - **Updated**: 2024-07-15 08:35:00+00:00 - **Authors**: Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/. ### Automated Label Unification for Multi-Dataset Semantic Segmentation with GNNs - **Arxiv ID**: http://arxiv.org/abs/2407.10534v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10534v1) - **Published**: 2024-07-15 08:42:10+00:00 - **Updated**: 2024-07-15 08:42:10+00:00 - **Authors**: Rong Ma, Jie Chen, Xiangyang Xue, Jian Pu - **Comment**: None - **Journal**: None - **Summary**: Deep supervised models possess significant capability to assimilate extensive training data, thereby presenting an opportunity to enhance model performance through training on multiple datasets. However, conflicts arising from different label spaces among datasets may adversely affect model performance. In this paper, we propose a novel approach to automatically construct a unified label space across multiple datasets using graph neural networks. This enables semantic segmentation models to be trained simultaneously on multiple datasets, resulting in performance improvements. Unlike existing methods, our approach facilitates seamless training without the need for additional manual reannotation or taxonomy reconciliation. This significantly enhances the efficiency and effectiveness of multi-dataset segmentation model training. The results demonstrate that our method significantly outperforms other multi-dataset training methods when trained on seven datasets simultaneously, and achieves state-of-the-art performance on the WildDash 2 benchmark. ### An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments - **Arxiv ID**: http://arxiv.org/abs/2407.10536v1 - **DOI**: 10.1007/s10462-024-10840-0 - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10536v1) - **Published**: 2024-07-15 08:44:37+00:00 - **Updated**: 2024-07-15 08:44:37+00:00 - **Authors**: J. J. Cabrera, V. Román, A. Gil, O. Reinoso, L. Payá - **Comment**: Published: 08 July 2024 Paper link: https://link.springer.com/content/pdf/10.1007/s10462-024-10840-0.pdf - **Journal**: Cabrera, J.J., Rom\'an, V., Gil, A. , Reinoso, O. and Pay\'a, L. An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments. Artificial Intellence Review 57, 198 (2024) - **Summary**: The objective of this paper is to address the localization problem using omnidirectional images captured by a catadioptric vision system mounted on the robot. For this purpose, we explore the potential of Siamese Neural Networks for modeling indoor environments using panoramic images as the unique source of information. Siamese Neural Networks are characterized by their ability to generate a similarity function between two input data, in this case, between two panoramic images. In this study, Siamese Neural Networks composed of two Convolutional Neural Networks (CNNs) are used. The output of each CNN is a descriptor which is used to characterize each image. The dissimilarity of the images is computed by measuring the distance between these descriptors. This fact makes Siamese Neural Networks particularly suitable to perform image retrieval tasks. First, we evaluate an initial task strongly related to localization that consists in detecting whether two images have been captured in the same or in different rooms. Next, we assess Siamese Neural Networks in the context of a global localization problem. The results outperform previous techniques for solving the localization task using the COLD-Freiburg dataset, in a variety of lighting conditions, specially when using images captured in cloudy and night conditions. ### Segmentation of Prostate Tumour Volumes from PET Images is a Different Ball Game - **Arxiv ID**: http://arxiv.org/abs/2407.10537v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10537v1) - **Published**: 2024-07-15 08:48:17+00:00 - **Updated**: 2024-07-15 08:48:17+00:00 - **Authors**: Shrajan Bhandary, Dejan Kuhn, Zahra Babaiee, Tobias Fechter, Simon K. B. Spohn, Constantinos Zamboglou, Anca-Ligia Grosu, Radu Grosu - **Comment**: None - **Journal**: None - **Summary**: Accurate segmentation of prostate tumours from PET images presents a formidable challenge in medical image analysis. Despite considerable work and improvement in delineating organs from CT and MR modalities, the existing standards do not transfer well and produce quality results in PET related tasks. Particularly, contemporary methods fail to accurately consider the intensity-based scaling applied by the physicians during manual annotation of tumour contours. In this paper, we observe that the prostate-localised uptake threshold ranges are beneficial for suppressing outliers. Therefore, we utilize the intensity threshold values, to implement a new custom-feature-clipping normalisation technique. We evaluate multiple, established U-Net variants under different normalisation schemes, using the nnU-Net framework. All models were trained and tested on multiple datasets, obtained with two radioactive tracers: [68-Ga]Ga-PSMA-11 and [18-F]PSMA-1007. Our results show that the U-Net models achieve much better performance when the PET scans are preprocessed with our novel clipping technique. ### 3D Geometric Shape Assembly via Efficient Point Cloud Matching - **Arxiv ID**: http://arxiv.org/abs/2407.10542v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10542v1) - **Published**: 2024-07-15 08:50:02+00:00 - **Updated**: 2024-07-15 08:50:02+00:00 - **Authors**: Nahyuk Lee, Juhong Min, Junha Lee, Seungwook Kim, Kanghee Lee, Jaesik Park, Minsu Cho - **Comment**: Accepted to ICML 2024 - **Journal**: None - **Summary**: Learning to assemble geometric shapes into a larger target structure is a pivotal task in various practical applications. In this work, we tackle this problem by establishing local correspondences between point clouds of part shapes in both coarse- and fine-levels. To this end, we introduce Proxy Match Transform (PMT), an approximate high-order feature transform layer that enables reliable matching between mating surfaces of parts while incurring low costs in memory and computation. Building upon PMT, we introduce a new framework, dubbed Proxy Match TransformeR (PMTR), for the geometric assembly task. We evaluate the proposed PMTR on the large-scale 3D geometric shape assembly benchmark dataset of Breaking Bad and demonstrate its superior performance and efficiency compared to state-of-the-art methods. Project page: https://nahyuklee.github.io/pmtr. ### Understanding the Dependence of Perception Model Competency on Regions in an Image - **Arxiv ID**: http://arxiv.org/abs/2407.10543v1 - **DOI**: 10.1007/978-3-031-63797-1_8 - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10543v1) - **Published**: 2024-07-15 08:50:13+00:00 - **Updated**: 2024-07-15 08:50:13+00:00 - **Authors**: Sara Pohland, Claire Tomlin - **Comment**: None - **Journal**: In: Explainable Artificial Intelligence. xAI 2024. Communications in Computer and Information Science, vol 2154 - **Summary**: While deep neural network (DNN)-based perception models are useful for many applications, these models are black boxes and their outputs are not yet well understood. To confidently enable a real-world, decision-making system to utilize such a perception model without human intervention, we must enable the system to reason about the perception model's level of competency and respond appropriately when the model is incompetent. In order for the system to make an intelligent decision about the appropriate action when the model is incompetent, it would be useful for the system to understand why the model is incompetent. We explore five novel methods for identifying regions in the input image contributing to low model competency, which we refer to as image cropping, segment masking, pixel perturbation, competency gradients, and reconstruction loss. We assess the ability of these five methods to identify unfamiliar objects, recognize regions associated with unseen classes, and identify unexplored areas in an environment. We find that the competency gradients and reconstruction loss methods show great promise in identifying regions associated with low model competency, particularly when aspects of the image that are unfamiliar to the perception model are causing this reduction in competency. Both of these methods boast low computation times and high levels of accuracy in detecting image regions that are unfamiliar to the model, allowing them to provide potential utility in decision-making pipelines. The code for reproducing our methods and results is available on GitHub: https://github.com/sarapohland/explainable-competency. ### Efficient Continual Learning with Low Memory Footprint For Edge Device - **Arxiv ID**: http://arxiv.org/abs/2407.10545v2 - **DOI**: None - **Categories**: **cs.LG**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10545v2) - **Published**: 2024-07-15 08:52:20+00:00 - **Updated**: 2024-07-17 14:34:43+00:00 - **Authors**: Zeqing Wang, Fei Cheng, Kangye Ji, Bohu Huang - **Comment**: None - **Journal**: None - **Summary**: Continual learning(CL) is a useful technique to acquire dynamic knowledge continually. Although powerful cloud platforms can fully exert the ability of CL,e.g., customized recommendation systems, similar personalized requirements for edge devices are almost disregarded. This phenomenon stems from the huge resource overhead involved in training neural networks and overcoming the forgetting problem of CL. This paper focuses on these scenarios and proposes a compact algorithm called LightCL. Different from other CL methods bringing huge resource consumption to acquire generalizability among all tasks for delaying forgetting, LightCL compress the resource consumption of already generalized components in neural networks and uses a few extra resources to improve memory in other parts. We first propose two new metrics of learning plasticity and memory stability to seek generalizability during CL. Based on the discovery that lower and middle layers have more generalizability and deeper layers are opposite, we $\textit{Maintain Generalizability}$ by freezing the lower and middle layers. Then, we $\textit{Memorize Feature Patterns}$ to stabilize the feature extracting patterns of previous tasks to improve generalizability in deeper layers. In the experimental comparison, LightCL outperforms other SOTA methods in delaying forgetting and reduces at most $\textbf{6.16$\times$}$ memory footprint, proving the excellent performance of LightCL in efficiency. We also evaluate the efficiency of our method on an edge device, the Jetson Nano, which further proves our method's practical effectiveness. ### Learning Natural Consistency Representation for Face Forgery Video Detection - **Arxiv ID**: http://arxiv.org/abs/2407.10550v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10550v1) - **Published**: 2024-07-15 09:00:02+00:00 - **Updated**: 2024-07-15 09:00:02+00:00 - **Authors**: Daichi Zhang, Zihao Xiao, Shikun Li, Fanzhao Lin, Jianmin Li, Shiming Ge - **Comment**: None - **Journal**: None - **Summary**: Face Forgery videos have elicited critical social public concerns and various detectors have been proposed. However, fully-supervised detectors may lead to easily overfitting to specific forgery methods or videos, and existing self-supervised detectors are strict on auxiliary tasks, such as requiring audio or multi-modalities, leading to limited generalization and robustness. In this paper, we examine whether we can address this issue by leveraging visual-only real face videos. To this end, we propose to learn the Natural Consistency representation (NACO) of real face videos in a self-supervised manner, which is inspired by the observation that fake videos struggle to maintain the natural spatiotemporal consistency even under unknown forgery methods and different perturbations. Our NACO first extracts spatial features of each frame by CNNs then integrates them into Transformer to learn the long-range spatiotemporal representation, leveraging the advantages of CNNs and Transformer on local spatial receptive field and long-term memory respectively. Furthermore, a Spatial Predictive Module~(SPM) and a Temporal Contrastive Module~(TCM) are introduced to enhance the natural consistency representation learning. The SPM aims to predict random masked spatial features from spatiotemporal representation, and the TCM regularizes the latent distance of spatiotemporal representation by shuffling the natural order to disturb the consistency, which could both force our NACO more sensitive to the natural spatiotemporal consistency. After the representation learning stage, a MLP head is fine-tuned to perform the usual forgery video classification task. Extensive experiments show that our method outperforms other state-of-the-art competitors with impressive generalization and robustness. ### ConTEXTure: Consistent Multiview Images to Texture - **Arxiv ID**: http://arxiv.org/abs/2407.10558v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10558v1) - **Published**: 2024-07-15 09:15:55+00:00 - **Updated**: 2024-07-15 09:15:55+00:00 - **Authors**: Jaehoon Ahn, Sumin Cho, Harim Jung, Kibeom Hong, Seonghoon Ban, Moon-Ryul Jung - **Comment**: 11 pages, 7 figures - **Journal**: None - **Summary**: We introduce ConTEXTure, a generative network designed to create a texture map/atlas for a given 3D mesh using images from multiple viewpoints. The process begins with generating a front-view image from a text prompt, such as 'Napoleon, front view', describing the 3D mesh. Additional images from different viewpoints are derived from this front-view image and camera poses relative to it. ConTEXTure builds upon the TEXTure network, which uses text prompts for six viewpoints (e.g., 'Napoleon, front view', 'Napoleon, left view', etc.). However, TEXTure often generates images for non-front viewpoints that do not accurately represent those viewpoints.To address this issue, we employ Zero123++, which generates multiple view-consistent images for the six specified viewpoints simultaneously, conditioned on the initial front-view image and the depth maps of the mesh for the six viewpoints. By utilizing these view-consistent images, ConTEXTure learns the texture atlas from all viewpoint images concurrently, unlike previous methods that do so sequentially. This approach ensures that the rendered images from various viewpoints, including back, side, bottom, and top, are free from viewpoint irregularities. ### LIP-CAR: contrast agent reduction by a deep learned inverse problem - **Arxiv ID**: http://arxiv.org/abs/2407.10559v1 - **DOI**: None - **Categories**: **cs.CV**, cs.NA, eess.IV, math.NA - **Links**: [PDF](http://arxiv.org/pdf/2407.10559v1) - **Published**: 2024-07-15 09:16:54+00:00 - **Updated**: 2024-07-15 09:16:54+00:00 - **Authors**: Davide Bianchi, Sonia Colombo Serra, Davide Evangelista, Pengpeng Luo, Elena Morotti, Giovanni Valbusa - **Comment**: None - **Journal**: None - **Summary**: The adoption of contrast agents in medical imaging protocols is crucial for accurate and timely diagnosis. While highly effective and characterized by an excellent safety profile, the use of contrast agents has its limitation, including rare risk of allergic reactions, potential environmental impact and economic burdens on patients and healthcare systems. In this work, we address the contrast agent reduction (CAR) problem, which involves reducing the administered dosage of contrast agent while preserving the visual enhancement. The current literature on the CAR task is based on deep learning techniques within a fully image processing framework. These techniques digitally simulate high-dose images from images acquired with a low dose of contrast agent. We investigate the feasibility of a ``learned inverse problem'' (LIP) approach, as opposed to the end-to-end paradigm in the state-of-the-art literature. Specifically, we learn the image-to-image operator that maps high-dose images to their corresponding low-dose counterparts, and we frame the CAR task as an inverse problem. We then solve this problem through a regularized optimization reformulation. Regularization methods are well-established mathematical techniques that offer robustness and explainability. Our approach combines these rigorous techniques with cutting-edge deep learning tools. Numerical experiments performed on pre-clinical medical images confirm the effectiveness of this strategy, showing improved stability and accuracy in the simulated high-dose images. ### Pathformer3D: A 3D Scanpath Transformer for 360° Images - **Arxiv ID**: http://arxiv.org/abs/2407.10563v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10563v1) - **Published**: 2024-07-15 09:24:27+00:00 - **Updated**: 2024-07-15 09:24:27+00:00 - **Authors**: Rong Quan, Yantao Lai, Mengyu Qiu, Dong Liang - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: Scanpath prediction in 360{\deg} images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360{\deg} images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perform scanpath prediction for 360{\deg} images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the 360{\deg} image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step's fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods. Code is available at https://github.com/lsztzp/Pathformer3D . ### PULPo: Probabilistic Unsupervised Laplacian Pyramid Registration - **Arxiv ID**: http://arxiv.org/abs/2407.10567v1 - **DOI**: None - **Categories**: **cs.CV**, eess.IV - **Links**: [PDF](http://arxiv.org/pdf/2407.10567v1) - **Published**: 2024-07-15 09:30:31+00:00 - **Updated**: 2024-07-15 09:30:31+00:00 - **Authors**: Leonard Siegert, Paul Fischer, Mattias P. Heinrich, Christian F. Baumgartner - **Comment**: Accepted as full paper to MICCAI 2024 - **Journal**: None - **Summary**: Deformable image registration is fundamental to many medical imaging applications. Registration is an inherently ambiguous task often admitting many viable solutions. While neural network-based registration techniques enable fast and accurate registration, the majority of existing approaches are not able to estimate uncertainty. Here, we present PULPo, a method for probabilistic deformable registration capable of uncertainty quantification. PULPo probabilistically models the distribution of deformation fields on different hierarchical levels combining them using Laplacian pyramids. This allows our method to model global as well as local aspects of the deformation field. We evaluate our method on two widely used neuroimaging datasets and find that it achieves high registration performance as well as substantially better calibrated uncertainty quantification compared to the current state-of-the-art. ### Stacking-Enhanced Bagging Ensemble Learning for Breast Cancer Classification with CNN - **Arxiv ID**: http://arxiv.org/abs/2407.10574v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10574v1) - **Published**: 2024-07-15 09:44:43+00:00 - **Updated**: 2024-07-15 09:44:43+00:00 - **Authors**: Peihceng Wu, Runze Ma, Teoh Teik Toe - **Comment**: Published in: 2023 3rd International Conference on Electronic Engineering (ICEEM) - **Journal**: None - **Summary**: This paper proposes a CNN classification network based on Bagging and stacking ensemble learning methods for breast cancer classification. The model was trained and tested on the public dataset of DDSM. The model is capable of fast and accurate classification of input images. According to our research results, for binary classification (presence or absence of breast cancer), the accuracy reached 98.84%, and for five-class classification, the accuracy reached 98.34%. The model also achieved a micro-average recall rate of 94.80% and an F1 score of 94.19%. In comparative experiments, we compared the effects of different values of bagging_ratio and n_models on the model, as well as several methods for ensemble bagging models. Furthermore, under the same parameter settings, our BSECNN outperformed VGG16 and ResNet-50 in terms of accuracy by 8.22% and 6.33% respectively. ### A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication - **Arxiv ID**: http://arxiv.org/abs/2407.10575v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10575v1) - **Published**: 2024-07-15 09:46:02+00:00 - **Updated**: 2024-07-15 09:46:02+00:00 - **Authors**: Jingyi Deng, Chenhao Lin, Zhengyu Zhao, Shuai Liu, Qian Wang, Chao Shen - **Comment**: None - **Journal**: None - **Summary**: Deep generative models have demonstrated impressive performance in various computer vision applications, including image synthesis, video generation, and medical analysis. Despite their significant advancements, these models may be used for malicious purposes, such as misinformation, deception, and copyright violation. In this paper, we provide a systematic and timely review of research efforts on defenses against AI-generated visual media, covering detection, disruption, and authentication. We review existing methods and summarize the mainstream defense-related tasks within a unified passive and proactive framework. Moreover, we survey the derivative tasks concerning the trustworthiness of defenses, such as their robustness and fairness. For each task, we formulate its general pipeline and propose a taxonomy based on methodological strategies that are uniformly applicable to the primary subtasks. Additionally, we summarize the commonly used evaluation datasets, criteria, and metrics. Finally, by analyzing the reviewed studies, we provide insights into current research challenges and suggest possible directions for future research. ### COSMU: Complete 3D human shape from monocular unconstrained images - **Arxiv ID**: http://arxiv.org/abs/2407.10586v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10586v1) - **Published**: 2024-07-15 10:06:59+00:00 - **Updated**: 2024-07-15 10:06:59+00:00 - **Authors**: Marco Pesavento, Marco Volino, Adrian Hilton - **Comment**: Accepted to ECCV24 - **Journal**: None - **Summary**: We present a novel framework to reconstruct complete 3D human shapes from a given target image by leveraging monocular unconstrained images. The objective of this work is to reproduce high-quality details in regions of the reconstructed human body that are not visible in the input target. The proposed methodology addresses the limitations of existing approaches for reconstructing 3D human shapes from a single image, which cannot reproduce shape details in occluded body regions. The missing information of the monocular input can be recovered by using multiple views captured from multiple cameras. However, multi-view reconstruction methods necessitate accurately calibrated and registered images, which can be challenging to obtain in real-world scenarios. Given a target RGB image and a collection of multiple uncalibrated and unregistered images of the same individual, acquired using a single camera, we propose a novel framework to generate complete 3D human shapes. We introduce a novel module to generate 2D multi-view normal maps of the person registered with the target input image. The module consists of body part-based reference selection and body part-based registration. The generated 2D normal maps are then processed by a multi-view attention-based neural implicit model that estimates an implicit representation of the 3D shape, ensuring the reproduction of details in both observed and occluded regions. Extensive experiments demonstrate that the proposed approach estimates higher quality details in the non-visible regions of the 3D clothed human shapes compared to related methods, without using parametric models. ### Deep-Learning-Based Markerless Pose Estimation Systems in Gait Analysis: DeepLabCut Custom Training and the Refinement Function - **Arxiv ID**: http://arxiv.org/abs/2407.10590v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10590v1) - **Published**: 2024-07-15 10:14:30+00:00 - **Updated**: 2024-07-15 10:14:30+00:00 - **Authors**: Giulia Panconi, Stefano Grasso, Sara Guarducci, Lorenzo Mucchi, Diego Minciacchi, Riccardo Bravi - **Comment**: None - **Journal**: None - **Summary**: The current gold standard for the study of human movement is the marker-based motion capture system that offers high precision but constrained by costs and controlled environments. Markerless pose estimation systems emerge as ecological alternatives, allowing unobtrusive data acquisition in natural settings. This study compares the performance of two popular markerless systems, OpenPose (OP) and DeepLabCut (DLC), in assessing locomotion. Forty healthy subjects walked along a 5 meters walkway equipped with four force platforms and a camera. Gait parameters were obtained using OP BODY 25 Pre-Trained model (OPPT), DLC Model Zoo full human Pre-Trained model (DLCPT) and DLC Custom-Trained model (DLCCT), then compared with those acquired from the force platforms as reference system. Our results demonstrated that DLCCT outperformed DLCPT and OPPT, highlighting the importance of leveraging DeepLabCut transfer learning to enhance the pose estimation performance with a custom-trained neural networks. Moreover, DLCCT, with the implementation of the DLC refinement function, offers the most promising markerless pose estimation solution for evaluating locomotion. Therefore, our data provide insights into the DLC training and refinement processes required to achieve optimal performance. This study offers perspectives for clinicians and practitioners seeking accurate low-cost methods for movement assessment beyond laboratory settings. ### InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture - **Arxiv ID**: http://arxiv.org/abs/2407.10592v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10592v1) - **Published**: 2024-07-15 10:15:58+00:00 - **Updated**: 2024-07-15 10:15:58+00:00 - **Authors**: Phillip Mueller, Jannik Wiese, Ioan Craciun, Lars Mikelsons - **Comment**: None - **Journal**: None - **Summary**: Recent advancements in image synthesis are fueled by the advent of large-scale diffusion models. Yet, integrating realistic object visualizations seamlessly into new or existing backgrounds without extensive training remains a challenge. This paper introduces InsertDiffusion, a novel, training-free diffusion architecture that efficiently embeds objects into images while preserving their structural and identity characteristics. Our approach utilizes off-the-shelf generative models and eliminates the need for fine-tuning, making it ideal for rapid and adaptable visualizations in product design and marketing. We demonstrate superior performance over existing methods in terms of image realism and alignment with input conditions. By decomposing the generation task into independent steps, InsertDiffusion offers a scalable solution that extends the capabilities of diffusion models for practical applications, achieving high-quality visualizations that maintain the authenticity of the original objects. ### An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots - **Arxiv ID**: http://arxiv.org/abs/2407.10596v1 - **DOI**: 10.1007/s12530-024-09604-6 - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10596v1) - **Published**: 2024-07-15 10:20:00+00:00 - **Updated**: 2024-07-15 10:20:00+00:00 - **Authors**: J. J. Cabrera, O. J. Céspedes, S. Cebollada, O. Reinoso, L. Payá - **Comment**: Published Evolving Systems (2024): 08 July 2024 PDF link: https://link.springer.com/content/pdf/10.1007/s12530-024-09604-6.pdf - **Journal**: None - **Summary**: This work presents an evaluation of CNN models and data augmentation to carry out the hierarchical localization of a mobile robot by using omnidireccional images. In this sense, an ablation study of different state-of-the-art CNN models used as backbone is presented and a variety of data augmentation visual effects are proposed for addressing the visual localization of the robot. The proposed method is based on the adaption and re-training of a CNN with a dual purpose: (1) to perform a rough localization step in which the model is used to predict the room from which an image was captured, and (2) to address the fine localization step, which consists in retrieving the most similar image of the visual map among those contained in the previously predicted room by means of a pairwise comparison between descriptors obtained from an intermediate layer of the CNN. In this sense, we evaluate the impact of different state-of-the-art CNN models such as ConvNeXt for addressing the proposed localization. Finally, a variety of data augmentation visual effects are separately employed for training the model and their impact is assessed. The performance of the resulting CNNs is evaluated under real operation conditions, including changes in the lighting conditions. Our code is publicly available on the project website https://github.com/juanjo-cabrera/IndoorLocalizationSingleCNN.git ### WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models - **Arxiv ID**: http://arxiv.org/abs/2407.10625v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10625v1) - **Published**: 2024-07-15 11:21:03+00:00 - **Updated**: 2024-07-15 11:21:03+00:00 - **Authors**: Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip H. S. Torr, Liang Lin - **Comment**: None - **Journal**: None - **Summary**: Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at wildvidfit-project.github.io. ### Brain Tumor Classification From MRI Images Using Machine Learning - **Arxiv ID**: http://arxiv.org/abs/2407.10630v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10630v1) - **Published**: 2024-07-15 11:30:40+00:00 - **Updated**: 2024-07-15 11:30:40+00:00 - **Authors**: Vidhyapriya Ranganathan, Celshiya Udaiyar, Jaisree Jayanth, Meghaa P V, Srija B, Uthra S - **Comment**: None - **Journal**: None - **Summary**: Brain tumor is a life-threatening problem and hampers the normal functioning of the human body. The average five-year relative survival rate for malignant brain tumors is 35.6 percent. For proper diagnosis and efficient treatment planning, it is necessary to detect the brain tumor in early stages. Due to advancement in medical imaging technology, the brain images are taken in different modalities. The ability to extract relevant characteristics from magnetic resonance imaging (MRI) scans is a crucial step for brain tumor classifiers. Several studies have proposed various strategies to extract relevant features from different modalities of MRI to predict the growth of abnormal tumors. Most techniques used conventional methods of image processing for feature extraction and machine learning for classification. More recently, the use of deep learning algorithms in medical imaging has resulted in significant improvements in the classification and diagnosis of brain tumors. Since tumors are located at different regions of the brain, localizing the tumor and classifying it to a particular category is a challenging task. The objective of this project is to develop a predictive system for brain tumor detection using machine learning(ensembling). ### Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model - **Arxiv ID**: http://arxiv.org/abs/2407.10632v1 - **DOI**: None - **Categories**: **eess.IV**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10632v1) - **Published**: 2024-07-15 11:36:22+00:00 - **Updated**: 2024-07-15 11:36:22+00:00 - **Authors**: Zhening Liu, Xinjie Zhang, Jiawei Shao, Zehong Lin, Jun Zhang - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM. ### Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction - **Arxiv ID**: http://arxiv.org/abs/2407.10636v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10636v1) - **Published**: 2024-07-15 11:48:57+00:00 - **Updated**: 2024-07-15 11:48:57+00:00 - **Authors**: Lin Zhu, Yunlong Zheng, Yijun Zhang, Xiao Wang, Lizhi Wang, Hua Huang - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing this challenge necessitates the integration of conditional information, encompassing temporal features, low-frequency texture, and high-frequency events, to guide the Denoising Diffusion Probabilistic Model (DDPM) in producing accurate and natural outputs. To tackle this issue, we introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors. Our framework incorporates three key conditioning modules: a pre-trained low-frequency intensity estimation module, a temporal recurrent encoder module, and an attention-based high-frequency prior enhancement module. In order to capture temporal scene variations from the events at the current moment, we employ a temporal-domain residual image as the target for the diffusion model. Through the combination of these three conditioning paths and the temporal residual framework, our framework excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches. Extensive experiments conducted on multiple benchmark datasets validate the superior performance of our framework compared to prior event-based reconstruction methods. ### Risk-aware Trajectory Prediction by Incorporating Spatio-temporal Traffic Interaction Analysis - **Arxiv ID**: http://arxiv.org/abs/2407.10639v1 - **DOI**: None - **Categories**: **cs.RO**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10639v1) - **Published**: 2024-07-15 11:57:06+00:00 - **Updated**: 2024-07-15 11:57:06+00:00 - **Authors**: Divya Thuremella, Lewis Ince, Lars Kunze - **Comment**: None - **Journal**: 2023 IEEE International Conference on Robotics and Automation (ICRA) 2023 IEEE International Conference on Robotics and Automation (ICRA) 2024 IEEE International Conference on Robotics and Automation (ICRA) - **Summary**: To operate in open-ended environments where humans interact in complex, diverse ways, autonomous robots must learn to predict their behaviour, especially when that behavior is potentially dangerous to other agents or to the robot. However, reducing the risk of accidents requires prior knowledge of where potential collisions may occur and how. Therefore, we propose to gain this information by analyzing locations and speeds that commonly correspond to high-risk interactions within the dataset, and use it within training to generate better predictions in high risk situations. Through these location-based and speed-based re-weighting techniques, we achieve improved overall performance, as measured by most-likely FDE and KDE, as well as improved performance on high-speed vehicles, and vehicles within high-risk locations. 2023 IEEE International Conference on Robotics and Automation (ICRA) ### Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems - **Arxiv ID**: http://arxiv.org/abs/2407.10641v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10641v1) - **Published**: 2024-07-15 12:00:46+00:00 - **Updated**: 2024-07-15 12:00:46+00:00 - **Authors**: Hyungjin Chung, Jong Chul Ye - **Comment**: ECCV 2024, 25 pages, 8 figures - **Journal**: None - **Summary**: Recent inverse problem solvers that leverage generative diffusion priors have garnered significant attention due to their exceptional quality. However, adaptation of the prior is necessary when there exists a discrepancy between the training and testing distributions. In this work, we propose deep diffusion image prior (DDIP), which generalizes the recent adaptation method of SCD by introducing a formal connection to the deep image prior. Under this framework, we propose an efficient adaptation method dubbed D3IP, specified for 3D measurements, which accelerates DDIP by orders of magnitude while achieving superior performance. D3IP enables seamless integration of 3D inverse solvers and thus leads to coherent 3D reconstruction. Moreover, we show that meta-learning techniques can also be applied to yield even better performance. We show that our method is capable of solving diverse 3D reconstructive tasks from the generative prior trained only with phantom images that are vastly different from the training set, opening up new opportunities of applying diffusion inverse solvers even when training with gold standard data is impossible. Code: https://github.com/HJ-harry/DDIP3D ### APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation - **Arxiv ID**: http://arxiv.org/abs/2407.10649v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10649v1) - **Published**: 2024-07-15 12:10:05+00:00 - **Updated**: 2024-07-15 12:10:05+00:00 - **Authors**: Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao - **Comment**: None - **Journal**: None - **Summary**: Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration. ### OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer - **Arxiv ID**: http://arxiv.org/abs/2407.10655v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10655v1) - **Published**: 2024-07-15 12:15:27+00:00 - **Updated**: 2024-07-15 12:15:27+00:00 - **Authors**: Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang - **Comment**: 4 pages - **Journal**: None - **Summary**: Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [https://github.com/Atten4Vis/LW-DETR]. ### Spatio-temporal neural distance fields for conditional generative modeling of the heart - **Arxiv ID**: http://arxiv.org/abs/2407.10663v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10663v1) - **Published**: 2024-07-15 12:26:52+00:00 - **Updated**: 2024-07-15 12:26:52+00:00 - **Authors**: Kristine Sørensen, Paula Diez, Jan Margeta, Yasmin El Youssef, Michael Pham, Jonas Jalili Pedersen, Tobias Kühl, Ole de Backer, Klaus Kofoed, Oscar Camara, Rasmus Paulsen - **Comment**: Accepted for MICCAI2024 - **Journal**: None - **Summary**: The rhythmic pumping motion of the heart stands as a cornerstone in life, as it circulates blood to the entire human body through a series of carefully timed contractions of the individual chambers. Changes in the size, shape and movement of the chambers can be important markers for cardiac disease and modeling this in relation to clinical demography or disease is therefore of interest. Existing methods for spatio-temporal modeling of the human heart require shape correspondence over time or suffer from large memory requirements, making it difficult to use for complex anatomies. We introduce a novel conditional generative model, where the shape and movement is modeled implicitly in the form of a spatio-temporal neural distance field and conditioned on clinical demography. The model is based on an auto-decoder architecture and aims to disentangle the individual variations from that related to the clinical demography. It is tested on the left atrium (including the left atrial appendage), where it outperforms current state-of-the-art methods for anatomical sequence completion and generates synthetic sequences that realistically mimics the shape and motion of the real left atrium. In practice, this means we can infer functional measurements from a static image, generate synthetic populations with specified demography or disease and investigate how non-imaging clinical data effect the shape and motion of cardiac anatomies. ### GeoMix: Towards Geometry-Aware Data Augmentation - **Arxiv ID**: http://arxiv.org/abs/2407.10681v1 - **DOI**: 10.1145/3637528.3671700 - **Categories**: **cs.LG**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10681v1) - **Published**: 2024-07-15 12:58:04+00:00 - **Updated**: 2024-07-15 12:58:04+00:00 - **Authors**: Wentao Zhao, Qitian Wu, Chenxiao Yang, Junchi Yan - **Comment**: Published as a conference paper at KDD 2024 - **Journal**: None - **Summary**: Mixup has shown considerable success in mitigating the challenges posed by limited labeled data in image classification. By synthesizing samples through the interpolation of features and labels, Mixup effectively addresses the issue of data scarcity. However, it has rarely been explored in graph learning tasks due to the irregularity and connectivity of graph data. Specifically, in node classification tasks, Mixup presents a challenge in creating connections for synthetic data. In this paper, we propose Geometric Mixup (GeoMix), a simple and interpretable Mixup approach leveraging in-place graph editing. It effectively utilizes geometry information to interpolate features and labels with those from the nearby neighborhood, generating synthetic nodes and establishing connections for them. We conduct theoretical analysis to elucidate the rationale behind employing geometry information for node Mixup, emphasizing the significance of locality enhancement-a critical aspect of our method's design. Extensive experiments demonstrate that our lightweight Geometric Mixup achieves state-of-the-art results on a wide variety of standard datasets with limited labeled data. Furthermore, it significantly improves the generalization capability of underlying GNNs across various challenging out-of-distribution generalization tasks. Our code is available at https://github.com/WtaoZhao/geomix. ### Addressing Image Hallucination in Text-to-Image Generation through Factual Image Retrieval - **Arxiv ID**: http://arxiv.org/abs/2407.10683v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10683v1) - **Published**: 2024-07-15 12:59:03+00:00 - **Updated**: 2024-07-15 12:59:03+00:00 - **Authors**: Youngsun Lim, Hyunjung Shim - **Comment**: This paper has been accepted for oral presentation at the IJCAI 2024 Workshop on Trustworthy Interactive Decision-Making with Foundation Models - **Journal**: None - **Summary**: Text-to-image generation has shown remarkable progress with the emergence of diffusion models. However, these models often generate factually inconsistent images, failing to accurately reflect the factual information and common sense conveyed by the input text prompts. We refer to this issue as Image hallucination. Drawing from studies on hallucinations in language models, we classify this problem into three types and propose a methodology that uses factual images retrieved from external sources to generate realistic images. Depending on the nature of the hallucination, we employ off-the-shelf image editing tools, either InstructPix2Pix or IP-Adapter, to leverage factual information from the retrieved image. This approach enables the generation of images that accurately reflect the facts and common sense. ### FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation - **Arxiv ID**: http://arxiv.org/abs/2407.10687v1 - **DOI**: None - **Categories**: **cs.CV**, cs.GR - **Links**: [PDF](http://arxiv.org/pdf/2407.10687v1) - **Published**: 2024-07-15 13:01:44+00:00 - **Updated**: 2024-07-15 13:01:44+00:00 - **Authors**: Honghao Xu, Juzhan Xu, Zeyu Huang, Pengfei Xu, Hui Huang, Ruizhen Hu - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: In this paper, we introduce a novel method called FRI-Net for 2D floorplan reconstruction from 3D point cloud. Existing methods typically rely on corner regression or box regression, which lack consideration for the global shapes of rooms. To address these issues, we propose a novel approach using a room-wise implicit representation with structural regularization to characterize the shapes of rooms in floorplans. By incorporating geometric priors of room layouts in floorplans into our training strategy, the generated room polygons are more geometrically regular. We have conducted experiments on two challenging datasets, Structured3D and SceneCAD. Our method demonstrates improved performance compared to state-of-the-art methods, validating the effectiveness of our proposed representation for floorplan reconstruction. ### Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification - **Arxiv ID**: http://arxiv.org/abs/2407.10694v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10694v1) - **Published**: 2024-07-15 13:08:42+00:00 - **Updated**: 2024-07-15 13:08:42+00:00 - **Authors**: Zhihao Chen, Yiyuan Ge, Qing Yue - **Comment**: 2024 International Conference on Intelligent Computing - **Journal**: None - **Summary**: Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods. ### IE-NeRF: Inpainting Enhanced Neural Radiance Fields in the Wild - **Arxiv ID**: http://arxiv.org/abs/2407.10695v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10695v1) - **Published**: 2024-07-15 13:10:23+00:00 - **Updated**: 2024-07-15 13:10:23+00:00 - **Authors**: Shuaixian Wang, Haoran Xu, Yaokun Li, Jiwei Chen, Guang Tan - **Comment**: None - **Journal**: None - **Summary**: We present a novel approach for synthesizing realistic novel views using Neural Radiance Fields (NeRF) with uncontrolled photos in the wild. While NeRF has shown impressive results in controlled settings, it struggles with transient objects commonly found in dynamic and time-varying scenes. Our framework called \textit{Inpainting Enhanced NeRF}, or \ours, enhances the conventional NeRF by drawing inspiration from the technique of image inpainting. Specifically, our approach extends the Multi-Layer Perceptrons (MLP) of NeRF, enabling it to simultaneously generate intrinsic properties (static color, density) and extrinsic transient masks. We introduce an inpainting module that leverages the transient masks to effectively exclude occlusions, resulting in improved volume rendering quality. Additionally, we propose a new training strategy with frequency regularization to address the sparsity issue of low-frequency transient components. We evaluate our approach on internet photo collections of landmarks, demonstrating its ability to generate high-quality novel views and achieve state-of-the-art performance. ### Deep ContourFlow: Advancing Active Contours with Deep Learning - **Arxiv ID**: http://arxiv.org/abs/2407.10696v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10696v1) - **Published**: 2024-07-15 13:12:34+00:00 - **Updated**: 2024-07-15 13:12:34+00:00 - **Authors**: Antoine Habis, Vannary Meas-Yedid, Elsa Angelini, Jean-Christophe Olivo-Marin - **Comment**: 11 pages, 12 figures - **Journal**: None - **Summary**: This paper introduces a novel approach that combines unsupervised active contour models with deep learning for robust and adaptive image segmentation. Indeed, traditional active contours, provide a flexible framework for contour evolution and learning offers the capacity to learn intricate features and patterns directly from raw data. Our proposed methodology leverages the strengths of both paradigms, presenting a framework for both unsupervised and one-shot approaches for image segmentation. It is capable of capturing complex object boundaries without the need for extensive labeled training data. This is particularly required in histology, a field facing a significant shortage of annotations due to the challenging and time-consuming nature of the annotation process. We illustrate and compare our results to state of the art methods on a histology dataset and show significant improvements. ### Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation - **Arxiv ID**: http://arxiv.org/abs/2407.10703v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10703v1) - **Published**: 2024-07-15 13:18:37+00:00 - **Updated**: 2024-07-15 13:18:37+00:00 - **Authors**: Yuhwan Jeong, Hoonhee Cho, Kuk-Jin Yoon - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Event cameras with high dynamic range ensure scene capture even in low-light conditions. However, night events exhibit patterns different from those captured during the day. This difference causes performance degradation when applying night events to a model trained solely on day events. This limitation persists due to a lack of annotated night events. To overcome the limitation, we aim to alleviate data imbalance by translating annotated day data into night events. However, generating events from different modalities challenges reproducing their unique properties. Accordingly, we propose an unpaired event-to-event day-to-night translation model that effectively learns to map from one domain to another using Diffusion GAN. The proposed translation model analyzes events in spatio-temporal dimension with wavelet decomposition and disentangled convolution layers. We also propose a new temporal contrastive learning with a novel shuffling and sampling strategy to regularize temporal continuity. To validate the efficacy of the proposed methodology, we redesign metrics for evaluating events translated in an unpaired setting, aligning them with the event modality for the first time. Our framework shows the successful day-to-night event translation while preserving the characteristics of events. In addition, through our translation method, we facilitate event-based modes to learn about night events by translating annotated day events into night events. Our approach effectively mitigates the performance degradation of applying real night events to downstream tasks. The code is available at https://github.com/jeongyh98/UDNET. ### Quantized Prompt for Efficient Generalization of Vision-Language Models - **Arxiv ID**: http://arxiv.org/abs/2407.10704v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10704v1) - **Published**: 2024-07-15 13:19:56+00:00 - **Updated**: 2024-07-15 13:19:56+00:00 - **Authors**: Tianxiang Hao, Xiaohan Ding, Juexiao Feng, Yuhong Yang, Hui Chen, Guiguang Ding - **Comment**: 14 pages, 7 figures. Accepted by ECCV 2024 - **Journal**: None - **Summary**: In the past few years, large-scale pre-trained vision-language models like CLIP have achieved tremendous success in various fields. Naturally, how to transfer the rich knowledge in such huge pre-trained models to downstream tasks and datasets becomes a hot topic. During downstream adaptation, the most challenging problems are overfitting and catastrophic forgetting, which can cause the model to overly focus on the current data and lose more crucial domain-general knowledge. Existing works use classic regularization techniques to solve the problems. As solutions become increasingly complex, the ever-growing storage and inference costs are also a significant problem that urgently needs to be addressed. While in this paper, we start from an observation that proper random noise can suppress overfitting and catastrophic forgetting. Then we regard quantization error as a kind of noise, and explore quantization for regularizing vision-language model, which is quite efficiency and effective. Furthermore, to improve the model's generalization capability while maintaining its specialization capacity at minimal cost, we deeply analyze the characteristics of the weight distribution in prompts, conclude several principles for quantization module design and follow such principles to create several competitive baselines. The proposed method is significantly efficient due to its inherent lightweight nature, making it possible to adapt on extremely resource-limited devices. Our method can be fruitfully integrated into many existing approaches like MaPLe, enhancing accuracy while reducing storage overhead, making it more powerful yet versatile. Extensive experiments on 11 datasets shows great superiority of our method sufficiently. Code is available at https://github.com/beyondhtx/QPrompt. ### Interactive Rendering of Relightable and Animatable Gaussian Avatars - **Arxiv ID**: http://arxiv.org/abs/2407.10707v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10707v1) - **Published**: 2024-07-15 13:25:07+00:00 - **Updated**: 2024-07-15 13:25:07+00:00 - **Authors**: Youyi Zhan, Tianjia Shao, He Wang, Yin Yang, Kun Zhou - **Comment**: None - **Journal**: None - **Summary**: Creating relightable and animatable avatars from multi-view or monocular videos is a challenging task for digital human creation and virtual reality applications. Previous methods rely on neural radiance fields or ray tracing, resulting in slow training and rendering processes. By utilizing Gaussian Splatting, we propose a simple and efficient method to decouple body materials and lighting from sparse-view or monocular avatar videos, so that the avatar can be rendered simultaneously under novel viewpoints, poses, and lightings at interactive frame rates (6.9 fps). Specifically, we first obtain the canonical body mesh using a signed distance function and assign attributes to each mesh vertex. The Gaussians in the canonical space then interpolate from nearby body mesh vertices to obtain the attributes. We subsequently deform the Gaussians to the posed space using forward skinning, and combine the learnable environment light with the Gaussian attributes for shading computation. To achieve fast shadow modeling, we rasterize the posed body mesh from dense viewpoints to obtain the visibility. Our approach is not only simple but also fast enough to allow interactive rendering of avatar animation under environmental light changes. Experiments demonstrate that, compared to previous works, our method can render higher quality results at a faster speed on both synthetic and real datasets. ### Detecting Omissions in Geographic Maps through Computer Vision - **Arxiv ID**: http://arxiv.org/abs/2407.10709v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10709v1) - **Published**: 2024-07-15 13:26:58+00:00 - **Updated**: 2024-07-15 13:26:58+00:00 - **Authors**: Phuc D. A. Nguyen, Anh Do, Minh Hoai - **Comment**: VinMap dataset: https://github.com/VinAIResearch/VinMap - **Journal**: None - **Summary**: This paper explores the application of computer vision technologies to the analysis of maps, an area with substantial historical, cultural, and political significance. Our focus is on developing and evaluating a method for automatically identifying maps that depict specific regions and feature landmarks with designated names, a task that involves complex challenges due to the diverse styles and methods used in map creation. We address three main subtasks: differentiating maps from non-maps, verifying the accuracy of the region depicted, and confirming the presence or absence of particular landmark names through advanced text recognition techniques. Our approach utilizes a Convolutional Neural Network and transfer learning to differentiate maps from non-maps, verify the accuracy of depicted regions, and confirm landmark names through advanced text recognition. We also introduce the VinMap dataset, containing annotated map images of Vietnam, to train and test our method. Experiments on this dataset demonstrate that our technique achieves F1-score of 85.51% for identifying maps excluding specific territorial landmarks. This result suggests practical utility and indicates areas for future improvement. ### Anticipating Future Object Compositions without Forgetting - **Arxiv ID**: http://arxiv.org/abs/2407.10723v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10723v1) - **Published**: 2024-07-15 13:49:31+00:00 - **Updated**: 2024-07-15 13:49:31+00:00 - **Authors**: Youssef Zahran, Gertjan Burghouts, Yke Bauke Eisma - **Comment**: None - **Journal**: None - **Summary**: Despite the significant advancements in computer vision models, their ability to generalize to novel object-attribute compositions remains limited. Existing methods for Compositional Zero-Shot Learning (CZSL) mainly focus on image classification. This paper aims to enhance CZSL in object detection without forgetting prior learned knowledge. We use Grounding DINO and incorporate Compositional Soft Prompting (CSP) into it and extend it with Compositional Anticipation. We achieve a 70.5% improvement over CSP on the harmonic mean (HM) between seen and unseen compositions on the CLEVR dataset. Furthermore, we introduce Contrastive Prompt Tuning to incrementally address model confusion between similar compositions. We demonstrate the effectiveness of this method and achieve an increase of 14.5% in HM across the pretrain, increment, and unseen sets. Collectively, these methods provide a framework for learning various compositions with limited data, as well as improving the performance of underperforming compositions when additional data becomes available. ### ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation - **Arxiv ID**: http://arxiv.org/abs/2407.10730v1 - **DOI**: None - **Categories**: **cs.CV**, cs.PF - **Links**: [PDF](http://arxiv.org/pdf/2407.10730v1) - **Published**: 2024-07-15 13:58:24+00:00 - **Updated**: 2024-07-15 13:58:24+00:00 - **Authors**: Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo - **Comment**: 5 pages, 3 figures, presented on MLArchSys workshop of ISCA'2024 - **Journal**: None - **Summary**: Convolution is a compute-intensive operation placed at the heart of Convolution Neural Networks (CNNs). It has led to the development of many high-performance algorithms, such as Im2col-GEMM, Winograd, and Direct-Convolution. However, the comparison of different convolution algorithms is an error-prone task as it requires specific data layouts and system resources. Failure to address these requirements might lead to unwanted time penalties. Thus, considering all processing steps within convolution algorithms is essential to comprehensively evaluate and fairly compare their performance. Furthermore, most known convolution benchmarking adopts ad-hoc testing suites with limited coverage and handmade operations. This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms. It assesses 9243 convolution operations derived from 1097 real-world deep learning models, resulting in performance and execution breakdown graphs for a detailed evaluation. ConvBench capability is evaluated across the Sliced Convolution (SConv) algorithm. The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions. However, the use of ConvBench allowed the delving into the remaining 6.4% underperforming convolutions, uncovering a critical slowdown of 79.5% on average of SConv's packing step. This analysis underscores a potential source of optimization for SConv, opening up new paths for convolution designers to improve their algorithms. ### Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture - **Arxiv ID**: http://arxiv.org/abs/2407.10733v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10733v1) - **Published**: 2024-07-15 14:01:03+00:00 - **Updated**: 2024-07-15 14:01:03+00:00 - **Authors**: Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho, Chanmin Park, Jinyoung Kim, Won Hwa Kim - **Comment**: 27 pages, 5 figures - **Journal**: None - **Summary**: In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenges in self-supervised learning: 1) extracting comprehensive representations for universal image segmentation from a pixel decoder, and 2) effectively training the transformer decoder. The use of the transformer decoder as a predictor within the JEPA framework allows proficient training in universal image segmentation tasks. Through rigorous evaluations on datasets such as ADE20K, Cityscapes and COCO, Mask-JEPA demonstrates not only competitive results but also exceptional adaptability and robustness across various training scenarios. The architecture-agnostic nature of Mask-JEPA further underscores its versatility, allowing seamless adaptation to various mask classification family. ### When Synthetic Traces Hide Real Content: Analysis of Stable Diffusion Image Laundering - **Arxiv ID**: http://arxiv.org/abs/2407.10736v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.MM - **Links**: [PDF](http://arxiv.org/pdf/2407.10736v1) - **Published**: 2024-07-15 14:01:35+00:00 - **Updated**: 2024-07-15 14:01:35+00:00 - **Authors**: Sara Mandelli, Paolo Bestagini, Stefano Tubaro - **Comment**: None - **Journal**: None - **Summary**: In recent years, methods for producing highly realistic synthetic images have significantly advanced, allowing the creation of high-quality images from text prompts that describe the desired content. Even more impressively, Stable Diffusion (SD) models now provide users with the option of creating synthetic images in an image-to-image translation fashion, modifying images in the latent space of advanced autoencoders. This striking evolution, however, brings an alarming consequence: it is possible to pass an image through SD autoencoders to reproduce a synthetic copy of the image with high realism and almost no visual artifacts. This process, known as SD image laundering, can transform real images into lookalike synthetic ones and risks complicating forensic analysis for content authenticity verification. Our paper investigates the forensic implications of image laundering, revealing a serious potential to obscure traces of real content, including sensitive and harmful materials that could be mistakenly classified as synthetic, thereby undermining the protection of individuals depicted. To address this issue, we propose a two-stage detection pipeline that effectively differentiates between pristine, laundered, and fully synthetic images (those generated from text prompts), showing robustness across various conditions. Finally, we highlight another alarming property of image laundering, which appears to mask the unique artifacts exploited by forensic detectors to solve the camera model identification task, strongly undermining their performance. Our experimental code is available at https://github.com/polimi-ispl/synthetic-image-detection. ### Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models - **Arxiv ID**: http://arxiv.org/abs/2407.10737v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10737v1) - **Published**: 2024-07-15 14:06:13+00:00 - **Updated**: 2024-07-15 14:06:13+00:00 - **Authors**: Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu - **Comment**: This article is accepted by ECCV 2024, which ID is 12149. Accepted papers' id can be found in: https://eccv2024.ecva.net/Conferences/2024/AcceptedPapers - **Journal**: None - **Summary**: Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose Vi-ST, a spatiotemporal convolutional neural network fed with a self-supervised Vision Transformer (ViT) prior, aimed at unraveling the temporal-based encoding patterns of retinal neuronal populations. The model demonstrates robust predictive performance in generalization tests. Furthermore, through detailed ablation experiments, we demonstrate the significance of each temporal module. Furthermore, we introduce a visual coding evaluation metric designed to integrate temporal considerations and compare the impact of different numbers of neuronal populations on complementary coding. In conclusion, our proposed Vi-ST demonstrates a novel modeling framework for neuronal coding of dynamic visual scenes in the brain, effectively aligning our brain representation of video with neuronal activity. The code is available at https://github.com/wurining/Vi-ST. ### AccDiffusion: An Accurate Method for Higher-Resolution Image Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10738v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10738v1) - **Published**: 2024-07-15 14:06:29+00:00 - **Updated**: 2024-07-15 14:06:29+00:00 - **Authors**: Zhihang Lin, Mingbao Lin, Meng Zhao, Rongrong Ji - **Comment**: 23 pages - **Journal**: None - **Summary**: This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation. Our code is released at \url{https://github.com/lzhxmu/AccDiffusion}. ### SEED: A Simple and Effective 3D DETR in Point Clouds - **Arxiv ID**: http://arxiv.org/abs/2407.10749v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10749v1) - **Published**: 2024-07-15 14:21:07+00:00 - **Updated**: 2024-07-15 14:21:07+00:00 - **Authors**: Zhe Liu, Jinghua Hou, Xiaoqing Ye, Tong Wang, Jingdong Wang, Xiang Bai - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Recently, detection transformers (DETRs) have gradually taken a dominant position in 2D detection thanks to their elegant framework. However, DETR-based detectors for 3D point clouds are still difficult to achieve satisfactory performance. We argue that the main challenges are twofold: 1) How to obtain the appropriate object queries is challenging due to the high sparsity and uneven distribution of point clouds; 2) How to implement an effective query interaction by exploiting the rich geometric structure of point clouds is not fully explored. To this end, we propose a simple and effective 3D DETR method (SEED) for detecting 3D objects from point clouds, which involves a dual query selection (DQS) module and a deformable grid attention (DGA) module. More concretely, to obtain appropriate queries, DQS first ensures a high recall to retain a large number of queries by the predicted confidence scores and then further picks out high-quality queries according to the estimated quality scores. DGA uniformly divides each reference box into grids as the reference points and then utilizes the predicted offsets to achieve a flexible receptive field, allowing the network to focus on relevant regions and capture more informative features. Extensive ablation studies on DQS and DGA demonstrate its effectiveness. Furthermore, our SEED achieves state-of-the-art detection performance on both the large-scale Waymo and nuScenes datasets, illustrating the superiority of our proposed method. The code is available at https://github.com/happinesslz/SEED ### OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection - **Arxiv ID**: http://arxiv.org/abs/2407.10753v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10753v1) - **Published**: 2024-07-15 14:29:15+00:00 - **Updated**: 2024-07-15 14:29:15+00:00 - **Authors**: Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D detectors due to the lack of the depth of 3D object center; 2) for distant objects, fine-grained depth estimation of the whole object is more challenging. Therefore, we argue that the object-wise depth (or 3D center of the object) is essential for accurate detection. In this paper, we propose a new multi-view 3D object detector named OPEN, whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. Specifically, we first employ an object-wise depth encoder, which takes the pixel-wise depth map as a prior, to accurately estimate the object-wise depth. Then, we utilize the proposed object-wise position embedding to encode the object-wise depth information into the transformer decoder, thereby producing 3D object-aware features for final detection. Extensive experiments verify the effectiveness of our proposed method. Furthermore, OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark. ### An Autonomous Drone Swarm for Detecting and Tracking Anomalies among Dense Vegetation - **Arxiv ID**: http://arxiv.org/abs/2407.10754v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10754v1) - **Published**: 2024-07-15 14:31:21+00:00 - **Updated**: 2024-07-15 14:31:21+00:00 - **Authors**: Rakesh John Amala Arokia Nathan, Sigrid Strand, Daniel Mehrwald, Dmitriy Shutin, Oliver Bimber - **Comment**: None - **Journal**: None - **Summary**: Swarms of drones offer an increased sensing aperture, and having them mimic behaviors of natural swarms enhances sampling by adapting the aperture to local conditions. We demonstrate that such an approach makes detecting and tracking heavily occluded targets practically feasible. While object classification applied to conventional aerial images generalizes poorly the randomness of occlusion and is therefore inefficient even under lightly occluded conditions, anomaly detection applied to synthetic aperture integral images is robust for dense vegetation, such as forests, and is independent of pre-trained classes. Our autonomous swarm searches the environment for occurrences of the unknown or unexpected, tracking them while continuously adapting its sampling pattern to optimize for local viewing conditions. In our real-life field experiments with a swarm of six drones, we achieved an average positional accuracy of 0.39 m with an average precision of 93.2% and an average recall of 95.9%. Here, adapted particle swarm optimization considers detection confidences and predicted target appearance. We show that sensor noise can effectively be included in the synthetic aperture image integration process, removing the need for a computationally costly optimization of high-dimensional parameter spaces. Finally, we present a complete hard- and software framework that supports low-latency transmission (approx. 80 ms round-trip time) and fast processing (approx. 600 ms per formation step) of extensive (70-120 Mbit/s) video and telemetry data, and swarm control for swarms of up to ten drones. ### GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation - **Arxiv ID**: http://arxiv.org/abs/2407.10756v2 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10756v2) - **Published**: 2024-07-15 14:32:45+00:00 - **Updated**: 2024-07-16 14:32:21+00:00 - **Authors**: Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang - **Comment**: ECCV 2024 accepted - **Journal**: None - **Summary**: In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints. ### Domain Generalization for 6D Pose Estimation Through NeRF-based Image Synthesis - **Arxiv ID**: http://arxiv.org/abs/2407.10762v1 - **DOI**: None - **Categories**: **cs.CV**, eess.IV - **Links**: [PDF](http://arxiv.org/pdf/2407.10762v1) - **Published**: 2024-07-15 14:42:08+00:00 - **Updated**: 2024-07-15 14:42:08+00:00 - **Authors**: Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer - **Comment**: None - **Journal**: None - **Summary**: This work introduces a novel augmentation method that increases the diversity of a train set to improve the generalization abilities of a 6D pose estimation network. For this purpose, a Neural Radiance Field is trained from synthetic images and exploited to generate an augmented set. Our method enriches the initial set by enabling the synthesis of images with (i) unseen viewpoints, (ii) rich illumination conditions through appearance extrapolation, and (iii) randomized textures. We validate our augmentation method on the challenging use-case of spacecraft pose estimation and show that it significantly improves the pose estimation generalization capabilities. On the SPEED+ dataset, our method reduces the error on the pose by 50% on both target domains. ### Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities - **Arxiv ID**: http://arxiv.org/abs/2407.10785v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10785v1) - **Published**: 2024-07-15 15:03:01+00:00 - **Updated**: 2024-07-15 15:03:01+00:00 - **Authors**: Nhat Le, Ciyue Shen, Chintan Shah, Blake Martin, Daniel Shenker, Harshith Padigela, Jennifer Hipp, Sean Grullon, John Abel, Harsha Vardhan Pokkalla, Dinkar Juyal - **Comment**: None - **Journal**: None - **Summary**: Mechanistic interpretability has been explored in detail for large language models (LLMs). For the first time, we provide a preliminary investigation with similar interpretability methods for medical imaging. Specifically, we analyze the features from a ViT-Small encoder obtained from a pathology Foundation Model via application to two datasets: one dataset of pathology images, and one dataset of pathology images paired with spatial transcriptomics. We discover an interpretable representation of cell and tissue morphology, along with gene expression within the model embedding space. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications. ### Mammographic Breast Positioning Assessment via Deep Learning - **Arxiv ID**: http://arxiv.org/abs/2407.10796v1 - **DOI**: None - **Categories**: **eess.IV**, cs.AI, cs.CV, J.3 - **Links**: [PDF](http://arxiv.org/pdf/2407.10796v1) - **Published**: 2024-07-15 15:14:10+00:00 - **Updated**: 2024-07-15 15:14:10+00:00 - **Authors**: Toygar Tanyel, Nurper Denizoglu, Mustafa Ege Seker, Deniz Alis, Esma Cerekci, Ercan Karaarslan, Erkin Aribal, Ilkay Oksuz - **Comment**: None - **Journal**: None - **Summary**: Breast cancer remains a leading cause of cancer-related deaths among women worldwide, with mammography screening as the most effective method for the early detection. Ensuring proper positioning in mammography is critical, as poor positioning can lead to diagnostic errors, increased patient stress, and higher costs due to recalls. Despite advancements in deep learning (DL) for breast cancer diagnostics, limited focus has been given to evaluating mammography positioning. This paper introduces a novel DL methodology to quantitatively assess mammogram positioning quality, specifically in mediolateral oblique (MLO) views using attention and coordinate convolution modules. Our method identifies key anatomical landmarks, such as the nipple and pectoralis muscle, and automatically draws a posterior nipple line (PNL), offering robust and inherently explainable alternative to well-known classification and regression-based approaches. We compare the performance of proposed methodology with various regression and classification-based models. The CoordAtt UNet model achieved the highest accuracy of 88.63% $\pm$ 2.84 and specificity of 90.25% $\pm$ 4.04, along with a noteworthy sensitivity of 86.04% $\pm$ 3.41. In landmark detection, the same model also recorded the lowest mean errors in key anatomical points and the smallest angular error of 2.42 degrees. Our results indicate that models incorporating attention mechanisms and CoordConv module increase the accuracy in classifying breast positioning quality and detecting anatomical landmarks. Furthermore, we make the labels and source codes available to the community to initiate an open research area for mammography, accessible at https://github.com/tanyelai/deep-breast-positioning. ### Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation - **Arxiv ID**: http://arxiv.org/abs/2407.10802v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG, cs.RO - **Links**: [PDF](http://arxiv.org/pdf/2407.10802v1) - **Published**: 2024-07-15 15:18:28+00:00 - **Updated**: 2024-07-15 15:18:28+00:00 - **Authors**: Friedhelm Hamann, Ziyun Wang, Ioannis Asmanis, Kenneth Chaney, Guillermo Gallego, Kostas Daniilidis - **Comment**: 24 pages, 8 figures, 8 tables, Project Page: https://github.com/tub-rip/MotionPriorCMax - **Journal**: European Conference on Computer Vision (ECCV), Milan, 2024 - **Summary**: Current optical flow and point-tracking methods rely heavily on synthetic datasets. Event cameras are novel vision sensors with advantages in challenging visual conditions, but state-of-the-art frame-based methods cannot be easily adapted to event data due to the limitations of current event simulators. We introduce a novel self-supervised loss combining the Contrast Maximization framework with a non-linear motion prior in the form of pixel-level trajectories and propose an efficient solution to solve the high-dimensional assignment problem between non-linear trajectories and events. Their effectiveness is demonstrated in two scenarios: In dense continuous-time motion estimation, our method improves the zero-shot performance of a synthetically trained model on the real-world dataset EVIMO2 by 29%. In optical flow estimation, our method elevates a simple UNet to achieve state-of-the-art performance among self-supervised methods on the DSEC optical flow benchmark. Our code is available at https://github.com/tub-rip/MotionPriorCMax. ### DINO Pre-training for Vision-based End-to-end Autonomous Driving - **Arxiv ID**: http://arxiv.org/abs/2407.10803v1 - **DOI**: None - **Categories**: **cs.CV**, cs.LG, cs.RO - **Links**: [PDF](http://arxiv.org/pdf/2407.10803v1) - **Published**: 2024-07-15 15:18:57+00:00 - **Updated**: 2024-07-15 15:18:57+00:00 - **Authors**: Shubham Juneja, Povilas Daniušis, Virginijus Marcinkevičius - **Comment**: None - **Journal**: None - **Summary**: In this article, we focus on the pre-training of visual autonomous driving agents in the context of imitation learning. Current methods often rely on a classification-based pre-training, which we hypothesise to be holding back from extending capabilities of implicit image understanding. We propose pre-training the visual encoder of a driving agent using the self-distillation with no labels (DINO) method, which relies on a self-supervised learning paradigm.% and is trained on an unrelated task. Our experiments in CARLA environment in accordance with the Leaderboard benchmark reveal that the proposed pre-training is more efficient than classification-based pre-training, and is on par with the recently proposed pre-training based on visual place recognition (VPRPre). ### Enhancing Robustness to Noise Corruption for Point Cloud Model via Spatial Sorting and Set-Mixing Aggregation Module - **Arxiv ID**: http://arxiv.org/abs/2407.10806v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10806v1) - **Published**: 2024-07-15 15:21:34+00:00 - **Updated**: 2024-07-15 15:21:34+00:00 - **Authors**: Dingxin Zhang, Jianhui Yu, Tengfei Xue, Chaoyi Zhang, Dongnan Liu, Weidong Cai - **Comment**: 22 pages, 9 figures - **Journal**: None - **Summary**: Current models for point cloud recognition demonstrate promising performance on synthetic datasets. However, real-world point cloud data inevitably contains noise, impacting model robustness. While recent efforts focus on enhancing robustness through various strategies, there still remains a gap in comprehensive analyzes from the standpoint of network architecture design. Unlike traditional methods that rely on generic techniques, our approach optimizes model robustness to noise corruption through network architecture design. Inspired by the token-mixing technique applied in 2D images, we propose Set-Mixer, a noise-robust aggregation module which facilitates communication among all points to extract geometric shape information and mitigating the influence of individual noise points. A sorting strategy is designed to enable our module to be invariant to point permutation, which also tackles the unordered structure of point cloud and introduces consistent relative spatial information. Experiments conducted on ModelNet40-C indicate that Set-Mixer significantly enhances the model performance on noisy point clouds, underscoring its potential to advance real-world applicability in 3D recognition and perception tasks. ### FabGPT: An Efficient Large Multimodal Model for Complex Wafer Defect Knowledge Queries - **Arxiv ID**: http://arxiv.org/abs/2407.10810v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.AR, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10810v1) - **Published**: 2024-07-15 15:25:45+00:00 - **Updated**: 2024-07-15 15:25:45+00:00 - **Authors**: Yuqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming Wu, Cheng Zhuo - **Comment**: None - **Journal**: None - **Summary**: Intelligence is key to advancing integrated circuit (IC) fabrication. Recent breakthroughs in Large Multimodal Models (LMMs) have unlocked unparalleled abilities in understanding images and text, fostering intelligent fabrication. Leveraging the power of LMMs, we introduce FabGPT, a customized IC fabrication large multimodal model for wafer defect knowledge query. FabGPT manifests expertise in conducting defect detection in Scanning Electron Microscope (SEM) images, performing root cause analysis, and providing expert question-answering (Q&A) on fabrication processes. FabGPT matches enhanced multimodal features to automatically detect minute defects under complex wafer backgrounds and reduce the subjectivity of manual threshold settings. Besides, the proposed modulation module and interactive corpus training strategy embed wafer defect knowledge into the pre-trained model, effectively balancing Q&A queries related to defect knowledge and original knowledge and mitigating the modality bias issues. Experiments on in-house fab data (SEM-WaD) show that our FabGPT achieves significant performance improvement in wafer defect detection and knowledge querying. ### Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification - **Arxiv ID**: http://arxiv.org/abs/2407.10814v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10814v1) - **Published**: 2024-07-15 15:31:55+00:00 - **Updated**: 2024-07-15 15:31:55+00:00 - **Authors**: Linhao Qu, Dingkang Yang, Dan Huang, Qinhao Guo, Rongkui Luo, Shaoting Zhang, Xiaosong Wang - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Current multi-instance learning algorithms for pathology image analysis often require a substantial number of Whole Slide Images for effective training but exhibit suboptimal performance in scenarios with limited learning data. In clinical settings, restricted access to pathology slides is inevitable due to patient privacy concerns and the prevalence of rare or emerging diseases. The emergence of the Few-shot Weakly Supervised WSI Classification accommodates the significant challenge of the limited slide data and sparse slide-level labels for diagnosis. Prompt learning based on the pre-trained models (\eg, CLIP) appears to be a promising scheme for this setting; however, current research in this area is limited, and existing algorithms often focus solely on patch-level prompts or confine themselves to language prompts. This paper proposes a multi-instance prompt learning framework enhanced with pathology knowledge, \ie, integrating visual and textual prior knowledge into prompts at both patch and slide levels. The training process employs a combination of static and learnable prompts, effectively guiding the activation of pre-trained models and further facilitating the diagnosis of key pathology patterns. Lightweight Messenger (self-attention) and Summary (attention-pooling) layers are introduced to model relationships between patches and slides within the same patient data. Additionally, alignment-wise contrastive losses ensure the feature-level alignment between visual and textual learnable prompts for both patches and slides. Our method demonstrates superior performance in three challenging clinical tasks, significantly outperforming comparative few-shot methods. ### Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks - **Arxiv ID**: http://arxiv.org/abs/2407.10825v2 - **DOI**: None - **Categories**: **cs.LG**, cs.CR, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10825v2) - **Published**: 2024-07-15 15:38:21+00:00 - **Updated**: 2024-07-16 04:21:12+00:00 - **Authors**: Quang H. Nguyen, Nguyen Ngoc-Hieu, The-Anh Ta, Thanh Nguyen-Tang, Kok-Seng Wong, Hoang Thanh-Tung, Khoa D. Doan - **Comment**: None - **Journal**: None - **Summary**: Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks. ### Temporal Event Stereo via Joint Learning with Stereoscopic Flow - **Arxiv ID**: http://arxiv.org/abs/2407.10831v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10831v1) - **Published**: 2024-07-15 15:43:08+00:00 - **Updated**: 2024-07-15 15:43:08+00:00 - **Authors**: Hoonhee Cho, Jae-Young Kang, Kuk-Jin Yoon - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Event cameras are dynamic vision sensors inspired by the biological retina, characterized by their high dynamic range, high temporal resolution, and low power consumption. These features make them capable of perceiving 3D environments even in extreme conditions. Event data is continuous across the time dimension, which allows a detailed description of each pixel's movements. To fully utilize the temporally dense and continuous nature of event cameras, we propose a novel temporal event stereo, a framework that continuously uses information from previous time steps. This is accomplished through the simultaneous training of an event stereo matching network alongside stereoscopic flow, a new concept that captures all pixel movements from stereo cameras. Since obtaining ground truth for optical flow during training is challenging, we propose a method that uses only disparity maps to train the stereoscopic flow. The performance of event-based stereo matching is enhanced by temporally aggregating information using the flows. We have achieved state-of-the-art performance on the MVSEC and the DSEC datasets. The method is computationally efficient, as it stacks previous information in a cascading manner. The code is available at https://github.com/mickeykang16/TemporalEventStereo. ### MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration - **Arxiv ID**: http://arxiv.org/abs/2407.10833v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10833v1) - **Published**: 2024-07-15 15:43:27+00:00 - **Updated**: 2024-07-15 15:43:27+00:00 - **Authors**: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR. The project can be found at https://renyulin-f.github.io/MoE-DiffIR.github.io/. ### Physics-Inspired Generative Models in Medical Imaging: A Review - **Arxiv ID**: http://arxiv.org/abs/2407.10856v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, physics.med-ph - **Links**: [PDF](http://arxiv.org/pdf/2407.10856v1) - **Published**: 2024-07-15 16:08:22+00:00 - **Updated**: 2024-07-15 16:08:22+00:00 - **Authors**: Dennis Hein, Afshin Bozorgpour, Dorit Merhof, Ge Wang - **Comment**: None - **Journal**: None - **Summary**: Physics-inspired generative models, in particular diffusion and Poisson flow models, enhance Bayesian methods and promise great utilities in medical imaging. This review examines the transformative role of such generative methods. First, a variety of physics-inspired generative models, including Denoising Diffusion Probabilistic Models (DDPM), Score-based Diffusion Models, and Poisson Flow Generative Models (PFGM and PFGM++), are revisited, with an emphasis on their accuracy, robustness as well as acceleration. Then, major applications of physics-inspired generative models in medical imaging are presented, comprising image reconstruction, image generation, and image analysis. Finally, future research directions are brainstormed, including unification of physics-inspired generative models, integration with vision-language models (VLMs),and potential novel applications of generative models. Since the development of generative methods has been rapid, this review will hopefully give peers and learners a timely snapshot of this new family of physics-driven generative models and help capitalize their enormous potential for medical imaging. ### Human-Centric Transformer for Domain Adaptive Action Recognition - **Arxiv ID**: http://arxiv.org/abs/2407.10860v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10860v1) - **Published**: 2024-07-15 16:10:11+00:00 - **Updated**: 2024-07-15 16:10:11+00:00 - **Authors**: Kun-Yu Lin, Jiaming Zhou, Wei-Shi Zheng - **Comment**: Accepted by TPAMI - **Journal**: None - **Summary**: We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer. ### R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection - **Arxiv ID**: http://arxiv.org/abs/2407.10862v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10862v1) - **Published**: 2024-07-15 16:10:58+00:00 - **Updated**: 2024-07-15 16:10:58+00:00 - **Authors**: Zheyuan Zhou, Le Wang, Naiyu Fang, Zili Wang, Lemiao Qiu, Shuyou Zhang - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: 3D anomaly detection plays a crucial role in monitoring parts for localized inherent defects in precision manufacturing. Embedding-based and reconstruction-based approaches are among the most popular and successful methods. However, there are two major challenges to the practical application of the current approaches: 1) the embedded models suffer the prohibitive computational and storage due to the memory bank structure; 2) the reconstructive models based on the MAE mechanism fail to detect anomalies in the unmasked regions. In this paper, we propose R3D-AD, reconstructing anomalous point clouds by diffusion model for precise 3D anomaly detection. Our approach capitalizes on the data distribution conversion of the diffusion process to entirely obscure the input's anomalous geometry. It step-wisely learns a strict point-level displacement behavior, which methodically corrects the aberrant points. To increase the generalization of the model, we further present a novel 3D anomaly simulation strategy named Patch-Gen to generate realistic and diverse defect shapes, which narrows the domain gap between training and testing. Our R3D-AD ensures a uniform spatial transformation, which allows straightforwardly generating anomaly results by distance comparison. Extensive experiments show that our R3D-AD outperforms previous state-of-the-art methods, achieving 73.4% Image-level AUROC on the Real3D-AD dataset and 74.9% Image-level AUROC on the Anomaly-ShapeNet dataset with an exceptional efficiency. ### GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM - **Arxiv ID**: http://arxiv.org/abs/2407.10870v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.HC, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10870v1) - **Published**: 2024-07-15 16:18:06+00:00 - **Updated**: 2024-07-15 16:18:06+00:00 - **Authors**: Keshav Bimbraw, Ye Wang, Jing Liu, Toshiaki Koike-Akino - **Comment**: 8 pages, 9 figures - **Journal**: None - **Summary**: Large vision-language models (LVLMs), such as the Generative Pre-trained Transformer 4-omni (GPT-4o), are emerging multi-modal foundation models which have great potential as powerful artificial-intelligence (AI) assistance tools for a myriad of applications, including healthcare, industrial, and academic sectors. Although such foundation models perform well in a wide range of general tasks, their capability without fine-tuning is often limited in specialized tasks. However, full fine-tuning of large foundation models is challenging due to enormous computation/memory/dataset requirements. We show that GPT-4o can decode hand gestures from forearm ultrasound data even with no fine-tuning, and improves with few-shot, in-context learning. ### Random Channel Ablation for Robust Hand Gesture Classification with Multimodal Biosignals - **Arxiv ID**: http://arxiv.org/abs/2407.10874v1 - **DOI**: None - **Categories**: **cs.HC**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10874v1) - **Published**: 2024-07-15 16:23:53+00:00 - **Updated**: 2024-07-15 16:23:53+00:00 - **Authors**: Keshav Bimbraw, Jing Liu, Ye Wang, Toshiaki Koike-Akino - **Comment**: 5 pages, 4 figures - **Journal**: None - **Summary**: Biosignal-based hand gesture classification is an important component of effective human-machine interaction. For multimodal biosignal sensing, the modalities often face data loss due to missing channels in the data which can adversely affect the gesture classification performance. To make the classifiers robust to missing channels in the data, this paper proposes using Random Channel Ablation (RChA) during the training process. Ultrasound and force myography (FMG) data were acquired from the forearm for 12 hand gestures over 2 subjects. The resulting multimodal data had 16 total channels, 8 for each modality. The proposed method was applied to convolutional neural network architecture, and compared with baseline, imputation, and oracle methods. Using 5-fold cross-validation for the two subjects, on average, 12.2% and 24.5% improvement was observed for gesture classification with up to 4 and 8 missing channels respectively compared to the baseline. Notably, the proposed method is also robust to an increase in the number of missing channels compared to other methods. These results show the efficacy of using random channel ablation to improve classifier robustness for multimodal and multi-channel biosignal-based hand gesture classification. ### RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception - **Arxiv ID**: http://arxiv.org/abs/2407.10876v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10876v1) - **Published**: 2024-07-15 16:25:07+00:00 - **Updated**: 2024-07-15 16:25:07+00:00 - **Authors**: Chunliang Li, Wencheng Han, Junbo Yin, Sanyuan Zhao, Jianbing Shen - **Comment**: Accepted by ECCV 2024 - **Journal**: None - **Summary**: Concurrent processing of multiple autonomous driving 3D perception tasks within the same spatiotemporal scene poses a significant challenge, in particular due to the computational inefficiencies and feature competition between tasks when using traditional multi-task learning approaches. This paper addresses these issues by proposing a novel unified representation, RepVF, which harmonizes the representation of various perception tasks such as 3D object detection and 3D lane detection within a single framework. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model that significantly reduces computational redundancy and feature competition. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks by utilizing a hierarchical structure of queries that implicitly model the relationships both between and within tasks. This approach eliminates the need for task-specific heads and parameters, fundamentally reducing the conflicts inherent in traditional multi-task learning paradigms. We validate our approach by combining labels from the OpenLane dataset with the Waymo Open dataset. Our work presents a significant advancement in the efficiency and effectiveness of multi-task perception in autonomous driving, offering a new perspective on handling multiple 3D perception tasks synchronously and in parallel. The code will be available at: https://github.com/jbji/RepVF ### Leveraging Multimodal CycleGAN for the Generation of Anatomically Accurate Synthetic CT Scans from MRIs - **Arxiv ID**: http://arxiv.org/abs/2407.10888v1 - **DOI**: None - **Categories**: **eess.IV**, cs.AI, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10888v1) - **Published**: 2024-07-15 16:38:59+00:00 - **Updated**: 2024-07-15 16:38:59+00:00 - **Authors**: Leonardo Crespi, Samuele Camnasio, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono - **Comment**: Currently submitted to: Scientific Reports - **Journal**: None - **Summary**: In many clinical settings, the use of both Computed Tomography (CT) and Magnetic Resonance (MRI) is necessary to pursue a thorough understanding of the patient's anatomy and to plan a suitable therapeutical strategy; this is often the case in MRI-based radiotherapy, where CT is always necessary to prepare the dose delivery, as it provides the essential information about the radiation absorption properties of the tissues. Sometimes, MRI is preferred to contour the target volumes. However, this approach is often not the most efficient, as it is more expensive, time-consuming and, most importantly, stressful for the patients. To overcome this issue, in this work, we analyse the capabilities of different configurations of Deep Learning models to generate synthetic CT scans from MRI, leveraging the power of Generative Adversarial Networks (GANs) and, in particular, the CycleGAN architecture, capable of working in an unsupervised manner and without paired images, which were not available. Several CycleGAN models were trained unsupervised to generate CT scans from different MRI modalities with and without contrast agents. To overcome the problem of not having a ground truth, distribution-based metrics were used to assess the model's performance quantitatively, together with a qualitative evaluation where physicians were asked to differentiate between real and synthetic images to understand how realistic the generated images were. The results show how, depending on the input modalities, the models can have very different performances; however, models with the best quantitative results, according to the distribution-based metrics used, can generate very difficult images to distinguish from the real ones, even for physicians, demonstrating the approach's potential. ### Optical Diffusion Models for Image Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10897v1 - **DOI**: None - **Categories**: **physics.optics**, cs.CV, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10897v1) - **Published**: 2024-07-15 16:46:14+00:00 - **Updated**: 2024-07-15 16:46:14+00:00 - **Authors**: Ilker Oguz, Niyazi Ulas Dinc, Mustafa Yildirim, Junjie Ke, Innfarn Yoo, Qifei Wang, Feng Yang, Christophe Moser, Demetri Psaltis - **Comment**: 14 pages, 6 figures - **Journal**: None - **Summary**: Diffusion models generate new samples by progressively decreasing the noise from the initially provided random distribution. This inference procedure generally utilizes a trained neural network numerous times to obtain the final output, creating significant latency and energy consumption on digital electronic hardware such as GPUs. In this study, we demonstrate that the propagation of a light beam through a semi-transparent medium can be programmed to implement a denoising diffusion model on image samples. This framework projects noisy image patterns through passive diffractive optical layers, which collectively only transmit the predicted noise term in the image. The optical transparent layers, which are trained with an online training approach, backpropagating the error to the analytical model of the system, are passive and kept the same across different steps of denoising. Hence this method enables high-speed image generation with minimal power consumption, benefiting from the bandwidth and energy efficiency of optical information processing. ### Interpreting Hand gestures using Object Detection and Digits Classification - **Arxiv ID**: http://arxiv.org/abs/2407.10902v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10902v1) - **Published**: 2024-07-15 16:53:04+00:00 - **Updated**: 2024-07-15 16:53:04+00:00 - **Authors**: Sangeetha K, Balaji VS, Kamalesh P, Anirudh Ganapathy PS - **Comment**: None - **Journal**: None - **Summary**: Hand gestures have evolved into a natural and intuitive means of engaging with technology. The objective of this research is to develop a robust system that can accurately recognize and classify hand gestures representing numbers. The proposed approach involves collecting a dataset of hand gesture images, preprocessing and enhancing the images, extracting relevant features, and training a machine learning model. The advancement of computer vision technology and object detection techniques, in conjunction with OpenCV's capability to analyze and comprehend hand gestures, presents a chance to transform the identification of numerical digits and its potential applications. The advancement of computer vision technology and object identification technologies, along with OpenCV's capacity to analyze and interpret hand gestures, has the potential to revolutionize human interaction, boosting people's access to information, education, and employment opportunities. Keywords: Computer Vision, Machine learning, Deep Learning, Neural Networks ### DataDream: Few-shot Guided Dataset Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10910v2 - **DOI**: None - **Categories**: **cs.CV**, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10910v2) - **Published**: 2024-07-15 17:10:31+00:00 - **Updated**: 2024-07-16 13:15:01+00:00 - **Authors**: Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, Zeynep Akata - **Comment**: Accepted to ECCV 2024 - **Journal**: None - **Summary**: While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance. The code is available at https://github.com/ExplainableML/DataDream. ### PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition - **Arxiv ID**: http://arxiv.org/abs/2407.10918v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10918v1) - **Published**: 2024-07-15 17:19:50+00:00 - **Updated**: 2024-07-15 17:19:50+00:00 - **Authors**: Xiao Li, Yining Liu, Na Dong, Sitian Qin, Xiaolin Hu - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Deep learning-based object recognition systems can be easily fooled by various adversarial perturbations. One reason for the weak robustness may be that they do not have part-based inductive bias like the human recognition process. Motivated by this, several part-based recognition models have been proposed to improve the adversarial robustness of recognition. However, due to the lack of part annotations, the effectiveness of these methods is only validated on small-scale nonstandard datasets. In this work, we propose PIN++, short for PartImageNet++, a dataset providing high-quality part segmentation annotations for all categories of ImageNet-1K (IN-1K). With these annotations, we build part-based methods directly on the standard IN-1K dataset for robust recognition. Different from previous two-stage part-based models, we propose a Multi-scale Part-supervised Model (MPM), to learn a robust representation with part annotations. Experiments show that MPM yielded better adversarial robustness on the large-scale IN-1K over strong baselines across various attack settings. Furthermore, MPM achieved improved robustness on common corruptions and several out-of-distribution datasets. The dataset, together with these results, enables and encourages researchers to explore the potential of part-based models in more real applications. ### Benchmarking Vision Language Models for Cultural Understanding - **Arxiv ID**: http://arxiv.org/abs/2407.10920v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.CL - **Links**: [PDF](http://arxiv.org/pdf/2407.10920v1) - **Published**: 2024-07-15 17:21:41+00:00 - **Updated**: 2024-07-15 17:21:41+00:00 - **Authors**: Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Stańczak, Aishwarya Agrawal - **Comment**: None - **Journal**: None - **Summary**: Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures. ### A Dual-Attention Aware Deep Convolutional Neural Network for Early Alzheimer's Detection - **Arxiv ID**: http://arxiv.org/abs/2407.10921v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV, cs.LG, F.2.2, I.2.7 - **Links**: [PDF](http://arxiv.org/pdf/2407.10921v1) - **Published**: 2024-07-15 17:22:16+00:00 - **Updated**: 2024-07-15 17:22:16+00:00 - **Authors**: Pandiyaraju V, Shravan Venkatraman, Abeshek A, Aravintakshan S A, Pavan Kumar S, Kannan A - **Comment**: 18 pages, 10 figures, 6 tables - **Journal**: None - **Summary**: Alzheimer's disease (AD) represents the primary form of neurodegeneration, impacting millions of individuals each year and causing progressive cognitive decline. Accurately diagnosing and classifying AD using neuroimaging data presents ongoing challenges in medicine, necessitating advanced interventions that will enhance treatment measures. In this research, we introduce a dual attention enhanced deep learning (DL) framework for classifying AD from neuroimaging data. Combined spatial and self-attention mechanisms play a vital role in emphasizing focus on neurofibrillary tangles and amyloid plaques from the MRI images, which are difficult to discern with regular imaging techniques. Results demonstrate that our model yielded remarkable performance in comparison to existing state of the art (SOTA) convolutional neural networks (CNNs), with an accuracy of 99.1%. Moreover, it recorded remarkable metrics, with an F1-Score of 99.31%, a precision of 99.24%, and a recall of 99.5%. These results highlight the promise of cutting edge DL methods in medical diagnostics, contributing to highly reliable and more efficient healthcare solutions. ### OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting - **Arxiv ID**: http://arxiv.org/abs/2407.10923v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10923v1) - **Published**: 2024-07-15 17:23:00+00:00 - **Updated**: 2024-07-15 17:23:00+00:00 - **Authors**: Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang - **Comment**: None - **Journal**: None - **Summary**: In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings. ### In-Loop Filtering via Trained Look-Up Tables - **Arxiv ID**: http://arxiv.org/abs/2407.10926v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10926v1) - **Published**: 2024-07-15 17:25:42+00:00 - **Updated**: 2024-07-15 17:25:42+00:00 - **Authors**: Zhuoyuan Li, Jiacheng Li, Yao Li, Li Li, Dong Liu, Feng Wu - **Comment**: 11 pages, 6 figures - **Journal**: None - **Summary**: In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time and computational complexity, and high demands of high-performance hardware, which is challenging to apply to the general uses of coding scene. To address this limitation, inspired by explorations in image restoration, we propose an efficient and practical in-loop filtering scheme by adopting the Look-up Table (LUT). We train the DNN of in-loop filtering within a fixed filtering reference range, and cache the output values of the DNN into a LUT via traversing all possible inputs. At testing time in the coding process, the filtered pixel is generated by locating input pixels (to-be-filtered pixel with reference pixels) and interpolating cached filtered pixel values. To further enable the large filtering reference range with the limited storage cost of LUT, we introduce the enhanced indexing mechanism in the filtering process, and clipping/finetuning mechanism in the training. The proposed method is implemented into the Versatile Video Coding (VVC) reference software, VTM-11.0. Experimental results show that the ultrafast, very fast, and fast mode of the proposed method achieves on average 0.13%/0.34%/0.51%, and 0.10%/0.27%/0.39% BD-rate reduction, under the all intra (AI) and random access (RA) configurations. Especially, our method has friendly time and computational complexity, only 101%/102%-104%/108% time increase with 0.13-0.93 kMACs/pixel, and only 164-1148 KB storage cost for a single model. Our solution may shed light on the journey of practical neural network-based coding tool evolution. ### STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences - **Arxiv ID**: http://arxiv.org/abs/2407.10935v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10935v1) - **Published**: 2024-07-15 17:35:38+00:00 - **Updated**: 2024-07-15 17:35:38+00:00 - **Authors**: Soroush Mehraban, Mohammad Javad Rajabi, Babak Taati - **Comment**: None - **Journal**: None - **Summary**: Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining. Project page: https://soroushmehraban.github.io/stars/ ### IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10937v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10937v1) - **Published**: 2024-07-15 17:36:54+00:00 - **Updated**: 2024-07-15 17:36:54+00:00 - **Authors**: Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang - **Comment**: ECCV 2024; project page: https://yhzhai.github.io/idol/ - **Journal**: None - **Summary**: Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields, leading to harmonized outputs. Additionally, a cross-attention map consistency loss is applied to align the cross-attention map of the video denoising with that of the depth denoising, further facilitating spatial alignment. Extensive experiments on the TikTok and NTU120 datasets show our superior performance, significantly surpassing existing methods in terms of video FVD and depth accuracy. ### GRUtopia: Dream General Robots in a City at Scale - **Arxiv ID**: http://arxiv.org/abs/2407.10943v1 - **DOI**: None - **Categories**: **cs.RO**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.10943v1) - **Published**: 2024-07-15 17:40:46+00:00 - **Updated**: 2024-07-15 17:40:46+00:00 - **Authors**: Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, Peizhou Cao, Wenye Yu, Zichao Ye, Jialun Li, Junfeng Long, Zirui Wang, Huiling Wang, Ying Zhao, Zhongying Tu, Yu Qiao, Dahua Lin, Jiangmiao Pang - **Comment**: None - **Journal**: None - **Summary**: Recent works have been exploring the scaling laws in the field of Embodied AI. Given the prohibitive costs of collecting real-world data, we believe the Simulation-to-Real (Sim2Real) paradigm is a crucial step for scaling the learning of embodied models. This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots. It features several advancements: (a) The scene dataset, GRScenes, includes 100k interactive, finely annotated scenes, which can be freely combined into city-scale environments. In contrast to previous works mainly focusing on home, GRScenes covers 89 diverse scene categories, bridging the gap of service-oriented environments where general robots would be initially deployed. (b) GRResidents, a Large Language Model (LLM) driven Non-Player Character (NPC) system that is responsible for social interaction, task generation, and task assignment, thus simulating social scenarios for embodied AI applications. (c) The benchmark, GRBench, supports various robots but focuses on legged robots as primary agents and poses moderately challenging tasks involving Object Loco-Navigation, Social Loco-Navigation, and Loco-Manipulation. We hope that this work can alleviate the scarcity of high-quality data in this field and provide a more comprehensive assessment of Embodied AI research. The project is available at https://github.com/OpenRobotLab/GRUtopia. ### Can Textual Semantics Mitigate Sounding Object Segmentation Preference? - **Arxiv ID**: http://arxiv.org/abs/2407.10947v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10947v1) - **Published**: 2024-07-15 17:45:20+00:00 - **Updated**: 2024-07-15 17:45:20+00:00 - **Authors**: Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference} ### Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes - **Arxiv ID**: http://arxiv.org/abs/2407.10957v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.10957v1) - **Published**: 2024-07-15 17:54:45+00:00 - **Updated**: 2024-07-15 17:54:45+00:00 - **Authors**: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}. ### InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models - **Arxiv ID**: http://arxiv.org/abs/2407.10958v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.10958v1) - **Published**: 2024-07-15 17:55:09+00:00 - **Updated**: 2024-07-15 17:55:09+00:00 - **Authors**: Nirat Saini, Navaneeth Bodla, Ashish Shrivastava, Avinash Ravichandran, Xiao Zhang, Abhinav Shrivastava, Bharat Singh - **Comment**: None - **Journal**: None - **Summary**: We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model's self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods. ### No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations - **Arxiv ID**: http://arxiv.org/abs/2407.10964v1 - **DOI**: None - **Categories**: **cs.CV**, cs.CL, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10964v1) - **Published**: 2024-07-15 17:58:42+00:00 - **Updated**: 2024-07-15 17:58:42+00:00 - **Authors**: Walter Simoncini, Spyros Gidaris, Andrei Bursuc, Yuki M. Asano - **Comment**: Preprint. Code available at https://github.com/WalterSimoncini/fungivision - **Journal**: None - **Summary**: This paper introduces FUNGI, Features from UNsupervised GradIents, a method to enhance the features of vision encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. These are projected to a lower dimension and then concatenated with the model's embedding. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio. Across backbones spanning various sizes and pretraining strategies, FUNGI features provide consistent performance improvements over the embeddings. We also show that using FUNGI features can benefit linear classification and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training. ### VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation - **Arxiv ID**: http://arxiv.org/abs/2407.10972v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.10972v1) - **Published**: 2024-07-15 17:59:55+00:00 - **Updated**: 2024-07-15 17:59:55+00:00 - **Authors**: Bocheng Zou, Mu Cai, Jianrui Zhang, Yong Jae Lee - **Comment**: Project Page: https://vgbench.github.io - **Journal**: None - **Summary**: In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons or sketches. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io. ### Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques - **Arxiv ID**: http://arxiv.org/abs/2407.11121v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.11121v1) - **Published**: 2024-07-15 18:00:01+00:00 - **Updated**: 2024-07-15 18:00:01+00:00 - **Authors**: Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, Irina Rish - **Comment**: None - **Journal**: None - **Summary**: Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they are becoming increasingly prevalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments. ### UFQA: Utility guided Fingerphoto Quality Assessment - **Arxiv ID**: http://arxiv.org/abs/2407.11141v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11141v1) - **Published**: 2024-07-15 18:06:07+00:00 - **Updated**: 2024-07-15 18:06:07+00:00 - **Authors**: Amol S. Joshi, Ali Dabouei, Jeremy Dawson, Nasser Nasrabadi - **Comment**: None - **Journal**: None - **Summary**: Quality assessment of fingerprints captured using digital cameras and smartphones, also called fingerphotos, is a challenging problem in biometric recognition systems. As contactless biometric modalities are gaining more attention, their reliability should also be improved. Many factors, such as illumination, image contrast, camera angle, etc., in fingerphoto acquisition introduce various types of distortion that may render the samples useless. Current quality estimation methods developed for fingerprints collected using contact-based sensors are inadequate for fingerphotos. We propose Utility guided Fingerphoto Quality Assessment (UFQA), a self-supervised dual encoder framework to learn meaningful feature representations to assess fingerphoto quality. A quality prediction model is trained to assess fingerphoto quality with additional supervision of quality maps. The quality metric is a predictor of the utility of fingerphotos in matching scenarios. Therefore, we use a holistic approach by including fingerphoto utility and local quality when labeling the training data. Experimental results verify that our approach performs better than the widely used fingerprint quality metric NFIQ2.2 and state-of-the-art image quality assessment algorithms on multiple publicly available fingerphoto datasets. ### Integrating Amortized Inference with Diffusion Models for Learning Clean Distribution from Corrupted Images - **Arxiv ID**: http://arxiv.org/abs/2407.11162v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11162v1) - **Published**: 2024-07-15 18:33:20+00:00 - **Updated**: 2024-07-15 18:33:20+00:00 - **Authors**: Yifei Wang, Weimin Bai, Weijian Luo, Wenzheng Chen, He Sun - **Comment**: None - **Journal**: None - **Summary**: Diffusion models (DMs) have emerged as powerful generative models for solving inverse problems, offering a good approximation of prior distributions of real-world image data. Typically, diffusion models rely on large-scale clean signals to accurately learn the score functions of ground truth clean image distributions. However, such a requirement for large amounts of clean data is often impractical in real-world applications, especially in fields where data samples are expensive to obtain. To address this limitation, in this work, we introduce \emph{FlowDiff}, a novel joint training paradigm that leverages a conditional normalizing flow model to facilitate the training of diffusion models on corrupted data sources. The conditional normalizing flow try to learn to recover clean images through a novel amortized inference mechanism, and can thus effectively facilitate the diffusion model's training with corrupted data. On the other side, diffusion models provide strong priors which in turn improve the quality of image recovery. The flow model and the diffusion model can therefore promote each other and demonstrate strong empirical performances. Our elaborate experiment shows that FlowDiff can effectively learn clean distributions across a wide range of corrupted data sources, such as noisy and blurry images. It consistently outperforms existing baselines with significant margins under identical conditions. Additionally, we also study the learned diffusion prior, observing its superior performance in downstream computational imaging tasks, including inpainting, denoising, and deblurring. ### Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing - **Arxiv ID**: http://arxiv.org/abs/2407.11168v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11168v1) - **Published**: 2024-07-15 18:44:17+00:00 - **Updated**: 2024-07-15 18:44:17+00:00 - **Authors**: Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Ioannis Patras - **Comment**: Accepted at ECCV 2024 - **Journal**: None - **Summary**: Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods' dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state-of-the-art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes. ### iHuman: Instant Animatable Digital Humans From Monocular Videos - **Arxiv ID**: http://arxiv.org/abs/2407.11174v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI - **Links**: [PDF](http://arxiv.org/pdf/2407.11174v1) - **Published**: 2024-07-15 18:51:51+00:00 - **Updated**: 2024-07-15 18:51:51+00:00 - **Authors**: Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli, Danda Pani Paudel, Jyoti Tandukar - **Comment**: 15 pages, eccv, 2024 - **Journal**: None - **Summary**: Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses. ### Expert-aware uncertainty estimation for quality control of neural-based blood typing - **Arxiv ID**: http://arxiv.org/abs/2407.11181v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11181v1) - **Published**: 2024-07-15 19:07:02+00:00 - **Updated**: 2024-07-15 19:07:02+00:00 - **Authors**: Ekaterina Zaychenkova, Dmitrii Iarchuk, Sergey Korchagin, Alexey Zaitsev, Egor Ershov - **Comment**: None - **Journal**: None - **Summary**: In medical diagnostics, accurate uncertainty estimation for neural-based models is essential for complementing second-opinion systems. Despite neural network ensembles' proficiency in this problem, a gap persists between actual uncertainties and predicted estimates. A major difficulty here is the lack of labels on the hardness of examples: a typical dataset includes only ground truth target labels, making the uncertainty estimation problem almost unsupervised. Our novel approach narrows this gap by integrating expert assessments of case complexity into the neural network's learning process, utilizing both definitive target labels and supplementary complexity ratings. We validate our methodology for blood typing, leveraging a new dataset "BloodyWell" unique in augmenting labeled reaction images with complexity scores from six medical specialists. Experiments demonstrate enhancement of our approach in uncertainty prediction, achieving a 2.5-fold improvement with expert labels and a 35% increase in performance with estimates of neural-based expert consensus. ### Efficient In-Context Medical Segmentation with Meta-driven Visual Prompt Selection - **Arxiv ID**: http://arxiv.org/abs/2407.11188v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11188v1) - **Published**: 2024-07-15 19:22:32+00:00 - **Updated**: 2024-07-15 19:22:32+00:00 - **Authors**: Chenwei Wu, David Restrepo, Zitao Shuai, Zhongming Liu, Liyue Shen - **Comment**: None - **Journal**: None - **Summary**: In-context learning (ICL) with Large Vision Models (LVMs) presents a promising avenue in medical image segmentation by reducing the reliance on extensive labeling. However, the ICL performance of LVMs highly depends on the choices of visual prompts and suffers from domain shifts. While existing works leveraging LVMs for medical tasks have focused mainly on model-centric approaches like fine-tuning, we study an orthogonal data-centric perspective on how to select good visual prompts to facilitate generalization to medical domain. In this work, we propose a label-efficient in-context medical segmentation method by introducing a novel Meta-driven Visual Prompt Selection mechanism (MVPS), where a prompt retriever obtained from a meta-learning framework actively selects the optimal images as prompts to promote model performance and generalizability. Evaluated on 8 datasets and 4 tasks across 3 medical imaging modalities, our proposed approach demonstrates consistent gains over existing methods under different scenarios, improving both computational and label efficiency. Finally, we show that MVPS is a flexible, finetuning-free module that could be easily plugged into different backbones and combined with other model-centric approaches. ### EyeDentify: A Dataset for Pupil Diameter Estimation based on Webcam Images - **Arxiv ID**: http://arxiv.org/abs/2407.11204v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.CY, cs.HC, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.11204v1) - **Published**: 2024-07-15 19:39:28+00:00 - **Updated**: 2024-07-15 19:39:28+00:00 - **Authors**: Vijul Shah, Ko Watanabe, Brian B. Moser, Andreas Dengel - **Comment**: None - **Journal**: None - **Summary**: In this work, we introduce EyeDentify, a dataset specifically designed for pupil diameter estimation based on webcam images. EyeDentify addresses the lack of available datasets for pupil diameter estimation, a crucial domain for understanding physiological and psychological states traditionally dominated by highly specialized sensor systems such as Tobii. Unlike these advanced sensor systems and associated costs, webcam images are more commonly found in practice. Yet, deep learning models that can estimate pupil diameters using standard webcam data are scarce. By providing a dataset of cropped eye images alongside corresponding pupil diameter information, EyeDentify enables the development and refinement of models designed specifically for less-equipped environments, democratizing pupil diameter estimation by making it more accessible and broadly applicable, which in turn contributes to multiple domains of understanding human activity and supporting healthcare. Our dataset is available at https://vijulshah.github.io/eyedentify/. ### Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion - **Arxiv ID**: http://arxiv.org/abs/2407.11211v1 - **DOI**: None - **Categories**: **cs.CV**, cs.AI, cs.CL - **Links**: [PDF](http://arxiv.org/pdf/2407.11211v1) - **Published**: 2024-07-15 19:53:02+00:00 - **Updated**: 2024-07-15 19:53:02+00:00 - **Authors**: Philipp Allgeuer, Kyra Ahrens, Stefan Wermter - **Comment**: None - **Journal**: None - **Summary**: We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues. ### OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models - **Arxiv ID**: http://arxiv.org/abs/2407.11213v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11213v1) - **Published**: 2024-07-15 19:56:42+00:00 - **Updated**: 2024-07-15 19:56:42+00:00 - **Authors**: Zijian Zhou, Zheng Zhu, Holger Caesar, Miaojing Shi - **Comment**: ECCV 2024 - **Journal**: None - **Summary**: Panoptic Scene Graph Generation (PSG) aims to segment objects and recognize their relations, enabling the structured understanding of an image. Previous methods focus on predicting predefined object and relation categories, hence limiting their applications in the open world scenarios. With the rapid development of large multimodal models (LMMs), significant progress has been made in open-set object detection and segmentation, yet open-set relation prediction in PSG remains unexplored. In this paper, we focus on the task of open-set relation prediction integrated with a pretrained open-set panoptic segmentation model to achieve true open-set panoptic scene graph generation (OpenPSG). Our OpenPSG leverages LMMs to achieve open-set relation prediction in an autoregressive manner. We introduce a relation query transformer to efficiently extract visual features of object pairs and estimate the existence of relations between them. The latter can enhance the prediction efficiency by filtering irrelevant pairs. Finally, we design the generation and judgement instructions to perform open-set relation prediction in PSG autoregressively. To our knowledge, we are the first to propose the open-set PSG task. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-set relation prediction and panoptic scene graph generation. Code is available at \url{https://github.com/franciszzj/OpenPSG}. ### Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras - **Arxiv ID**: http://arxiv.org/abs/2407.11216v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11216v1) - **Published**: 2024-07-15 20:00:50+00:00 - **Updated**: 2024-07-15 20:00:50+00:00 - **Authors**: Hoonhee Cho, Sung-Hoon Yoon, Hyeokjun Kweon, Kuk-Jin Yoon - **Comment**: Accepted by ECCV2024 - **Journal**: None - **Summary**: Event cameras excel in capturing high-contrast scenes and dynamic objects, offering a significant advantage over traditional frame-based cameras. Despite active research into leveraging event cameras for semantic segmentation, generating pixel-wise dense semantic maps for such challenging scenarios remains labor-intensive. As a remedy, we present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation that utilizes sparse point annotations. To fully leverage the temporal characteristics of event data, the proposed framework performs asymmetric dual-student learning between 1) the original forward event data and 2) the longer reversed event data, which contain complementary information from the past and the future, respectively. Besides, to mitigate the challenges posed by sparse supervision, we propose feature-level contrastive learning based on class-wise prototypes, carefully aggregated at both spatial region and sample levels. Additionally, we further excavate the potential of our dual-student learning model by exchanging prototypes between the two learning paths, thereby harnessing their complementary strengths. With extensive experiments on various datasets, including DSEC Night-Point with sparse point annotations newly provided by this paper, the proposed method achieves substantial segmentation results even without relying on pixel-level dense ground truths. The code and dataset are available at https://github.com/Chohoonhee/EV-WSSS. ### TLRN: Temporal Latent Residual Networks For Large Deformation Image Registration - **Arxiv ID**: http://arxiv.org/abs/2407.11219v1 - **DOI**: None - **Categories**: **cs.CV**, eess.IV - **Links**: [PDF](http://arxiv.org/pdf/2407.11219v1) - **Published**: 2024-07-15 20:07:45+00:00 - **Updated**: 2024-07-15 20:07:45+00:00 - **Authors**: Nian Wu, Jiarui Xing, Miaomiao Zhang - **Comment**: 10 pages. Accepted by MICCAI 2024 - **Journal**: None - **Summary**: This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase). To achieve accurate and robust registration results, we leverage the nature of motion continuity and exploit the temporal smoothness in consecutive image frames. Our proposed TLRN highlights a temporal residual network with residual blocks carefully designed in latent deformation spaces, which are parameterized by time-sequential initial velocity fields. We treat a sequence of residual blocks over time as a dynamic training system, where each block is designed to learn the residual function between desired deformation features and current input accumulated from previous time frames. We validate the effectivenss of TLRN on both synthetic data and real-world cine cardiac magnetic resonance (CMR) image videos. Our experimental results shows that TLRN is able to achieve substantially improved registration accuracy compared to the state-of-the-art. Our code is publicly available at https://github.com/nellie689/TLRN. ### Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding - **Arxiv ID**: http://arxiv.org/abs/2407.11224v1 - **DOI**: None - **Categories**: **eess.IV**, cs.CV - **Links**: [PDF](http://arxiv.org/pdf/2407.11224v1) - **Published**: 2024-07-15 20:20:04+00:00 - **Updated**: 2024-07-15 20:20:04+00:00 - **Authors**: Danish Nazir, Timo Bartels, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt - **Comment**: accepted in ECCV2024 - **Journal**: None - **Summary**: Distributed computing in the context of deep neural networks (DNNs) implies the execution of one part of the network on edge devices and the other part typically on a large-scale cloud platform. Conventional methods propose to employ a serial concatenation of a learned image and source encoder, the latter projecting the image encoder output (bottleneck features) into a quantized representation for bitrate-efficient transmission. In the cloud, a respective source decoder reprojects the quantized representation to the original feature representation, serving as an input for the downstream task decoder performing, e.g., semantic segmentation. In this work, we propose joint source and task decoding, as it allows for a smaller network size in the cloud. This further enables the scalability of such services in large numbers without requiring extensive computational load on the cloud per channel. We demonstrate the effectiveness of our method by achieving a distributed semantic segmentation SOTA over a wide range of bitrates on the mean intersection over union metric, while using only $9.8 \%$ ... $11.59 \%$ of cloud DNN parameters used in the previous SOTA on the COCO and Cityscapes datasets. ### Unraveling the Truth: Do LLMs really Understand Charts? A Deep Dive into Consistency and Robustness - **Arxiv ID**: http://arxiv.org/abs/2407.11229v1 - **DOI**: None - **Categories**: **cs.CL**, cs.AI, cs.CV, cs.HC, cs.LG - **Links**: [PDF](http://arxiv.org/pdf/2407.11229v1) - **Published**: 2024-07-15 20:29:24+00:00 - **Updated**: 2024-07-15 20:29:24+00:00 - **Authors**: Srija Mukhopadhyay, Adnan Qidwai, Aparna Garimella, Pritika Ramu, Vivek Gupta, Dan Roth - **Comment**: 22 pages, 7 Tables, 3 Figures, 25 examples - **Journal**: None - **Summary**: Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models' ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field. ### Evaluating geometric accuracy of NeRF reconstructions compared to SLAM method - **Arxiv ID**: http://arxiv.org/abs/2407.11238v1 - **DOI**: None - **Categories**: **cs.CV**, cs.RO - **Links**: [PDF](http://arxiv.org/pdf/2407.11238v1) - **Published**: 2024-07-15 21:04:11+00:00 - **Updated**: 2024-07-15 21:04:11+00:00 - **Authors**: Adam Korycki, Colleen Josephson, Steve McGuire - **Comment**: None - **Journal**: None - **Summary**: As Neural Radiance Field (NeRF) implementations become faster, more efficient and accurate, their applicability to real world mapping tasks becomes more accessible. Traditionally, 3D mapping, or scene reconstruction, has relied on expensive LiDAR sensing. Photogrammetry can perform image-based 3D reconstruction but is computationally expensive and requires extremely dense image representation to recover complex geometry and photorealism. NeRFs perform 3D scene reconstruction by training a neural network on sparse image and pose data, achieving superior results to photogrammetry with less input data. This paper presents an evaluation of two NeRF scene reconstructions for the purpose of estimating the diameter of a vertical PVC cylinder. One of these are trained on commodity iPhone data and the other is trained on robot-sourced imagery and poses. This neural-geometry is compared to state-of-the-art lidar-inertial SLAM in terms of scene noise and metric-accuracy. ### Representation Learning and Identity Adversarial Training for Facial Behavior Understanding - **Arxiv ID**: http://arxiv.org/abs/2407.11243v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11243v1) - **Published**: 2024-07-15 21:13:28+00:00 - **Updated**: 2024-07-15 21:13:28+00:00 - **Authors**: Mang Ning, Albert Ali Salah, Itir Onal Ertugrul - **Comment**: None - **Journal**: None - **Summary**: Facial Action Unit (AU) detection has gained significant research attention as AUs contain complex expression information. In this paper, we unpack two fundamental factors in AU detection: data and subject identity regularization, respectively. Motivated by recent advances in foundation models, we highlight the importance of data and collect a diverse dataset Face9M, comprising 9 million facial images, from multiple public resources. Pretraining a masked autoencoder on Face9M yields strong performance in AU detection and facial expression tasks. We then show that subject identity in AU datasets provides a shortcut learning for the model and leads to sub-optimal solutions to AU predictions. To tackle this generic issue of AU tasks, we propose Identity Adversarial Training (IAT) and demonstrate that a strong IAT regularization is necessary to learn identity-invariant features. Furthermore, we elucidate the design space of IAT and empirically show that IAT circumvents the identity shortcut learning and results in a better solution. Our proposed methods, Facial Masked Autoencoder (FMAE) and IAT, are simple, generic and effective. Remarkably, the proposed FMAE-IAT approach achieves new state-of-the-art F1 scores on BP4D (67.1\%), BP4D+ (66.8\%), and DISFA (70.1\%) databases, significantly outperforming previous work. We release the code and model at https://github.com/forever208/FMAE-IAT, the first open-sourced facial model pretrained on 9 million diverse images. ### Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation - **Arxiv ID**: http://arxiv.org/abs/2407.11266v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11266v1) - **Published**: 2024-07-15 22:17:35+00:00 - **Updated**: 2024-07-15 22:17:35+00:00 - **Authors**: Rong Wang, Wei Mao, Changsheng Lu, Hongdong Li - **Comment**: Accepted to ECCV 2024 - **Journal**: None - **Summary**: Animating stylized characters to match a reference motion sequence is a highly demanded task in film and gaming industries. Existing methods mostly focus on rigid deformations of characters' body, neglecting local deformations on the apparel driven by physical dynamics. They deform apparel the same way as the body, leading to results with limited details and unrealistic artifacts, e.g. body-apparel penetration. In contrast, we present a novel method aiming for high-quality motion transfer with realistic apparel animation. As existing datasets lack annotations necessary for generating realistic apparel animations, we build a new dataset named MMDMC, which combines stylized characters from the MikuMikuDance community with real-world Motion Capture data. We then propose a data-driven pipeline that learns to disentangle body and apparel deformations via two neural deformation modules. For body parts, we propose a geodesic attention block to effectively incorporate semantic priors into skeletal body deformation to tackle complex body shapes for stylized characters. Since apparel motion can significantly deviate from respective body joints, we propose to model apparel deformation in a non-linear vertex displacement field conditioned on its historic states. Extensive experiments show that our method produces results with superior quality for various types of apparel. Our dataset is released in https://github.com/rongakowang/MMDMC. ### Differentiable Voxelization and Mesh Morphing - **Arxiv ID**: http://arxiv.org/abs/2407.11272v1 - **DOI**: None - **Categories**: **cs.CV**, math.DG - **Links**: [PDF](http://arxiv.org/pdf/2407.11272v1) - **Published**: 2024-07-15 22:39:03+00:00 - **Updated**: 2024-07-15 22:39:03+00:00 - **Authors**: Yihao Luo, Yikai Wang, Zhengrui Xiang, Yuliang Xiu, Guang Yang, ChoonHwai Yap - **Comment**: None - **Journal**: None - **Summary**: In this paper, we propose the differentiable voxelization of 3D meshes via the winding number and solid angles. The proposed approach achieves fast, flexible, and accurate voxelization of 3D meshes, admitting the computation of gradients with respect to the input mesh and GPU acceleration. We further demonstrate the application of the proposed voxelization in mesh morphing, where the voxelized mesh is deformed by a neural network. The proposed method is evaluated on the ShapeNet dataset and achieves state-of-the-art performance in terms of both accuracy and efficiency. ### M18K: A Comprehensive RGB-D Dataset and Benchmark for Mushroom Detection and Instance Segmentation - **Arxiv ID**: http://arxiv.org/abs/2407.11275v1 - **DOI**: None - **Categories**: **cs.CV** - **Links**: [PDF](http://arxiv.org/pdf/2407.11275v1) - **Published**: 2024-07-15 22:50:06+00:00 - **Updated**: 2024-07-15 22:50:06+00:00 - **Authors**: Abdollah Zakeri, Mulham Fawakherji, Jiming Kang, Bikram Koirala, Venkatesh Balan, Weihang Zhu, Driss Benhaddou, Fatima A. Merchant - **Comment**: None - **Journal**: None - **Summary**: Automating agricultural processes holds significant promise for enhancing efficiency and sustainability in various farming practices. This paper contributes to the automation of agricultural processes by providing a dedicated mushroom detection dataset related to automated harvesting, growth monitoring, and quality control of the button mushroom produced using Agaricus Bisporus fungus. With over 18,000 mushroom instances in 423 RGB-D image pairs taken with an Intel RealSense D405 camera, it fills the gap in mushroom-specific datasets and serves as a benchmark for detection and instance segmentation algorithms in smart mushroom agriculture. The dataset, featuring realistic growth environment scenarios with comprehensive annotations, is assessed using advanced detection and instance segmentation algorithms. The paper details the dataset's characteristics, evaluates algorithmic performance, and for broader applicability, we have made all resources publicly available including images, codes, and trained models via our GitHub repository https://github.com/abdollahzakeri/m18k

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets