Arxiv Papers in cs.CV on 2024-09-18

Hyperspectral Image Classification Based on Faster Residual Multi-branch Spiking Neural Network

Arxiv ID: http://arxiv.org/abs/2409.11619v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 00:51:01+00:00
Updated: 2024-09-18 00:51:01+00:00
Authors: Yang Liu, Yahui Li, Rui Li, Liming Zhou, Lanxue Dang, Huiyu Mu, Qiang Ge
Comment: 15pages,12figures
Journal: None
Summary: Convolutional neural network (CNN) performs well in Hyperspectral Image (HSI) classification tasks, but its high energy consumption and complex network structure make it difficult to directly apply it to edge computing devices. At present, spiking neural networks (SNN) have developed rapidly in HSI classification tasks due to their low energy consumption and event driven characteristics. However, it usually requires a longer time step to achieve optimal accuracy. In response to the above problems, this paper builds a spiking neural network (SNN-SWMR) based on the leaky integrate-and-fire (LIF) neuron model for HSI classification tasks. The network uses the spiking width mixed residual (SWMR) module as the basic unit to perform feature extraction operations. The spiking width mixed residual module is composed of spiking mixed convolution (SMC), which can effectively extract spatial-spectral features. Secondly, this paper designs a simple and efficient arcsine approximate derivative (AAD), which solves the non-differentiable problem of spike firing by fitting the Dirac function. Through AAD, we can directly train supervised spike neural networks. Finally, this paper conducts comparative experiments with multiple advanced HSI classification algorithms based on spiking neural networks on six public hyperspectral data sets. Experimental results show that the AAD function has strong robustness and a good fitting effect. Meanwhile, compared with other algorithms, SNN-SWMR requires a time step reduction of about 84%, training time, and testing time reduction of about 63% and 70% at the same accuracy. This study solves the key problem of SNN based HSI classification algorithms, which has important practical significance for promoting the practical application of HSI classification algorithms in edge devices such as spaceborne and airborne devices.

Multimodal Generalized Category Discovery

Arxiv ID: http://arxiv.org/abs/2409.11624v1
DOI: None
Categories: cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 01:08:49+00:00
Updated: 2024-09-18 01:08:49+00:00
Authors: Yuchang Su, Renping Zhou, Siyu Huang, Xingjian Li, Tianyang Wang, Ziyue Wang, Min Xu
Comment: None
Journal: None
Summary: Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5% and 4.7%, respectively.

PainDiffusion: Can robot express pain?

Arxiv ID: http://arxiv.org/abs/2409.11635v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 01:55:00+00:00
Updated: 2024-09-18 01:55:00+00:00
Authors: Quang Tien Dam, Tri Tung Nguyen Nguyen, Dinh Tuan Tran, Joo-Ho Lee
Comment: Under reviewing
Journal: None
Summary: Pain is a more intuitive and user-friendly way of communicating problems, making it especially useful in rehabilitation nurse training robots. While most previous methods have focused on classifying or recognizing pain expressions, these approaches often result in unnatural, jiggling robot faces. We introduce PainDiffusion, a model that generates facial expressions in response to pain stimuli, with controllable pain expressiveness and emotion status. PainDiffusion leverages diffusion forcing to roll out predictions over arbitrary lengths using a conditioned temporal U-Net. It operates as a latent diffusion model within EMOCA's facial expression latent space, ensuring a compact data representation and quick rendering time. For training data, we process the BioVid Heatpain Database, extracting expression codes and subject identity configurations. We also propose a novel set of metrics to evaluate pain expressions, focusing on expressiveness, diversity, and the appropriateness of model-generated outputs. Finally, we demonstrate that PainDiffusion outperforms the autoregressive method, both qualitatively and quantitatively. Code, videos, and further analysis are available at: \href{https://damtien444.github.io/paindf/}{https://damtien444.github.io/paindf/}.

DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion

Arxiv ID: http://arxiv.org/abs/2409.11642v1
DOI: None
Categories: cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 02:14:08+00:00
Updated: 2024-09-18 02:14:08+00:00
Authors: Jian Xu, Xin He
Comment: 5pages,4figures
Journal: None
Summary: Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. However, due to the significant differences between the two modalities, preserving key features during the fusion process remains a challenge. To address this issue, we propose a dual-branch feature decomposition fusion network (DAF-Net) with domain adaptive, which introduces Multi-Kernel Maximum Mean Discrepancy (MK-MMD) into the base encoder and designs a hybrid kernel function suitable for infrared and visible image fusion. The base encoder built on the Restormer network captures global structural information while the detail encoder based on Invertible Neural Networks (INN) focuses on extracting detail texture information. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images. Experimental results demonstrate that the proposed method outperforms existing techniques across multiple datasets, significantly enhancing both visual quality and fusion performance. The related Python code is available at https://github.com/xujian000/DAF-Net.

Few-Shot Learning Approach on Tuberculosis Classification Based on Chest X-Ray Images

Arxiv ID: http://arxiv.org/abs/2409.11644v1
DOI: None
Categories: eess.IV, cs.AI, cs.CV
Links: PDF
Published: 2024-09-18 02:15:01+00:00
Updated: 2024-09-18 02:15:01+00:00
Authors: A. A. G. Yogi Pramana, Faiz Ihza Permana, Muhammad Fazil Maulana, Dzikri Rahadian Fudholi
Comment: 6 pages. Pre-print
Journal: None
Summary: Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.

Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Arxiv ID: http://arxiv.org/abs/2409.11652v1
DOI: None
Categories: cs.CV, cs.CR
Links: PDF
Published: 2024-09-18 02:37:04+00:00
Updated: 2024-09-18 02:37:04+00:00
Authors: Hongyu Zhu, Xin Jin, Hongchao Liao, Yan Xiang, Mounim A. El-Yacoubi, Huafeng Qin
Comment: Accepted By CCBR 2024
Journal: None
Summary: Eye movement biometrics is a secure and innovative identification method. Deep learning methods have shown good performance, but their network architecture relies on manual design and combined priori knowledge. To address these issues, we introduce automated network search (NAS) algorithms to the field of eye movement recognition and present Relax DARTS, which is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training. The key idea is to circumvent the issue of weight sharing by independently training the architecture parameters $\alpha$ to achieve a more precise target architecture. Moreover, the introduction of module input weights $\beta$ allows cells the flexibility to select inputs, to alleviate the overfitting phenomenon and improve the model performance. Results on four public databases demonstrate that the Relax DARTS achieves state-of-the-art recognition performance. Notably, Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Arxiv ID: http://arxiv.org/abs/2409.11653v1
DOI: None
Categories: cs.LG, cs.CV
Links: PDF
Published: 2024-09-18 02:40:31+00:00
Updated: 2024-09-18 02:40:31+00:00
Authors: Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu
Comment: Under Review
Journal: None
Summary: Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $\alpha$-Maximum Mean Discrepancy ($\alpha$-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing $\alpha$-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Arxiv ID: http://arxiv.org/abs/2409.11656v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 02:46:28+00:00
Updated: 2024-09-18 02:46:28+00:00
Authors: Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing Cheng, Cong Yao
Comment: Accepted by ACM-MM2024
Journal: None
Summary: Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

Bridging Domain Gap for Flight-Ready Spaceborne Vision

Arxiv ID: http://arxiv.org/abs/2409.11661v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 02:56:50+00:00
Updated: 2024-09-18 02:56:50+00:00
Authors: Tae Ha Park, Simone D'Amico
Comment: Submitted to Journal of Spacecraft and Rockets; Appeared as Chapter 4 of Tae Ha Park's PhD thesis
Journal: None
Summary: This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. As opposed to existing literature, SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to a wide range of close-range rendezvous and proximity operations with target resident space objects. The code implementation of SPNv3 will be made publicly available.

Agent Aggregator with Mask Denoise Mechanism for Histopathology Whole Slide Image Analysis

Arxiv ID: http://arxiv.org/abs/2409.11664v1
DOI: 10.1145/3664647.3681425
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 03:02:19+00:00
Updated: 2024-09-18 03:02:19+00:00
Authors: Xitong Ling, Minxi Ouyang, Yizhi Wang, Xinrui Chen, Renao Yan, Hongbo Chu, Junru Cheng, Tian Guan, Sufang Tian, Xiaoping Liu, Yonghong He
Comment: None
Journal: None
Summary: Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL's superiority over state-of-the-art methods.

Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks

Arxiv ID: http://arxiv.org/abs/2409.11681v1
DOI: None
Categories: cs.CV, cs.RO
Links: PDF
Published: 2024-09-18 03:45:44+00:00
Updated: 2024-09-18 03:45:44+00:00
Authors: Joji Joseph, Bharadwaj Amrutur, Shalabh Bhatnagar
Comment: Preprint, Under review for ICRA 2025
Journal: None
Summary: 3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at https://jojijoseph.github.io/3dgs-segmentation.

SRIF: Semantic Shape Registration Empowered by Diffusion-based Image Morphing and Flow Estimation

Arxiv ID: http://arxiv.org/abs/2409.11682v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 03:47:24+00:00
Updated: 2024-09-18 03:47:24+00:00
Authors: Mingze Sun, Chen Guo, Puhua Jiang, Shiwei Mao, Yurun Chen, Ruqi Huang
Comment: None
Journal: None
Summary: In this paper, we propose SRIF, a novel Semantic shape Registration framework based on diffusion-based Image morphing and Flow estimation. More concretely, given a pair of extrinsically aligned shapes, we first render them from multi-views, and then utilize an image interpolation framework based on diffusion models to generate sequences of intermediate images between them. The images are later fed into a dynamic 3D Gaussian splatting framework, with which we reconstruct and post-process for intermediate point clouds respecting the image morphing processing. In the end, tailored for the above, we propose a novel registration module to estimate continuous normalizing flow, which deforms source shape consistently towards the target, with intermediate point clouds as weak guidance. Our key insight is to leverage large vision models (LVMs) to associate shapes and therefore obtain much richer semantic information on the relationship between shapes than the ad-hoc feature extraction and alignment. As a consequence, SRIF achieves high-quality dense correspondences on challenging shape pairs, but also delivers smooth, semantically meaningful interpolation in between. Empirical evidence justifies the effectiveness and superiority of our method as well as specific design choices. The code is released at https://github.com/rqhuang88/SRIF.

Detecting Underdiagnosed Medical Conditions with Deep Learning-Based Opportunistic CT Imaging

Arxiv ID: http://arxiv.org/abs/2409.11686v1
DOI: None
Categories: cs.CV, cs.AI, cs.LG
Links: PDF
Published: 2024-09-18 03:56:56+00:00
Updated: 2024-09-18 03:56:56+00:00
Authors: Asad Aali, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Laura T Derry, David Svec, Jason Hom, Robert D. Boutin, Akshay S. Chaudhari
Comment: None
Journal: None
Summary: Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT's potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

SLAM assisted 3D tracking system for laparoscopic surgery

Arxiv ID: http://arxiv.org/abs/2409.11688v1
DOI: None
Categories: cs.RO, cs.CV
Links: PDF
Published: 2024-09-18 04:00:54+00:00
Updated: 2024-09-18 04:00:54+00:00
Authors: Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari
Comment: Demo: https://youtu.be/B1xZW8bj3cM
Journal: None
Summary: A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and "organ-background" relative motion.

GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

Arxiv ID: http://arxiv.org/abs/2409.11689v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 04:05:59+00:00
Updated: 2024-09-18 04:05:59+00:00
Authors: Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
Comment: None
Journal: None
Summary: Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.

ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation

Arxiv ID: http://arxiv.org/abs/2409.11692v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 04:21:04+00:00
Updated: 2024-09-18 04:21:04+00:00
Authors: Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong
Comment: None
Journal: None
Summary: Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through attention weights, marking a novel exploration in this area. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.

Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects

Arxiv ID: http://arxiv.org/abs/2409.11702v1
DOI: None
Categories: cs.RO, cs.CV
Links: PDF
Published: 2024-09-18 04:53:38+00:00
Updated: 2024-09-18 04:53:38+00:00
Authors: Jianhua Sun, Yuxuan Li, Longfei Xu, Jiude Wei, Liang Chai, Cewu Lu
Comment: None
Journal: None
Summary: Human cognition can leverage fundamental conceptual knowledge, like geometric and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOT-driven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View

Arxiv ID: http://arxiv.org/abs/2409.11706v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 05:16:34+00:00
Updated: 2024-09-18 05:16:34+00:00
Authors: Jinrang Jia, Guangqi Yi, Yifeng Shi
Comment: None
Journal: None
Summary: Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.

LFIC-DRASC: Deep Light Field Image Compression Using Disentangled Representation and Asymmetrical Strip Convolution

Arxiv ID: http://arxiv.org/abs/2409.11711v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 05:33:42+00:00
Updated: 2024-09-18 05:33:42+00:00
Authors: Shiyu Feng, Yun Zhang, Linwei Zhu, Sam Kwong
Comment: None
Journal: None
Summary: Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5% bit rate reductions comparing with the state-of-the-art methods.

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Arxiv ID: http://arxiv.org/abs/2409.11718v2
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 05:55:01+00:00
Updated: 2024-09-22 08:23:33+00:00
Authors: Yuan Tian, Guo Lu, Guangtao Zhai
Comment: ECCV2024
Journal: None
Summary: Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Arxiv ID: http://arxiv.org/abs/2409.11729v1
DOI: None
Categories: cs.MM, cs.CV, cs.SD, eess.AS
Links: PDF
Published: 2024-09-18 06:38:48+00:00
Updated: 2024-09-18 06:38:48+00:00
Authors: Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu
Comment: under review
Journal: None
Summary: Current audio-visual representation learning can capture rough object categories (e.g., animals'' and instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like dogs'' and flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models

Arxiv ID: http://arxiv.org/abs/2409.11734v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 06:43:40+00:00
Updated: 2024-09-18 06:43:40+00:00
Authors: Yan Zheng, Lemeng Wu
Comment: 8 pages, 6 figures
Journal: None
Summary: In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for GEO, an exceptionally versatile image editing technique designed to cater to customized user requirements at both local and global scales. Our approach seamlessly integrates text prompts and image prompts to yield diverse and precise editing outcomes. Notably, our method operates without the need for training and is driven by two key contributions: (i) a novel geometric accumulation loss that enhances DDIM inversion to faithfully preserve pixel space geometry and layout, and (ii) an innovative boosted image prompt technique that combines pixel-level editing for text-only inversion with latent space geometry guidance for standard classifier-free reversion. Leveraging the publicly available Stable Diffusion model, our approach undergoes extensive evaluation across various image types and challenging prompt editing scenarios, consistently delivering high-fidelity editing results for real images.

Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing

Arxiv ID: http://arxiv.org/abs/2409.11738v2
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 06:51:29+00:00
Updated: 2024-09-19 03:54:58+00:00
Authors: Seongmin Hong, Jaehyeok Bae, Jongho Lee, Se Young Chun
Comment: 30 pages, 9.8 MB, Accepted to ECCV 2024
Journal: None
Summary: Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling. However, traditional optimization-based reconstruction is slow and can not yield an exact image in practice. Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction, outperforming it in accuracy and computation speed. Finding an efficient sampling method with deep learning-based reconstruction, especially for Fourier CS remains a challenge. Existing joint optimization of sampling-reconstruction works ($\mathcal{H}_1$) optimize the sampling mask but have low potential as it is not adaptive to each data point. Adaptive sampling ($\mathcal{H}2$) has also disadvantages of difficult optimization and Pareto sub-optimality. Here, we propose a novel adaptive selection of sampling-reconstruction ($\mathcal{H}{1.5}$) framework that selects the best sampling mask and reconstruction network for each input data. We provide theorems that our method has a higher potential than $\mathcal{H}_1$ and effectively solves the Pareto sub-optimality problem in sampling-reconstruction by using separate reconstruction networks for different sampling masks. To select the best sampling mask, we propose to quantify the high-frequency Bayesian uncertainty of the input, using a super-resolution space generation model. Our method outperforms joint optimization of sampling-reconstruction ($\mathcal{H}_1$) and adaptive sampling ($\mathcal{H}_2$) by achieving significant improvements on several Fourier CS problems.

Exploring Gaze Pattern in Autistic Children: Clustering, Visualization, and Prediction

Arxiv ID: http://arxiv.org/abs/2409.11744v1
DOI: None
Categories: cs.CV, cs.AI, cs.HC
Links: PDF
Published: 2024-09-18 06:56:06+00:00
Updated: 2024-09-18 06:56:06+00:00
Authors: Weiyan Shi, Haihong Zhang, Jin Yang, Ruiqing Ding, YongWei Zhu, Kenny Tsu Wei Choo
Comment: None
Journal: None
Summary: Autism Spectrum Disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ($81%$ AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \url{https://github.com/username/projectname}.

RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework

Arxiv ID: http://arxiv.org/abs/2409.11749v1
DOI: None
Categories: cs.CV, cs.RO
Links: PDF
Published: 2024-09-18 07:08:08+00:00
Updated: 2024-09-18 07:08:08+00:00
Authors: Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui
Comment: RockTrack establishes a new state-of-the-art with 59.1% AMOTA on the nuScenes vision-only test leaderboard with ResNet50-level backbone
Journal: None
Summary: 3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.

Neural Encoding for Image Recall: Human-Like Memory

Arxiv ID: http://arxiv.org/abs/2409.11750v1
DOI: None
Categories: cs.CV, 68T07, 68U10, I.4.10; I.2.6; I.5.1
Links: PDF
Published: 2024-09-18 07:09:19+00:00
Updated: 2024-09-18 07:09:19+00:00
Authors: Virgile Foussereau, Robin Dumas
Comment: 5 pages, 7 figures
Journal: None
Summary: Achieving human-like memory recall in artificial systems remains a challenging frontier in computer vision. Humans demonstrate remarkable ability to recall images after a single exposure, even after being shown thousands of images. However, this capacity diminishes significantly when confronted with non-natural stimuli such as random textures. In this paper, we present a method inspired by human memory processes to bridge this gap between artificial and biological memory systems. Our approach focuses on encoding images to mimic the high-level information retained by the human brain, rather than storing raw pixel data. By adding noise to images before encoding, we introduce variability akin to the non-deterministic nature of human memory encoding. Leveraging pre-trained models' embedding layers, we explore how different architectures encode images and their impact on memory recall. Our method achieves impressive results, with 97% accuracy on natural images and near-random performance (52%) on textures. We provide insights into the encoding process and its implications for machine learning memory systems, shedding light on the parallels between human and artificial intelligence memory mechanisms.

Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation using Rein to Fine-tune Vision Foundation Models

Arxiv ID: http://arxiv.org/abs/2409.11752v2
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 07:10:24+00:00
Updated: 2024-09-19 09:39:49+00:00
Authors: Pengzhou Cai, Xueyuan Zhang, Libin Lan, Ze Zhao
Comment: None
Journal: None
Summary: In recent years, significant progress has been made in tumor segmentation within the field of digital pathology. However, variations in organs, tissue preparation methods, and image acquisition processes can lead to domain discrepancies among digital pathology images. To address this problem, in this paper, we use Rein, a fine-tuning method, to parametrically and efficiently fine-tune various vision foundation models (VFMs) for MICCAI 2024 Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation (COSAS2024). The core of Rein consists of a set of learnable tokens, which are directly linked to instances, improving functionality at the instance level in each layer. In the data environment of the COSAS2024 Challenge, extensive experiments demonstrate that Rein fine-tuned the VFMs to achieve satisfactory results. Specifically, we used Rein to fine-tune ConvNeXt and DINOv2. Our team used the former to achieve scores of 0.7719 and 0.7557 on the preliminary test phase and final test phase in task1, respectively, while the latter achieved scores of 0.8848 and 0.8192 on the preliminary test phase and final test phase in task2. Code is available at GitHub.

Knowledge Adaptation Network for Few-Shot Class-Incremental Learning

Arxiv ID: http://arxiv.org/abs/2409.11770v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 07:51:38+00:00
Updated: 2024-09-18 07:51:38+00:00
Authors: Ye Wang, Yaxiong Wang, Guoshuai Zhao, Xueming Qian
Comment: 13 pages;6 figures
Journal: None
Summary: Few-shot class-incremental learning (FSCIL) aims to incrementally recognize new classes using a few samples while maintaining the performance on previously learned classes. One of the effective methods to solve this challenge is to construct prototypical evolution classifiers. Despite the advancement achieved by most existing methods, the classifier weights are simply initialized using mean features. Because representations for new classes are weak and biased, we argue such a strategy is suboptimal. In this paper, we tackle this issue from two aspects. Firstly, thanks to the development of foundation models, we employ a foundation model, the CLIP, as the network pedestal to provide a general representation for each class. Secondly, to generate a more reliable and comprehensive instance representation, we propose a Knowledge Adapter (KA) module that summarizes the data-specific knowledge from training data and fuses it into the general representation. Additionally, to tune the knowledge learned from the base classes to the upcoming classes, we propose a mechanism of Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL. Taken together, our proposed method, dubbed as Knowledge Adaptation Network (KANet), achieves competitive performance on a wide range of datasets, including CIFAR100, CUB200, and ImageNet-R.

Distilling Channels for Efficient Deep Tracking

Arxiv ID: http://arxiv.org/abs/2409.11785v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 08:09:20+00:00
Updated: 2024-09-18 08:09:20+00:00
Authors: Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao
Comment: Published by IEEE TIP 2020
Journal: None
Summary: Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.

Efficient Low-Resolution Face Recognition via Bridge Distillation

Arxiv ID: http://arxiv.org/abs/2409.11786v1
DOI: None
Categories: cs.CV, cs.AI, cs.MM
Links: PDF
Published: 2024-09-18 08:10:35+00:00
Updated: 2024-09-18 08:10:35+00:00
Authors: Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, Jia Li
Comment: This paper is published in IEEE TIP 2020
Journal: IEEE TIP 2020
Summary: Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.

Latent fingerprint enhancement for accurate minutiae detection

Arxiv ID: http://arxiv.org/abs/2409.11802v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 08:35:31+00:00
Updated: 2024-09-18 08:35:31+00:00
Authors: Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak
Comment: None
Journal: None
Summary: Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method's dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.

EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning

Arxiv ID: http://arxiv.org/abs/2409.11813v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 09:01:34+00:00
Updated: 2024-09-18 09:01:34+00:00
Authors: Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang
Comment: None
Journal: None
Summary: The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.

SymFace: Additional Facial Symmetry Loss for Deep Face Recognition

Arxiv ID: http://arxiv.org/abs/2409.11816v1
DOI: None
Categories: cs.CV, 68T45 (Primary), I.4.9
Links: PDF
Published: 2024-09-18 09:06:55+00:00
Updated: 2024-09-18 09:06:55+00:00
Authors: Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran
Comment: 11 Pages, 6 Figures, 5 Tables, Submitted for WACV 2025
Journal: None
Summary: Over the past decade, there has been a steady advancement in enhancing face recognition algorithms leveraging advanced machine learning methods. The role of the loss function is pivotal in addressing face verification problems and playing a game-changing role. These loss functions have mainly explored variations among intra-class or inter-class separation. This research examines the natural phenomenon of facial symmetry in the face verification problem. The symmetry between the left and right hemi faces has been widely used in many research areas in recent decades. This paper adopts this simple approach judiciously by splitting the face image vertically into two halves. With the assumption that the natural phenomena of facial symmetry can enhance face verification methodology, we hypothesize that the two output embedding vectors of split faces must project close to each other in the output embedding space. Inspired by this concept, we penalize the network based on the disparity of embedding of the symmetrical pair of split faces. Symmetrical loss has the potential to minimize minor asymmetric features due to facial expression and lightning conditions, hence significantly increasing the inter-class variance among the classes and leading to more reliable face embedding. This loss function propels any network to outperform its baseline performance across all existing network architectures and configurations, enabling us to achieve SoTA results.

EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis

Arxiv ID: http://arxiv.org/abs/2409.11817v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 09:08:16+00:00
Updated: 2024-09-18 09:08:16+00:00
Authors: Shaojie Li, Zhaoshuo Diao
Comment: None
Journal: None
Summary: The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.

End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation

Arxiv ID: http://arxiv.org/abs/2409.11819v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 09:11:31+00:00
Updated: 2024-09-18 09:11:31+00:00
Authors: Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper
Comment: None
Journal: None
Summary: 6D object pose estimation is the problem of identifying the position and orientation of an object relative to a chosen coordinate system, which is a core technology for modern XR applications. State-of-the-art 6D object pose estimators directly predict an object pose given an object observation. Due to the ill-posed nature of the pose estimation problem, where multiple different poses can correspond to a single observation, generating additional plausible estimates per observation can be valuable. To address this, we reformulate the state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End Probabilistic Geometry-Guided Regression). Instead of predicting a single pose per detection, we estimate a probability density distribution of the pose. Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose Estimation) Challenge, we test our approach on four of its core datasets and demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and ITODD. Our probabilistic solution shows that predicting a pose distribution instead of a single pose can improve state-of-the-art single-view pose estimation while providing the additional benefit of being able to sample multiple meaningful pose candidates.

RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets

Arxiv ID: http://arxiv.org/abs/2409.11831v1
DOI: None
Categories: cs.RO, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 09:30:03+00:00
Updated: 2024-09-18 09:30:03+00:00
Authors: Jikai Ye, Wanze Li, Shiraz Khan, Gregory S. Chirikjian
Comment: None
Journal: None
Summary: Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.

Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better Steering

Arxiv ID: http://arxiv.org/abs/2409.12716v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 09:36:24+00:00
Updated: 2024-09-18 09:36:24+00:00
Authors: Fouad Makiyeh, Mark Bastourous, Anass Bairouk, Wei Xiao, Mirjana Maras, Tsun-Hsuan Wangb, Marc Blanchon, Ramin Hasani, Patrick Chareyre, Daniela Rus
Comment: None
Journal: None
Summary: Autonomous vehicle navigation is a key challenge in artificial intelligence, requiring robust and accurate decision-making processes. This research introduces a new end-to-end method that exploits multimodal information from a single monocular camera to improve the steering predictions for self-driving cars. Unlike conventional models that require several sensors which can be costly and complex or rely exclusively on RGB images that may not be robust enough under different conditions, our model significantly improves vehicle steering prediction performance from a single visual sensor. By focusing on the fusion of RGB imagery with depth completion information or optical flow data, we propose a comprehensive framework that integrates these modalities through both early and hybrid fusion techniques. We use three distinct neural network models to implement our approach: Convolution Neural Network - Neutral Circuit Policy (CNN-NCP) , Variational Auto Encoder - Long Short-Term Memory (VAE-LSTM) , and Neural Circuit Policy architecture VAE-NCP. By incorporating optical flow into the decision-making process, our method significantly advances autonomous navigation. Empirical results from our comparative study using Boston driving data show that our model, which integrates image and motion information, is robust and reliable. It outperforms state-of-the-art approaches that do not use optical flow, reducing the steering estimation error by 31%. This demonstrates the potential of optical flow data, combined with advanced neural network architectures (a CNN-based structure for fusing data and a Recurrence-based network for inferring a command from latent space), to enhance the performance of autonomous vehicles steering estimation.

NT-ViT: Neural Transcoding Vision Transformers for EEG-to-fMRI Synthesis

Arxiv ID: http://arxiv.org/abs/2409.11836v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 09:38:08+00:00
Updated: 2024-09-18 09:38:08+00:00
Authors: Romeo Lanzino, Federico Fontana, Luigi Cinque, Francesco Scarcello, Atsuto Maki
Comment: ECCV24 Workshop on Synthetic Data for Computer Vision
Journal: None
Summary: This paper introduces the Neural Transcoding Vision Transformer (\modelname), a generative model designed to estimate high-resolution functional Magnetic Resonance Imaging (fMRI) samples from simultaneous Electroencephalography (EEG) data. A key feature of \modelname is its Domain Matching (DM) sub-module which effectively aligns the latent EEG representations with those of fMRI volumes, enhancing the model's accuracy and reliability. Unlike previous methods that tend to struggle with fidelity and reproducibility of images, \modelname addresses these challenges by ensuring methodological integrity and higher-quality reconstructions which we showcase through extensive evaluation on two benchmark datasets; \modelname outperforms the current state-of-the-art by a significant margin in both cases, e.g. achieving a $10\times$ reduction in RMSE and a $3.14\times$ increase in SSIM on the Oddball dataset. An ablation study also provides insights into the contribution of each component to the model's overall effectiveness. This development is critical in offering a new approach to lessen the time and financial constraints typically linked with high-resolution brain imaging, thereby aiding in the swift and precise diagnosis of neurological disorders. Although it is not a replacement for actual fMRI but rather a step towards making such imaging more accessible, we believe that it represents a pivotal advancement in clinical practice and neuroscience research. Code is available at \url{https://github.com/rom42pla/ntvit}.

Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments

Arxiv ID: http://arxiv.org/abs/2409.11854v1
DOI: None
Categories: cs.RO, cs.CV
Links: PDF
Published: 2024-09-18 10:22:07+00:00
Updated: 2024-09-18 10:22:07+00:00
Authors: Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li
Comment: Accepted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
Journal: None
Summary: Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

Distillation-free Scaling of Large SSMs for Images and Videos

Arxiv ID: http://arxiv.org/abs/2409.11867v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 10:48:10+00:00
Updated: 2024-09-18 10:48:10+00:00
Authors: Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
Comment: None
Journal: None
Summary: State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.

SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition

Arxiv ID: http://arxiv.org/abs/2409.11869v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 10:52:02+00:00
Updated: 2024-09-18 10:52:02+00:00
Authors: Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao
Comment: None
Journal: None
Summary: Gait recognition is a rapidly progressing technique for the remote identification of individuals. Prior research predominantly employing 2D sensors to gather gait data has achieved notable advancements; nonetheless, they have unavoidably neglected the influence of 3D dynamic characteristics on recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly captures 3D spatial features but also diminishes the impact of lighting conditions while ensuring privacy protection.The essence of the problem lies in how to effectively extract discriminative 3D dynamic representation from point clouds.In this paper, we proposes a method named SpheriGait for extracting and enhancing dynamic features from point clouds for Lidar-based gait recognition. Specifically, it substitutes the conventional point cloud plane projection method with spherical projection to augment the perception of dynamic feature.Additionally, a network block named DAM-L is proposed to extract gait cues from the projected point cloud data. We conducted extensive experiments and the results demonstrated the SpheriGait achieved state-of-the-art performance on the SUSTech1K dataset, and verified that the spherical projection method can serve as a universal data preprocessing technique to enhance the performance of other LiDAR-based gait recognition methods, exhibiting exceptional flexibility and practicality.

ABHINAW: A method for Automatic Evaluation of Typography within AI-Generated Images

Arxiv ID: http://arxiv.org/abs/2409.11874v1
DOI: None
Categories: cs.CV, eess.IV
Links: PDF
Published: 2024-09-18 11:04:35+00:00
Updated: 2024-09-18 11:04:35+00:00
Authors: Abhinaw Jagtap, Nachiket Tapas, R. G. Brajesh
Comment: None
Journal: None
Summary: In the fast-evolving field of Generative AI, platforms like MidJourney, DALL-E, and Stable Diffusion have transformed Text-to-Image (T2I) Generation. However, despite their impressive ability to create high-quality images, they often struggle to generate accurate text within these images. Theoretically, if we could achieve accurate text generation in AI images in a ``zero-shot'' manner, it would not only make AI-generated images more meaningful but also democratize the graphic design industry. The first step towards this goal is to create a robust scoring matrix for evaluating text accuracy in AI-generated images. Although there are existing bench-marking methods like CLIP SCORE and T2I-CompBench++, there's still a gap in systematically evaluating text and typography in AI-generated images, especially with diffusion-based methods. In this paper, we introduce a novel evaluation matrix designed explicitly for quantifying the performance of text and typography generation within AI-generated images. We have used letter by letter matching strategy to compute the exact matching scores from the reference text to the AI generated text. Our novel approach to calculate the score takes care of multiple redundancies such as repetition of words, case sensitivity, mixing of words, irregular incorporation of letters etc. Moreover, we have developed a Novel method named as brevity adjustment to handle excess text. In addition we have also done a quantitative analysis of frequent errors arise due to frequently used words and less frequently used words. Project page is available at: https://github.com/Abhinaw3906/ABHINAW-MATRIX.

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Arxiv ID: http://arxiv.org/abs/2409.11904v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 12:02:20+00:00
Updated: 2024-09-18 12:02:20+00:00
Authors: Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen
Comment: None
Journal: None
Summary: Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

Tumor aware recurrent inter-patient deformable image registration of computed tomography scans with lung cancer

Arxiv ID: http://arxiv.org/abs/2409.11910v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 12:11:59+00:00
Updated: 2024-09-18 12:11:59+00:00
Authors: Jue Jiang, Chloe Min Seo Choi, Maria Thor, Joseph O. Deasy, Harini Veeraraghavan
Comment: Minor revision under the journal of Medical Physics
Journal: None
Summary: Background: Voxel-based analysis (VBA) for population level radiotherapy (RT) outcomes modeling requires topology preserving inter-patient deformable image registration (DIR) that preserves tumors on moving images while avoiding unrealistic deformations due to tumors occurring on fixed images. Purpose: We developed a tumor-aware recurrent registration (TRACER) deep learning (DL) method and evaluated its suitability for VBA. Methods: TRACER consists of encoder layers implemented with stacked 3D convolutional long short term memory network (3D-CLSTM) followed by decoder and spatial transform layers to compute dense deformation vector field (DVF). Multiple CLSTM steps are used to compute a progressive sequence of deformations. Input conditioning was applied by including tumor segmentations with 3D image pairs as input channels. Bidirectional tumor rigidity, image similarity, and deformation smoothness losses were used to optimize the network in an unsupervised manner. TRACER and multiple DL methods were trained with 204 3D CT image pairs from patients with lung cancers (LC) and evaluated using (a) Dataset I (N = 308 pairs) with DL segmented LCs, (b) Dataset II (N = 765 pairs) with manually delineated LCs, and (c) Dataset III with 42 LC patients treated with RT. Results: TRACER accurately aligned normal tissues. It best preserved tumors, blackindicated by the smallest tumor volume difference of 0.24%, 0.40%, and 0.13 % and mean square error in CT intensities of 0.005, 0.005, 0.004, computed between original and resampled moving image tumors, for Datasets I, II, and III, respectively. It resulted in the smallest planned RT tumor dose difference computed between original and resampled moving images of 0.01 Gy and 0.013 Gy when using a female and a male reference.

FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation

Arxiv ID: http://arxiv.org/abs/2409.12720v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 12:30:02+00:00
Updated: 2024-09-18 12:30:02+00:00
Authors: Thomas Pöllabauer, Ashwin Pramod, Volker Knauthe, Michael Wahl
Comment: None
Journal: None
Summary: 6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Arxiv ID: http://arxiv.org/abs/2409.11919v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 12:32:25+00:00
Updated: 2024-09-18 12:32:25+00:00
Authors: Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
Comment: EVAL-FoMo workshop, ECCV 2024
Journal: None
Summary: Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires white-box' access to the model's architecture and weights as well as expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a black-box' manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive results when compared with classic fine-tuning.

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Arxiv ID: http://arxiv.org/abs/2409.11920v1
DOI: None
Categories: cs.CV, cs.LG, I.2.0
Links: PDF
Published: 2024-09-18 12:32:39+00:00
Updated: 2024-09-18 12:32:39+00:00
Authors: Lorenzo Mandelli, Stefano Berretti
Comment: 13 pages, 6 figures
Journal: None
Summary: In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.

Agglomerative Token Clustering

Arxiv ID: http://arxiv.org/abs/2409.11923v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 12:37:58+00:00
Updated: 2024-09-18 12:37:58+00:00
Authors: Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
Comment: ECCV 2024. Project webpage at https://vap.aau.dk/atc/
Journal: None
Summary: We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.

Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective

Arxiv ID: http://arxiv.org/abs/2409.11937v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 12:52:54+00:00
Updated: 2024-09-18 12:52:54+00:00
Authors: Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang
Comment: 16 pages, 13 figures
Journal: None
Summary: Tooth arrangement is an essential step in the digital orthodontic planning process. Existing learning-based methods use hidden teeth features to directly regress teeth motions, which couples target pose perception and motion regression. It could lead to poor perceptions of three-dimensional transformation. They also ignore the possible overlaps or gaps between teeth of predicted dentition, which is generally unacceptable. Therefore, we propose DTAN, a differentiable collision-supervised tooth arrangement network, decoupling predicting tasks and feature modeling. DTAN decouples the tooth arrangement task by first predicting the hidden features of the final teeth poses and then using them to assist in regressing the motions between the beginning and target teeth. To learn the hidden features better, DTAN also decouples the teeth-hidden features into geometric and positional features, which are further supervised by feature consistency constraints. Furthermore, we propose a novel differentiable collision loss function for point cloud data to constrain the related gestures between teeth, which can be easily extended to other 3D point cloud tasks. We propose an arch-width guided tooth arrangement network, named C-DTAN, to make the results controllable. We construct three different tooth arrangement datasets and achieve drastically improved performance on accuracy and speed compared with existing methods.

GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

Arxiv ID: http://arxiv.org/abs/2409.11951v1
DOI: None
Categories: cs.CV, cs.GR
Links: PDF
Published: 2024-09-18 13:05:43+00:00
Updated: 2024-09-18 13:05:43+00:00
Authors: Kartik Teotia, Hyeongwoo Kim, Pablo Garrido, Marc Habermann, Mohamed Elgharib, Christian Theobalt
Comment: ACM Transaction on Graphics (SIGGRAPH Asia 2024); Project page: https://vcai.mpi-inf.mpg.de/projects/GaussianHeads/
Journal: None
Summary: Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.

Tracking Any Point with Frame-Event Fusion Network at High Frame Rate

Arxiv ID: http://arxiv.org/abs/2409.11953v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 13:07:19+00:00
Updated: 2024-09-18 13:07:19+00:00
Authors: Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
Comment: None
Journal: None
Summary: Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at https://github.com/ljx1002/FE-TAP.

A Chinese Continuous Sign Language Dataset Based on Complex Environments

Arxiv ID: http://arxiv.org/abs/2409.11960v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 13:11:15+00:00
Updated: 2024-09-18 13:11:15+00:00
Authors: Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan
Comment: 11 pages, 3 figures
Journal: None
Summary: The current bottleneck in continuous sign language recognition (CSLR) research lies in the fact that most publicly available datasets are limited to laboratory environments or television program recordings, resulting in a single background environment with uniform lighting, which significantly deviates from the diversity and complexity found in real-life scenarios. To address this challenge, we have constructed a new, large-scale dataset for Chinese continuous sign language (CSL) based on complex environments, termed the complex environment - chinese sign language dataset (CE-CSL). This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes, featuring more than 70 different complex backgrounds to ensure representativeness and generalization capability. To tackle the impact of complex backgrounds on CSLR performance, we propose a time-frequency network (TFNet) model for continuous sign language recognition. This model extracts frame-level features and then utilizes both temporal and spectral information to separately derive sequence features before fusion, aiming to achieve efficient and accurate CSLR. Experimental results demonstrate that our approach achieves significant performance improvements on the CE-CSL, validating its effectiveness under complex background conditions. Additionally, our proposed method has also yielded highly competitive results when applied to three publicly available CSL datasets.

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

Arxiv ID: http://arxiv.org/abs/2409.11969v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 13:20:46+00:00
Updated: 2024-09-18 13:20:46+00:00
Authors: Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li
Comment: None
Journal: None
Summary: End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.

MitoSeg: Mitochondria Segmentation Tool

Arxiv ID: http://arxiv.org/abs/2409.11974v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 13:26:45+00:00
Updated: 2024-09-18 13:26:45+00:00
Authors: Faris Serdar Taşel, Efe Çiftci
Comment: None
Journal: None
Summary: Recent studies suggest a potential link between the physical structure of mitochondria and neurodegenerative diseases. With advances in Electron Microscopy techniques, it has become possible to visualize the boundary and internal membrane structures of mitochondria in detail. It is crucial to automatically segment mitochondria from these images to investigate the relationship between mitochondria and diseases. In this paper, we present a software solution for mitochondrial segmentation, highlighting mitochondria boundaries in electron microscopy tomography images and generating corresponding 3D meshes.

Intraoperative Registration by Cross-Modal Inverse Neural Rendering

Arxiv ID: http://arxiv.org/abs/2409.11983v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 13:40:59+00:00
Updated: 2024-09-18 13:40:59+00:00
Authors: Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine
Comment: Accepted at MICCAI 2024
Journal: None
Summary: We present in this paper a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. This disentanglement is achieved by controlling a Neural Radiance Field's appearance with a multi-style hypernetwork. Once trained, the implicit neural representation serves as a differentiable rendering engine, which can be used to estimate the surgical camera pose by minimizing the dissimilarity between its rendered images and the target intraoperative image. We tested our method on retrospective patients' data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration. Code and additional resources can be found at https://maxfehrentz.github.io/style-ngp/.

Towards Global Localization using Multi-Modal Object-Instance Re-Identification

Arxiv ID: http://arxiv.org/abs/2409.12002v1
DOI: None
Categories: cs.RO, cs.CV, 68T40, I.2.9; I.2.10
Links: PDF
Published: 2024-09-18 14:15:10+00:00
Updated: 2024-09-18 14:15:10+00:00
Authors: Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna
Comment: 8 pages, 5 figures, 3 tables. Submitted to ICRA 2025
Journal: None
Summary: Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.

Panoptic-Depth Forecasting

Arxiv ID: http://arxiv.org/abs/2409.12008v1
DOI: None
Categories: cs.CV, cs.RO
Links: PDF
Published: 2024-09-18 14:21:07+00:00
Updated: 2024-09-18 14:21:07+00:00
Authors: Juana Valeria Hurtado, Riya Mohan, Abhinav Valada
Comment: None
Journal: None
Summary: Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de.

ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation

Arxiv ID: http://arxiv.org/abs/2409.12010v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:24:29+00:00
Updated: 2024-09-18 14:24:29+00:00
Authors: Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla
Comment: None
Journal: None
Summary: Significant work has been conducted in the domain of food computing, yet these studies typically focus on single tasks such as t2t (instruction generation from food titles and ingredients), i2t (recipe generation from food images), or t2i (food image generation from recipes). None of these approaches integrate all modalities simultaneously. To address this gap, we introduce a novel food computing foundation model that achieves true multimodality, encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large language models (LLMs) and pre-trained image encoder and decoder models, our model can perform a diverse array of food computing-related tasks, including food understanding, food recognition, recipe generation, and food image generation. Compared to previous models, our foundation model demonstrates a significantly broader range of capabilities and exhibits superior performance, particularly in food image generation and recipe generation tasks. We open-sourced ChefFusion at GitHub.

Mixture of Prompt Learning for Vision Language Models

Arxiv ID: http://arxiv.org/abs/2409.12011v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:25:02+00:00
Updated: 2024-09-18 14:25:02+00:00
Authors: Yu Du, Tong Niu, Rong Zhao
Comment: None
Journal: None
Summary: As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}

BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling

Arxiv ID: http://arxiv.org/abs/2409.12014v2
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:28:52+00:00
Updated: 2024-09-22 14:57:35+00:00
Authors: Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger
Comment: None
Journal: None
Summary: Understanding the anisotropic reflectance of complex Earth surfaces from satellite imagery is crucial for numerous applications. Neural radiance fields (NeRF) have become popular as a machine learning technique capable of deducing the bidirectional reflectance distribution function (BRDF) of a scene from multiple images. However, prior research has largely concentrated on applying NeRF to close-range imagery, estimating basic Microfacet BRDF models, which fall short for many Earth surfaces. Moreover, high-quality NeRFs generally require several images captured simultaneously, a rare occurrence in satellite imaging. To address these limitations, we propose BRDF-NeRF, developed to explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical BRDF model commonly employed in remote sensing. We assess our approach using two datasets: (1) Djibouti, captured in a single epoch at varying viewing angles with a fixed Sun position, and (2) Lanzhou, captured over multiple epochs with different viewing angles and Sun positions. Our results, based on only three to four satellite images for training, demonstrate that BRDF-NeRF can effectively synthesize novel views from directions far removed from the training data and produce high-quality digital surface models (DSMs).

Computational Imaging for Long-Term Prediction of Solar Irradiance

Arxiv ID: http://arxiv.org/abs/2409.12016v1
DOI: None
Categories: cs.CV, eess.IV
Links: PDF
Published: 2024-09-18 14:29:43+00:00
Updated: 2024-09-18 14:29:43+00:00
Authors: Leron Julian, Haejoon Lee, Soummya Kar, Aswin C. Sankaranarayanan
Comment: None
Journal: None
Summary: The occlusion of the sun by clouds is one of the primary sources of uncertainties in solar power generation, and is a factor that affects the wide-spread use of solar power as a primary energy source. Real-time forecasting of cloud movement and, as a result, solar irradiance is necessary to schedule and allocate energy across grid-connected photovoltaic systems. Previous works monitored cloud movement using wide-angle field of view imagery of the sky. However, such images have poor resolution for clouds that appear near the horizon, which reduces their effectiveness for long term prediction of solar occlusion. Specifically, to be able to predict occlusion of the sun over long time periods, clouds that are near the horizon need to be detected, and their velocities estimated precisely. To enable such a system, we design and deploy a catadioptric system that delivers wide-angle imagery with uniform spatial resolution of the sky over its field of view. To enable prediction over a longer time horizon, we design an algorithm that uses carefully selected spatio-temporal slices of the imagery using estimated wind direction and velocity as inputs. Using ray-tracing simulations as well as a real testbed deployed outdoors, we show that the system is capable of predicting solar occlusion as well as irradiance for tens of minutes in the future, which is an order of magnitude improvement over prior work.

LEMON: Localized Editing with Mesh Optimization and Neural Shaders

Arxiv ID: http://arxiv.org/abs/2409.12024v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:34:06+00:00
Updated: 2024-09-18 14:34:06+00:00
Authors: Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach
Comment: None
Journal: None
Summary: In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.

On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery

Arxiv ID: http://arxiv.org/abs/2409.12026v1
DOI: None
Categories: cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 14:36:50+00:00
Updated: 2024-09-18 14:36:50+00:00
Authors: BW Sheffield, Jeffrey Ellen, Ben Whitmore
Comment: None
Journal: None
Summary: Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.

PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba

Arxiv ID: http://arxiv.org/abs/2409.12031v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:48:50+00:00
Updated: 2024-09-18 14:48:50+00:00
Authors: Chaoqi Luo, Yiping Xie, Zitong Yu
Comment: Accepted by CCBR 2024
Journal: None
Summary: Facial-video based Remote photoplethysmography (rPPG) aims at measuring physiological signals and monitoring heart activity without any contact, showing significant potential in various applications. Previous deep learning based rPPG measurement are primarily based on CNNs and Transformers. However, the limited receptive fields of CNNs restrict their ability to capture long-range spatio-temporal dependencies, while Transformers also struggle with modeling long video sequences with high complexity. Recently, the state space models (SSMs) represented by Mamba are known for their impressive performance on capturing long-range dependencies from long sequences. In this paper, we propose the PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Specifically, we introduce the Temporal Difference Mamba block to first enhance local dynamic differences and further model the long-range spatio-temporal context. Moreover, a dual-stream SlowFast architecture is utilized to fuse the multi-scale temporal features. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba. The codes are available at https://github.com/Chaoqi31/PhysMamba

Multi-Sensor Deep Learning for Glacier Mapping

Arxiv ID: http://arxiv.org/abs/2409.12034v1
DOI: None
Categories: cs.CV, eess.IV
Links: PDF
Published: 2024-09-18 14:51:36+00:00
Updated: 2024-09-18 14:51:36+00:00
Authors: Codruţ-Andrei Diaconu, Konrad Heidler, Jonathan L. Bamber, Harry Zekollari
Comment: This article will be a chapter of the book Deep Learning for Multi-Sensor Earth Observation, to be published by Elsevier
Journal: None
Summary: The more than 200,000 glaciers outside the ice sheets play a crucial role in our society by influencing sea-level rise, water resource management, natural hazards, biodiversity, and tourism. However, only a fraction of these glaciers benefit from consistent and detailed in-situ observations that allow for assessing their status and changes over time. This limitation can, in part, be overcome by relying on satellite-based Earth Observation techniques. Satellite-based glacier mapping applications have historically mainly relied on manual and semi-automatic detection methods, while recently, a fast and notable transition to deep learning techniques has started. This chapter reviews how combining multi-sensor remote sensing data and deep learning allows us to better delineate (i.e. map) glaciers and detect their temporal changes. We explain how relying on deep learning multi-sensor frameworks to map glaciers benefits from the extensive availability of regional and global glacier inventories. We also analyse the rationale behind glacier mapping, the benefits of deep learning methodologies, and the inherent challenges in integrating multi-sensor earth observation data with deep learning algorithms. While our review aims to provide a broad overview of glacier mapping efforts, we highlight a few setups where deep learning multi-sensor remote sensing applications have a considerable potential added value. This includes applications for debris-covered and rock glaciers that are visually difficult to distinguish from surroundings and for calving glaciers that are in contact with the ocean. These specific cases are illustrated through a series of visual imageries, highlighting some significant advantages and challenges when detecting glacier changes, including dealing with seasonal snow cover, changing debris coverage, and distinguishing glacier fronts from the surrounding sea ice.

SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency

Arxiv ID: http://arxiv.org/abs/2409.12040v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 14:59:30+00:00
Updated: 2024-09-18 14:59:30+00:00
Authors: Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen
Comment: None
Journal: None
Summary: Remote Photoplethysmography (rPPG) is a non-contact method that uses facial video to predict changes in blood volume, enabling physiological metrics measurement. Traditional rPPG models often struggle with poor generalization capacity in unseen domains. Current solutions to this problem is to improve its generalization in the target domain through Domain Generalization (DG) or Domain Adaptation (DA). However, both traditional methods require access to both source domain data and target domain data, which cannot be implemented in scenarios with limited access to source data, and another issue is the privacy of accessing source domain data. In this paper, we propose the first Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which overcomes these limitations by enabling effective domain adaptation without access to source domain data. Our framework incorporates a Three-Branch Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency across domains. Furthermore, we propose a new rPPG distribution alignment loss based on the Frequency-domain Wasserstein Distance (FWD), which leverages optimal transport to align power spectrum distributions across domains effectively and further enforces the alignment of the three branches. Extensive cross-domain experiments and ablation studies demonstrate the effectiveness of our proposed method in source-free domain adaptation settings. Our findings highlight the significant contribution of the proposed FWD loss for distributional alignment, providing a valuable reference for future research and applications. The source code is available at https://github.com/XieYiping66/SFDA-rPPG

PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning

Arxiv ID: http://arxiv.org/abs/2409.12072v1
DOI: None
Categories: cs.CR, cs.AI, cs.CV
Links: PDF
Published: 2024-09-18 15:47:23+00:00
Updated: 2024-09-18 15:47:23+00:00
Authors: Yukai Xu, Yujie Gu, Kouichi Sakurai
Comment: None
Journal: None
Summary: Backdoor attacks pose a significant threat to deep neural networks, particularly as recent advancements have led to increasingly subtle implantation, making the defense more challenging. Existing defense mechanisms typically rely on an additional clean dataset as a standard reference and involve retraining an auxiliary model or fine-tuning the entire victim model. However, these approaches are often computationally expensive and not always feasible in practical applications. In this paper, we propose a novel and lightweight defense mechanism, termed PAD-FT, that does not require an additional clean dataset and fine-tunes only a very small part of the model to disinfect the victim model. To achieve this, our approach first introduces a simple data purification process to identify and select the most-likely clean data from the poisoned training dataset. The self-purified clean dataset is then used for activation clipping and fine-tuning only the last classification layer of the victim model. By integrating data purification, activation clipping, and classifier fine-tuning, our mechanism PAD-FT demonstrates superior effectiveness across multiple backdoor attack methods and datasets, as confirmed through extensive experimental evaluation.

Online Refractive Camera Model Calibration in Visual Inertial Odometry

Arxiv ID: http://arxiv.org/abs/2409.12074v1
DOI: None
Categories: cs.RO, cs.CV
Links: PDF
Published: 2024-09-18 15:48:05+00:00
Updated: 2024-09-18 15:48:05+00:00
Authors: Mohit Singh, Kostas Alexis
Comment: Accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), 8 pages
Journal: None
Summary: This paper presents a general refractive camera model and online co-estimation of odometry and the refractive index of unknown media. This enables operation in diverse and varying refractive fluids, given only the camera calibration in air. The refractive index is estimated online as a state variable of a monocular visual-inertial odometry framework in an iterative formulation using the proposed camera model. The method was verified on data collected using an underwater robot traversing inside a pool. The evaluations demonstrate convergence to the ideal refractive index for water despite significant perturbations in the initialization. Simultaneously, the approach enables on-par visual-inertial odometry performance in refractive media without prior knowledge of the refractive index or requirement of medium-specific camera calibration.

Denoising diffusion models for high-resolution microscopy image restoration

Arxiv ID: http://arxiv.org/abs/2409.12078v1
DOI: None
Categories: eess.IV, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 15:53:45+00:00
Updated: 2024-09-18 15:53:45+00:00
Authors: Pamela Osuna-Vargas, Maren H. Wehrheim, Lucas Zinz, Johanna Rahm, Ashwin Balakrishnan, Alexandra Kaminer, Mike Heilemann, Matthias Kaschube
Comment: None
Journal: None
Summary: Advances in microscopy imaging enable researchers to visualize structures at the nanoscale level thereby unraveling intricate details of biological organization. However, challenges such as image noise, photobleaching of fluorophores, and low tolerability of biological samples to high light doses remain, restricting temporal resolutions and experiment durations. Reduced laser doses enable longer measurements at the cost of lower resolution and increased noise, which hinders accurate downstream analyses. Here we train a denoising diffusion probabilistic model (DDPM) to predict high-resolution images by conditioning the model on low-resolution information. Additionally, the probabilistic aspect of the DDPM allows for repeated generation of images that tend to further increase the signal-to-noise ratio. We show that our model achieves a performance that is better or similar to the previously best-performing methods, across four highly diverse datasets. Importantly, while any of the previous methods show competitive performance for some, but not all datasets, our method consistently achieves high performance across all four data sets, suggesting high generalizability.

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

Arxiv ID: http://arxiv.org/abs/2409.12099v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 16:19:57+00:00
Updated: 2024-09-18 16:19:57+00:00
Authors: Jaehoon Joo, Taejin Jeong, Seongjae Hwang
Comment: None
Journal: None
Summary: Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.

SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba

Arxiv ID: http://arxiv.org/abs/2409.12108v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 16:26:56+00:00
Updated: 2024-09-18 16:26:56+00:00
Authors: Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian
Comment: None
Journal: None
Summary: Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially designed for the treatment of early gastric cancer but is now widely used for various gastrointestinal lesions. Computer-assisted Surgery systems have played a crucial role in improving the precision and safety of ESD procedures, however, their effectiveness is limited by the accurate recognition of surgical phases. The intricate nature of ESD, with different lesion characteristics and tissue structures, presents challenges for real-time surgical phase recognition algorithms. Existing surgical phase recognition algorithms struggle to efficiently capture temporal contexts in video-based scenarios, leading to insufficient performance. To address these issues, we propose SPRMamba, a novel Mamba-based framework for ESD surgical phase recognition. SPRMamba leverages the strengths of Mamba for long-term temporal modeling while introducing the Scaled Residual TranMamba block to enhance the capture of fine-grained details, overcoming the limitations of traditional temporal models like Temporal Convolutional Networks and Transformers. Moreover, a Temporal Sample Strategy is introduced to accelerate the processing, which is essential for real-time phase recognition in clinical settings. Extensive testing on the ESD385 dataset and the cholecystectomy Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art methods and exhibits greater robustness across various surgical phase recognition tasks.

Applications of Knowledge Distillation in Remote Sensing: A Survey

Arxiv ID: http://arxiv.org/abs/2409.12111v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 16:30:49+00:00
Updated: 2024-09-18 16:30:49+00:00
Authors: Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad
Comment: 50 pages, 11 figures and 9 tables
Journal: None
Summary: With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.

Optimal Visual Search with Highly Heuristic Decision Rules

Arxiv ID: http://arxiv.org/abs/2409.12124v1
DOI: None
Categories: q-bio.NC, cs.CV, stat.AP
Links: PDF
Published: 2024-09-18 16:46:36+00:00
Updated: 2024-09-18 16:46:36+00:00
Authors: Anqi Zhang, Wilson S. Geisler
Comment: None
Journal: None
Summary: Visual search is a fundamental natural task for humans and other animals. We investigated the decision processes humans use when searching briefly presented displays having well-separated potential target-object locations. Performance was compared with the Bayesian-optimal decision process under the assumption that the information from the different potential target locations is statistically independent. Surprisingly, humans performed slightly better than optimal, despite humans' substantial loss of sensitivity in the fovea, and the implausibility of the human brain replicating the optimal computations. We show that three factors can quantitatively explain these seemingly paradoxical results. Most importantly, simple and fixed heuristic decision rules reach near optimal search performance. Secondly, foveal neglect primarily affects only the central potential target location. Finally, spatially correlated neural noise causes search performance to exceed that predicted for independent noise. These findings have far-reaching implications for understanding visual search tasks and other identification tasks in humans and other animals.

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

Arxiv ID: http://arxiv.org/abs/2409.12140v1
DOI: None
Categories: cs.CV, cs.MM
Links: PDF
Published: 2024-09-18 17:03:30+00:00
Updated: 2024-09-18 17:03:30+00:00
Authors: Kalakonda Sai Shashank, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla
Comment: None
Journal: None
Summary: We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: https://motion-rag.github.io/

Autopet III challenge: Incorporating anatomical knowledge into nnUNet for lesion segmentation in PET/CT

Arxiv ID: http://arxiv.org/abs/2409.12155v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 17:16:57+00:00
Updated: 2024-09-18 17:16:57+00:00
Authors: Hamza Kalisch, Fabian Hörst, Ken Herrmann, Jens Kleesiek, Constantin Seibold
Comment: AutoPET III challenge submission
Journal: None
Summary: Lesion segmentation in PET/CT imaging is essential for precise tumor characterization, which supports personalized treatment planning and enhances diagnostic precision in oncology. However, accurate manual segmentation of lesions is time-consuming and prone to inter-observer variability. Given the rising demand and clinical use of PET/CT, automated segmentation methods, particularly deep-learning-based approaches, have become increasingly more relevant. The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images in a multitracer multicenter setting, addressing the clinical need for quantitative, robust, and generalizable solutions. Building on previous challenges, the third iteration of the autoPET challenge introduces a more diverse dataset featuring two different tracers (FDG and PSMA) from two clinical centers. To this extent, we developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan. We trained two individual nnUNet-ensembles for each tracer where anatomical labels are included as a multi-label task to enhance the model's performance. Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets, respectively. The code is available at https://github.com/hakal104/autoPETIII/ .

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

Arxiv ID: http://arxiv.org/abs/2409.12156v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 17:18:13+00:00
Updated: 2024-09-18 17:18:13+00:00
Authors: Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
Comment: Accepted by BMVC 2024. Project Page: https://starc52.github.io/publications/2024-07-19-JEAN
Journal: None
Summary: We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.

Precise Forecasting of Sky Images Using Spatial Warping

Arxiv ID: http://arxiv.org/abs/2409.12162v1
DOI: 10.1109/ICCVW54120.2021.00133
Categories: cs.CV
Links: PDF
Published: 2024-09-18 17:25:42+00:00
Updated: 2024-09-18 17:25:42+00:00
Authors: Leron Julian, Aswin C. Sankaranarayanan
Comment: None
Journal: None
Summary: The intermittency of solar power, due to occlusion from cloud cover, is one of the key factors inhibiting its widespread use in both commercial and residential settings. Hence, real-time forecasting of solar irradiance for grid-connected photovoltaic systems is necessary to schedule and allocate resources across the grid. Ground-based imagers that capture wide field-of-view images of the sky are commonly used to monitor cloud movement around a particular site in an effort to forecast solar irradiance. However, these wide FOV imagers capture a distorted image of sky image, where regions near the horizon are heavily compressed. This hinders the ability to precisely predict cloud motion near the horizon which especially affects prediction over longer time horizons. In this work, we combat the aforementioned constraint by introducing a deep learning method to predict a future sky image frame with higher resolution than previous methods. Our main contribution is to derive an optimal warping method to counter the adverse affects of clouds at the horizon, and learn a framework for future sky image prediction which better determines cloud evolution for longer time horizons.

multiPI-TransBTS: A Multi-Path Learning Framework for Brain Tumor Image Segmentation Based on Multi-Physical Information

Arxiv ID: http://arxiv.org/abs/2409.12167v1
DOI: None
Categories: eess.IV, cs.CV
Links: PDF
Published: 2024-09-18 17:35:19+00:00
Updated: 2024-09-18 17:35:19+00:00
Authors: Hongjun Zhu, Jiaohang Huang, Kuo Chen, Xuehui Ying, Ying Qian
Comment: None
Journal: None
Summary: Brain Tumor Segmentation (BraTS) plays a critical role in clinical diagnosis, treatment planning, and monitoring the progression of brain tumors. However, due to the variability in tumor appearance, size, and intensity across different MRI modalities, automated segmentation remains a challenging task. In this study, we propose a novel Transformer-based framework, multiPI-TransBTS, which integrates multi-physical information to enhance segmentation accuracy. The model leverages spatial information, semantic information, and multi-modal imaging data, addressing the inherent heterogeneity in brain tumor characteristics. The multiPI-TransBTS framework consists of an encoder, an Adaptive Feature Fusion (AFF) module, and a multi-source, multi-scale feature decoder. The encoder incorporates a multi-branch architecture to separately extract modality-specific features from different MRI sequences. The AFF module fuses information from multiple sources using channel-wise and element-wise attention, ensuring effective feature recalibration. The decoder combines both common and task-specific features through a Task-Specific Feature Introduction (TSFI) strategy, producing accurate segmentation outputs for Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET) regions. Comprehensive evaluations on the BraTS2019 and BraTS2020 datasets demonstrate the superiority of multiPI-TransBTS over the state-of-the-art methods. The model consistently achieves better Dice coefficients, Hausdorff distances, and Sensitivity scores, highlighting its effectiveness in addressing the BraTS challenges. Our results also indicate the need for further exploration of the balance between precision and recall in the ET segmentation task. The proposed framework represents a significant advancement in BraTS, with potential implications for improving clinical outcomes for brain tumor patients.

Massively Multi-Person 3D Human Motion Forecasting with Scene Context

Arxiv ID: http://arxiv.org/abs/2409.12189v1
DOI: None
Categories: cs.CV, cs.LG, I.2; I.4
Links: PDF
Published: 2024-09-18 17:58:51+00:00
Updated: 2024-09-18 17:58:51+00:00
Authors: Felix B Mueller, Julian Tanke, Juergen Gall
Comment: 14 pages, 6 figures
Journal: None
Summary: Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.

Bundle Adjustment in the Eager Mode

Arxiv ID: http://arxiv.org/abs/2409.12190v1
DOI: None
Categories: cs.RO, cs.CV
Links: PDF
Published: 2024-09-18 17:59:29+00:00
Updated: 2024-09-18 17:59:29+00:00
Authors: Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang
Comment: None
Journal: None
Summary: Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Arxiv ID: http://arxiv.org/abs/2409.12191v1
DOI: None
Categories: cs.CV, cs.AI, cs.CL
Links: PDF
Published: 2024-09-18 17:59:32+00:00
Updated: 2024-09-18 17:59:32+00:00
Authors: Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
Comment: Code is available at https://github.com/QwenLM/Qwen2-VL
Journal: None
Summary: We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{https://github.com/QwenLM/Qwen2-VL}.

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Arxiv ID: http://arxiv.org/abs/2409.12192v1
DOI: None
Categories: cs.RO, cs.AI, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 17:59:43+00:00
Updated: 2024-09-18 17:59:43+00:00
Authors: Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto
Comment: None
Journal: None
Summary: Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io

Vista3D: Unravel the 3D Darkside of a Single Image

Arxiv ID: http://arxiv.org/abs/2409.12193v1
DOI: None
Categories: cs.CV, cs.AI, cs.GT, cs.MM
Links: PDF
Published: 2024-09-18 17:59:44+00:00
Updated: 2024-09-18 17:59:44+00:00
Authors: Qiuhong Shen, Xingyi Yang, Michael Bi Mi, Xinchao Wang
Comment: ECCV'2024
Journal: None
Summary: We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at https://github.com/florinshen/Vista3D.

GCA-SUN: A Gated Context-Aware Swin-UNet for Exemplar-Free Counting

Arxiv ID: http://arxiv.org/abs/2409.12249v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 18:14:00+00:00
Updated: 2024-09-18 18:14:00+00:00
Authors: Yuzhe Wu, Yipeng Xu, Tianyu Xu, Jialu Zhang, Jianfeng Ren, Xudong Jiang
Comment: None
Journal: None
Summary: Exemplar-Free Counting aims to count objects of interest without intensive annotations of objects or exemplars. To achieve this, we propose Gated Context-Aware Swin-UNet (GCA-SUN) to directly map an input image to the density map of countable objects. Specifically, a Gated Context-Aware Modulation module is designed in the encoder to suppress irrelevant objects or background through a gate mechanism and exploit the attentive support of objects of interest through a self-similarity matrix. The gate strategy is also incorporated into the bottleneck network and the decoder to highlight the features most relevant to objects of interest. By explicitly exploiting the attentive support among countable objects and eliminating irrelevant features through the gate mechanisms, the proposed GCA-SUN focuses on and counts objects of interest without relying on predefined categories or exemplars. Experimental results on the FSC-147 and CARPK datasets demonstrate that GCA-SUN outperforms state-of-the-art methods.

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Arxiv ID: http://arxiv.org/abs/2409.12259v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 18:46:51+00:00
Updated: 2024-09-18 18:46:51+00:00
Authors: Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, Stefanos Zafeiriou
Comment: Project Page https://rolpotamias.github.io/WiLoR
Journal: None
Summary: In recent years, 3D hand pose estimation methods have garnered significant attention due to their extensive applications in human-computer interaction, virtual reality, and robotics. In contrast, there has been a notable gap in hand detection pipelines, posing significant challenges in constructing effective real-world multi-hand reconstruction systems. In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model. To tackle the limitations of previous methods and build a robust and stable detection network, we introduce a large-scale dataset with over than 2M in-the-wild hand images with diverse lighting, illumination, and occlusion conditions. Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks. Finally, we showcase the effectiveness of our pipeline to achieve smooth 3D hand tracking from monocular videos, without utilizing any temporal components. Code, models, and dataset are available https://rolpotamias.github.io/WiLoR.

Unsupervised Feature Orthogonalization for Learning Distortion-Invariant Representations

Arxiv ID: http://arxiv.org/abs/2409.12276v1
DOI: None
Categories: eess.IV, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 19:25:38+00:00
Updated: 2024-09-18 19:25:38+00:00
Authors: Sebastian Doerrich, Francesco Di Salvo, Christian Ledig
Comment: Accepted at RROW@BMVC 2024 (Workshop on Robust Recognition in the Open World at the British Machine Vision Conference)
Journal: None
Summary: This study introduces unORANIC+, a novel method that integrates unsupervised feature orthogonalization with the ability of a Vision Transformer to capture both local and global relationships for improved robustness and generalizability. The streamlined architecture of unORANIC+ effectively separates anatomical and image-specific attributes, resulting in robust and unbiased latent representations that allow the model to demonstrate excellent performance across various medical image analysis tasks and diverse datasets. Extensive experimentation demonstrates unORANIC+'s reconstruction proficiency, corruption resilience, as well as capability to revise existing image distortions. Additionally, the model exhibits notable aptitude in downstream tasks such as disease classification and corruption detection. We confirm its adaptability to diverse datasets of varying image sources and sample sizes which positions the method as a promising algorithm for advanced medical image analysis, particularly in resource-constrained environments lacking large, tailored datasets. The source code is available at https://github.com/sdoerrich97/unoranic-plus .

Self-Supervised Pre-training Tasks for an fMRI Time-series Transformer in Autism Detection

Arxiv ID: http://arxiv.org/abs/2409.12304v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 20:29:23+00:00
Updated: 2024-09-18 20:29:23+00:00
Authors: Yinchi Zhou, Peiyu Duan, Yuexi Du, Nicha C. Dvornek
Comment: None
Journal: None
Summary: Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that encompasses a wide variety of symptoms and degrees of impairment, which makes the diagnosis and treatment challenging. Functional magnetic resonance imaging (fMRI) has been extensively used to study brain activity in ASD, and machine learning methods have been applied to analyze resting state fMRI (rs-fMRI) data. However, fewer studies have explored the recent transformer-based models on rs-fMRI data. Given the superiority of transformer models in capturing long-range dependencies in sequence data, we have developed a transformer-based self-supervised framework that directly analyzes time-series fMRI data without computing functional connectivity. To address over-fitting in small datasets and enhance the model performance, we propose self-supervised pre-training tasks to reconstruct the randomly masked fMRI time-series data, investigating the effects of various masking strategies. We then finetune the model for the ASD classification task and evaluate it using two public datasets and five-fold cross-validation with different amounts of training data. The experiments show that randomly masking entire ROIs gives better model performance than randomly masking time points in the pre-training step, resulting in an average improvement of 10.8% for AUC and 9.3% for subject accuracy compared with the transformer model trained from scratch across different levels of training data availability. Our code is available on GitHub.

Measuring Sound Symbolism in Audio-visual Models

Arxiv ID: http://arxiv.org/abs/2409.12306v1
DOI: None
Categories: cs.CL, cs.CV, cs.SD, eess.AS
Links: PDF
Published: 2024-09-18 20:33:54+00:00
Updated: 2024-09-18 20:33:54+00:00
Authors: Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
Comment: SLT 2024
Journal: None
Summary: Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.

Understanding Implosion in Text-to-Image Generative Models

Arxiv ID: http://arxiv.org/abs/2409.12314v1
DOI: 10.1145/3658644.3690205
Categories: cs.CR, cs.AI, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 21:06:45+00:00
Updated: 2024-09-18 21:06:45+00:00
Authors: Wenxin Ding, Cathy Y. Li, Shawn Shan, Ben Y. Zhao, Haitao Zheng
Comment: ACM CCS 2024
Journal: None
Summary: Recent works show that text-to-image generative models are surprisingly vulnerable to a variety of poisoning attacks. Empirical results find that these models can be corrupted by altering associations between individual text prompts and associated visual features. Furthermore, a number of concurrent poisoning attacks can induce "model implosion," where the model becomes unable to produce meaningful images for unpoisoned prompts. These intriguing findings highlight the absence of an intuitive framework to understand poisoning attacks on these models. In this work, we establish the first analytical framework on robustness of image generative models to poisoning attacks, by modeling and analyzing the behavior of the cross-attention mechanism in latent diffusion models. We model cross-attention training as an abstract problem of "supervised graph alignment" and formally quantify the impact of training data by the hardness of alignment, measured by an Alignment Difficulty (AD) metric. The higher the AD, the harder the alignment. We prove that AD increases with the number of individual prompts (or concepts) poisoned. As AD grows, the alignment task becomes increasingly difficult, yielding highly distorted outcomes that frequently map meaningful text prompts to undefined or meaningless visual representations. As a result, the generative model implodes and outputs random, incoherent images at large. We validate our analytical framework through extensive experiments, and we confirm and explain the unexpected (and unexplained) effect of model implosion while producing new, unforeseen insights. Our work provides a useful tool for studying poisoning attacks against diffusion models and their defenses.

A large-scale study of performance and equity of commercial remote identity verification technologies across demographics

Arxiv ID: http://arxiv.org/abs/2409.12318v1
DOI: None
Categories: cs.CV, ACM-class:I5
Links: PDF
Published: 2024-09-18 21:15:31+00:00
Updated: 2024-09-18 21:15:31+00:00
Authors: Kaniz Fatima, Michael Schuckers, Gerardo Cruz-Ortiz, Daqing Hou, Sandip Purnapatra, Tiffany Andrews, Ambuj Neupane, Brandeis Marshall, Stephanie Schuckers
Comment: None
Journal: None
Summary: As more types of transactions move online, there is an increasing need to verify someone's identity remotely. Remote identity verification (RIdV) technologies have emerged to fill this need. RIdV solutions typically use a smart device to validate an identity document like a driver's license by comparing a face selfie to the face photo on the document. Recent research has been focused on ensuring that biometric systems work fairly across demographic groups. This study assesses five commercial RIdV solutions for equity across age, gender, race/ethnicity, and skin tone across 3,991 test subjects. This paper employs statistical methods to discern whether the RIdV result across demographic groups is statistically distinguishable. Two of the RIdV solutions were equitable across all demographics, while two RIdV solutions had at least one demographic that was inequitable. For example, the results for one technology had a false negative rate of 10.5% +/- 4.5% and its performance for each demographic category was within the error bounds, and, hence, were equitable. The other technologies saw either poor overall performance or inequitable performance. For one of these, participants of the race Black/African American (B/AA) as well as those with darker skin tones (Monk scale 7/8/9/10) experienced higher false rejections. Finally, one technology demonstrated more favorable but inequitable performance for the Asian American and Pacific Islander (AAPI) demographic. This study confirms that it is necessary to evaluate products across demographic groups to fully understand the performance of remote identity verification technologies.

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Arxiv ID: http://arxiv.org/abs/2409.12319v1
DOI: None
Categories: cs.CV, cs.MM, cs.SD, eess.AS
Links: PDF
Published: 2024-09-18 21:17:27+00:00
Updated: 2024-09-18 21:17:27+00:00
Authors: Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic
Comment: The code will be made available at this link: https://github.com/umbertocappellazzo/AVSR-LLMs
Journal: None
Summary: Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.

Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus

Arxiv ID: http://arxiv.org/abs/2409.12323v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 21:36:37+00:00
Updated: 2024-09-18 21:36:37+00:00
Authors: Jinchang Zhang, Ningning Xu, Hao Zhang, Guoyu Lu
Comment: None
Journal: None
Summary: Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.

ReFu: Recursive Fusion for Exemplar-Free 3D Class-Incremental Learning

Arxiv ID: http://arxiv.org/abs/2409.12326v1
DOI: None
Categories: cs.CV
Links: PDF
Published: 2024-09-18 21:44:33+00:00
Updated: 2024-09-18 21:44:33+00:00
Authors: Yi Yang, Lei Zhong, Huiping Zhuang
Comment: None
Journal: None
Summary: We introduce a novel Recursive Fusion model, dubbed ReFu, designed to integrate point clouds and meshes for exemplar-free 3D Class-Incremental Learning, where the model learns new 3D classes while retaining knowledge of previously learned ones. Unlike existing methods that either rely on storing historical data to mitigate forgetting or focus on single data modalities, ReFu eliminates the need for exemplar storage while utilizing the complementary strengths of both point clouds and meshes. To achieve this, we introduce a recursive method which continuously accumulates knowledge by updating the regularized auto-correlation matrix. Furthermore, we propose a fusion module, featuring a Pointcloud-guided Mesh Attention Layer that learns correlations between the two modalities. This mechanism effectively integrates point cloud and mesh features, leading to more robust and stable continual learning. Experiments across various datasets demonstrate that our proposed framework outperforms existing methods in 3D class-incremental learning. Project Page: https://arlo397.github.io/ReFu/

Scale-specific auxiliary multi-task contrastive learning for deep liver vessel segmentation

Arxiv ID: http://arxiv.org/abs/2409.12333v1
DOI: 10.1109/ISBI53787.2023.10230364
Categories: eess.IV, cs.AI, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 22:03:22+00:00
Updated: 2024-09-18 22:03:22+00:00
Authors: Amine Sadikine, Bogdan Badic, Jean-Pierre Tasu, Vincent Noblet, Pascal Ballet, Dimitris Visvikis, Pierre-Henri Conze
Comment: 5 pages, 5 figures, conference
Journal: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 2023, pp. 1-5
Summary: Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally-independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi-scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi-task learning by incorporating scale-specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D-IRCADb dataset.

Deep vessel segmentation with joint multi-prior encoding

Arxiv ID: http://arxiv.org/abs/2409.12334v1
DOI: 10.1109/ISBI56570.2024.10635835
Categories: eess.IV, cs.AI, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 22:03:46+00:00
Updated: 2024-09-18 22:03:46+00:00
Authors: Amine Sadikine, Bogdan Badic, Enzo Ferrante, Vincent Noblet, Pascal Ballet, Dimitris Visvikis, Pierre-Henri Conze
Comment: 5 pages, 3 figures, conference
Journal: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 2024, pp. 1-5
Summary: The precise delineation of blood vessels in medical images is critical for many clinical applications, including pathology detection and surgical planning. However, fully-automated vascular segmentation is challenging because of the variability in shape, size, and topology. Manual segmentation remains the gold standard but is time-consuming, subjective, and impractical for large-scale studies. Hence, there is a need for automatic and reliable segmentation methods that can accurately detect blood vessels from medical images. The integration of shape and topological priors into vessel segmentation models has been shown to improve segmentation accuracy by offering contextual information about the shape of the blood vessels and their spatial relationships within the vascular tree. To further improve anatomical consistency, we propose a new joint prior encoding mechanism which incorporates both shape and topology in a single latent space. The effectiveness of our method is demonstrated on the publicly available 3D-IRCADb dataset. More globally, the proposed approach holds promise in overcoming the challenges associated with automatic vessel delineation and has the potential to advance the field of deep priors encoding.

Axial Attention Transformer Networks: A New Frontier in Breast Cancer Detection

Arxiv ID: http://arxiv.org/abs/2409.12347v1
DOI: None
Categories: eess.IV, cs.AI, cs.CV, cs.LG
Links: PDF
Published: 2024-09-18 22:40:29+00:00
Updated: 2024-09-18 22:40:29+00:00
Authors: Weijie He, Runyuan Bao, Yiru Cang, Jianjun Wei, Yang Zhang, Jiacheng Hu
Comment: None
Journal: None
Summary: This paper delves into the challenges and advancements in the field of medical image segmentation, particularly focusing on breast cancer diagnosis. The authors propose a novel Transformer-based segmentation model that addresses the limitations of traditional convolutional neural networks (CNNs), such as U-Net, in accurately localizing and segmenting small lesions within breast cancer images. The model introduces an axial attention mechanism to enhance the computational efficiency and address the issue of global contextual information that is often overlooked by CNNs. Additionally, the paper discusses improvements tailored to the small dataset challenge, including the incorporation of relative position information and a gated axial attention mechanism to refine the model's focus on relevant features. The proposed model aims to significantly improve the segmentation accuracy of breast cancer images, offering a more efficient and effective tool for computer-aided diagnosis.

Advancing Cucumber Disease Detection in Agriculture through Machine Vision and Drone Technology

Arxiv ID: http://arxiv.org/abs/2409.12350v1
DOI: None
Categories: cs.CV, cs.AI
Links: PDF
Published: 2024-09-18 22:54:23+00:00
Updated: 2024-09-18 22:54:23+00:00
Authors: Syada Tasfia Rahman, Nishat Vasker, Amir Khabbab Ahammed, Mahamudul Hasan
Comment: 10 page and 6 figure
Journal: None
Summary: This study uses machine vision and drone technologies to propose a unique method for the diagnosis of cucumber disease in agriculture. The backbone of this research is a painstakingly curated dataset of hyperspectral photographs acquired under genuine field conditions. Unlike earlier datasets, this study included a wide variety of illness types, allowing for precise early-stage detection. The model achieves an excellent 87.5% accuracy in distinguishing eight unique cucumber illnesses after considerable data augmentation. The incorporation of drone technology for high-resolution images improves disease evaluation. This development has enormous potential for improving crop management, lowering labor costs, and increasing agricultural productivity. This research, which automates disease detection, represents a significant step toward a more efficient and sustainable agricultural future.