Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

@Haggunenons · 1 year ago

Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

@Haggunenons · 1 year ago

detailed summary by chatGPT-4

The paper titled “Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models” presents a new framework for recognizing animal actions in videos, addressing the challenges unique to this field compared to human action recognition. These challenges include the lack of annotated training data, significant intra-class variation, and interference from cluttered backgrounds. The field has remained largely unexplored, especially for video-based recognition, due to these difficulties and the complex and diverse morphologies of various animal species.

The framework is built on the CLIP model, a contrastive vision-language pretrained model known for its strong zero-shot generalization ability. This model has been adapted to encode both video and text representations, integrating two transformer blocks for modeling spatiotemporal information. The key innovation is the introduction of a category-specific prompting module, which generates adaptive prompts for both text and video based on the detected animal category in the input videos. This approach allows for more precise and customized descriptions for each animal action category pair, improving the alignment between textual and visual space and reducing the interference of background noise in videos.

The experiments were conducted on the Animal Kingdom dataset, a diverse collection of 50 hours of video clips featuring over 850 animals across 140 different action classes in various environments and weather conditions. The dataset was divided into training and testing sets, with a separate setting for action recognition on unseen animal categories.

The proposed method was compared against five state-of-the-art action recognition models, divided into traditional methods based on convolutional neural networks (CNNs) and transformers, and methods based on image-language pretrained models. The Category-CLIP model, which utilized the category feature extraction module, outperformed the best image-language pretraining method by 3.67% in the mAP metric and showed a significant improvement over the best traditional method by 30.47%. Additionally, the model demonstrated strong generalization ability on unseen animals, performing better than other methods in this aspect as well.

In summary, this paper introduces an innovative approach to animal action recognition that addresses specific challenges in the field, showing superior performance and generalization ability compared to existing methods.