 
 
    Please use this identifier to cite or link to this item:
    
    
    https://dspace.iiti.ac.in/handle/123456789/14799
Full metadata record
| DC Field | Value | Language | 
|---|---|---|
| dc.contributor.author | Banda, Gourinath | en_US | 
| dc.date.accessioned | 2024-10-25T05:51:04Z | - | 
| dc.date.available | 2024-10-25T05:51:04Z | - | 
| dc.date.issued | 2024 | - | 
| dc.identifier.citation | Yadav, R., Halder, R., & Banda, G. (2024). Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition. IEEE Access. Scopus. https://doi.org/10.1109/ACCESS.2024.3457024 | en_US | 
| dc.identifier.issn | 2169-3536 | - | 
| dc.identifier.other | EID(2-s2.0-85204128369) | - | 
| dc.identifier.uri | https://doi.org/10.1109/ACCESS.2024.3457024 | - | 
| dc.identifier.uri | https://dspace.iiti.ac.in/handle/123456789/14799 | - | 
| dc.description.abstract | Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels | en_US | 
| dc.description.abstract | 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets | en_US | 
| dc.description.abstract | and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities. © 2013 IEEE. | en_US | 
| dc.language.iso | en | en_US | 
| dc.publisher | Institute of Electrical and Electronics Engineers Inc. | en_US | 
| dc.source | IEEE Access | en_US | 
| dc.subject | Group activity recognition (GAR) | en_US | 
| dc.subject | hostage crime | en_US | 
| dc.subject | IITP hostage dataset | en_US | 
| dc.subject | masked autoencoder | en_US | 
| dc.subject | spatial and temporal interaction | en_US | 
| dc.subject | vision transformer | en_US | 
| dc.title | Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition | en_US | 
| dc.type | Journal Article | en_US | 
| dc.rights.license | All Open Access, Gold | - | 
| Appears in Collections: | Department of Computer Science and Engineering | |
Files in This Item:
There are no files associated with this item.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
Altmetric Badge:
            	
                
    
            
