Research
My research focuses on representation learning in long videos, leveraging vision-language models, multimodal inputs, and both egocentric and exocentric viewpoints. I am also interested in vision-language-action models for robot learning and the use of diffusion models for video generation.
|
News
2025
May: Outstanding Reviewer for CVPR 2025.
Apr: Serving as Area Chair for NeurIPS 2025.
Feb: 3rd-place in Elderly Action Recognition Challenge - WACV 2025.
2024
Dec: 2 papers accepted to AAAI 2025.
Oct: 3 papers accepted to NeurIPS 2024 workshops; early version of LLAVIDAL presented at VLM workshop.
Jul: 2 papers accepted to ECCV 2024, 1 to ACM MM 2024.
Apr: DeepFake Generation paper accepted at CVPRW 2024.
Feb: 3 papers accepted to CVPR 2024.
2023
Oct: Paper accepted to WACV 2024.
Aug: DeepFake detection paper at ICCVW 2023 and CLIP for Action Detection at BMVC 2023.
May: Serving as SPC at AAAI 2024 for the second time.
May: Dominick Reilly awarded Chateaubriand Fellowship.
Feb: First NSF Grant awarded – Link.
Jan: Paper accepted to ISBI 2023 with Stony Brook Medicine collaborators.
2022
Sep: Two papers accepted to NeurIPS 2022.
Aug: Paper accepted to WACV 2023 (first round).
Aug: Joined UNC Charlotte as Assistant Professor.
Lab members
Other Current Students
- Nitin Chandrasekhar (UG student at UNC Charlotte)
- Drew O’Donnell (UG student at UNC Charlotte)
Selected Publications (For full list of papers, visit my Google Scholar.)
Preprints
|
|
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
Dominick Reilly, Manish Kumar Govind, Le Xue, and Srijan Das.
Preprint
arXiv
/
Code
We leverage the complementary nature of egocentric views to enhance LVLM’s understanding of exocentric ADL videos through online ego2exo distillation. |
|
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, and Srijan Das.
Preprint
arXiv
/
code
MS-Temba is the first Mamba based architecture for action detection in long untrimmed videos that can be trained/tested on NVIDIA Jetson Nano.
|
2025
|
|
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Saarthak Kapse, Pushpak Pati, Srikar Yellapragada, Srijan Das , Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna.
To Appear in ICCV 2025
arXiv
/
Code
Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO) aligns WSIs with a Concept Prior for delivering clinically meaningful interpretability. |
|
MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild
Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Mayur Jagdishbhai Patel, Hongfei Xue, Ahmed Helmy, Srijan Das, Pu Wang.
To Appear in ICCV 2025
arXiv
/
Website
A novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes. |
|
LLAVIDAL : A Large LAnguage VIsion Model for Daily Activities of Living
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das.
CVPR 2025
arXiv
/
website
/
code
LLAVIDAL, a Large Language Vision Model, incorporates 3D poses and relevant object trajectories to understand the intricate spatiotemporal relationships within ADLs.
|
|
SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, and Srijan Das.
AAAI 2025
arXiv
/
Code
Ski-models introduce 3D skeletons into the vision-language embedding space to enable effective zeroshot learning for ADL. |
|
GenHMR: Generative Human Mesh Recovery
Muhammad Usama Saleem , Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen.
AAAI 2025
arXiv
/
Website
A generative framework that reformulates monocular HMR as an image-conditioned generative task.
|
2024
|
|
Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer
Wenhan Wu, Ce Zheng, Zihao Yang, Chen Chen, Srijan Das, Aidong Lu.
ACM MM 2024
arXiv
/
code
A frequency-aware attention module to unweave skeleton frequency representations for action recognition.
|
|
Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier
Prantik Howlader, Srijan Das, Hieu Le, Dimitris Samaras.
ECCV 2024
arXiv
/
code
A novel plug-in module designed for existing semi-supervised segmentation frameworks that offers patch-level supervision.
|
|
BAMM: Bidirectional Autoregressive Motion Model
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, Chen Chen.
ECCV 2024
arXiv
/
website
/
code
A novel text-to-motion generation framework. BAMM captures rich and bidirectional dependencies among motion tokens.
|
|
Just Add π! Pose Induced Video Transformers for Understanding Activities of Daily Living
Dominick Reilly and Srijan Das.
CVPR 2024
arXiv
/
code
We introduce the first Pose Induced Video Transformer: PI-ViT (or π-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information.
|
|
SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology
Saarthak Kapse*, Pushpak Pati*, Srijan Das, Jingwei Zhang, Chao Chen, Maria Vakalopoulou, Joel Saltz, Dimitris Samaras, Rajarsi Gupta, Prateek Prasanna.
CVPR 2024
arXiv
/
code
Self-Interpretable MIL (SI-MIL), the first interpretable-by-design MIL method for gigapixel WSIs, which provides de novo feature-level interpretations grounded on pathological insights for a WSI.
|
|
Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?
Aritra Dutta, Srijan Das , Jacob Nielsen, Rajatsubhra Chakraborty, Mubarak Shah.
CVPR 2024
arXiv
/
Website
We present MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera.
|
|
Attention de-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning
Saarthak Kapse, Srijan Das, Jingwei Zhang, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna.
Medical Image Analysis (IF 10.9)
arXiv
A diversity-inducing pretraining technique, tailored to enhance representation learning in digital pathology.
|
|
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, Soumyajit Karmakar, Shyam Marjit, Xiang Li, Abhijit Das, and Michael S. Ryoo.
WACV 2024
arXiv
/
code
/
Poster
/
Video
This paper shows that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task is surprisingly beneficial when the amount of training data is limited.
|
2023
|
|
Attributes-Aware Network for Temporal Action Detection
Rui Dai, Srijan Das, Michael S. Ryoo, Francois Bremond.
BMVC 2023
arXiv / video
This paper explains how to utilize OpenAI's CLIP for long-term action detection in videos.
|
|
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Srijan Das and Michael S. Ryoo.
18th International Conference on Machine Vision Applications , July 2023
arXiv
/
Poster
/
Best Poster Award
This paper focuses on designing video augmentation for self-supervised learning, we propose CMMC to make use of other modalities in videos for data mixing.
|
|
ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints
Srijan Das, and Michael S. Ryoo.
WACV 2023
arXiv
A framework for learning self-supervised video representation that is invariant to unseen camera viewpoints.
|
2022
|
|
Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?
Xiang Li, Jinghuan Shang, Srijan Das, Michael S. Ryoo.
NeurIPS 2022
arXiv
/
code
The impacts of the existing self-supervised losses with Joint Learning framework for RL is limited, while there is no golden method that can dominate all tasks.
|
|
Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space
Jinghuan Shang, Srijan Das, Michael S. Ryoo.
NeurIPS 2022
arXiv
/
Project Page
/
code
3DTRL is a light-weighted, plug-and play layer that recovers 3D information of visual tokens and leverages it for learning viewpoint-agnostic representations.
|
|
Toyota Smarthome Untrimmed: Real-World Untrimmed Videos for Activity Detection
Rui Dai, Srijan Das, Saurav Sharma, Luca Minciullo, Lorenzo Garattoni, François Brémond, Gianpiero Francesca.
T-PAMI 2022
Project Link / Code
TSU is a new untrimmed daily-living dataset consisting of 51 activities performed in a spontaneous manner, captured from non-optimal viewpoints.
|
|
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond.
CVPR 2022
arXiv
/
code
A ConvTransformer network that explores global and local temporal relations at multiple resolutions.
|
2021
|
|
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living
Srijan Das, Rui Dai, Di Yang, Francois Bremond,
TPAMI, 2021
arXiv
/
code
VPN++ is an extension of our VPN model (ECCV 2020). VPN++ hallucinates pose driven features while not requiring costly 3D Poses at inference.
|
|
CTRN: Class Temporal Relational Network for Action Detection
Rui Dai, Srijan Das, Francois Bremond.
BMVC 2021, Oral
|
|
Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection
Rui Dai, Srijan Das, Francois Bremond.
ICCV 2021
|
|
PDAN: Pyramid Dilated Attention Network for Action Detection.
Rui Dai, Srijan Das, Luca Minciullo, Lorenzo Garattoni, Gianpiero Francesca and Francois Bremond.
WACV 2021
Code / Video / Poster
|
2020
|
|
VPN: Learning Video-Pose Embedding for Activities of Daily Living
Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat.
ECCV 2020
Code
|
|
Looking deeper into Time for Activities of Daily Living Recognition
Srijan Das, Monique Tonnat and Francois Bremond.
WACV 2020
|
2019
|
|
Toyota Smarthome: Real World Activities of Daily Living.
Srijan Das, Rui Dai, Michal Koperski, Luca Minciullo, Lorenzo Garattoni, Francois Bremond and Gianpiero Francesca.
ICCV 2019
Project Link / Code
|
|
Where to focus on for Human Action Recognition?
Srijan Das, Arpit Chaudhary, Francois Bremond and Monique Thonnat.
WACV 2019
|
Talks
- May 2025 Invited Talk on "LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living" in the first special WACV 2025 Meetup Series.
- May 2025 Invited Academic Talk on "Improved Reasoning in AI Models for Deepfake Detection" in Martigny Biometrics Workshop co-organised by the European Association for Biometrics (EAB), the Center for Identification Technology Research (CITeR) and the Idiap Research Institute at Idiap in Martigny, Switzerland.
- May 2025 Invited research poster presentation at the Computing Community Consortium (CCC) Computing Futures Symposium in Washington, DC, USA.
- Mar 2025 Guest Lecture on "Deep Neural Networks" at The University of Michigan-Dearborn.
- Mar 2024 Talk on "Computer Vision Projects in CharMLab" in a RoundTable discussion on AI in conjunction with the Defense Alliance of NC (DANC) and the Michael Best Law Firm.
- Feb 2024 Invited Online Tech Talk on "From Pixels to Robots: Recipes for Vision-Enabled Robot Learning" at Christ University, Bangalore, India.
- Dec 2023 Invited Talk on "Video Understanding using AI" as part of the "AI and ROS for Robotics: Theory and Practice" short-term training program at IIITDM.
- Jun 2023 Invited Talk on "Computer Vision for Robot Learning" as part of the "AI and Machine Vision for Robotics" short-term training program at IIITDM. (Virtually)
- Apr 2023 Talk on "From Few to More: Enhancing ViT Performance on Limited Data" at PHPC Lab in UNC Charlotte.
- Mar 2023 Talk on "From Pixels to Robots: Recipes for Vision-Enabled Robot Learning" at the Seminar on Controls and Robotics in UNC Charlotte.
- Jan 2023 Talk on "Quo vadis, computer vision!" at the PhD seminar in UNC Charlotte.
- Mar 2022 Invited Talk in AICTE sponsored Short Term Course on "Multiple Modalities are all you need for Video Understanding!" at IIITDM Kancheepuram. (Virtually)
- Sep 2021 Talk on "Vision for understanding Activities of Daily Living" at SciTech Talks . [video]
- Apr 2021 Seminar talk on "How to combine modalities for understanding Activities of Daily Living? " for CSE 600 at Stony Brook University, NY, USA.
- Nov 2020 Seminar talk on "How to combine RGB & Poses for understanding Activities of Daily Living?" at Université Lumière Lyon 2.
- Nov 2019 Nice Data Science meetup . [slides]
- Aug 2018 Summer School Brain Innovation Generation @ UCA . [slides]
|
Academic Activities
- Area Chair for NeurIPS 2025.
- Program committee member of AAAI-24 Student Program.
- Associate Editor for ICRA 2024.
- Member of DEI committe for CVPR 2023.
- Senior Program Committee Member for AAAI 2023 and AAAI 2024.
- Session chair for Image Understanding & Activity Recognition session at IPAS 2020.
- Mentored for B.E.N.J.I. in GirlScript Summer of Code 2019 edition.
- Mentor for the Emerging Technology Business Incubator (ETBI) Led by NIT Rourkela, a platform envisaged to transform the start-up ecosystem of the region.
- Reviewer at ICACIE 2017, 2018, SETIT 2018, KCST 2019, ICAML 2019, AVSS 2019, 2022, WACV 2020, 2021, 2022, CVPR 2021, 2022, 2023, 2024, 2025, ECCV 2022, 2024, ICCV 2021, 2023, 2025, AAAI 2023, NeurIPS 2023, IROS 2021, 2024.
- Reviewer at TPAMI, Pattern Recognition, Elsevier Journal of CVIU, Elsevier Journal of FGCS, Elsevier Journal of Computer & Electrical Engineering, MTAP, and Journal of Signal Processing: Image Communication.
- Volunteer at ICACNI 2014, ICACNI 2016, ICCV 2019, ICLR 2020 & ICML 2020.
|
|