Hi there, I am a second-year computer science Ph.D. student at University of Electronic Science and Technology of China (UESTC), advised by Prof. Lianli Gao and Prof. Jingkuan Song. Before that, I obtained my bachelor’s degree at Xidian University. Then, I began to complete my master’s degree in 2020 at UESTC and transferred to pursue my Ph.D. degree in 2022.
My research interest includes Multi-modal Learning such as Cross-modal Retrieval, Image Captioning, and VQA. Now the focus of my research is mainly on building LLM-based Agents and Large-scale Multimodal Pre-training.
🔥 News
- 2024.04: I will be interning at the Tongyi Lab this summer.
- 2023.12: FT‐Data Ranker: Fine‐Tuning Data Processing Competition for LLMs, 7B‐Model Track (10/377)
- 2023.12: FT‐Data Ranker: Fine‐Tuning Data Processing Competition for LLMs, 1B‐Model Track (13/383)
- 2023.11: 🎉 One paper was accepted by TCSVT 2023.
- 2023.09: 🎉 One paper was accepted by TNNLS 2023.
- 2023.08: 🔥 I release a repo of curated list of Awesome-Embodied-Agent-with-LLMs research.
- 2023.07: 🎉 One paper was accepted by TMM 2023.
- 2023.07: 🎉 One paper was accepted by ACM MM 2023.
- 2023.01: 🎉 One paper was accepted by PR 2023.
- 2022.08: 🎉 One paper was accepted by TIP 2022.
- 2022.06: 🎉 One paper was accepted by IJCAI 2022.
- 2022.05: 🎉 One paper was accepted by NeurIPS 2022.
- 2021.07: ICCV 2021 Multi‐Modal Video Reasoning and Analyzing Competition (MMVRAC) Track 1 Top 4.
📝 Publications
![glscl](images/GLSCL.png)
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Xinyu Lyu, Yihang Duan, Heng Tao Shen
Arxiv 2024
Area: Query-based learning, Text-video retrieval, CLIP
We introduce a Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities via lightweight queries. An inter-consistency loss and an intra-diversity loss are processed to ensure the consistency and diversity of learned concepts across and within modalities.
![ump](images/UMP.png)
UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen
Arxiv 2024
Area: Prompt tuning, Text-video retrieval, Video and Language Learning, CLIP
We present a Unified Modality-aware Prompt Tuning (UMP) method that encourages the mutual promotion of two branches by utilizing shared modality-aware prompt tokens and fine-grained spatial-temporal information via a parameter-free spatial-temporal shifting strategy.
![dast](images/DAST.png)
Depth-Aware Sparse Transformer for Video-Language Learning
Haonan Zhang, Lianli Gao, Pengpeng Zeng, Alan Hanjalic, Heng Tao Shen
ACM International Conference on Multimedia, MM 2023
[Paper] [Code] [Poster]
Area: Video and Language Learning, Vision Transformer, Depth Estimation, Sparse Attention, Hierarchical Structure
We propose a Depth-Aware Sparse Transformer (DAST) for video-language learning, which focuses on the geometrical relationship of instances by introducing depth information.
![snlc](images/SNLC.png)
Learning visual question answering on controlled semantic noisy labels
Haonan Zhang, Pengpeng Zeng, Yuxuan Hu, Jin Qian, Jingkuan Song, Lianli Gao
Pattern Recognition, PR 2023
[Paper] [Code]
Area: Visual question answering, Noisy datasets, Semantic labels, Contrastive learning
We propose a new challenging task, namely learning visual question answering with controlled semantic noisy labels. It mainly aims to explore a more robust VQA model when in the case of labels containing semantic noises.
![s2](images/S2.png)
$\mathcal{S}$2 Transformer for Image Captioning
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, Lianli Gao
International Joint Conference on Artificial Intelligence, IJCAI 2022
[Paper] [Code]
Area: Image Captioning, Clustering, Transformer, Unsupervised learning
We study how to effectively and efficiently in corporate grid features with transformer-based architecture for image captioning. To achieve this target, we propose a $\mathcal{S}$2 Transformer—a simple yet effective approach that implicitly learns pseudo regions through a series of learnable clusters in a SP module and simultaneously explores both low and high-level encoded features in a SR module.
Memory-based Augmentation Network for Video Captioning. Shuaiqi Jing, Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen. IEEE Transactions on Multimedia, TMM 2023.
[Paper] [Code]
Visual Commonsense-aware Representation Network for Video Captioning. Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng Tao Shen. IEEE Transactions on Neural Networks and Learning Systems, TNNLS 2023
[arXiv] [Paper] [Code]
Video Question Answering with Prior Knowledge and Object-sensitive Learning. Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen. IEEE Transactions on Image Processing, TIP 2022
[Paper] [Code]
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval. Hao Li, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Haonan Zhang, Gongfu Li. Advances in Neural Information Processing Systems, NeurIPS 2022
[Paper] [Code]
You should know more: Learning external knowledge for visual dialog. Lei Zhao, Haonan Zhang, Xiangpeng Li, Sen Yang, Yuanfeng Song. Neurocomputing 2022
[Paper]
🥇 Honors and Awards
- 2023.12 Shenzhen Stock Exchange Scholarship
- 2023.11 First-class Scholarship.
- 2023.04 Outstanding Graduate Teaching Assistant Award.
- 2022.06 Outstanding Graduate Student Cadre.
- 2022.04 “Academic Youth” Graduate Student Honor Award.
- 2019.10 Individual Scholarship.
📖 Educations
- 2022.06 - now, University of Electronic Science and Technology of China (UESTC), Ph.D. student of Computer Science and Technology.
- 2020.09 - 2022.06, University of Electronic Science and Technology of China (UESTC), Master of Computer Technology, transferred to Ph.D.
- 2016.09 - 2020.06, XIDIAN University, Bachelor of Computer Science and Technology.
💬 Services
- Reviewer for ECCV 2024, CVPR 2024/2023, WACV, ACM MM, TMM, etc.