Hi there, I am a second-year computer science Ph.D. student at University of Electronic Science and Technology of China (UESTC), advised by Prof. Lianli Gao and Prof. Jingkuan Song. Before that, I obtained my bachelor’s degree at Xidian University. Then, I began to complete my master’s degree in 2020 at UESTC and transferred to pursue my Ph.D. degree in 2022.

My research interest includes Multi-modal Learning such as Cross-modal Retrieval, Image Captioning, and VQA. Now the focus of my research is mainly on building LLM-based Agents and Large-scale Multimodal Pre-training.

🔥 News

  • 2024.04: I will be interning at the Tongyi Lab this summer.
  • 2023.12: FT‐Data Ranker: Fine‐Tuning Data Processing Competition for LLMs, 7B‐Model Track (10/377)
  • 2023.12: FT‐Data Ranker: Fine‐Tuning Data Processing Competition for LLMs, 1B‐Model Track (13/383)
  • 2023.11: 🎉 One paper was accepted by TCSVT 2023.
  • 2023.09: 🎉 One paper was accepted by TNNLS 2023.
  • 2023.08: 🔥 I release a repo of curated list of Awesome-Embodied-Agent-with-LLMs research. GitHub Repo stars
  • 2023.07: 🎉 One paper was accepted by TMM 2023.
  • 2023.07: 🎉 One paper was accepted by ACM MM 2023.
  • 2023.01: 🎉 One paper was accepted by PR 2023.
  • 2022.08: 🎉 One paper was accepted by TIP 2022.
  • 2022.06: 🎉 One paper was accepted by IJCAI 2022.
  • 2022.05: 🎉 One paper was accepted by NeurIPS 2022.
  • 2021.07: ICCV 2021 Multi‐Modal Video Reasoning and Analyzing Competition (MMVRAC) Track 1 Top 4.

📝 Publications

glscl

Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Xinyu Lyu, Yihang Duan, Heng Tao Shen
Arxiv 2024
Area: Query-based learning, Text-video retrieval, CLIP

We introduce a Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities via lightweight queries. An inter-consistency loss and an intra-diversity loss are processed to ensure the consistency and diversity of learned concepts across and within modalities.

ump

UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen
Arxiv 2024
Area: Prompt tuning, Text-video retrieval, Video and Language Learning, CLIP

We present a Unified Modality-aware Prompt Tuning (UMP) method that encourages the mutual promotion of two branches by utilizing shared modality-aware prompt tokens and fine-grained spatial-temporal information via a parameter-free spatial-temporal shifting strategy.

dast

Depth-Aware Sparse Transformer for Video-Language Learning
Haonan Zhang, Lianli Gao, Pengpeng Zeng, Alan Hanjalic, Heng Tao Shen
ACM International Conference on Multimedia, MM 2023
[Paper] [Code] [Poster]
Area: Video and Language Learning, Vision Transformer, Depth Estimation, Sparse Attention, Hierarchical Structure

We propose a Depth-Aware Sparse Transformer (DAST) for video-language learning, which focuses on the geometrical relationship of instances by introducing depth information.

snlc

Learning visual question answering on controlled semantic noisy labels
Haonan Zhang, Pengpeng Zeng, Yuxuan Hu, Jin Qian, Jingkuan Song, Lianli Gao
Pattern Recognition, PR 2023
[Paper] [Code]
Area: Visual question answering, Noisy datasets, Semantic labels, Contrastive learning

We propose a new challenging task, namely learning visual question answering with controlled semantic noisy labels. It mainly aims to explore a more robust VQA model when in the case of labels containing semantic noises.

s2

$\mathcal{S}$2 Transformer for Image Captioning
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, Lianli Gao
International Joint Conference on Artificial Intelligence, IJCAI 2022
[Paper] [Code]
Area: Image Captioning, Clustering, Transformer, Unsupervised learning

We study how to effectively and efficiently in corporate grid features with transformer-based architecture for image captioning. To achieve this target, we propose a $\mathcal{S}$2 Transformer—a simple yet effective approach that implicitly learns pseudo regions through a series of learnable clusters in a SP module and simultaneously explores both low and high-level encoded features in a SR module.

Memory-based Augmentation Network for Video Captioning. Shuaiqi Jing, Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen. IEEE Transactions on Multimedia, TMM 2023.
[Paper] [Code]

Visual Commonsense-aware Representation Network for Video Captioning. Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng Tao Shen. IEEE Transactions on Neural Networks and Learning Systems, TNNLS 2023
[arXiv] [Paper] [Code]

Video Question Answering with Prior Knowledge and Object-sensitive Learning. Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, Heng Tao Shen. IEEE Transactions on Image Processing, TIP 2022
[Paper] [Code]

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval. Hao Li, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Haonan Zhang, Gongfu Li. Advances in Neural Information Processing Systems, NeurIPS 2022
[Paper] [Code]

You should know more: Learning external knowledge for visual dialog. Lei Zhao, Haonan Zhang, Xiangpeng Li, Sen Yang, Yuanfeng Song. Neurocomputing 2022
[Paper]

🥇 Honors and Awards

  • 2023.12 Shenzhen Stock Exchange Scholarship
  • 2023.11 First-class Scholarship.
  • 2023.04 Outstanding Graduate Teaching Assistant Award.
  • 2022.06 Outstanding Graduate Student Cadre.
  • 2022.04 “Academic Youth” Graduate Student Honor Award.
  • 2019.10 Individual Scholarship.

📖 Educations

  • 2022.06 - now, University of Electronic Science and Technology of China (UESTC), Ph.D. student of Computer Science and Technology.
  • 2020.09 - 2022.06, University of Electronic Science and Technology of China (UESTC), Master of Computer Technology, transferred to Ph.D.
  • 2016.09 - 2020.06, XIDIAN University, Bachelor of Computer Science and Technology.

💬 Services

  • Reviewer for ECCV 2024, CVPR 2024/2023, WACV, ACM MM, TMM, etc.

Free Web Counters
Unique visitors since Feb 2024.