I’m Baoqi Pei(裴宝琦). I am a third-year Ph.D. student at College of Computer Science and Technology, Zhejiang University, supervised by Prof. Fei Wu and Prof Yu Qiao, and I work closely with Yifei Huang. Prior to this, I got my Bachelor’s degree from Beihang University in 2023.

My research interest includes general video understanding, egocentric vision perception and multimodal large language models.

🔥 News

📝 Publications

NeurIPS 2025
sym

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, et al.

[Paper]  [Code]  [Data]

  • A framework which equips MLLMs with strong egocentric reasoning via EgoRe-5M dataset, spatio-temporal chain-of-thought supervision and a two-stage training stage.
ICLR 2025
sym

EgoHOD: Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, et al.  

[Paper]  [Code]  [Data]

  • An egocentric video-language pretrained model that learns fine-grained egocentric video representations by modeling hand-object dynamics.
IMMUT 2025
sym

Vinci: A real-time embodied smart assistant based on egocentric vision-language model

Yifei Huang*, Jilan Xu*, Baoqi Pei*, Lijin Yang, MingFang Zhang, Yuping He, Guo Chen, et al.

[Paper]  [Code] 

  • A real-time egocentric wearable assistant to assist users with daily tasks, including scene understanding, grounding, summarization, and future planning.
IJCV 2025
sym

CoQo: Guiding Audio-Visual Question Answering with Collective Question Reasoning

Baoqi Pei, Yifei Huang, Guo Chen, Jilan Xu, et al.

[Paper] 

  • A multimodal model to parse AVQA task with a Question Guided Transformer and Collective Question-Answering Training strategy.
ECCV 2024
sym

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Jilan Xu, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

[Paper]  [Code] 

  • A foundation model for video / text / audio understanding, achieving SOTA over several benchmarks.

📖 Educations

  • 2023.09 - Present, Ph.D. in College of Computer Science and Technology, Zhejiang University.
  • 2019.09 - 2023.06, B.Sc. in College of Computer Science, BeiHang University.

🎖 Honors and Awards

  • Winner of the 7 tracks in the 1st EgoVis Workshop @ CVPR 2024
  • Distinguished Paper Award in Egovis 2023/2024
  • Outstanding Graduate Student in Zhejiang University, 2025
  • Outstanding Student Scholarship in Beihang University, 2021