Special Topic: Large Multimodal Models - SCIENCE CHINA Information Sciences

Special Topic: Large Multimodal Models
RESEARCH PAPER Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 5

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
Chen, Zhe; Wang, Weiyun; Tian, Hao; Ye, Shenglong; Gao, Zhangwei; Cui, Erfei; Tong, Wenwen; Hu, Kongzhi; Luo, Jiapeng; Ma, Zheng; Ma, Ji; Wang, Jiaqi; Dong, Xiaoyi; Yan, Hang; Guo, Hewei; He, Conghui; Shi, Botian; Jin, Zhenjiang; Xu, Chao; Wang, Bin; Wei, Xingjian; Li, Wei; Zhang, Wenjian; Zhang, Bo; Cai, Pinlong; Wen, Licheng; Yan, Xiangchao; Dou, Min; Lu, Lewei; Zhu, Xizhou; Lu, Tong; Lin, Dahua; Qiao, Yu; Dai, Jifeng; Wang, Wenhai
Sci China Inf Sci, 2024, 67(12): 220101

Keywords: multimodal model; open-source; vision encoder; dynamic resolution; bilingual dataset; LMM

Cite as: Chen Z, Wang W Y, Tian H, et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. Sci China Inf Sci, 2024, 67(12): 220101, doi: 10.1007/s11432-024-4231-5

Special Topic: Large Multimodal Models
RESEARCH PAPER Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 3

OCRBench: on the hidden mystery of OCR in large multimodal models
Liu, Yuliang; Li, Zhang; Huang, Mingxin; Yang, Biao; Yu, Wenwen; Li, Chunyuan; Yin, Xu-Cheng; Liu, Cheng-Lin; Jin, Lianwen; Bai, Xiang
Sci China Inf Sci, 2024, 67(12): 220102

Keywords: large multimodal model; LMM; OCR; text recognition; scene text-centric VQA; document-oriented VQA; key information extraction; handwritten mathematical expression recognition

Cite as: Liu Y L, Li Z, Huang M X, et al. OCRBench: on the hidden mystery of OCR in large multimodal models. Sci China Inf Sci, 2024, 67(12): 220102, doi: 10.1007/s11432-024-4235-6

Special Topic: Large Multimodal Models
RESEARCH PAPER Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 0

MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity
Liu, Yangzhou; Cao, Yue; Gao, Zhangwei; Wang, Weiyun; Chen, Zhe; Wang, Wenhai; Tian, Hao; Lu, Lewei; Zhu, Xizhou; Lu, Tong; Qiao, Yu; Dai, Jifeng
Sci China Inf Sci, 2024, 67(12): 220103

Keywords: instruction tuning; multi-modal; multi-domain; dataset; vision large language model; LMM

Cite as: Liu Y Z, Cao Y, Gao Z W, et al. MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity. Sci China Inf Sci, 2024, 67(12): 220103, doi: 10.1007/s11432-024-4187-3

Special Topic: Large Multimodal Models
RESEARCH PAPER Supplementary Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 5

Woodpecker: hallucination correction for multimodal large language models
Yin, Shukang; Fu, Chaoyou; Zhao, Sirui; Xu, Tong; Wang, Hao; Sui, Dianbo; Shen, Yunhang; Li, Ke; Sun, Xing; Chen, Enhong
Sci China Inf Sci, 2024, 67(12): 220105

Keywords: multimodal learning; multimodal large language models; hallucination correction; large language models; vision and language; LMM

Cite as: Yin S K, Fu C Y, Zhao S R, et al. Woodpecker: hallucination correction for multimodal large language models. Sci China Inf Sci, 2024, 67(12): 220105, doi: 10.1007/s11432-024-4251-x

Special Topic: Large Multimodal Models
RESEARCH PAPER Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 2

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding
Feng, Hao; Liu, Qi; Liu, Hao; Tang, Jingqun; Zhou, Wengang; Li, Houqiang; Huang, Can
Sci China Inf Sci, 2024, 67(12): 220106

Keywords: document understanding; large multimodal model; LMM; OCR-free; high-resolution; frequency

Cite as: Feng H, Liu Q, Liu H, et al. DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding. Sci China Inf Sci, 2024, 67(12): 220106, doi: 10.1007/s11432-024-4250-y

Special Topic: Large Multimodal Models
RESEARCH PAPER Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 0

Modality-experts coordinated adaptation for large multimodal models
Zhang, Yan; Ji, Zhong; Pang, Yanwei; Han, Jungong; Li, Xuelong
Sci China Inf Sci, 2024, 67(12): 220107

Keywords: large multimodal model; LMM; multimodal learning; vision-language pretraining; parameter-efficient fine-tuning; adapter; modality expert

Cite as: Zhang Y, Ji Z, Pang Y W, et al. Modality-experts coordinated adaptation for large multimodal models. Sci China Inf Sci, 2024, 67(12): 220107, doi: 10.1007/s11432-024-4234-4

Special Topic: Large Multimodal Models
LETTER Supplementary Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 2

COMET: "cone of experience" enhanced large multimodal model for mathematical problem generation
Liu, Sannyuya; Feng, Jintian; Yang, Zongkai; Luo, Yawei; Wan, Qian; Shen, Xiaoxuan; Sun, Jianwen
Sci China Inf Sci, 2024, 67(12): 220108

Keywords: mathematical problem generation; mathematical problem solving; large multimodal model; LMM; educational application; smart education

Cite as: Liu S N Y, Feng J T, Yang Z K, et al. COMET: "cone of experience" enhanced large multimodal model for mathematical problem generation. Sci China Inf Sci, 2024, 67(12): 220108, doi: 10.1007/s11432-024-4242-0

Special Topic: Large Multimodal Models
LETTER Supplementary Webpage Webpage-cn SpringerLink Google Scholar Cited in SCI: 2

ChemDFM-X: towards large multimodal model for chemistry
Zhao, Zihan; Chen, Bo; Li, Jingpiao; Chen, Lu; Wen, Liyang; Wang, Pengyu; Zhu, Zichen; Zhang, Danyang; Li, Yansi; Dai, Zhongyang; Chen, Xin; Yu, Kai
Sci China Inf Sci, 2024, 67(12): 220109

Keywords: LMM; AI for Science; Instruction-Tuning; Cross-Modality; Chemistry

Cite as: Zhao Z H, Chen B, Li J P, et al. ChemDFM-X: towards large multimodal model for chemistry. Sci China Inf Sci, 2024, 67(12): 220109, doi: 10.1007/s11432-024-4243-0