OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents

Overview of the OmniCharacter++ benchmark — **Overview of OmniCharacter++.** The benchmark moves beyond text-only and dyadic-only role-playing evaluation by combining rich role profiles, multi-role dialogues, vivid speech, goal-oriented scenarios, and multi-level evaluation.

Headline Results

10K+

Character profiles across games, fiction, and public domains

118K+

Dyadic and multi-party role-playing dialogues

1M+

Synthesized speech responses with varied styles and emotions

3,941.76 h

Total speech duration for text-speech driven interaction

Context Understanding — Multi-Choice Evaluation

Performance comparison with state-of-the-art models on the OmniCharacter++ test set of multi-party dialogue. Models are evaluated with multi-choice QA and Circular Evaluation Strategy for robust context understanding. Neg.: negotiation, Exc.: exchange, Free.: free-talk, Exp.: expert-domain, Inst.: instruction-giving, Per.: persuasion, Conf.: conflict-resolution, Pla.: planning. The number in parentheses indicates the rank.

Models	Avg.	Multi-party Dialogue - Context Understanding (Multi-Choice)
Models	Avg.	Neg.	Exc.	Free.	Exp.	Inst.	Per.	Conf.	Pla.
Human Evaluation
Human	89.84 (-)	88.66±1.0	90.88±1.1	91.11±1.0	92.77±1.0	89.88±1.2	87.44±1.1	85.88±0.7	92.11±1.1
Blind Evaluation (w/o dialogue context)
Random Choice	23.14 (-)	24.02±1.1	23.88±1.2	18.44	27.77±1.0	22.66±1.4	20.11±1.2	24.66±1.0	23.55±1.3
Random Choice (circular eval.)	0.00 (-)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
GPT-3.5-Turbo	21.51 (-)	18.44±1.1	23.77±1.0	25.88±1.2	18.55±1.1	26.11±1.3	27.66±1.1	11.22±1.1	20.44±1.0
GPT-4o	24.68 (-)	21.11±0.9	30.02±1.2	30.44±0.7	16.22±1.1	24.88±1.0	41.77±1.0	17.11±1.0	15.88±1.2
Proprietary Models
GPT-4.1	50.11 (1)	37.44±1.2	54.22±1.0	69.11±1.3	67.88±1.1	42.11±1.4	46.77±1.2	41.11±1.3	42.22±0.9
GPT-4.1-mini	40.55 (5)	29.44±1.2	38.11±1.2	57.44±1.1	50.88±1.2	40.11±1.1	34.11±1.1	34.22±1.0	40.11±1.2
GPT-4o	45.20 (4)	39.11±1.1	52.22±1.2	66.88±1.3	50.88±1.3	31.77±1.3	43.77±1.0	38.77±1.0	38.22±1.1
GPT-4o-mini	32.12 (7)	21.11±1.2	33.77±1.2	50.11±1.0	46.77±1.3	30.11±1.3	31.88±1.0	24.11±1.0	19.11±1.1
GPT-3.5-Turbo	22.90 (8)	23.88±1.2	18.44±1.0	27.11±0.9	22.88±0.9	20.11±1.0	23.88±1.3	24.11±1.2	22.77±1.0
DeepSeek-V3	39.94 (6)	33.77±0.9	43.88±1.3	57.44±0.6	46.77±1.1	38.77±1.0	36.22±0.9	28.44±1.2	34.22±1.1
Doubao-1.5-Pro-32K	47.66 (3)	37.44±0.4	47.88±1.1	50.11±1.0	52.88±1.2	56.11±1.1	46.22±1.3	43.88±1.2	46.77±0.9
Gemini-2.0-flash-preview	48.36 (2)	42.77±1.1	52.22±1.3	66.88±1.1	46.77±0.9	48.88±1.0	48.11±1.4	41.11±1.2	40.11±0.7
Open-source Models
LLaMA-3.1-405B-Instruct	39.75 (2)	34.77±1.3	38.88±1.4	39.22±1.2	41.88±1.2	39.11±0.9	41.11±1.3	40.88±1.1	42.11±1.0
LLaMA-3.1-70B-Instruct	36.21 (5)	34.77±1.0	39.11±1.0	36.11±1.1	35.11±1.2	39.11±1.2	35.22±1.1	38.11±1.0	32.11±1.2
LLaMA-3.1-8B-Instruct	22.94 (7)	25.11±1.1	19.11±0.9	20.11±1.2	12.11±1.2	31.77±1.1	25.11±1.0	28.11±0.8	22.11±1.3
Qwen2.5-72B-Instruct	43.59 (1)	34.88±1.0	36.77±0.9	54.11±0.8	59.11±1.1	49.11±1.2	36.88±1.2	41.11±1.0	36.77±1.3
Qwen2.5-32B-Instruct	38.49 (3)	30.11±1.0	39.11±0.8	54.11±1.1	48.11±1.0	36.11±1.0	45.11±1.0	27.11±1.0	28.11±1.2
Qwen2.5-14B-Instruct	36.91 (4)	25.11±1.2	26.77±1.3	57.11±0.9	52.11±0.6	35.11±1.2	30.88±1.0	34.11±1.3	34.11±1.2
Qwen2.5-7B-Instruct	23.33 (6)	19.11±1.0	21.11±0.7	35.11±1.1	32.88±1.1	14.11±1.0	27.11±1.0	21.11±1.2	16.11±1.2
Reasoning Models
o4-mini	38.91 (4)	34.88±1.1	40.11±1.1	49.11±1.2	48.11±1.0	36.11±1.1	39.11±0.8	27.11±1.0	36.77±1.0
o3-mini	41.15 (3)	37.44±1.2	44.11±1.1	54.11±1.2	48.11±1.4	36.11±1.2	39.11±1.2	30.11±1.2	40.11±1.2
o1-mini	35.82 (5)	30.11±1.0	36.77±1.0	42.11±1.1	48.11±1.1	36.11±1.2	37.11±1.0	29.11±1.2	27.11±1.4
Gemini-2.5-flash	42.33 (2)	34.77±1.0	34.77±0.7	43.11±1.2	49.11±1.1	37.11±1.2	45.11±1.2	46.77±1.3	47.88±0.8
Gemini-2.5-pro-preview-05-06	45.62 (1)	40.11±0.8	40.11±1.1	52.11±1.0	54.11±1.2	41.11±1.1	45.88±1.3	46.77±1.1	44.77±1.0
Role-playing Models
CharacterGLM	36.47 (6)	34.77±1.0	35.11±1.1	60.88±1.1	42.77±0.8	20.11±0.9	39.11±0.9	32.88±1.2	26.11±1.1
Baichuan-NPC	36.23 (7)	29.88±1.0	38.77±1.1	49.11±1.1	42.77±1.0	31.77±1.1	30.88±1.3	32.88±1.1	33.77±0.9
Minimax-abab6-chat	42.26 (4)	41.77±1.0	39.88±1.2	40.88±1.0	64.88±1.0	41.77±1.2	36.88±1.1	36.11±1.1	35.88±0.6
Xingchen-Plus	41.62 (5)	42.88±1.2	41.88±1.1	64.77±1.0	41.77±1.1	36.77±1.2	36.11±1.3	32.88±0.9	35.88±0.9
Qwen2.5-7B-Instruct w/ our data	42.58 (3)	39.11±1.1	45.88±1.0	68.11±1.0	36.77±1.4	43.77±0.9	36.77±1.2	36.11±1.4	34.11±1.3
OmniCharacter-7B (Ours)	43.31 (2)	34.77±1.1	48.88±1.1	66.88±1.3	34.77±1.0	35.88±1.3	36.77±1.2	46.77±1.2	41.77±0.9
UniCharacter-7B (Ours)	47.80 (1)	43.88±1.1	46.77±0.7	70.11±1.1	41.77±1.2	44.77±1.3	44.11±1.0	50.88±1.3	40.11±1.2

For dyadic dialogue, generation ability, human perception, and the full experimental breakdown, please see the paper.

Model Framework

OmniCharacter++ builds a speech-language collaborative model for realistic role-playing agents. The framework aligns character profiles, dialogue context, text queries, and speech queries, then adapts the response through role-aware speech decoding, emotion preference learning, and role-contextual dialogue adaptation.

Generalization on CharacterEval

Quantitative results on the generalizability of state-of-the-art models and UniCharacter on the CharacterEval dataset. We evaluate Character Consistency, Conversational Ability, and Role-playing Attractiveness. KE: Knowledge-Exposure, KA: Knowledge-Accuracy, KH: Knowledge-Hallucination, PB: Persona-Behavior, PU: Persona-Utterance, Flu.: Fluency, Coh.: Coherency, Cons.: Consistency, HL: Human-Likeness, CS: Communication Skill, ED: Expression Diversity, Emp.: Empathy.

Models	Character Consistency						Conversational Ability				Role-playing Attractiveness					Avg.
Models	KE	KA	KH	PB	PU	Avg.	Flu.	Coh.	Cons.	Avg.	HL	CS	ED	Emp.	Avg.	Avg.
Proprietary Models
Baichuan-NPC	1.802	2.964	2.993	2.910	3.151	2.764	3.578	3.898	3.916	3.798	3.836	2.643	2.336	2.971	2.946	3.169
MiniMax	1.835	2.910	2.944	2.774	3.125	2.718	3.609	3.932	3.811	3.784	3.768	2.672	2.150	3.017	2.902	3.134
GPT-3.5	1.716	2.339	2.212	1.921	2.316	2.101	2.629	2.917	2.700	2.749	2.565	2.422	1.660	2.526	2.293	2.381
GPT-4	2.250	2.855	2.785	2.721	2.873	2.697	3.332	3.669	3.343	3.448	3.143	3.184	2.153	3.010	2.873	3.006
Open-sourced Models
ChatGLM3-6B	2.016	2.792	2.704	2.455	2.812	2.556	3.269	3.647	3.283	3.399	3.064	2.932	1.969	2.993	2.739	2.898
Baichuan2-7B	1.813	2.849	2.929	2.830	3.081	2.700	3.551	3.894	3.827	3.757	3.670	2.728	2.115	2.984	2.874	3.110
Baichuan2-13B	1.802	2.869	2.946	2.808	3.081	2.701	3.596	3.924	3.864	3.759	3.700	2.703	2.136	3.021	2.890	3.116
InternLM-7B	1.782	2.800	2.781	2.719	3.016	2.620	3.527	3.823	3.744	3.698	3.546	2.622	2.070	2.897	2.784	2.983
InternLM-20B	1.945	2.916	2.920	2.753	3.041	2.715	3.576	3.943	3.717	3.745	3.582	2.885	2.132	3.047	2.911	3.123
CharacterGLM	1.640	2.819	2.738	2.301	2.969	2.493	3.414	3.717	3.737	3.623	3.738	2.265	1.966	2.812	2.695	2.937
Llama-3.1-8B	2.197	2.701	2.615	3.130	2.704	2.669	3.059	3.477	3.071	3.202	2.922	2.934	2.634	2.759	2.812	2.894
Qwen-7B	1.956	2.728	2.633	2.605	2.780	2.540	3.187	3.564	3.229	3.327	3.036	2.791	2.052	2.838	2.679	2.848
Qwen-14B	1.988	2.800	2.811	2.744	2.900	2.649	3.351	3.765	3.510	3.542	3.354	2.871	2.237	2.970	2.858	3.016
Qwen2-7B-Instruct	1.966	2.537	2.412	2.313	2.436	2.333	2.864	3.171	2.743	2.926	2.655	2.612	1.867	2.654	2.447	2.569
Qwen2.5-7B-Instruct	2.207	2.740	2.633	2.700	2.614	2.579	3.125	3.401	2.971	3.166	2.786	2.871	2.180	2.826	2.666	2.804
OmniCharacter-7B (Ours)	2.230	3.040	2.918	3.531	2.988	2.941	3.369	3.768	3.410	3.516	3.374	3.261	3.002	3.187	3.206	3.221
UniCharacter-7B (Ours)	1.897	3.083	2.998	3.210	3.349	2.907	3.658	4.023	4.122	3.394	3.958	2.821	2.319	3.188	3.072	3.304

Abstract

Existing Role-Playing Agents (RPAs), powered by large language models, are predominantly evaluated on static, text-only, dyadic conversations, which inadequately reflect the complexity of realistic human interactions involving multiple interlocutors and multi-modal communication. To bridge this gap, we propose OmniCharacter++, the first benchmark for evaluating multi-character interactions in a joint text-speech context. Specifically, OmniCharacter++ contributes: (1) a large-scale dataset comprising 10,287 characters, 118,017 multi-turn dialogues, and over one million audio responses across 8 open-world topics and 31 subfields, covering diverse multi-modal role-playing scenarios; (2) a comprehensive evaluation suite for dialogue understanding, generation quality, and perceptual naturalness; and (3) UniCharacter-7B, a unified text-speech model trained on this dataset to manage complex multi-character dynamics, ensuring both role-specific vocal fidelity and cross-participant semantic alignment. Experimental results demonstrate that UniCharacter-7B achieves more realistic and consistent role-playing responses in terms of both attractiveness and consistency, while also highlighting that OmniCharacter++ poses substantial challenges for state-of-the-art models, charting a clear path for future research.

BibTeX

@article{zhang2026omnicharacter++,
  title={OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents},
  author={Zhang, Haonan and Zeng, Pengpeng and Zhang, Ji and Song, Jingkuan and Sebe, Nicu and Shen, Heng Tao and Gao, Lianli},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2026},
  publisher={IEEE}
}

@inproceedings{zhang2025omnicharacter,
  title={Omnicharacter: Towards immersive role-playing agents with seamless speech-language personality interaction},
  author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
  booktitle={ACL (main)},
  pages={26318--26331},
  year={2025}
}