OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents

Haonan Zhang1 Pengpeng Zeng2✉ Ji Zhang3 Jingkuan Song2 Nicu Sebe4 Heng Tao Shen2 Lianli Gao1
1University of Electronic Science and Technology of China 2Tongji University 3Southwest Jiaotong University 4University of Trento
Corresponding author
Overview of the OmniCharacter++ benchmark
Overview of OmniCharacter++. The benchmark moves beyond text-only and dyadic-only role-playing evaluation by combining rich role profiles, multi-role dialogues, vivid speech, goal-oriented scenarios, and multi-level evaluation.

Headline Results

10K+
Character profiles across games, fiction, and public domains
118K+
Dyadic and multi-party role-playing dialogues
1M+
Synthesized speech responses with varied styles and emotions
3,941.76 h
Total speech duration for text-speech driven interaction

Context Understanding — Multi-Choice Evaluation

Performance comparison with state-of-the-art models on the OmniCharacter++ test set of multi-party dialogue. Models are evaluated with multi-choice QA and Circular Evaluation Strategy for robust context understanding. Neg.: negotiation, Exc.: exchange, Free.: free-talk, Exp.: expert-domain, Inst.: instruction-giving, Per.: persuasion, Conf.: conflict-resolution, Pla.: planning. The number in parentheses indicates the rank.

Models Avg. Multi-party Dialogue - Context Understanding (Multi-Choice)
Neg. Exc. Free. Exp. Inst. Per. Conf. Pla.
Human Evaluation
Human89.84 (-)88.66±1.090.88±1.191.11±1.092.77±1.089.88±1.287.44±1.185.88±0.792.11±1.1
Blind Evaluation (w/o dialogue context)
Random Choice23.14 (-)24.02±1.123.88±1.218.4427.77±1.022.66±1.420.11±1.224.66±1.023.55±1.3
Random Choice (circular eval.)0.00 (-)0.000.000.000.000.000.000.000.00
GPT-3.5-Turbo21.51 (-)18.44±1.123.77±1.025.88±1.218.55±1.126.11±1.327.66±1.111.22±1.120.44±1.0
GPT-4o24.68 (-)21.11±0.930.02±1.230.44±0.716.22±1.124.88±1.041.77±1.017.11±1.015.88±1.2
Proprietary Models
GPT-4.150.11 (1)37.44±1.254.22±1.069.11±1.367.88±1.142.11±1.446.77±1.241.11±1.342.22±0.9
GPT-4.1-mini40.55 (5)29.44±1.238.11±1.257.44±1.150.88±1.240.11±1.134.11±1.134.22±1.040.11±1.2
GPT-4o45.20 (4)39.11±1.152.22±1.266.88±1.350.88±1.331.77±1.343.77±1.038.77±1.038.22±1.1
GPT-4o-mini32.12 (7)21.11±1.233.77±1.250.11±1.046.77±1.330.11±1.331.88±1.024.11±1.019.11±1.1
GPT-3.5-Turbo22.90 (8)23.88±1.218.44±1.027.11±0.922.88±0.920.11±1.023.88±1.324.11±1.222.77±1.0
DeepSeek-V339.94 (6)33.77±0.943.88±1.357.44±0.646.77±1.138.77±1.036.22±0.928.44±1.234.22±1.1
Doubao-1.5-Pro-32K47.66 (3)37.44±0.447.88±1.150.11±1.052.88±1.256.11±1.146.22±1.343.88±1.246.77±0.9
Gemini-2.0-flash-preview48.36 (2)42.77±1.152.22±1.366.88±1.146.77±0.948.88±1.048.11±1.441.11±1.240.11±0.7
Open-source Models
LLaMA-3.1-405B-Instruct39.75 (2)34.77±1.338.88±1.439.22±1.241.88±1.239.11±0.941.11±1.340.88±1.142.11±1.0
LLaMA-3.1-70B-Instruct36.21 (5)34.77±1.039.11±1.036.11±1.135.11±1.239.11±1.235.22±1.138.11±1.032.11±1.2
LLaMA-3.1-8B-Instruct22.94 (7)25.11±1.119.11±0.920.11±1.212.11±1.231.77±1.125.11±1.028.11±0.822.11±1.3
Qwen2.5-72B-Instruct43.59 (1)34.88±1.036.77±0.954.11±0.859.11±1.149.11±1.236.88±1.241.11±1.036.77±1.3
Qwen2.5-32B-Instruct38.49 (3)30.11±1.039.11±0.854.11±1.148.11±1.036.11±1.045.11±1.027.11±1.028.11±1.2
Qwen2.5-14B-Instruct36.91 (4)25.11±1.226.77±1.357.11±0.952.11±0.635.11±1.230.88±1.034.11±1.334.11±1.2
Qwen2.5-7B-Instruct23.33 (6)19.11±1.021.11±0.735.11±1.132.88±1.114.11±1.027.11±1.021.11±1.216.11±1.2
Reasoning Models
o4-mini38.91 (4)34.88±1.140.11±1.149.11±1.248.11±1.036.11±1.139.11±0.827.11±1.036.77±1.0
o3-mini41.15 (3)37.44±1.244.11±1.154.11±1.248.11±1.436.11±1.239.11±1.230.11±1.240.11±1.2
o1-mini35.82 (5)30.11±1.036.77±1.042.11±1.148.11±1.136.11±1.237.11±1.029.11±1.227.11±1.4
Gemini-2.5-flash42.33 (2)34.77±1.034.77±0.743.11±1.249.11±1.137.11±1.245.11±1.246.77±1.347.88±0.8
Gemini-2.5-pro-preview-05-0645.62 (1)40.11±0.840.11±1.152.11±1.054.11±1.241.11±1.145.88±1.346.77±1.144.77±1.0
Role-playing Models
CharacterGLM36.47 (6)34.77±1.035.11±1.160.88±1.142.77±0.820.11±0.939.11±0.932.88±1.226.11±1.1
Baichuan-NPC36.23 (7)29.88±1.038.77±1.149.11±1.142.77±1.031.77±1.130.88±1.332.88±1.133.77±0.9
Minimax-abab6-chat42.26 (4)41.77±1.039.88±1.240.88±1.064.88±1.041.77±1.236.88±1.136.11±1.135.88±0.6
Xingchen-Plus41.62 (5)42.88±1.241.88±1.164.77±1.041.77±1.136.77±1.236.11±1.332.88±0.935.88±0.9
Qwen2.5-7B-Instruct w/ our data42.58 (3)39.11±1.145.88±1.068.11±1.036.77±1.443.77±0.936.77±1.236.11±1.434.11±1.3
OmniCharacter-7B (Ours)43.31 (2)34.77±1.148.88±1.166.88±1.334.77±1.035.88±1.336.77±1.246.77±1.241.77±0.9
UniCharacter-7B (Ours)47.80 (1)43.88±1.146.77±0.770.11±1.141.77±1.244.77±1.344.11±1.050.88±1.340.11±1.2

For dyadic dialogue, generation ability, human perception, and the full experimental breakdown, please see the paper.

Model Framework

OmniCharacter++ builds a speech-language collaborative model for realistic role-playing agents. The framework aligns character profiles, dialogue context, text queries, and speech queries, then adapts the response through role-aware speech decoding, emotion preference learning, and role-contextual dialogue adaptation.

OmniCharacter++ model framework
Framework of OmniCharacter++. The model integrates text and speech features, learns preferred emotional speech tokens, and retrieves role-contextual memory during inference to produce character-consistent speech-language responses.

Generalization on CharacterEval

Quantitative results on the generalizability of state-of-the-art models and UniCharacter on the CharacterEval dataset. We evaluate Character Consistency, Conversational Ability, and Role-playing Attractiveness. KE: Knowledge-Exposure, KA: Knowledge-Accuracy, KH: Knowledge-Hallucination, PB: Persona-Behavior, PU: Persona-Utterance, Flu.: Fluency, Coh.: Coherency, Cons.: Consistency, HL: Human-Likeness, CS: Communication Skill, ED: Expression Diversity, Emp.: Empathy.

Models Character Consistency Conversational Ability Role-playing Attractiveness Avg.
KEKAKHPBPUAvg. Flu.Coh.Cons.Avg. HLCSEDEmp.Avg.
Proprietary Models
Baichuan-NPC1.8022.9642.9932.9103.1512.7643.5783.8983.9163.7983.8362.6432.3362.9712.9463.169
MiniMax1.8352.9102.9442.7743.1252.7183.6093.9323.8113.7843.7682.6722.1503.0172.9023.134
GPT-3.51.7162.3392.2121.9212.3162.1012.6292.9172.7002.7492.5652.4221.6602.5262.2932.381
GPT-42.2502.8552.7852.7212.8732.6973.3323.6693.3433.4483.1433.1842.1533.0102.8733.006
Open-sourced Models
ChatGLM3-6B2.0162.7922.7042.4552.8122.5563.2693.6473.2833.3993.0642.9321.9692.9932.7392.898
Baichuan2-7B1.8132.8492.9292.8303.0812.7003.5513.8943.8273.7573.6702.7282.1152.9842.8743.110
Baichuan2-13B1.8022.8692.9462.8083.0812.7013.5963.9243.8643.7593.7002.7032.1363.0212.8903.116
InternLM-7B1.7822.8002.7812.7193.0162.6203.5273.8233.7443.6983.5462.6222.0702.8972.7842.983
InternLM-20B1.9452.9162.9202.7533.0412.7153.5763.9433.7173.7453.5822.8852.1323.0472.9113.123
CharacterGLM1.6402.8192.7382.3012.9692.4933.4143.7173.7373.6233.7382.2651.9662.8122.6952.937
Llama-3.1-8B2.1972.7012.6153.1302.7042.6693.0593.4773.0713.2022.9222.9342.6342.7592.8122.894
Qwen-7B1.9562.7282.6332.6052.7802.5403.1873.5643.2293.3273.0362.7912.0522.8382.6792.848
Qwen-14B1.9882.8002.8112.7442.9002.6493.3513.7653.5103.5423.3542.8712.2372.9702.8583.016
Qwen2-7B-Instruct1.9662.5372.4122.3132.4362.3332.8643.1712.7432.9262.6552.6121.8672.6542.4472.569
Qwen2.5-7B-Instruct2.2072.7402.6332.7002.6142.5793.1253.4012.9713.1662.7862.8712.1802.8262.6662.804
OmniCharacter-7B (Ours)2.2303.0402.9183.5312.9882.9413.3693.7683.4103.5163.3743.2613.0023.1873.2063.221
UniCharacter-7B (Ours)1.8973.0832.9983.2103.3492.9073.6584.0234.1223.3943.9582.8212.3193.1883.0723.304

Abstract

Existing Role-Playing Agents (RPAs), powered by large language models, are predominantly evaluated on static, text-only, dyadic conversations, which inadequately reflect the complexity of realistic human interactions involving multiple interlocutors and multi-modal communication. To bridge this gap, we propose OmniCharacter++, the first benchmark for evaluating multi-character interactions in a joint text-speech context. Specifically, OmniCharacter++ contributes: (1) a large-scale dataset comprising 10,287 characters, 118,017 multi-turn dialogues, and over one million audio responses across 8 open-world topics and 31 subfields, covering diverse multi-modal role-playing scenarios; (2) a comprehensive evaluation suite for dialogue understanding, generation quality, and perceptual naturalness; and (3) UniCharacter-7B, a unified text-speech model trained on this dataset to manage complex multi-character dynamics, ensuring both role-specific vocal fidelity and cross-participant semantic alignment. Experimental results demonstrate that UniCharacter-7B achieves more realistic and consistent role-playing responses in terms of both attractiveness and consistency, while also highlighting that OmniCharacter++ poses substantial challenges for state-of-the-art models, charting a clear path for future research.

BibTeX

@article{zhang2026omnicharacterpp,
  title={OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents},
  author={Zhang, Haonan and Zeng, Pengpeng and Zhang, Ji and Song, Jingkuan and Sebe, Nicu and Shen, Heng Tao and Gao, Lianli},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  doi={10.1109/TPAMI.2026.3690447},
  year={2026}
}