Leaderboard of CoLLaM

Welcome to submit

The performances of top models at each level are reported below.

Welcome to send your result to CoLLaM. Please send you result scores to this email.

******仅为示例,数据不完全正确

Leaderboard

Zero-shot performance(%) of various models at Memorization, Understanding, and Logic Inference level. Best performance in each column is marked bold.
Model Memorization(Acc.) Understanding(Acc.) Logic Inference(Acc.)
1-1 1-2 1-3 2-1 2-2 2-3 2-4 2-5 3-1 3-2 3-3 3-4 3-5 3-6
GPT-4 27.2 34.8 14.0 79.8 51.0 94.0 77.2 96.2 79.2 68.3 62.4 33.2 66.0 51.0
Qwen-14B-Chat 29.4 38.6 11.4 93.0 44.7 90.0 86.0 91.6 80.0 90.5 66.4 30.4 44.7 49.2
Qwen-7B-Chat 22.4 38.8 8.4 79.8 43.3 88.0 67.0 92.6 79.4 84.0 25.8 24.6 45.6 30.6
ChatGPT 20.1 26.3 9.0 57.3 42.3 83.2 77.0 80.0 77.8 58.9 57.1 18.9 39.6 40.2
Baichuan-13B-Chat 14.4 33.7 10.0 54.4 35.0 73.0 62.2 75.6 76.8 57.5 34.6 20.0 33.5 21.2
InternLM-7B-Chat 20.6 36.4 10.4 59.4 41.7 88.0 48.6 54.6 75.5 76.6 22.8 22.6 37.3 42.6
ChatGLM3 20.2 28.7 6.4 40.0 36.7 69.0 64.0 79.4 71.3 58.8 16.8 20.2 42.9 37.6
ChatGLM2 28.8 25.9 16.1 24.0 30.7 64.0 53.2 66.6 77.7 57.2 4.0 24.0 29.9 14.0
Baichuan-13B-base 20.0 14.0 8.4 35.4 25.7 67.0 59.2 74.6 58.8 24.1 38.4 23.4 30.5 12.2
Chinese-Alpaca-2-7B 16.0 20.3 15.4 34.0 26.7 64.0 54.4 30.8 63.6 48.5 60.2 14.8 21.8 13.2
Fuzi-Mingcha 13.0 25.0 6.7 62.0 29.0 61.0 46.4 24.8 68.5 58.6 15.6 16.0 28.9 18.2
ChatLaw-33B 16.0 25.9 7.0 51.4 31.3 76.0 67.6 62.2 60.0 53.2 12.2 15.4 23.6 26.2
InternLM-7B 20.4 9.4 13.0 2.6 28.3 58.0 60.0 58.4 71.7 43.6 63.8 21.8 35.0 15.0
TigerBot-base 16.6 28.4 10.7 22.2 27.0 61.0 53.2 24.4 71.7 36.5 26.2 20.0 30.7 18.8
BELLE-LLAMA-2-Chat 15.6 23.2 8.0 30.4 25.0 67.0 53.6 42.8 63.1 44.2 23.6 17.6 30.2 19.4
Zero-shot performance(%) of various models at Discrimination, Generation, and Ethic level. Best performance in each column is marked bold.
Model Discrimination (Acc.) Generation (Rough-L) Ethic (Acc.) Average Rank
4-1 4-2 5-1 5-2 5-3 5-4 6-1 6-2 6-3
GPT-4 34.0 39.1 26.9 14.2 38.9 15.7 65.2 55.2 75.8 52.1 1
Qwen-14B-Chat 28.6 31.6 33.4 24.1 35.7 18.6 31.2 42.2 63.2 50.2 2
Qwen-7B-Chat 25.4 28.9 31.5 19.2 34.7 18.3 22.1 39.1 56.6 43.4 3
ChatGPT 22.7 22.4 24.0 10.7 38.0 17.1 33.7 32.1 55.8 41.1 4
Baichuan-13B-Chat 20.0 20.4 32.0 6.7 35.7 17.3 16.4 22.0 40.8 35.4 5
InternLM-7B-Chat 0.2 9.5 17.7 2.1 29.2 11.6 22.6 28.1 48.4 35.1 6
ChatGLM3 25.6 14.8 27.2 17.1 29.0 14.3 21.1 30.7 49.0 34.9 7
ChatGLM2 22.8 18.4 29.4 15.0 26.0 14.4 35.0 26.1 52.0 32.8 8
Baichuan-13B-base 15.8 22.4 27.3 7.8 23.8 20.1 15.9 27.5 43.4 30.2 9
Chinese-Alpaca-2-7B 25.2 15.5 28.5 15.6 31.9 13.5 17.8 20.4 31.2 29.7 10
Fuzi-Mingcha 18.8 16.1 54.0 20.7 21.4 17.4 10.8 13.1 25.0 29.1 11
ChatLaw-33B 10.0 17.1 21.4 7.0 14.2 13.2 15.3 19.1 34.2 28.6 12
InternLM-7B 3.0 15.8 2.2 5.7 19.3 7.4 21.9 30.6 50.6 28.5 13
TigerBot-base 26.0 23.4 21.4 11.6 30.4 13.4 16.7 20.4 40.6 28.3 14
BELLE-LLAMA-2-Chat 10.0 21.7 21.6 7.5 22.1 13.3 24.5 22.5 39.2 28.0 15