Leaderboard of CoLLaM
Welcome to submit
The performances of top models at each level are reported below.
Welcome to send your result to CoLLaM. Please send you result scores to this email.
******仅为示例,数据不完全正确
Leaderboard
Model | Memorization(Acc.) | Understanding(Acc.) | Logic Inference(Acc.) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1-1 | 1-2 | 1-3 | 2-1 | 2-2 | 2-3 | 2-4 | 2-5 | 3-1 | 3-2 | 3-3 | 3-4 | 3-5 | 3-6 | |||
GPT-4 | 27.2 | 34.8 | 14.0 | 79.8 | 51.0 | 94.0 | 77.2 | 96.2 | 79.2 | 68.3 | 62.4 | 33.2 | 66.0 | 51.0 | ||
Qwen-14B-Chat | 29.4 | 38.6 | 11.4 | 93.0 | 44.7 | 90.0 | 86.0 | 91.6 | 80.0 | 90.5 | 66.4 | 30.4 | 44.7 | 49.2 | ||
Qwen-7B-Chat | 22.4 | 38.8 | 8.4 | 79.8 | 43.3 | 88.0 | 67.0 | 92.6 | 79.4 | 84.0 | 25.8 | 24.6 | 45.6 | 30.6 | ||
ChatGPT | 20.1 | 26.3 | 9.0 | 57.3 | 42.3 | 83.2 | 77.0 | 80.0 | 77.8 | 58.9 | 57.1 | 18.9 | 39.6 | 40.2 | ||
Baichuan-13B-Chat | 14.4 | 33.7 | 10.0 | 54.4 | 35.0 | 73.0 | 62.2 | 75.6 | 76.8 | 57.5 | 34.6 | 20.0 | 33.5 | 21.2 | ||
InternLM-7B-Chat | 20.6 | 36.4 | 10.4 | 59.4 | 41.7 | 88.0 | 48.6 | 54.6 | 75.5 | 76.6 | 22.8 | 22.6 | 37.3 | 42.6 | ||
ChatGLM3 | 20.2 | 28.7 | 6.4 | 40.0 | 36.7 | 69.0 | 64.0 | 79.4 | 71.3 | 58.8 | 16.8 | 20.2 | 42.9 | 37.6 | ||
ChatGLM2 | 28.8 | 25.9 | 16.1 | 24.0 | 30.7 | 64.0 | 53.2 | 66.6 | 77.7 | 57.2 | 4.0 | 24.0 | 29.9 | 14.0 | ||
Baichuan-13B-base | 20.0 | 14.0 | 8.4 | 35.4 | 25.7 | 67.0 | 59.2 | 74.6 | 58.8 | 24.1 | 38.4 | 23.4 | 30.5 | 12.2 | ||
Chinese-Alpaca-2-7B | 16.0 | 20.3 | 15.4 | 34.0 | 26.7 | 64.0 | 54.4 | 30.8 | 63.6 | 48.5 | 60.2 | 14.8 | 21.8 | 13.2 | ||
Fuzi-Mingcha | 13.0 | 25.0 | 6.7 | 62.0 | 29.0 | 61.0 | 46.4 | 24.8 | 68.5 | 58.6 | 15.6 | 16.0 | 28.9 | 18.2 | ||
ChatLaw-33B | 16.0 | 25.9 | 7.0 | 51.4 | 31.3 | 76.0 | 67.6 | 62.2 | 60.0 | 53.2 | 12.2 | 15.4 | 23.6 | 26.2 | ||
InternLM-7B | 20.4 | 9.4 | 13.0 | 2.6 | 28.3 | 58.0 | 60.0 | 58.4 | 71.7 | 43.6 | 63.8 | 21.8 | 35.0 | 15.0 | ||
TigerBot-base | 16.6 | 28.4 | 10.7 | 22.2 | 27.0 | 61.0 | 53.2 | 24.4 | 71.7 | 36.5 | 26.2 | 20.0 | 30.7 | 18.8 | ||
BELLE-LLAMA-2-Chat | 15.6 | 23.2 | 8.0 | 30.4 | 25.0 | 67.0 | 53.6 | 42.8 | 63.1 | 44.2 | 23.6 | 17.6 | 30.2 | 19.4 |
Model | Discrimination (Acc.) | Generation (Rough-L) | Ethic (Acc.) | Average | Rank | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
4-1 | 4-2 | 5-1 | 5-2 | 5-3 | 5-4 | 6-1 | 6-2 | 6-3 | |||
GPT-4 | 34.0 | 39.1 | 26.9 | 14.2 | 38.9 | 15.7 | 65.2 | 55.2 | 75.8 | 52.1 | 1 |
Qwen-14B-Chat | 28.6 | 31.6 | 33.4 | 24.1 | 35.7 | 18.6 | 31.2 | 42.2 | 63.2 | 50.2 | 2 |
Qwen-7B-Chat | 25.4 | 28.9 | 31.5 | 19.2 | 34.7 | 18.3 | 22.1 | 39.1 | 56.6 | 43.4 | 3 |
ChatGPT | 22.7 | 22.4 | 24.0 | 10.7 | 38.0 | 17.1 | 33.7 | 32.1 | 55.8 | 41.1 | 4 |
Baichuan-13B-Chat | 20.0 | 20.4 | 32.0 | 6.7 | 35.7 | 17.3 | 16.4 | 22.0 | 40.8 | 35.4 | 5 |
InternLM-7B-Chat | 0.2 | 9.5 | 17.7 | 2.1 | 29.2 | 11.6 | 22.6 | 28.1 | 48.4 | 35.1 | 6 |
ChatGLM3 | 25.6 | 14.8 | 27.2 | 17.1 | 29.0 | 14.3 | 21.1 | 30.7 | 49.0 | 34.9 | 7 |
ChatGLM2 | 22.8 | 18.4 | 29.4 | 15.0 | 26.0 | 14.4 | 35.0 | 26.1 | 52.0 | 32.8 | 8 |
Baichuan-13B-base | 15.8 | 22.4 | 27.3 | 7.8 | 23.8 | 20.1 | 15.9 | 27.5 | 43.4 | 30.2 | 9 |
Chinese-Alpaca-2-7B | 25.2 | 15.5 | 28.5 | 15.6 | 31.9 | 13.5 | 17.8 | 20.4 | 31.2 | 29.7 | 10 |
Fuzi-Mingcha | 18.8 | 16.1 | 54.0 | 20.7 | 21.4 | 17.4 | 10.8 | 13.1 | 25.0 | 29.1 | 11 |
ChatLaw-33B | 10.0 | 17.1 | 21.4 | 7.0 | 14.2 | 13.2 | 15.3 | 19.1 | 34.2 | 28.6 | 12 |
InternLM-7B | 3.0 | 15.8 | 2.2 | 5.7 | 19.3 | 7.4 | 21.9 | 30.6 | 50.6 | 28.5 | 13 |
TigerBot-base | 26.0 | 23.4 | 21.4 | 11.6 | 30.4 | 13.4 | 16.7 | 20.4 | 40.6 | 28.3 | 14 |
BELLE-LLAMA-2-Chat | 10.0 | 21.7 | 21.6 | 7.5 | 22.1 | 13.3 | 24.5 | 22.5 | 39.2 | 28.0 | 15 |