Leaderboard of CoLLaM

Welcome to submit

The performances of top models at each level are reported below.

Welcome to send your result to CoLLaM. Please send you result scores to this email.

******仅为示例，数据不完全正确

Leaderboard

Zero-shot performance(%) of various models at Memorization, Understanding, and Logic Inference level. Best performance in each column is marked bold.
Model	Memorization(Acc.)				Understanding(Acc.)						Logic Inference(Acc.)
Model	1-1	1-2	1-3	2-1	2-2	2-3	2-4	2-5	3-1	3-2	3-3	3-4	3-5	3-6
GPT-4	27.2	34.8	14.0	79.8	51.0	94.0	77.2	96.2	79.2	68.3	62.4	33.2	66.0	51.0
Qwen-14B-Chat	29.4	38.6	11.4	93.0	44.7	90.0	86.0	91.6	80.0	90.5	66.4	30.4	44.7	49.2
Qwen-7B-Chat	22.4	38.8	8.4	79.8	43.3	88.0	67.0	92.6	79.4	84.0	25.8	24.6	45.6	30.6
ChatGPT	20.1	26.3	9.0	57.3	42.3	83.2	77.0	80.0	77.8	58.9	57.1	18.9	39.6	40.2
Baichuan-13B-Chat	14.4	33.7	10.0	54.4	35.0	73.0	62.2	75.6	76.8	57.5	34.6	20.0	33.5	21.2
InternLM-7B-Chat	20.6	36.4	10.4	59.4	41.7	88.0	48.6	54.6	75.5	76.6	22.8	22.6	37.3	42.6
ChatGLM3	20.2	28.7	6.4	40.0	36.7	69.0	64.0	79.4	71.3	58.8	16.8	20.2	42.9	37.6
ChatGLM2	28.8	25.9	16.1	24.0	30.7	64.0	53.2	66.6	77.7	57.2	4.0	24.0	29.9	14.0
Baichuan-13B-base	20.0	14.0	8.4	35.4	25.7	67.0	59.2	74.6	58.8	24.1	38.4	23.4	30.5	12.2
Chinese-Alpaca-2-7B	16.0	20.3	15.4	34.0	26.7	64.0	54.4	30.8	63.6	48.5	60.2	14.8	21.8	13.2
Fuzi-Mingcha	13.0	25.0	6.7	62.0	29.0	61.0	46.4	24.8	68.5	58.6	15.6	16.0	28.9	18.2
ChatLaw-33B	16.0	25.9	7.0	51.4	31.3	76.0	67.6	62.2	60.0	53.2	12.2	15.4	23.6	26.2
InternLM-7B	20.4	9.4	13.0	2.6	28.3	58.0	60.0	58.4	71.7	43.6	63.8	21.8	35.0	15.0
TigerBot-base	16.6	28.4	10.7	22.2	27.0	61.0	53.2	24.4	71.7	36.5	26.2	20.0	30.7	18.8
BELLE-LLAMA-2-Chat	15.6	23.2	8.0	30.4	25.0	67.0	53.6	42.8	63.1	44.2	23.6	17.6	30.2	19.4

Zero-shot performance(%) of various models at Discrimination, Generation, and Ethic level. Best performance in each column is marked bold.
Model	Discrimination (Acc.)		Generation (Rough-L)				Ethic (Acc.)			Average	Rank
Model	4-1	4-2	5-1	5-2	5-3	5-4	6-1	6-2	6-3	Average	Rank
GPT-4	34.0	39.1	26.9	14.2	38.9	15.7	65.2	55.2	75.8	52.1	1
Qwen-14B-Chat	28.6	31.6	33.4	24.1	35.7	18.6	31.2	42.2	63.2	50.2	2
Qwen-7B-Chat	25.4	28.9	31.5	19.2	34.7	18.3	22.1	39.1	56.6	43.4	3
ChatGPT	22.7	22.4	24.0	10.7	38.0	17.1	33.7	32.1	55.8	41.1	4
Baichuan-13B-Chat	20.0	20.4	32.0	6.7	35.7	17.3	16.4	22.0	40.8	35.4	5
InternLM-7B-Chat	0.2	9.5	17.7	2.1	29.2	11.6	22.6	28.1	48.4	35.1	6
ChatGLM3	25.6	14.8	27.2	17.1	29.0	14.3	21.1	30.7	49.0	34.9	7
ChatGLM2	22.8	18.4	29.4	15.0	26.0	14.4	35.0	26.1	52.0	32.8	8
Baichuan-13B-base	15.8	22.4	27.3	7.8	23.8	20.1	15.9	27.5	43.4	30.2	9
Chinese-Alpaca-2-7B	25.2	15.5	28.5	15.6	31.9	13.5	17.8	20.4	31.2	29.7	10
Fuzi-Mingcha	18.8	16.1	54.0	20.7	21.4	17.4	10.8	13.1	25.0	29.1	11
ChatLaw-33B	10.0	17.1	21.4	7.0	14.2	13.2	15.3	19.1	34.2	28.6	12
InternLM-7B	3.0	15.8	2.2	5.7	19.3	7.4	21.9	30.6	50.6	28.5	13
TigerBot-base	26.0	23.4	21.4	11.6	30.4	13.4	16.7	20.4	40.6	28.3	14
BELLE-LLAMA-2-Chat	10.0	21.7	21.6	7.5	22.1	13.3	24.5	22.5	39.2	28.0	15