Text
MMLU: 71.4% • HumanEval: 31.7%
Text
MMLU: 62.2% • HumanEval: 0%
Text
MMLU: 68.1% • HumanEval: 0%
Text
MMLU: 55.1% • HumanEval: 18.1%
Text
MMLU: 57.8% • HumanEval: 22.7%
Text
MMLU: 75% • HumanEval: 71.2%
Text
MMLU: 76.2% • HumanEval: 72.5%
Text
MMLU: 77.5% • HumanEval: 73.8%
Text
MMLU: 78.5% • HumanEval: 71.2%
Text
MMLU: 81.2% • HumanEval: 75.5%
TextVision
MMLU: 75.2% • HumanEval: 75.9%
TextVision
MMLU: 86.8% • HumanEval: 84.9%
TextVision
MMLU: 79% • HumanEval: 73%
TextVision
MMLU: 82.2% • HumanEval: 85.9%
TextVision
MMLU: 88.3% • HumanEval: 92%
TextVision
MMLU: 88.7% • HumanEval: 92%
TextVisionDocument
MMLU: 92.3% • HumanEval: 94.7%
TextVisionDocument
MMLU: 89.7% • HumanEval: 92.8%
Text
MMLU: 56.8% • HumanEval: 48.8%
Text
MMLU: 62.1% • HumanEval: 83.5%
Text
MMLU: 67.1% • HumanEval: 81.4%
Text
MMLU: 69.4% • HumanEval: 84.2%
Text
MMLU: 35% • HumanEval: 47%
Text
MMLU: 68.3% • HumanEval: 25.7%
Text
MMLU: 73.8% • HumanEval: 40.7%
Text
MMLU: 80.2% • HumanEval: 56.1%
Text
MMLU: 37.8% • HumanEval: 65.8%
Text
MMLU: 58.8% • HumanEval: 78.6%
Text
MMLU: 75.9% • HumanEval: 90.2%
Text
MMLU: 60.1% • HumanEval: 81.1%
Text
MMLU: 71.3% • HumanEval: 37.6%
Text
MMLU: 48.2% • HumanEval: 26.6%
Text
MMLU: 64.7% • HumanEval: 43.6%
Text
MMLU: 67.2% • HumanEval: 45.1%
Text
MMLU: 78.5% • HumanEval: 89.6%
Text
MMLU: 88.5% • HumanEval: 92.2%
Text
MMLU: 35.7% • HumanEval: 0%
Text
MMLU: 28.3% • HumanEval: 0%
Text
MMLU: 52.4% • HumanEval: 22%
Text
MMLU: 55.1% • HumanEval: 30.2%
Text
MMLU: 70.4% • HumanEval: 35%
Text
MMLU: 20% • HumanEval: 0%
Text
MMLU: 29% • HumanEval: 1.8%
Text
MMLU: 28% • HumanEval: 1.5%
Text
MMLU: 25% • HumanEval: 1%
Text
MMLU: 30% • HumanEval: 2%
Text
MMLU: 25% • HumanEval: 0%
Text
MMLU: 30% • HumanEval: 0%
Text
MMLU: 35% • HumanEval: 0%
Text
MMLU: 43.9% • HumanEval: 0%
Text
MMLU: 70% • HumanEval: 48.1%
Text
MMLU: 70% • HumanEval: 48.1%
TextVision
MMLU: 86.4% • HumanEval: 67%
TextVision
MMLU: 86.4% • HumanEval: 67%
TextVision
MMLU: 86.4% • HumanEval: 67%
TextVision
MMLU: 88.9% • HumanEval: 89.2%
TextVision
MMLU: 86.5% • HumanEval: 85.7%
TextVision
MMLU: 80.1% • HumanEval: 75.4%
TextVisionAudio
MMLU: 88.7% • HumanEval: 90.2%
TextVision
MMLU: 82% • HumanEval: 87.2%
Text
MMLU: 42.1% • HumanEval: 11.6%
Text
MMLU: 51.6% • HumanEval: 15.4%
TextVisionAudio
MMLU: 78.9% • HumanEval: 74.2%
TextVisionAudio
MMLU: 85.9% • HumanEval: 71.9%
TextVisionAudio
MMLU: 85.8% • HumanEval: 85.4%
TextVisionAudio
MMLU: 88.7% • HumanEval: 88.9%
Text
MMLU: 83.7% • HumanEval: 67.7%
Text
MMLU: 90% • HumanEval: 74.4%
Text
MMLU: 75.4% • HumanEval: 58.9%
TextVision
MMLU: 86% • HumanEval: 79.2%
Text
MMLU: 60% • HumanEval: 26.2%
Text
MMLU: 80.4% • HumanEval: 58.1%
Text
MMLU: 64.1% • HumanEval: 23.4%
Text
MMLU: 71.2% • HumanEval: 31.9%
Text
MMLU: 63.4% • HumanEval: 23.7%
Text
MMLU: 35.1% • HumanEval: 10.5%
Text
MMLU: 68.9% • HumanEval: 29.9%
Text
MMLU: 48.9% • HumanEval: 13.1%
Text
MMLU: 82% • HumanEval: 81.7%
Text
MMLU: 68.4% • HumanEval: 62.2%
Text
MMLU: 88.6% • HumanEval: 89%
TextVision
MMLU: 86.3% • HumanEval: 84.7%
Mistral 7B Instruct v0.2
INSTRUCTION FOLLOWINGMistral AI2023Instruction-tuned Transformer with Sliding Window
Text
MMLU: 65.4% • HumanEval: 36.8%
Text
MMLU: 60.1% • HumanEval: 30.5%
Text
MMLU: 84% • HumanEval: 73%
Text
MMLU: 75.3% • HumanEval: 61.4%
Text
MMLU: 72.2% • HumanEval: 58.4%
Text
MMLU: 77.8% • HumanEval: 45.1%
Text
MMLU: 78.9% • HumanEval: 61.4%
Text
MMLU: 71.4% • HumanEval: 54.8%
Text
MMLU: 70.6% • HumanEval: 40.2%
Text
MMLU: 81.8% • HumanEval: 73.2%
TextVisionVideo
MMLU: 82.8% • HumanEval: 68.4%
Text
MMLU: 25.8% • HumanEval: 12.2%
Text
MMLU: 42.3% • HumanEval: 18.9%
Text
MMLU: 78.3% • HumanEval: 37.6%
Text
MMLU: 70.7% • HumanEval: 26.2%
Text
MMLU: 42.1% • HumanEval: 50.6%
Text
MMLU: 52.7% • HumanEval: 47%
Text
MMLU: 75.3% • HumanEval: 70.4%
Text
MMLU: 69.2% • HumanEval: 61.8%
Text
MMLU: 69% • HumanEval: 61.2%
Text
MMLU: 70.9% • HumanEval: 68.1%
Text
MMLU: 78.9% • HumanEval: 75.8%
Text
MMLU: 84.7% • HumanEval: 82.6%
Text
MMLU: 47.2% • HumanEval: 13.2%
Text
MMLU: 77.4% • HumanEval: 64.6%
Text
MMLU: 84.2% • HumanEval: 80.7%
Text
MMLU: 46.8% • HumanEval: 54.1%
Text
MMLU: 42.9% • HumanEval: 20.9%
Text
MMLU: 33.6% • HumanEval: 33.6%
Text
MMLU: 46.2% • HumanEval: 46.2%
Text
MMLU: 68.7% • HumanEval: 26.2%
Text
MMLU: 52.4% • HumanEval: 15.8%
Text
MMLU: 75.2% • HumanEval: 42.8%
Text
MMLU: 45.8% • HumanEval: 12.4%
Vicuna 33B v1.3
CONVERSATIONALUC Berkeley, CMU, Stanford, UC San Diego, MBZUAI2023LLaMA-based Instruction-tuned
Text
MMLU: 59.2% • HumanEval: 25.6%
TextVision
MMLU: 94.8% • HumanEval: 92.3%
Text
MMLU: 85.2% • HumanEval: 87%
Text
MMLU: 90.8% • HumanEval: 89.7%