<aside> ❓

What's best python library for llm evals on datasets such as gsm8k, mmlu? i want to benchmark local models as well as API call models on them

</aside>

Benchmarking LLMs on GSM8K and MMLU

For LLM evaluations on datasets like GSM8K and MMLU, I'd recommend using the following Python libraries:

EleutherAI/lm-evaluation-harness - The open-source version maintained by EleutherAI, which offers standardized evaluation across many benchmarks.

LLM evaluation | EleutherAI lm-evaluation-harness
Langchain - For API-based models, Langchain provides tools for evaluation and benchmarking.

Part 3: The Future of Evaluation: Harnessing AI to Assess LLM Generated Text
HuggingFace Evaluate - Part of the HuggingFace ecosystem, it provides evaluation metrics and datasets.
TruLens - Good for evaluating both local and API models, with focus on quality and bias metrics.

Evaluate and Track your LLM Experiments: Introducing TruLens

EleutherAI/lm-evaluation-harness for LLM Evaluations

EleutherAI's lm-evaluation-harness is the most specialized and established framework specifically designed for standardized LLM benchmarking. It stands out as the industry standard for consistent evaluation across models and benchmarks.

Core Features of lm-evaluation-harness

Comprehensive Benchmark Support - Native integration of 200+ tasks including GSM8K and MMLU
Standardized Evaluation - Ensures consistent protocols for fair model comparisons
Model Flexibility - Supports HuggingFace models, various APIs, and custom model integrations
Multi-token Prediction - Properly handles multi-token outputs crucial for benchmarks

Setting Up Evaluations

The basic evaluation workflow is straightforward:

from lm_eval import evaluator
from lm_eval.models import get_model

# Load a local model
local_model = get_model("hf",
                       pretrained="mistralai/Mistral-7B-Instruct-v0.2",
                       device="cuda")

# Or load an API model
api_model = get_model("openai",
                     model_name="gpt-4",
                     api_key="your-key")

# Run evaluation on specific tasks
results = evaluator.simple_evaluate(
    model=local_model,
    tasks=["gsm8k", "mmlu"],
    num_fewshot=0,  # Zero-shot evaluation
    batch_size=8
)

# Print results
print(evaluator.make_table(results))

For GSM8K and MMLU Specifically

The harness has specialized handling for these benchmarks: