<aside> ❓
What's best python library for llm evals on datasets such as gsm8k, mmlu? i want to benchmark local models as well as API call models on them
</aside>
For LLM evaluations on datasets like GSM8K and MMLU, I'd recommend using the following Python libraries:
EleutherAI/lm-evaluation-harness - The open-source version maintained by EleutherAI, which offers standardized evaluation across many benchmarks.
Langchain - For API-based models, Langchain provides tools for evaluation and benchmarking.
Part 3: The Future of Evaluation: Harnessing AI to Assess LLM Generated Text
HuggingFace Evaluate - Part of the HuggingFace ecosystem, it provides evaluation metrics and datasets.
TruLens - Good for evaluating both local and API models, with focus on quality and bias metrics.
Evaluate and Track your LLM Experiments: Introducing TruLens
EleutherAI's lm-evaluation-harness is the most specialized and established framework specifically designed for standardized LLM benchmarking. It stands out as the industry standard for consistent evaluation across models and benchmarks.
The basic evaluation workflow is straightforward:
from lm_eval import evaluator
from lm_eval.models import get_model
# Load a local model
local_model = get_model("hf",
pretrained="mistralai/Mistral-7B-Instruct-v0.2",
device="cuda")
# Or load an API model
api_model = get_model("openai",
model_name="gpt-4",
api_key="your-key")
# Run evaluation on specific tasks
results = evaluator.simple_evaluate(
model=local_model,
tasks=["gsm8k", "mmlu"],
num_fewshot=0, # Zero-shot evaluation
batch_size=8
)
# Print results
print(evaluator.make_table(results))
The harness has specialized handling for these benchmarks: