ProLLM Benchmarks

LLM as a Judge

Description: Evaluates an LLM's ability to judge the acceptability of other LLM answers to given technical and non-technical questions, including some coding questions.

Number of Samples: 136

Language: English

Provider: Toqan and Stack Overflow

Evaluation Method: Binary classification accuracy to evaluate the acceptability of generated answers. Ground truth was curated by initial labeling from two domain experts and finalized by a third expert after a thorough review.

Data Collection Period: February 2024 - April 2024

Last updated: June 5, 2024

Share this view

#	Model	Provider	Accuracy
No results.