LLM as a Judge
Description: Evaluates an LLM's ability to judge the acceptability of other LLM answers to given technical and non-technical questions, including some coding questions.
Number of Samples: 136
Language: English
Provider: Toqan and Stack Overflow
Evaluation Method: Binary classification accuracy to evaluate the acceptability of generated answers. Ground truth was curated by initial labeling from two domain experts and finalized by a third expert after a thorough review.
Data Collection Period: February 2024 - April 2024
Last updated: June 5, 2024
Share this view
# | Model | Provider | Accuracy |
---|---|---|---|
No results. |