LLM as a Judge
DescriptionEvaluates an LLM's ability to judge the acceptability of other LLM answers to given technical and non-technical questions, including some coding questions.Number of Samples136LanguageEnglishProviderToqan and Stack OverflowEvaluation MethodBinary classification accuracy to evaluate the acceptability of generated answers. Ground truth was curated by initial labeling from two domain experts and finalized by a third expert after a thorough review.Data Collection PeriodFebruary 2024 - April 2024
Last updated: September 10, 2024
Share this view
# | Model | Provider | Size | Accuracy |
---|---|---|---|---|
No results. |