LLM as a Judge

Description: Evaluates an LLM's ability to judge the acceptability of other LLM answers to given technical and non-technical questions, including some coding questions.

Number of Samples: 136

Language: English

Provider: Toqan and Stack Overflow

Evaluation Method: Binary classification accuracy to evaluate the acceptability of generated answers. Ground truth was curated by initial labeling from two domain experts and finalized by a third expert after a thorough review.

Data Collection Period: February 2024 - April 2024

Last updated: June 5, 2024

Share this view
#
Model
Provider
Accuracy
No results.

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.