Summarization

DescriptionEvaluates an LLM's ability to accurately summarize long texts from diverse sources such as YouTube video transcripts, websites, PDFs, and direct text inputs. It also assesses the model's capacity to follow detailed user instructions to extract specific data insights. The dataset consists of 41 unique entries in English, which have been translated into Afrikaans, Brazilian Portuguese, and Polish using machine translation.Number of Samples164LanguageEnglish, Afrikaans, Brazilian Portuguese, PolishProviderToqanEvaluation MethodAuto-evaluation with GPT4-Turbo over ground-truth summaries.Data Collection PeriodFebruary 2022 - October 2023

Language
Language of the source document
Complexity
The complexity level of the summary requests.

Last updated: November 12, 2024

Share this view
#
Model
Provider
Size
Chunk Size
Adherence To Instructions
Accuracy of Content
Quality of Writing
No results.

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.