Function Calling

DescriptionEvaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.Number of Samples788LanguageEnglishProviderToqanEvaluation MethodMulti-class classification accuracy using human-labeled data & Auto-evaluation with GPT4-Turbo over ground-truth.Data Collection PeriodJanuary 2024 - May 2024

Function
Function Types the models were tested on.
Inference Method
Approach to querying the model for function use.

Last updated: August 30, 2024

Share this view
#
Model
Provider
Size
Inference
Function Accuracy
Argument Correctness
No results.

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.