ProLLM Benchmarks

Function Calling

Description: Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.

Number of Samples: 936

Language: English

Provider: Toqan

Evaluation Method: Multi-class classification accuracy using human-labeled data & Auto-evaluation with GPT4 Turbo

Data Collection Period: January 2024 - May 2024

Last updated: July 2, 2024

Share this view

#	Model	Provider	Size	Tool Accuracy	Argument Correctness (Web Search)	Argument Correctness (Code Execution)	Argument Correctness (Planning)
No results.