Function Calling
Description: Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.
Number of Samples: 936
Language: English
Provider: Toqan
Evaluation Method: Multi-class classification accuracy using human-labeled data & Auto-evaluation with GPT4 Turbo
Data Collection Period: January 2024 - May 2024
Last updated: July 2, 2024
Share this view
# | Model | Provider | Size | Tool Accuracy | Argument Correctness (Web Search) | Argument Correctness (Code Execution) | Argument Correctness (Planning) |
---|---|---|---|---|---|---|---|
No results. |