<h1> 🛠️ToolBench🤖</h1>
Model • Data Release • Web Demo • Tool Eval • Paper • Citation

🔨This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.
2024.8 Update We have updated the RapidAPI server with a new IP, please make sure you get the latest code. You can also build it locally using codes here.
💁♂️💁💁♀️ Join Us on Discord!
Read this in 中文.
[2024/3/17] Welcome to StableToolBench: A stable and reliable local toolbench server based on API response simulation. Dive deeper into the tech behind StableToolBench with paper here and explore more on the project homepage. Codes are available here.
[2023/9/29] A new version ToolEval which is more stable and covers more models including GPT4! Please refer to ToolEval for more details. Besides, ToolLLaMA-2-7b-v2 is released with stronger tool-use capabilities. Please use the ToolLLaMA-2-7b-v2 model to reproduce our latest experimental results with the new version ToolEval.
[2023/8/30] Data updation, with more than 120,000 solution path annotations and intact reasoning thoughts! Please find data.zip on Google Drive.
[2023/8/8] No more hallucination! ToolLLaMA-2-7b-v1 (fine-tuned from LLaMA-2-7b) is released with lower API hallucination than ChatGPT.
[2023/8/4] We provide RapidAPI backend service to free you from using your own RapidAPI key and subscribing the APIs. Please fill out our form. We will review it as soon as possible and send you the ToolBench key to get start on it!
[2023/8/1] Our paper is released.
[2023/7/27] New version ToolBench is released.
✨Here is an overview of the dataset construction, training, and evaluation.

✨✨Features: - API Collection: we gather 16464 representational state transfer (REST) APIs from RapidAPI, a platform that hosts massive real-world APIs provided by developers. - Instruction Generation: we curate instructions that involve both single-tool and multi-tool scenarios. - Answer Annotation: we develop a novel depth-first search based decision tree (DFSDT) to bolster the planning and reasoning ability of LLMs, which significantly improves the annotation efficiency and successfully annotates those complex instructions that cannot be answered with CoT or ReACT. We provide responses that not only include the final answer but also incorporate the model's reasoning process, tool execution, and tool execution results. - API Retriver: we incorporate API retrieval to equip ToolLLaMA with open-domain tool-using abilities. - All the data is automatically generated by OpenAI API and filtered by us, the whole data creation process is easy to scale up.

We also provide A demo of using ToolLLaMA
https://github.com/OpenBMB/ToolBench/assets/25274507/f1151d85-747b-4fac-92ff-6c790d8d9a31
Currently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, we will continually improve the data quality and increase the coverage of real-world tools.

Here is the Old version of ToolBench.
👐ToolBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under Apache License 2.0. Below is the statistics of the data :
| Tool Nums | API Nums | Instance Nums | Real API Call | Reasoning Traces |
|---|---|---|---|---|
| 3451 | 16464 | 126486 | 469585 | 4.0 |
We crawl 16000+ real-world APIs from RapidAPI, and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.

ToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:

Please download our dataset using the following link: Google Drive or Tsinghua Cloud. Notice that data_0801 is the old version data.
The file structure is as follows:
├── /data/
│ ├── /instruction/
│ ├── /answer/
│ ├── /toolenv/
│ ├── /retrieval/
│ ├── /test_instruction/
│ ├── /test_query_ids/
│ ├── /retrieval_test_query_ids/
│ ├── toolllama_G123_dfs_train.json
│ └── toolllama_G123_dfs_eval.json
├── /reproduction_data/
│ ├── /chatgpt_cot/
│ ├── /chatgpt_dfs/
│ ├── ...
│ └── /toolllama_dfs/
Here are some descriptions for the data directory:
- instruction and answer: The instruction data and solution path annotation data. G1,G2, G3 refers to single-tool, intra-category multi-tool and intra-collection multi-tool data respectively. We also have an Atlas Explorer for visualization.
- toolenv: The tool environment related data, containing API jsons, API codes and API example responses.
- retrieval: The data used for tool retrieval is included in this directory.
- test_instruction and test_query_ids: We sample 200 instances from every test set. The test_instruction directory contains test queries for each test set, and the test_query_ids contains query ids of the test instances in each test set.
- retrieval_test_query_ids: This directory contains query ids of the test instances for retriever.
- toolllama_G123_dfs_train.json and toolllama_G123_dfs_eval.json: Preprocessed data that can be used to train toolllama directly and reproduce our results. For preprocessing details, we split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments.
Please make sure you have downloaded the necessary data and put the directory (e.g. data/) under ToolBench/, so that the following bash scripts can navigate to the related data.
We release the ToolLLaMA-2-7b-v2 which is trained on the latest version data, and ToolLLaMA-7b-v1, ToolLLaMA-7b-LoRA-v1 which are trained on the 0801 version data. All models are trained on the released dataset in a multi-task fashion. We also release the tool retriever trained under our experimental setting.
Clone this repository and navigate to the ToolBench folder.
git clone git@github.com:OpenBMB/ToolBench.git
cd ToolBench
Install Package (python>=3.9)
pip install -r requirements.txt
or for ToolEval only
pip install -r toolbench/tooleval/requirements.txt
Prepare the data and tool environment:
wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1XFjDxVZdUY7TXYF2yvzx3pJlS2fy78jk&confirm=yes' -O data.zip
unzip data.zip
export PYTHONPATH=./
python preprocess/preprocess_retriever_data.py \
--query_file data/instruction/G1_query.json \
--index_file data/test_query_ids/G1_instruction_test_query_ids.json \
--dataset_name G1 \
--output_dir data/retrieval/G1
export PYTHONPATH=./
python toolbench/retrieval/train.py \
--data_path data/retrieval/G1/ \
--model_name bert-base-uncased \
--output_path retrieval_model \
--num_epochs 5 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--warmup_steps 500 \
--max_seq_length 256
export PYTHONPATH=./
python preprocess/preprocess_toolllama_data.py \
--tool_data_dir data/answer/G1_answer \
--method DFS_woFilter_w2 \
--output_file data/answer/toolllama_G1_dfs.json
data/toolllama_G123_dfs_train.json. For preprocessing details, we split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments:export PYTHONPATH=./
torchrun --nproc_per_node=2 --master_port=20001 toolbench/train/train_mem.py \
--model_name_or_path huggyllama/llama-7b \
--data_path data/toolllama_G123_dfs_train.json \
--eval_data_path data/toolllama_G123_dfs_eval.json \
--conv_template tool-llama-single-round \
--bf16 True \
--output_dir toolllama \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "epoch" \
--prediction_loss_only \
--save_strategy "epoch" \
--save_total_limit 8 \
--learning_rate 5e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--source_model_max_length 2048 \
--model_max_length 8192 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to none
To train lora version:
export PYTHONPATH=./
deepspeed --master_port=20001 toolbench/train/train_lora.py \
--model_name_or_path huggyllama/llama-7b \
--data_path data/toolllama_G123_dfs_train.json \
--eval_data_path data/toolllama_G123_dfs_eval.json \
--conv_template tool-llama-single-round \
--bf16 True \
--output_dir toolllama_lora \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "epoch" \
--prediction_loss_only \
--save_strategy "epoch" \
--save_total_limit 8 \
--learning_rate 5e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--source_model_max_length 2048 \
--model_max_length 8192 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed ds_configs/stage2.json \
--report_to none
Please fill out the form first and after reviewing we will send you the toolbench key. Then prepare your toolbench key by:
export TOOLBENCH_KEY="your_toolbench_key"
To inference with ToolLLaMA, run the following commands: ```bash export PYTHONPATH=./ python toolbench/inference/qa_pipeline.py \ --tool_root_dir data/toolenv/tools/ \ --backbone_model toolllama \ --model_path ToolBench/ToolLLaMA-7b \ --max_observation_length 1024 \ --ob
$ claude mcp add ToolBench \
-- python -m otcore.mcp_server <graph>