🛠 Tool-Augmented LLMs

Aug 28, 2023

Hi all,

It’s great to get back to writing regularly. Writing a newsletter provides me with an opportunity to delve into topics I’m excited about. In this edition, we’ll explore tool use—arguably one of the hottest new capabilities of LLMs. We’ll look at types of tools, benefits of tool use, recent developments, and future directions.

Update 24.09.23: 🇰🇷 This article has been translated to Korean by Park Ji Ho. Thanks!

What is Tool Use?

Language models are useful for a wide range of applications such as creative content generation, virtual assistants, customer support, search, etc. By definition, however, they are limited to producing natural language, which does not allow them to interact with the real world.1

This can be ameliorated by allowing the model to access external tools—by predicting special tokens or commands. A tool can take various forms: it can be a) the model itself or another neural network; b) a retrieval component such as a search engine; c) a symbolic computation or code module; or d) a module for controlling a physical robot or virtual agent as in the previous newsletter:

🤖🗣 Generative Agents, 🏛 Forums for Foundation Models

Sebastian Ruder

August 21, 2023

Read full story

More broadly, a tool can be an arbitrary API. Below, are three examples of tools that can be useful for language modeling: question answering, machine translation, and a calculator. Mialon et al. (2023) provide a great overview of this emerging topic in their survey.

Examples of different tools used by Toolformer (Schick et al., 2023). From top to bottom: `question answering`, `machine translation`, `calculator`.

Benefits of Tools

Tools provide a practical way to address some of the limitations of current LLMs:

❌ LLMs are bad at math (e.g., Hendrycks et al., 2021).2 ✅ Calling a calculator may improve models’ arithmetic capabilities.

❌ LLMs’ pre-training data quickly becomes outdated. ✅ Calling a search engine allows the LLM to produce up-to-date information.

❌ LLMs may hallucinate information. ✅ Allowing an LLM to cite its sources may improve its trustworthiness.

❌ LLMs are black boxes. ✅ A trace of the API calls an LLM used to obtain a prediction provides some degree of interpretability.

How to Teach Tool Use

Many tools are just an API call away—but how do we teach an LLM to use them? Few-shot prompting is a standard way to condition current models. However, a few-shot prompt may not provide enough supervision to enable an LLM to effectively use a tool, particularly if tools have complex arguments or multiple tools are required.

Instead of showing a few demonstrations of tool use to a model, we can instead provide it with tool documentation. While a demonstration showcases how a tool should be used for a specific task, documentation describes the general functionality of different tools. Hsieh et al. (2023) find that tool documentation outperforms few-shot prompting with demonstrations on new domains.

Two types of information for prompting LLMs for tool use. Few-shot prompting with demonstrations (left), i.e., <input, output> pairs consisting of questions and their corresponding output tool-use plan. Documentation (right) provides descriptions of tool functionality (Hsieh et al., 2023).

Fine-tuning on data augmented with API calls seems like the preferred choice.3 As many API calls are possible for an example, the data can be filtered to only retain ‘correct’ API calls. In practice, an LLM can be prompted in a few-shot manner and API calls that do not lead to the correct final prediction are discarded. Parisi et al. (2022), for instance, generate sample API calls for Natural Questions examples. The API calls are executed and used to produce a model response. Examples where the model produced an incorrect output are filtered and the model is fine-tuned on the updated dataset augmented with API calls.

Schick et al. (2023) use a similar strategy applied to an unlabeled text dataset (a subset of Common Crawl). Rather than only retaining API calls that lead to correct responses, they retain calls that reduce the LLM’s loss over the next tokens. As annotating large unlabeled texts with calls from multiple APIs is expensive, they use heuristics that inform when each API should be selected.4

Schick et al. (2023) augment an unlabeled text dataset with API calls by 1) sampling API calls for random positions in the text via few-shot prompting; 2) executing the API calls; 3) filtering out all API calls that do not reduce the LLM’s loss over the next tokens; and adding all remaining API calls to the text.

Models can also be trained using reinforcement learning with hard-coded reward functions or from human feedback (RLHF) although this may lead to instability issues during training.5

Platforms for Tool-Augmented LLMs

Given the versatility of current models, tool-augmented LLMs have quickly captured researchers’ attention, with multiple recent papers claiming that tool use paves the way towards artificial general intelligence (AGI; Li et al., 2023; Ge et al., 2023). A central challenge for tool-augmented LLMs is the accessibility of APIs and models. The following platforms for tool-augmented LLMs have been proposed recently:

TaskMatrix.AI (March 2023), a vision for an ecosystem that enables LLMs to seamlessly interface with millions of APIs. Their framework includes a base LLM, an API platform, and an API search engine. The authors envision that models mainly learn how to use APIs using RLHF, which may be difficult to scale to millions of APIs. They include a case study using ChatGPT to interface with the PowerPoint API.

API-Bank (April 2023), a benchmark to evaluate the tool use of LLMs in a few-shot prompting setting. In order to make tool use in the few-shot setting feasible, the model needs to produce a query for an API search engine, which returns documentation for the most relevant API.
OpenAGI (April 2023), a benchmark consisting of synthetic multi-step multi-modal datasets that require chaining calls to different domain-specific models. Models can be evaluated in zero-shot, few-shot, fine-tuning, or RL-based settings.
Gentopia (August 2023), a platform for creating and sharing tool-augmented agents.

Overview of TaskMatrix.AI. (1) A foundation model generates an outline of the solution based on which the API Selector selects the most relevant API. (2) The LLM generates an API call, which is executed against the API.

Looking Back

It is inspiring to look back to see how far the field has progressed in just a few years. There are a few trends and developments in particular that have brought us to where we are.

Tool use then and now. The idea of having a model interface with auxiliary modules is not new. For instance, the Neural Programmer-Interpreter (Reed & de Freitas, 2016) required a complex neural network architecture to learn to execute different domain-specific programs; for equipping BERT with a calculator (Andor et al., 2019), vector-based operations for a limited set of arithmetic operations were defined. What has changed is that current LLMs are much more versatile than prior models, which allows the use of arbitrary APIs.

Embeddings→modules→tools. 5 years ago, we had approaches that learned to select the best combination of embeddings for a given task (e.g., Kiela et al., 2018). Last year, approaches selected new parameter-efficient modules for a given task (e.g., Mao et al., 2022). Now we are at a stage where models learn to select and use entire models and arbitrary black-box tools.

Chitchat→goal-oriented dialogue. End-to-end goal or task-oriented dialogue has been a challenging task in NLP for a long time (Bordes et al., 2016). While prior models have already queried database information based on their belief states (Hosseini-Asl et al., 2020), tool-augmented LLMs will be able to more seamlessly transition from chitchat to goal-oriented dialogue.

The Future of Tool-Augmented LLMs

Looking ahead, there are several challenges and directions for tool-augmented LLMs:

Making APIs accessible for model use. There are millions of APIs available that models can interact with. API platforms (see above) as well as ChatGPT Plugins and others aim to centralize access to APIs, which may risk locking in users. To ensure research progress in this area, it will be key to ensure that a standard set of APIs are available openly and freely to use.
API search and extensibility. The problem of finding the most relevant API is similar to finding the most appropriate skill for virtual assistants such as Alexa (Kim et al., 2018). It will be key to have a search component that reliably returns the most relevant API from a growing API pool as well as enabling LLMs to be easily extended with new tools.
Learning to use tools. How to best teach an LLM to use tools remains an open problem. The approach of Schick et al. (2023) is restricted to using a single tool at a time and requires tool-specific heuristics in order to augment a dataset efficiently. It will be important to investigate methods that can provide (multi-step) supervision and scale to 100s and 1000s of APIs.
Pre-training tool-augmented LLMs. Given the diversity of APIs and their use cases, it makes sense to dedicate larger budgets to training tool-augmented LLMs. While pre-trained models can be fine-tuned for tool use, pre-training a tool-augmented LLM allows the model to off-load certain behavior early in training and focus on learning what is not captured by the tools.
Improving reasoning and problem decomposition. Reasoning and tool use are closely intertwined (Mialon et al., 2023). In order to call the right APIs for a problem, it needs to be decomposed into potentially simpler subtasks. How to best decompose open-ended problems is an open challenge.
Compensating for API errors and preventing error cascades. API calls to other models or tools such as search engines may produce erroneous results, which can lead to downstream failures. LLMs should learn to assess the reliability of APIs and recover from API failures.
Gaining a better understanding of tool use. Many aspects of how models learn to use and interface with tools are poorly understood. For instance, it is unclear to what extent models use predicted reasoning steps to support the final prediction (Yu et al., 2022). It is thus important to develop analysis methods and diagnostic tools together with new tool-augmented LLMs.

Overall, tool use allows us to address some of the current models’ limitations and has the potential to make them more capable and more interpretable at the same time. I’m excited to see what future progress in this area will look like.

The lack of grounding to the real world, which tool use enables, has been highlighted as a limitation of LLMs in the past (Bender & Kohler, 2020).

While recent models have much improved mathematical capabilities, they are not yet able to solve graduate-level math problems (Frieder et al., 2023). They are, however, useful as assistants to mathematicians and to guide human intuition (Davies et al., 2021).

We can also refer to this as ‘behavioral fine-tuning’ as we aim to teach the model something about its intended target behavior.

For instance, texts should only be considered for the calculator tool if they contain at least three numbers.

See Mialon et al. (2023) for an overview of this area.

Vic

Nov 26, 2023

Sebastian, 22 nov on Ai Journey conference we presented a multi-agent system FractalGPT and its Math agent, that is able to solve math tasks with 99.8% accuracy, outperforming most LLM nevertheless they are trained on math dataset or not (ChatGPT, GPT4, Claude2). Tested on dataset https://huggingface.co/datasets/ChilleD/MultiArith Very well scaled to any math problem

Watch: https://aij.ru/eng/broadcast?date=2023-11-22&streamName=Science Hall 1

Autonomous AI agents: industry trends and why prompts are not almighty

Viktor Nosko

FractalTech

This agent approach is a successor of tool use.

Expand full comment

2 replies by Sebastian Ruder and others

2 more comments...

NLP News

🤖🗣 Generative Agents, 🏛 Forums for Foundation Models

Discussion about this post

Ready for more?