Endpoints Not Tools

TL;DR

The default instinct when building agentic systems is to encode every capability as a tool. This doesn’t scale. As the tool count grows, agent performance degrades: the model spends more tokens reasoning about which tool to pick than actually solving the problem. The fix is a separation of concerns: tools are generic bridges (run code, read files), and endpoints are the business logic surface that agents compose through code. This plays directly to what LLMs are already good at, writing code that calls APIs, and pairs naturally with progressive discovery.

The Tool Proliferation Problem

You wire up your first agent with three tools. It works beautifully. Clean tool selection, correct parameters, fast execution. You ship it.

Six weeks later the tool count is fourteen. The agent now spends half its reasoning budget selecting between create_invoice, create_draft_invoice, create_recurring_invoice, and create_invoice_from_template. It picks wrong roughly 20% of the time. Worse, when it picks wrong, it often doesn’t know it picked wrong because the tool executed successfully, it just did the wrong thing.

This is not a failure of the underlying model. This is an architectural failure. We have fundamentally misunderstood the purpose of a “Tool” in an agentic system. When you give an agent 50 discrete tools, you are forcing the model to evaluate the relevance of every single tool, along with its specific JSON schema, at every step of the reasoning loop. You are over-exposing the agent. It collapses at medium scale because tool selection is, fundamentally, a classification problem, and every tool you add is a new class the model must disambiguate under ambiguity.

The common response is to write increasingly elaborate tool descriptions, add “when to use this” hints, or create routing logic that pre-filters tools. These are patches. They treat the symptom (bad selection) without addressing the cause (too many things to select from).

What Tools Actually Are

Strip away the framework abstractions and a tool is a simple thing: it’s a way for the model to do something in the world it cannot do with text alone. Read a file. Execute code. Make a network request. Send a message.

That’s it. Tools are bridges between the model’s reasoning and the external environment. They are I/O primitives.

The problem is that we’ve been overloading this primitive. We take domain-specific business logic, “create an invoice with line items, tax calculation, and currency conversion”, and pack it into a tool. The model can invoke it, but it can’t compose it with other operations flexibly, or adapt when requirements shift slightly.

When I say tools should be minimal and generic, I mean something specific:

Tool (Bridge)	Purpose
`execute_code`	Run code in a sandboxed environment
`read_file`	Read a file from the filesystem
`write_file`	Write a file to the filesystem
`http_request`	Make an HTTP call

This is a small, stable surface. An agent with these four tools can, in principle, do almost anything, because the actual logic lives in the code it writes and the endpoints it calls. The tools don’t grow as your domain grows. The endpoints do.

Endpoints for “Business Logic”

▸ What is an Endpoint?

An endpoint is a stable programmatic operation exposed by a service. POST /invoices, GET /customers/{id}/allergies, PUT /menu/{id}/substitutions. They have typed inputs, typed outputs, documented behavior, and predictable error codes.

LLMs are already excellent at calling APIs through code. This isn’t a theoretical claim; it’s an empirical observation rooted in training data. The internet is saturated with examples of “how to call the Stripe API,” “how to use the GitHub REST API,” “how to compose multiple API calls in a script.” When you give an agent endpoint documentation and a code execution tool, you are operating in the exact modality where LLMs are strongest.

▸ Dinner menu for allergens. Comparison Example

Approach A: Tool-per-operation

Tool-heavy approach

You define dedicated tools:

check_ingredient_allergens(ingredient: string, allergens: string[])
get_menu_ingredients(menu_id: string)
substitute_ingredient(menu_id: string, original: string, replacement: string)
validate_substitution_safety(menu_id: string, allergens: string[])

The agent must:

Reason about which tool to call first
Call get_menu_ingredients, parse the result
Loop (conceptually) through ingredients, calling check_ingredient_allergens for each
Decide on substitutions, calling substitute_ingredient
Call validate_substitution_safety

Each step is a separate tool invocation. Each invocation requires the model to re-enter the tool-selection loop, re-read the tool schemas, and format the parameters. For a menu with 15 ingredients and 3 allergens, that’s potentially dozens of round trips. Token cost spikes. Latency compounds. And the model has no ability to optimize the sequence, it’s locked into one-call-at-a-time orchestration.

Approach B: Endpoints composed through code

Endpoint + code approach

You expose the same operations as API endpoints, documented in an OpenAPI-style spec:

POST /ingredients/check-allergens
  Body: { ingredients: string[], allergens: string[] }
  Response: { flagged: [{ ingredient, allergen, severity }] }

GET /menus/{id}/ingredients
  Response: { ingredients: string[] }

PUT /menus/{id}/substitutions
  Body: { substitutions: [{ original, replacement }] }
  Response: { updated_menu }

The agent writes a single script:

import requests

BASE = "http://menu-service:8080"
menu_id = "dinner-feb-28"
allergens = ["peanut", "gluten"]

# 1. Get all ingredients in one call
ingredients = requests.get(
    f"{BASE}/menus/{menu_id}/ingredients"
).json()["ingredients"]

# 2. Batch check allergens in one call
flagged = requests.post(
    f"{BASE}/ingredients/check-allergens",
    json={"ingredients": ingredients, "allergens": allergens},
).json()["flagged"]

# 3. Build substitutions for flagged items
substitutions = [
    {"original": f["ingredient"], "replacement": suggest_safe(f)}
    for f in flagged
]

# 4. Apply all substitutions in one call
updated = requests.put(
    f"{BASE}/menus/{menu_id}/substitutions",
    json={"substitutions": substitutions},
).json()

print(updated)

Three HTTP calls. One tool invocation (execute_code). The model controls the flow, handles edge cases in code, and can add logic (retries, logging, conditional branches) without needing new tools.

The difference isn’t subtle. In Approach A, the model is an operator pressing buttons one at a time. In Approach B, the model is a programmer composing an API surface, which is exactly what it was trained to do.

Promoting Progressive Discovery

In a progressive discovery system (also called progressive disclosure), optional capabilities stay invisible until a relevant skill is loaded. Endpoints fit this model cleanly.

Consider a dinner party agent. Its base context includes always-needed capabilities: gather preferences, propose a menu, produce a shopping list. No allergy tools. No allergy endpoints. No allergy anything.

When the user mentions a peanut allergy, the agent loads the dietary_restrictions skill. That skill reveals:

Knowledge about how to handle allergen constraints
Endpoint documentation for ingredient_checker and the substitution API

Before skill load, those endpoint specs didn’t exist in the agent’s context. After skill load, the agent has exactly the API surface it needs, no more, no less. It writes code to compose those endpoints, executes it through the single execute_code tool, and returns the result.

Now contrast this with tools. If you’d modeled those allergy operations as tools, you’d need the framework to dynamically register and deregister tools at runtime as skills load and unload. Some frameworks support this; many don’t. And even the ones that do often struggle with the model’s tool schema cache, the model “remembers” tools it saw earlier in the conversation, creating ghost tool hallucinations after deregistration.

Endpoints sidestep this entirely. The model doesn’t need framework-level registration to call an API. It just needs to know the URL, the method, and the schema, all of which arrive as text inside the skill document. The progressive disclosure mechanism is just text injection, which is the simplest and most reliable operation you can perform on an LLM context.

The Advantage of Code Execution

There’s a deeper structural benefit to the endpoint-over-tool approach, and it has to do with control flow.

Tools are atomic. You call one, you get a result, and then the model reasons about what to call next. The model is the orchestrator, and it orchestrates one step at a time. This is fine for simple linear sequences. It’s painful for anything involving:

Loops: Process each item in a list (the model must “loop” by making repeated tool calls across turns)
Conditionals: If X then do Y else do Z (the model must reason through the branch each time)
Error handling: Retry with backoff on failure (the model must track attempt counts and decide when to stop)
Aggregation: Collect results from multiple calls and produce a summary (the model must hold all intermediate results in context)

Code handles all of these natively. for loops, if/else, try/except, list comprehensions, these are the primitive control structures that programming languages were invented to provide. When the model writes code that calls endpoints, it gets access to the full power of a programming language’s control flow for free. The model writes the plan once, the runtime executes it deterministically.

This is not a new observation. The advantage of code execution in tool use is well documented. But the implication for system design is underappreciated: if your agent can write code, most of your “tools” should be endpoints that code calls, not framework-level tool registrations.

When You Still Need Tools

I’m not arguing for zero tools. Tools are important, but for a specific, narrow purpose: providing the bridges that code alone cannot cross.

Code execution itself is a tool. The agent can’t run code without a tool that accepts code and returns output. This is the foundational bridge.

File I/O is a tool. Reading and writing artifacts, loading skill documents, persisting results, these require filesystem access that the model can’t conjure from a script alone (unless the script is running in an environment with pre-configured access, in which case it becomes an endpoint concern).

User interaction is a tool. Asking for confirmation, presenting choices, collecting input, these require a channel back to the human that pure code execution doesn’t provide.

The principle is: tools provide generic environment interaction. Endpoints provide domain-specific business logic. If you find yourself encoding domain knowledge, business rules, or workflow steps into a tool definition, stop. That’s an endpoint.

▸ Decision heuristic: Tool or Endpoint?

Tool or Endpoint?

Ask these questions in order:

Is this a generic I/O operation (run code, read/write file, make HTTP call, talk to user)? → Tool. It’s environment interaction.
Does the agent need to compose this with other operations in a flexible sequence? → Endpoint. Let the agent write code that orchestrates the composition.
Does the operation encode domain-specific business logic (validation rules, data transformation, policy checks)? → Endpoint. Keep business logic in a service, not in a tool schema.
Will this operation only ever be called in isolation, never composed? → Could be either, but lean toward endpoint unless there’s a strong reason for the framework to manage it. Endpoints are testable, documentable, and reusable outside the agent.
Does the framework require this to be a registered tool for technical reasons (e.g., modifies agent state)? → Tool, reluctantly. Document why it can’t be an endpoint.

Limitations and Trade-offs

Security Sandboxing

A tool to execute code is powerful and dangerous. It requires robust sandboxing (network policies, filesystem restrictions, resource limits, rollback). This architecture only pays off for systems with clearly defined, complex tasks. If your agent just needs to call one API and return the result, the overhead isn’t worth it.

Endpoint Discovery Overhead

Agents need to “see” endpoint documentation. In the Progressive Discovery model, this happens via skill loading. If your skill descriptions are vague, the agent won’t know which endpoint to use. This requires disciplined API documentation.

Not all frameworks support this cleanly

Some agentic frameworks are heavily tool-centric. Their orchestration, logging, and state management assume tool calls are the primary action primitive. Moving to an endpoint-through-code model may require working against the framework’s grain, or building custom bridges. This is getting better, but it’s not frictionless today.

An Architectural Solution

This is an architectural issue. It’s the difference between building a system where every new capability requires a new tool registration, new disambiguation logic, and new selection testing, versus a system where new capabilities are new endpoints, documented in skill files, composed through code the model already knows how to write.

Tools are bridges. Keep them few, generic, and stable. Let endpoints carry the business logic. Let code carry the composition. Let progressive discovery control what the agent can see.

If you’re implementing this, start by auditing your current tool list. Any tool that performs “business logic” (calculations, data lookups, state changes) is a candidate for conversion to an endpoint. Reserve tools for interaction with the outside world.

macastro@blog:~

>Endpoints Not Tools