# LLM-first design — why Bindly is built for the model, not just the user

"LLM-first" is a phrase that gets thrown around loosely. For us, it has a specific meaning: the API is designed as if the primary client is a language model, not a web browser. The human uses Bindly through the LLM. The LLM uses Bindly directly.

This changes almost everything about how the API works.

## The fundamental difference

A web browser making an API call has no bandwidth constraint. If the API returns 50KB of JSON, the browser parses it and renders what's relevant. The JavaScript doesn't care about token counts.

An LLM making an API call has a hard context window limit. Every token in a tool response is a token that can't go to reasoning, to other context, to the output. An API that dumps 50KB of raw content into every response will consume the entire context window on the first few tool calls.

This constraint shapes every API design decision in Bindly.

## Token awareness in every response

Every Binding response includes `contentTokenCount`:

```json
{
  "id": "bnd_abc123",
  "name": "Cloudflare Workers limits",
  "summary": "Overview of CPU time limits, bundle size...",
  "contentTokenCount": 2840,
  "textUrl": "https://bind.ly/@bindly/cloudflare-workers-limits?format=md"
}
```

Before deciding whether to fetch the full content (Tier 2), the LLM knows it costs 2,840 tokens. It can compare this against its remaining context budget and decide: load the full content, or work with the summary?

This is information a human interface never needs. A human scrolls and reads what's relevant. An LLM must decide in advance whether to load content into its context, because loading and then discarding it wastes tokens that are already spent.

`summaryTokenCount` and `totalTokenCount` (on Sets) serve the same purpose — budget before fetching.

## The Tier system as bandwidth management

Tier 1 (~300 tokens), Tier 1.5 (~800 tokens), Tier 2 (full content) aren't just "how much detail do you want?" They're a bandwidth management system.

A typical LLM workflow for finding relevant knowledge:

1. Search returns 10 results at Tier 1 — costs ~3,000 tokens total
2. LLM identifies 2 that seem relevant, fetches at Tier 1.5 — costs ~1,600 tokens
3. LLM confirms one is directly relevant, fetches at Tier 2 — costs whatever the document is

Total: 3,000 + 1,600 + 2,840 = 7,440 tokens to find and load one 2,840-token document.

Without the Tier system (everything at Tier 2): 10 × 2,840 = 28,400 tokens just for the search results. The LLM burned 3.8× more context and probably couldn't fit everything anyway.

The Tier system is what makes knowledge retrieval viable at scale. You can have thousands of Bindings and still search efficiently.

## Batch operations as N+1 prevention

The N+1 problem is familiar in database access: fetching a list, then fetching details for each item individually. For an LLM with tool calls, this is worse — each tool call has fixed overhead (the invocation itself, response parsing, context).

`mcp_get_bindings` takes an array of IDs:

```typescript
mcp_get_bindings({
  ids: ["bnd_abc", "bnd_def", "bnd_ghi"],
  tier: "1"
})
```

Three items, one tool call. Without this, fetching three bindings by ID requires three tool calls with three round-trips.

`mcp_get_set_context` is more sophisticated:

```typescript
mcp_get_set_context({
  setId: "set_xyz",
  maxTokens: 8000,
  tier: "1.5"
})
```

This assembles up to 8,000 tokens of content from a Set in one call, ordered by position, stopping when the budget runs out. The LLM tells the API its budget; the API fills it optimally. Without this, loading a Set's contents would require `mcp_get_set` (to list Version IDs) + N × `mcp_get_version`. For a 15-item Set: 16 tool calls versus 1.

## Provenance tracking as a first-class feature

Every Binding version has `source: "human" | "llm"` and an optional `model` field:

```json
{
  "bindingId": "bnd_abc123",
  "version": 3,
  "source": "llm",
  "model": "claude-opus-4-6",
  "createdAt": 1744000000
}
```

When an LLM creates or updates a Binding via MCP, `source` is automatically `"llm"`. When a human edits the same Binding in the web UI, `source` is `"human"`.

This matters for two reasons:

**Trust signals**: An LLM reading a Binding can see whether it was written by a human or generated by another model. Human-written content may be more reliable for certain use cases. LLM-generated content may be more structured.

**Audit trail**: You can see which model created which knowledge. If a model generates incorrect information and saves it as a Binding, you can identify all content created by that model version and review it.

Comments have the same provenance: `mcp_add_comment` marks comments as `source: "llm"`. Human comments from the web UI are `source: "human"`. The UI shows this distinction — AI annotations are visually distinct from human notes.

## Removing LLM summarization from the web UI

v0 generated summaries and key points automatically using the Claude API whenever a user created a Binding via the web UI. This seemed like a feature. In practice it was a cost center and a quality problem.

The cost was real: every web-side Binding creation made an LLM API call. At scale, this would dominate the infrastructure bill.

The quality was also wrong: auto-generated summaries captured what the LLM thought was important, not what the user thought was important. Users would create a Binding about a specific technical decision, and the auto-summary would focus on the general topic rather than the specific decision.

v1 removes this entirely. Web-side Binding creation is manual: the user writes the title, the content, the summary, the key points. This produces better metadata because the user knows what's important.

The MCP path is different. When Claude creates a Binding via MCP, *Claude is the author* — the summary and key points Claude provides reflect Claude's understanding of what it just processed. This is accurate provenance: the LLM created and described the content.

Better summaries, clearer provenance, no unexpected API costs.

## What the LLM Context Block communicates

Every MCP response includes structured metadata alongside the Markdown text:

```json
{
  "content": [{ "type": "text", "text": "..." }],
  "_meta": {
    "context": {
      "binding": {
        "id": "bnd_abc",
        "contentTokenCount": 2840,
        "publicUrl": "https://bind.ly/@bindly/...",
        "textUrl": "https://bind.ly/@bindly/...?format=md"
      },
      "agentMeta": {
        "source": "llm",
        "model": "claude-opus-4-6"
      },
      "pagination": {
        "total": 47,
        "hasMore": true
      }
    }
  }
}
```

This metadata answers questions the LLM might have without needing to make additional tool calls:

- "Can I link to this?" → `publicUrl`
- "Can I give someone a Markdown URL?" → `textUrl`
- "How much context budget will the full content use?" → `contentTokenCount`
- "Are there more results I haven't seen?" → `pagination.hasMore`
- "Who created this?" → `agentMeta.source` + `agentMeta.model`

A human API client ignores this block. An LLM uses it to make better decisions about what to do next.

## The help system as LLM documentation

`mcp_help` fetches help content from `bind.ly/help/*` as Markdown. Help documentation is updated without deploying new MCP code — update the KV store, and `mcp_help({ topic: "mcp/tools" })` returns the new content immediately.

It also means the help content is designed for LLMs to read, not just humans. `/help/mcp/tools` explains not just what each tool does, but *when* to use it, *how* to combine tools, and *what to do* when something goes wrong. Documentation written for a model that will read it in the middle of a task, not a developer who reads it once.

This is what LLM-first means in practice: every part of the system — the API, the responses, the documentation — is designed for a consumer that reasons over tokens, operates in batches, and needs to make decisions with incomplete information.