LowRouter

Reduce your footprint

The methodology gives you a number. This page is what to actually do about it. Each section is a lever, with the order of magnitude of its effect, and the trade-off it carries.

Pick a smaller model when you can

The largest single lever. Energy scales roughly linearly with active parameters (see the methodology formula). A 7B-active model is ~10× lower energy per token than a 70B model.

When to use a smaller model:

  • Classification, extraction, and structured-output tasks.
  • High-volume background work (summarisation, tagging).
  • Anything where the output is verified by a downstream system.

When not to:

  • Tasks that the smaller model fails at and your application has to retry on a larger one anyway. Two failed cheap calls + one big call

    one big call.

The pseudo-model lowrouter/auto-cheap biases toward the smallest model that can plausibly handle the request. Try it on your traffic; if quality holds, keep it.

Cache prompts where the upstream supports it

Several providers offer prompt caching: a long system prompt sent repeatedly with different user messages is charged at a discount on the cached portion. Where supported, this cuts both cost and energy on the cached part.

Practical:

  • Place stable instructions, examples, and reference material first in the messages array.
  • Place the variable part (the user’s question) last.
  • Keep the stable prefix above the upstream’s caching threshold (for example, ≥1024 tokens).

The dashboard’s per-transaction view shows cached_tokens when an upstream applied a cache hit.

Trim prompts

Energy scales linearly with total_tokens. A 50% prompt-length reduction is a 50% energy reduction for the prompt portion.

  • Drop preamble that doesn’t change the model’s behaviour.
  • Drop few-shot examples that the model no longer needs.
  • Compress reference material (use IDs instead of full descriptions when the model has been trained on them).

This compounds with prompt caching: a shorter cached prefix is cheaper and faster to cache.

Choose the cleaner region when residency permits

The grid intensity in eu-north (mostly hydro/nuclear) is roughly 5–8× lower than in coal-heavy regions. If your data residency allows EU-North, you can pick it explicitly:

JSON
{
  "model": "lowrouter/auto",
  "messages": [...],
  "route": {"region": "eu-north", "prefer_low_carbon": true}
}

Or, if you’d rather let the router pick whichever region is cleanest and available right now, just set prefer_low_carbon: true and leave region unset.

Bound completion length

max_tokens lets you stop generation when “enough is enough”. For classification or extraction, set it to the actual answer length plus a small margin. The carbon savings are linear with the saved tokens.

Some prompts respond well to “Answer in one sentence.” instructions; others ignore them. Both are worth trying — the first time you check the dashboard, you’ll see if the average completion length actually came down.

Rate-limit your retries

A retry storm can multiply your footprint by 3–10× of the underlying call. Use exponential backoff with jitter on retries, cap the retry count, and never retry on a 4xx that is not a 408 timeout.

Memoise

If the same user asks the same question twice, an in-application cache returns the previous answer at zero gateway cost. This is the cheapest watt: the one not spent.

A few patterns that work:

  • Hash the prompt (after normalisation) and cache the response by that hash.
  • Cache lookup tables generated by the model (taxonomies, slot schemas) and refresh them on a schedule, not per-request.
  • For chat, cache the last few responses in memory keyed by the full conversation; reuse when the user re-asks immediately.

Aggregate where you can

Many small completions cost more than one larger one with multiple items. Examples:

  • Classify a batch of 20 items in a single request rather than 20 requests.
  • Extract structured fields for a list of inputs in one structured- output call.

Watch out for context-window limits and for the cost of re-prompting when one item in the batch fails — sometimes individual calls are cheaper net.

Order of magnitude summary

LeverTypical reduction
Smaller model5–10× per request
Region pinning to clean grid3–8× on the carbon term
Prompt caching30–80% on the cached portion
Prompt trimminglinear with the % trimmed
Memoisation of repeats100% on the cached call
max_tokens boundinglinear with completion-tokens saved
Aggregation / batching2–5× on overhead

These are independent — applying several stacks. The first two are where most teams start.

Confirm with the dashboard

After any of these changes, check the eco-impact widget on the dashboard for the same time window before/after. If the change you made should have reduced the per-1K-tokens carbon number and didn’t, something is off — the dashboard’s transaction-detail page tells you which model and provider actually served each request.