Reduce your footprint

The methodology gives you a number. This page is what to actually do about it. Each section is a lever, with the order of magnitude of its effect, and the trade-off it carries.

Pick a smaller model when you can

The largest single lever. Energy scales roughly linearly with active parameters (see the methodology formula). A 7B-active model is ~10× lower energy per token than a 70B model.

When to use a smaller model:

Classification, extraction, and structured-output tasks.
High-volume background work (summarisation, tagging).
Anything where the output is verified by a downstream system.

When not to:

Tasks that the smaller model fails at and your application has to retry on a larger one anyway. Two failed cheap calls + one big call

one big call.

The pseudo-model lowrouter/auto-cheap biases toward the smallest model that can plausibly handle the request. Try it on your traffic; if quality holds, keep it.

Cache prompts where the upstream supports it

Several providers offer prompt caching: a long system prompt sent repeatedly with different user messages is charged at a discount on the cached portion. Where supported, this cuts both cost and energy on the cached part.

Practical:

Place stable instructions, examples, and reference material first in the messages array.
Place the variable part (the user’s question) last.
Keep the stable prefix above the upstream’s caching threshold (for example, ≥1024 tokens).

The dashboard’s per-transaction view shows cached_tokens when an upstream applied a cache hit.

Trim prompts

Energy scales linearly with total_tokens. A 50% prompt-length reduction is a 50% energy reduction for the prompt portion.

Drop preamble that doesn’t change the model’s behaviour.
Drop few-shot examples that the model no longer needs.
Compress reference material (use IDs instead of full descriptions when the model has been trained on them).

This compounds with prompt caching: a shorter cached prefix is cheaper and faster to cache.

Choose the cleaner region when residency permits

The grid intensity in eu-north (mostly hydro/nuclear) is roughly 5–8× lower than in coal-heavy regions. If your data residency allows EU-North, you can pick it explicitly:

JSON

{
  "model": "lowrouter/auto",
  "messages": [...],
  "route": {"region": "eu-north", "prefer_low_carbon": true}
}

Or, if you’d rather let the router pick whichever region is cleanest and available right now, just set prefer_low_carbon: true and leave region unset.

Bound completion length

max_tokens lets you stop generation when “enough is enough”. For classification or extraction, set it to the actual answer length plus a small margin. The carbon savings are linear with the saved tokens.

Some prompts respond well to “Answer in one sentence.” instructions; others ignore them. Both are worth trying — the first time you check the dashboard, you’ll see if the average completion length actually came down.

Rate-limit your retries

A retry storm can multiply your footprint by 3–10× of the underlying call. Use exponential backoff with jitter on retries, cap the retry count, and never retry on a 4xx that is not a 408 timeout.

Memoise

If the same user asks the same question twice, an in-application cache returns the previous answer at zero gateway cost. This is the cheapest watt: the one not spent.

A few patterns that work:

Hash the prompt (after normalisation) and cache the response by that hash.
Cache lookup tables generated by the model (taxonomies, slot schemas) and refresh them on a schedule, not per-request.
For chat, cache the last few responses in memory keyed by the full conversation; reuse when the user re-asks immediately.

Aggregate where you can

Many small completions cost more than one larger one with multiple items. Examples:

Classify a batch of 20 items in a single request rather than 20 requests.
Extract structured fields for a list of inputs in one structured- output call.

Watch out for context-window limits and for the cost of re-prompting when one item in the batch fails — sometimes individual calls are cheaper net.

Order of magnitude summary

Lever	Typical reduction
Smaller model	5–10× per request
Region pinning to clean grid	3–8× on the carbon term
Prompt caching	30–80% on the cached portion
Prompt trimming	linear with the % trimmed
Memoisation of repeats	100% on the cached call
`max_tokens` bounding	linear with completion-tokens saved
Aggregation / batching	2–5× on overhead

These are independent — applying several stacks. The first two are where most teams start.

Confirm with the dashboard

After any of these changes, check the eco-impact widget on the dashboard for the same time window before/after. If the change you made should have reduced the per-1K-tokens carbon number and didn’t, something is off — the dashboard’s transaction-detail page tells you which model and provider actually served each request.