Reduce your footprint
The methodology gives you a number. This page is what to actually do about it. Each section is a lever, with the order of magnitude of its effect, and the trade-off it carries.
Pick a smaller model when you can
The largest single lever. Energy scales roughly linearly with active parameters (see the methodology formula). A 7B-active model is ~10× lower energy per token than a 70B model.
When to use a smaller model:
- Classification, extraction, and structured-output tasks.
- High-volume background work (summarisation, tagging).
- Anything where the output is verified by a downstream system.
When not to:
- Tasks that the smaller model fails at and your application has to
retry on a larger one anyway. Two failed cheap calls + one big call
one big call.
The pseudo-model lowrouter/auto-cheap biases toward the smallest
model that can plausibly handle the request. Try it on your traffic;
if quality holds, keep it.
Cache prompts where the upstream supports it
Several providers offer prompt caching: a long system prompt sent repeatedly with different user messages is charged at a discount on the cached portion. Where supported, this cuts both cost and energy on the cached part.
Practical:
- Place stable instructions, examples, and reference material first in the messages array.
- Place the variable part (the user’s question) last.
- Keep the stable prefix above the upstream’s caching threshold (for example, ≥1024 tokens).
The dashboard’s per-transaction view shows cached_tokens when an
upstream applied a cache hit.
Trim prompts
Energy scales linearly with total_tokens. A 50% prompt-length
reduction is a 50% energy reduction for the prompt portion.
- Drop preamble that doesn’t change the model’s behaviour.
- Drop few-shot examples that the model no longer needs.
- Compress reference material (use IDs instead of full descriptions when the model has been trained on them).
This compounds with prompt caching: a shorter cached prefix is cheaper and faster to cache.
Choose the cleaner region when residency permits
The grid intensity in eu-north (mostly hydro/nuclear) is roughly
5–8× lower than in coal-heavy regions. If your data residency
allows EU-North, you can pick it explicitly:
{
"model": "lowrouter/auto",
"messages": [...],
"route": {"region": "eu-north", "prefer_low_carbon": true}
}Or, if you’d rather let the router pick whichever region is cleanest
and available right now, just set prefer_low_carbon: true and
leave region unset.
Bound completion length
max_tokens lets you stop generation when “enough is enough”. For
classification or extraction, set it to the actual answer length plus
a small margin. The carbon savings are linear with the saved tokens.
Some prompts respond well to “Answer in one sentence.” instructions; others ignore them. Both are worth trying — the first time you check the dashboard, you’ll see if the average completion length actually came down.
Rate-limit your retries
A retry storm can multiply your footprint by 3–10× of the underlying call. Use exponential backoff with jitter on retries, cap the retry count, and never retry on a 4xx that is not a 408 timeout.
Memoise
If the same user asks the same question twice, an in-application cache returns the previous answer at zero gateway cost. This is the cheapest watt: the one not spent.
A few patterns that work:
- Hash the prompt (after normalisation) and cache the response by that hash.
- Cache lookup tables generated by the model (taxonomies, slot schemas) and refresh them on a schedule, not per-request.
- For chat, cache the last few responses in memory keyed by the full conversation; reuse when the user re-asks immediately.
Aggregate where you can
Many small completions cost more than one larger one with multiple items. Examples:
- Classify a batch of 20 items in a single request rather than 20 requests.
- Extract structured fields for a list of inputs in one structured- output call.
Watch out for context-window limits and for the cost of re-prompting when one item in the batch fails — sometimes individual calls are cheaper net.
Order of magnitude summary
| Lever | Typical reduction |
|---|---|
| Smaller model | 5–10× per request |
| Region pinning to clean grid | 3–8× on the carbon term |
| Prompt caching | 30–80% on the cached portion |
| Prompt trimming | linear with the % trimmed |
| Memoisation of repeats | 100% on the cached call |
max_tokens bounding | linear with completion-tokens saved |
| Aggregation / batching | 2–5× on overhead |
These are independent — applying several stacks. The first two are where most teams start.
Confirm with the dashboard
After any of these changes, check the eco-impact widget on the dashboard for the same time window before/after. If the change you made should have reduced the per-1K-tokens carbon number and didn’t, something is off — the dashboard’s transaction-detail page tells you which model and provider actually served each request.