
# Reduce your footprint

The methodology gives you a number. This page is what to actually do
about it. Each section is a lever, with the order of magnitude of its
effect, and the trade-off it carries.

## Pick a smaller model when you can

The largest single lever. Energy scales roughly linearly with active
parameters (see the [methodology](methodology) formula). A 7B-active
model is ~10× lower energy per token than a 70B model.

When to use a smaller model:

- Classification, extraction, and structured-output tasks.
- High-volume background work (summarisation, tagging).
- Anything where the output is verified by a downstream system.

When **not** to:

- Tasks that the smaller model fails at and your application has to
  retry on a larger one anyway. Two failed cheap calls + one big call
  > one big call.

The pseudo-model `lowrouter/auto-cheap` biases toward the smallest
model that can plausibly handle the request. Try it on your traffic;
if quality holds, keep it.

## Cache prompts where the upstream supports it

Several providers offer prompt caching: a long system prompt sent
repeatedly with different user messages is charged at a discount on
the cached portion. Where supported, this cuts both cost and energy
on the cached part.

Practical:

- Place stable instructions, examples, and reference material **first**
  in the messages array.
- Place the variable part (the user's question) **last**.
- Keep the stable prefix above the upstream's caching threshold (for
  example, ≥1024 tokens).

The dashboard's per-transaction view shows `cached_tokens` when an
upstream applied a cache hit.

## Trim prompts

Energy scales linearly with `total_tokens`. A 50% prompt-length
reduction is a 50% energy reduction for the prompt portion.

- Drop preamble that doesn't change the model's behaviour.
- Drop few-shot examples that the model no longer needs.
- Compress reference material (use IDs instead of full descriptions
  when the model has been trained on them).

This compounds with prompt caching: a shorter cached prefix is
cheaper *and* faster to cache.

## Choose the cleaner region when residency permits

The grid intensity in `eu-north` (mostly hydro/nuclear) is roughly
**5–8× lower** than in coal-heavy regions. If your data residency
allows EU-North, you can pick it explicitly:

```json
{
  "model": "lowrouter/auto",
  "messages": [...],
  "route": {"region": "eu-north", "prefer_low_carbon": true}
}
```

Or, if you'd rather let the router pick whichever region is cleanest
*and* available right now, just set `prefer_low_carbon: true` and
leave `region` unset.

## Bound completion length

`max_tokens` lets you stop generation when "enough is enough". For
classification or extraction, set it to the actual answer length plus
a small margin. The carbon savings are linear with the saved tokens.

Some prompts respond well to "Answer in one sentence." instructions;
others ignore them. Both are worth trying — the first time you check
the dashboard, you'll see if the average completion length actually
came down.

## Rate-limit your retries

A retry storm can multiply your footprint by 3–10× of the underlying
call. Use exponential backoff with jitter on retries, cap the retry
count, and **never** retry on a 4xx that is not a 408 timeout.

## Memoise

If the same user asks the same question twice, an in-application
cache returns the previous answer at zero gateway cost. This is the
cheapest watt: the one not spent.

A few patterns that work:

- Hash the prompt (after normalisation) and cache the response by
  that hash.
- Cache lookup tables generated by the model (taxonomies, slot
  schemas) and refresh them on a schedule, not per-request.
- For chat, cache the last few responses in memory keyed by the full
  conversation; reuse when the user re-asks immediately.

## Aggregate where you can

Many small completions cost more than one larger one with multiple
items. Examples:

- Classify a batch of 20 items in a single request rather than 20
  requests.
- Extract structured fields for a list of inputs in one structured-
  output call.

Watch out for context-window limits and for the cost of re-prompting
when one item in the batch fails — sometimes individual calls are
cheaper net.

## Order of magnitude summary

| Lever | Typical reduction |
|-------|-------------------|
| Smaller model | 5–10× per request |
| Region pinning to clean grid | 3–8× on the carbon term |
| Prompt caching | 30–80% on the cached portion |
| Prompt trimming | linear with the % trimmed |
| Memoisation of repeats | 100% on the cached call |
| `max_tokens` bounding | linear with completion-tokens saved |
| Aggregation / batching | 2–5× on overhead |

These are independent — applying several stacks. The first two are
where most teams start.

## Confirm with the dashboard

After any of these changes, check the eco-impact widget on the
dashboard for the same time window before/after. If the change you
made should have reduced the per-1K-tokens carbon number and didn't,
something is off — the dashboard's transaction-detail page tells you
which model and provider actually served each request.
