$5 free credits when you sign up
Use Case · Code Generation

Coding assistants with model choice and per-seat budgets.

A coding tool mixes fast autocomplete with deep reasoning. Nemo Router lets you pick the model per task from one catalog, route latency-sensitive calls to the quickest endpoint, and cap spend per developer seat.

code-assistant · seat key

One catalog, the right model per task

Inline completiongemini-2.5-flash
Refactor / testsgemini-2.5-pro
Route strategyleast-busy
Streamingtoken-by-token
Seat budget$31 / $80
Provider confignone
model choicelow-latencyper-seat budget
Model catalog
20+

models on Google Vertex AI

Gateway overhead
95 ms

p50 — LLM inference dominates

Per-seat spend
Budgeted

One key per developer, capped

Every call
Logged

Model, latency, and cost per request

Why Nemo for code

The right model, fast, and within budget

A coding assistant has two competing needs — speed for completion, depth for reasoning — and cost that scales with every seat. Nemo Router handles all three behind one key.

Pick the model per task

Inline autocomplete wants a fast model; a multi-file refactor or a test-suite generation wants a strong reasoning model. The catalog exposes every model behind one key — set it per request.

  • Choose any catalog model per chat.completions call
  • Tag-filtered routing keeps code tasks on code-capable models
  • Swap models as the catalog grows — no SDK change
  • 20+ models live, Anthropic and OpenAI shipping next

Low-latency routing for completion

A developer is waiting on every keystroke. Latency-based routing steers each completion request to the quickest healthy endpoint; least-busy avoids the most loaded deployment.

  • Routing decisions add ~95 ms p50 — LLM time dominates
  • Latency-based and least-busy strategies for hot paths
  • Streaming proxied transparently for token-by-token completion
  • Failover keeps the editor responsive during a provider blip

Per-seat budgets

A coding assistant scales cost with headcount. Issue one virtual key per developer or per team and set a per-key budget — spend is visible and capped per seat.

  • One virtual key per developer or team
  • Per-key budget with a hard 402 ceiling
  • Per-key RPM/TPM limits prevent one seat hogging capacity
  • Spend analytics break cost down by key

Observability for every generation

When a suggestion is wrong or slow, you want the request. Every completion lands in the log with the model, latency, token counts, and real cost.

  • Request log records model, latency, tokens, and cost per call
  • Filter by a developer’s virtual key to inspect their traffic
  • Export to Langfuse, Datadog, or S3 via a logging callback
  • A/B test two models on real completion traffic
How it works

An editor request, end to end

Each developer carries a seat key. Completion and reasoning requests route to the model the task needs, stream back token-by-token, and land in the log attributed to that seat.

Code-assistant request flow

  1. Editor request

    completion or chat

    A keystroke completion or a refactor prompt from the IDE.

  2. Seat key

    sk-nemo-... · per developer

    Budget and rate limit scope to the developer’s key.

  3. Model + latency route

    catalog · least-busy

    Fast model for completion, strong model for reasoning.

  4. Stream back

    token-by-token

    Streaming proxied transparently to the editor.

  5. Logged per seat

    request log

    Model, latency, tokens, cost — attributed to the seat.

The gateway adds about 95 ms at p50 — LLM inference is the dominant latency factor. Streaming is proxied with no hot-path buffering.

Budgets

Cost that scales with headcount, capped per seat

Per-seat budgets

One key per developer — spend you can see and cap

A coding assistant’s bill grows with every engineer you add. Give each developer their own virtual key with a per-key budget. The request log, rate limits, and spend all scope to that key, so cost-per-seat is a number, not a guess — and a 402 is the ceiling, not an overage.

  • One virtual key per developer or per team
  • Per-key budget enforced with a hard 402 ceiling
  • Per-key RPM/TPM limits keep one seat from hogging capacity
  • Spend analytics attribute every dollar to a key
budgets · engineering team

Spend per seat

dev-key · alex$31 / $80
dev-key · priya$58 / $80
dev-key · sam$80 / $80
sam · next call402 — capped
Team budget$169 / $400
per-keyhard ceilingtracked
The code

Set the model per request

A coding assistant just sets the model field per call — fast for completion, strong for reasoning. These snippets come from the same SDK examples the playground uses; change the model string and the catalog does the rest.

Installpip install openai
1# Cache: enabled (org default). Pass nemo_cache: false to skip.
2from openai import OpenAI
3import os
4
5client = OpenAI(
6 api_key=os.environ["NEMOROUTER_API_KEY"],
7 base_url="https://api.nemorouter.ai/v1",
8)
9
10response = client.chat.completions.create(
11 model="gemini-2.5-flash",
12 temperature=1,
13 max_tokens=1024,
14 top_p=1,
15 messages=[
16 {"role": "user", "content": "Hello! What models do you support?"},
17 ],
18 extra_body={
19 # "nemo_cache": False, # Uncomment to skip cache
20 },
21)
22
23print(response.choices[0].message.content)

One key reaches every model in the catalog — no per-model provider account to manage.

FAQ

Common code-assistant questions

Model choice, low latency, per-seat budgets

Build a coding assistant your finance team can read

Pick the model per task, route for latency, and cap spend per developer — all unlocked on every plan.

OpenAI-compatible — works with any IDE extension that targets the OpenAI SDK.