Can I use different models for autocomplete versus deep code reasoning?

Yes. The model catalog exposes every model behind one key. Set the model per request — a fast model for inline completion, a stronger reasoning model for refactors or test generation — with no separate provider account or SDK swap.

How does Nemo Router keep code completion latency low?

Routing decisions happen in-memory and add roughly 95 ms at p50 — LLM inference time dominates. Latency-based routing steers each request to the quickest healthy endpoint, and least-busy routing avoids the most loaded deployment.

Can I budget LLM spend per developer seat?

Yes. Issue one virtual key per developer or per team and set a per-key budget. Spend, rate limits, and the request log all scope to that key, so you can see and cap cost per seat.

Can I use different models for autocomplete versus deep code reasoning?

Yes. The model catalog exposes every model behind one key. Set the model per request — a fast model for inline completion, a stronger reasoning model for refactors or test generation — with no separate provider account or SDK swap.

How does Nemo Router keep code completion latency low?

Routing decisions happen in-memory and add roughly 95 ms at p50 — LLM inference time dominates. Latency-based routing steers each request to the quickest healthy endpoint, and least-busy routing avoids the most loaded deployment.

Can I budget LLM spend per developer seat?

Yes. Issue one virtual key per developer or per team and set a per-key budget. Spend, rate limits, and the request log all scope to that key, so you can see and cap cost per seat.

Does streaming work for token-by-token completion?

Yes. Nemo Router is OpenAI-compatible and proxies streaming responses transparently, so an editor extension gets tokens as they are generated — the gateway adds no buffering on the hot path.

Use Case · Code Generation

Coding assistants with model choice and per-seat budgets.

A coding tool mixes fast autocomplete with deep reasoning. Nemo Router lets you pick the model per task from one catalog, route latency-sensitive calls to the quickest endpoint, and cap spend per developer seat.

Get started See the request flow

code-assistant · seat key

One catalog, the right model per task

Inline completiongemini-2.5-flash

Refactor / testsgemini-2.5-pro

Route strategyleast-busy

Streamingtoken-by-token

Seat budget$31 / $80

Provider confignone

model choicelow-latencyper-seat budget

Model catalog: 20+
Gateway overhead: 95 ms
Per-seat spend: Budgeted
Every call: Logged

Why Nemo for code

The right model, fast, and within budget

A coding assistant has two competing needs — speed for completion, depth for reasoning — and cost that scales with every seat. Nemo Router handles all three behind one key.

Pick the model per task

Inline autocomplete wants a fast model; a multi-file refactor or a test-suite generation wants a strong reasoning model. The catalog exposes every model behind one key — set it per request.

Choose any catalog model per chat.completions call
Tag-filtered routing keeps code tasks on code-capable models
Swap models as the catalog grows — no SDK change
20+ models live, Anthropic and OpenAI shipping next

Low-latency routing for completion

A developer is waiting on every keystroke. Latency-based routing steers each completion request to the quickest healthy endpoint; least-busy avoids the most loaded deployment.

Routing decisions add ~95 ms p50 — LLM time dominates
Latency-based and least-busy strategies for hot paths
Streaming proxied transparently for token-by-token completion
Failover keeps the editor responsive during a provider blip

Per-seat budgets

A coding assistant scales cost with headcount. Issue one virtual key per developer or per team and set a per-key budget — spend is visible and capped per seat.

One virtual key per developer or team
Per-key budget with a hard 402 ceiling
Per-key RPM/TPM limits prevent one seat hogging capacity
Spend analytics break cost down by key

Observability for every generation

When a suggestion is wrong or slow, you want the request. Every completion lands in the log with the model, latency, token counts, and real cost.

Request log records model, latency, tokens, and cost per call
Filter by a developer’s virtual key to inspect their traffic
Export to Langfuse, Datadog, or S3 via a logging callback
A/B test two models on real completion traffic

How it works

An editor request, end to end

Each developer carries a seat key. Completion and reasoning requests route to the model the task needs, stream back token-by-token, and land in the log attributed to that seat.

Code-assistant request flow

Editor request
completion or chat
A keystroke completion or a refactor prompt from the IDE.
Seat key
sk-nemo-... · per developer
Budget and rate limit scope to the developer’s key.
Model + latency route
catalog · least-busy
Fast model for completion, strong model for reasoning.
Stream back
token-by-token
Streaming proxied transparently to the editor.
Logged per seat
request log
Model, latency, tokens, cost — attributed to the seat.

The gateway adds about 95 ms at p50 — LLM inference is the dominant latency factor. Streaming is proxied with no hot-path buffering.

Budgets

Cost that scales with headcount, capped per seat

Per-seat budgets

One key per developer — spend you can see and cap

A coding assistant’s bill grows with every engineer you add. Give each developer their own virtual key with a per-key budget. The request log, rate limits, and spend all scope to that key, so cost-per-seat is a number, not a guess — and a 402 is the ceiling, not an overage.

One virtual key per developer or per team
Per-key budget enforced with a hard 402 ceiling
Per-key RPM/TPM limits keep one seat from hogging capacity
Spend analytics attribute every dollar to a key

budgets · engineering team

Spend per seat

dev-key · alex$31 / $80

dev-key · priya$58 / $80

dev-key · sam$80 / $80

sam · next call402 — capped

Team budget$169 / $400

per-keyhard ceilingtracked

The code

Set the model per request

A coding assistant just sets the model field per call — fast for completion, strong for reasoning. These snippets come from the same SDK examples the playground uses; change the model string and the catalog does the rest.

Installpip install openai

1	`# Cache: enabled (org default). Pass nemo_cache: false to skip.`
2	`from openai import OpenAI`
3	`import os`
4
5	`client = OpenAI(`
6	`api_key=os.environ["NEMOROUTER_API_KEY"],`
7	`base_url="https://api.nemorouter.ai/v1",`
8	`)`
9
10	`response = client.chat.completions.create(`
11	`model="gemini-2.5-flash",`
12	`temperature=1,`
13	`max_tokens=1024,`
14	`top_p=1,`
15	`messages=[`
16	`{"role": "user", "content": "Hello! What models do you support?"},`
17	`],`
18	`extra_body={`
19	`# "nemo_cache": False, # Uncomment to skip cache`
20	`},`
21	`)`
22
23	`print(response.choices[0].message.content)`

One key reaches every model in the catalog — no per-model provider account to manage.

FAQ

Common code-assistant questions

Model choice, low latency, per-seat budgets

Build a coding assistant your finance team can read

Pick the model per task, route for latency, and cap spend per developer — all unlocked on every plan.

Get started See budget controls

OpenAI-compatible — works with any IDE extension that targets the OpenAI SDK.