$5 free credits when you sign up — No card required · use any model right awayStart free
Use Case · RAG
Retrieval-augmented generation, one key for embed and chat.
A RAG pipeline calls two model families — embeddings to index and query, a chat model to synthesize the answer. Route both through one Nemo Router endpoint, cache the repeats, and track cost per stage.
Two model families, repetitive traffic, real cost pressure, and providers that occasionally fail. Nemo Router handles all four behind one key.
Embeddings and chat, one key
A RAG pipeline touches two model families. The same endpoint that answers chat.completions also serves the embeddings call that builds and queries your index — one key, one bill, zero provider config.
embeddings + chat.completions on the same base URL
Swap embedding or chat models without re-keying
Catalog tags surface long-context models for big retrievals
No separate embedding-provider account to manage
Caching for repetitive traffic
RAG traffic repeats — FAQs re-asked, identical context windows, popular queries. Caching is on by default, so an exact-match request is served without ever hitting the provider.
Response cache enabled by default per org
Exact-match repeats skip the provider call entirely
Override per request with nemo_cache: false for fresh output
Cache decision recorded in the request log
Cost tracking per pipeline stage
LiteLLM reports the real cost of every call. Tag your embedding calls and your synthesis calls separately and the dashboard attributes spend to each stage of the pipeline.
Real per-call cost from the response-cost header
Tag index vs. query traffic via request metadata
Per-org, per-team, and per-key budgets cap runaway spend
Spend analytics break down cost by model and tag
Failover keeps retrieval answering
When an embedding or chat provider degrades, the fallback chain retries the next link transparently. Your index build finishes and your query path keeps returning answers.
Ordered fallback chain per model group
Timeouts, 5xx, and circuit-breaks all trigger the next link
Retries honor cooldown and provider rate-limit hints
Every fallback logged for replay
How it works
A RAG request, end to end
Index once, then query: embed the question, retrieve context from your own vector store, and synthesize with a chat model. Nemo sits on the two LLM hops — embeddings and synthesis — and logs the cost of each.
RAG pipeline flow
Index documents
POST /v1/embeddings
Chunk + embed your corpus once; store vectors in your DB.
Query embedding
POST /v1/embeddings
Embed the user question with the same model.
Retrieve context
your vector store
Nearest-neighbour search runs in your own database.
Synthesize answer
POST /v1/chat/completions
Chat model answers from retrieved context — cached if repeated.
Settled + logged
cost per stage
Embed cost, chat cost, cache hit — all in the request log.
Nearest-neighbour search stays in your database. Nemo Router handles the two LLM hops — embeddings and synthesis — with caching, failover, and per-stage cost tracking.
Caching
Repeated questions never hit the provider twice
Response caching
Exact-match repeats are served from cache
Knowledge-base RAG answers the same questions over and over. With caching on by default, an identical request — same model, same context, same prompt — returns from cache instead of paying for another generation. The cache decision lands in the request log so you can see the hit rate.
Caching enabled by default per org
Exact-match repeats skip the provider call and the cost
nemo_cache: false forces a fresh generation when freshness matters
Cache hit / miss recorded per request for observability
cache · knowledge-base RAG
Cache behaviour
Question"reset my password?"
First askmiss · generated
Re-askhit · 0 ms
Provider callskipped
Cost on hit$0.00
default-onper-request overridelogged
The code
Same client for embeddings and chat
A RAG pipeline is just two endpoint calls against one key. These snippets come straight from the SDK examples the playground and dashboard use — set NEMOROUTER_API_KEY and the chat call runs as-is; the embeddings call uses the same client and base URL.
Installpip install openai
1
# Cache: enabled (org default). Pass nemo_cache: false to skip.
2
from openai importOpenAI
3
importos
4
5
client = OpenAI(
6
api_key=os.environ["NEMOROUTER_API_KEY"],
7
base_url="https://api.nemorouter.ai/v1",
8
)
9
10
response = client.chat.completions.create(
11
model="gemini-2.5-flash",
12
temperature=1,
13
max_tokens=1024,
14
top_p=1,
15
messages=[
16
{"role": "user", "content": "Hello! What models do you support?"},
17
],
18
extra_body={
19
# "nemo_cache": False, # Uncomment to skip cache
20
},
21
)
22
23
print(response.choices[0].message.content)
The same client object also calls client.embeddings.create() — one key covers the whole pipeline.
FAQ
Common RAG questions
One key for the whole pipeline
Ship a RAG pipeline without juggling providers
Embeddings, chat, caching, and per-stage cost tracking — all behind one NemoRouter key. Every feature is unlocked on every plan.