Status

All systems operational — and we publish the proof.

NemoRouter runs on managed Cloud Run with multi-region failover, synthetic probes, and revision-scoped post-deploy log scans. Targeted uptime is 99.9% across every plan tier — the same SLA on Tier 1 as on Enterprise.

Subscribe to incidents Report an issue

Status board

Every component, monitored independently

Failures rarely take down everything at once. Each surface is reported separately so partial degradations show up honestly — never papered over by a single green light.

Some systems are experiencing issues

As of May 19, 2026

Gateway

API proxy — US-Central (primary)
OpenAI-compatible /v1 endpoint, in-process router
99.99%Operational
Multi-region failover
Automatic failover when the primary degrades
100%Operational
Provider routing & fallback
Routing strategies, fallback chains, retry policies
99.99%Operational

Identity & Auth

Authentication
Login, signup, SSO/SAML, session management
99.99%Operational
RLS-scoped Supabase
Tenant-isolated Postgres with Row-Level Security
99.99%Operational
Virtual key validation
Per-key auth, RPM/TPM enforcement, budget caps
99.99%Operational

Billing & Credits

Credit ledger
Reserve + settle, atomic balance mutations
100%Operational
Stripe billing & webhooks
Checkout, top-ups, subscription, webhook ingest
99.99%Operational

Dashboard & Observability

Management dashboard
Keys, teams, analytics, guardrails, settings
99.99%Operational
Observability & logging
Request logs, callbacks, alerts, latency metrics
99.99%Operational

Provider routing

Google Vertex AI
Gemini, Imagen, Veo, embeddings — live today
99.98%Operational
Anthropic · OpenAI · AWS Bedrock
Shipping next — not yet routing production traffic
Under maintenance

Anthropic, OpenAI, and AWS Bedrock are marked “under maintenance” because they are shipping next — they do not yet route production traffic. See the model catalog for what is live today.

Uptime SLA: 99.9%
API proxy availability: 99.99%
Avg latency overhead: ~8 ms
Incidents this quarter: 0

Incident history

No open or recent customer-impacting incidents

Major incidents (P0 / P1) get a published postmortem within 7 days. When this section has entries, each will carry a timeline, root cause, and remediation.

0 incidents this quarter

As of May 19, 2026. Postmortems are published to the changelog.

How we monitor

Probes, gates, and alerts on every surface

Uptime is an output. The inputs are synthetic probes, deploy-time gates, and customer-configurable alerts that fire before a ticket gets filed.

Synthetic probes

Synthetic checks run against /v1/chat/completions and /health/readiness from every active region every 60 seconds. A miss alerts on-call before any customer ticket.

Hits the real proxy path, not a mock endpoint
Validates streaming + non-streaming separately
Probes a virtual key — same path as real customer traffic

Readiness gating

Every Cloud Run revision must pass a revision-scoped log scan before it serves traffic. A 200 on /health is necessary, not sufficient.

Boot logs scanned for DataError + 5xx for 60s
Prisma drift detection blocks promotion
Auto-rollback if the SLO window degrades

Customer alerts

Configure per-org alerts for budget burn, error spikes, latency regressions, or provider outages. Webhook, email, Slack, and Teams channels are all built in.

8 alert types out of the box
Per-key + per-org thresholds
Slack / Teams / webhook delivery

Incident pipeline

From probe to rollback in minutes

When something breaks, the path from detection to mitigation runs through this pipeline — every step automated, every step audited.

Detection → response

Synthetic probe
every 60s
Cross-region health checks
Cloud Run revision
:8090
Readiness + revision-scoped log scan
Alerting
webhook · slack · email
On-call paged on degradation
Auto-rollback
SLO breach
Previous revision restored

Common questions

What this page covers — and what it does not

Is this a live-polled status feed?

Not yet — and we will not pretend otherwise. This board reflects the operational posture as of the snapshot date and changes with the deploy that resolves any incident. A live, server-backed feed is on the roadmap. We chose not to embed a third-party vendor widget (Statuspage / Better Stack) because it requires a separate account and a public page that drifts from our deploys.

How do you measure uptime?

Synthetic probes against the live API proxy from every active region every 60 seconds. Customer-impacting incidents — the kind that surface in real production traffic — count against the trailing 30-day window. Cosmetic and self-healing transient errors do not.

Where is the incident history?

Major incidents (P0 / P1) get a postmortem in the changelog within 7 days, with timeline, root cause, and remediation. As of the snapshot date there are no open or recent customer-impacting incidents.

Notice an issue?

Tell us before our probes do

If you are seeing routing errors or latency spikes that are not reflected here, our engineers want to know. Reports are triaged within an hour during business windows.

Email support Security disclosure

Security-sensitive reports — please email security@nemorouter.ai instead of filing a public issue.