Amperes API reference
OpenAI-compatible inference control plane. Drop-in replacement for api.openai.com
with smart routing, governance, and observability.
Quickstart
Amperes runs in one of two deployment modes today. The API is identical across both — only the base_url changes.
| If you want… | Use | Onboarding |
|---|---|---|
| Fastest start, full visibility | Hosted | ~5 minutes · we provision a key |
| Zero data egress, your AWS account | In-VPC deploy | ~30 min · CloudFormation in your AWS account |
An in-process SDK that keeps the routing decision inside your service (we only receive metadata) is in development. Book a demo if you'd like to be on the early-access list.
1. Get credentials
Book a demo to discuss your team's deployment mode and expected monthly volume. Keys are provisioned out of the call (Hosted SaaS) or as part of the CloudFormation stack (In-VPC).
2. Set environment variables
3. Swap one line in your code
That's it. Existing call sites work without modification. The response shape is byte-identical to OpenAI's.
Authentication
All non-public endpoints require a bearer token in the Authorization header.
Keys are per-customer. Lost or compromised keys can be rotated by booking a quick call. We hash keys before storage (SHA-256, hmac.compare_digest on lookup) so a SQL leak cannot recover them.
What's public
One endpoint requires no auth: GET /health. Use it for load-balancer liveness probes.
Deployment modes
Same routing engine. Two places it can live today. The choice is about where your prompt data is processed — not what the product does.
Hosted
~5-minute onboarding · your traffic passes through our proxy
The fastest start. We run the proxy, dashboard, and audit log on AWS in us-east-2. By default we store only prompt hashes (SHA-256); full prompt storage is opt-in per customer via store_full_prompts=true.
SOC 2 Type I is planned (not yet started). If procurement needs a current attestation today, In-VPC mode keeps your data in your own account.
Good fit for: AI-native startups and mid-market teams where the data-residency question is "US is fine."
In-VPC deploy · CloudFormation
~30-minute deploy · zero data egress to Amperes
Amperes runs as an ECS Fargate task inside your AWS account. The stack (infra/aws/cloudformation.yaml) provisions a VPC, RDS Postgres (encrypted at rest), a Bedrock VPC endpoint, a scoped IAM role, and an internal ALB.
Your inference traffic stays inside your VPC end to end — we never see a token. We can publish the proxy image to your private ECR or supply a tagged build on request.
Your application talks to the internal ALB URL (for example https://amperes.internal.acme.com/v1). All other code is unchanged from the Hosted snippet above — same headers, same response shape.
Good fit for: healthcare (HIPAA), fintech, insurance, government, any F500 where a CISO has to sign off. Book a demo to get the template walked through.
Chat completions
POST/v1/chat/completions
The primary routing endpoint. Mirrors OpenAI's chat completion API; accepts every field OpenAI does. The proxy classifies, picks a model, calls upstream, and returns the response unchanged.
Request body
Same as OpenAI's spec, with two notable behaviors:
modelaccepts the literal string"auto"to let the router decide. Pass a specific model ID (fromGET /v1/models) to force a model — the policy still applies governance + region filters.- Credential / mock / callback fields are stripped before reaching upstream.
Streaming
Set "stream": true for SSE. The proxy forwards each chunk as it arrives, preserves OpenAI's data: {...}\n\n framing, and emits data: [DONE] at the end. Time to first token (TTFB) and total latency are logged separately.
Tools
Tool calls work end-to-end. Models that don't support tools are filtered out of the candidate set automatically.
Embeddings
POST/v1/embeddings
Passthrough to your preferred embedding provider with governance enforcement. PII detection and region constraints apply. HIPAA-only customers receive 403 until a HIPAA-eligible embedding provider is configured.
Why governance on embeddings?
Embeddings move the prompt data through the same network and providers as chat. Without governance, a HIPAA customer could embed patient notes through a US-only model — same compliance risk as a chat completion. We close that gap.
Model registry
GET/v1/models
OpenAI-shape model list with our routing metadata attached. Each model entry:
Response headers
Every chat completion (streaming or not) carries control-plane headers. Read them in your client to audit routing decisions:
| Header | Value type | Description |
|---|---|---|
| x-router-model | string | The model that served this request |
| x-router-tier | low/medium/high | Classified complexity |
| x-router-task-type | string | coding / extraction / qa / planning / etc. |
| x-router-cost-usd | float | What this request actually cost |
| x-router-baseline-cost-usd | float | What the same prompt on Opus would have cost |
| x-router-policy | string | Decision reasoning with score breakdown |
| x-router-escalated | true / absent | Cheap-first failed confidence check; escalated to strong model |
| x-router-pii-detected | true / absent | PII detected and handled per policy |
| x-router-pii-categories | comma-list | email, phone, ssn, credit_card, … |
| x-router-governance-action | redacted / blocked | What we did with PII |
| x-router-agent-step | string | planning / retrieval_synthesis / formatting / tool_use |
| x-router-drift-active | comma-list | Model ids currently flagged by the drift detector (sent only when there is one) |
| x-router-request-id | uuid | For correlation with our audit log |
Errors
OpenAI-shape error responses. Same JSON envelope, same client retry logic.
| Status | type | When |
|---|---|---|
| 400 | invalid_request_error | Bad JSON, missing fields, malformed messages |
| 401 | authentication_error | Missing / invalid Bearer token |
| 403 | permission_error | Tenant isolation, PII block, HIPAA mismatch, region mismatch |
| 413 | request_too_large | Body exceeds 5 MB |
| 429 | rate_limit_error | Per-customer rate limit; Retry-After header included |
| 502 | upstream_error | All routing candidates failed; non-retryable provider error |
Policies
A policy is a tier-stratified allowlist of candidate models. The router picks the best candidate via multi-objective scoring.
| Policy | Profile |
|---|---|
| balanced | Default. Cheapest viable per tier across providers. |
| aggressive | Maximum savings. Downshifts tiers when possible. |
| conservative | Quality-first. Upshifts tiers when in doubt. |
| anthropic_only | Anthropic direct + Bedrock-Claude. |
| openai_only | OpenAI direct. |
| bedrock_only | AWS Bedrock-Claude only. HIPAA-friendly. |
| local_only | On-prem models in your VPC. $0/token. HIPAA + EU + air-gap. |
| hybrid_local_cloud | Local for 80% of traffic; cloud Opus for the hardest 20%. |
Task types
Detected from prompt content (regex + keyword scoring; embedding fallback). Drives task-affinity scoring in the policy.
qa · summarization · extraction · structured_output · coding · reasoning · planning · agentic · creative · tool_use · long_context · general
Confidence escalation
When eligible (tier > low, no tools, no structured output), the router calls a cheaper model first and inspects the response. Signals that trigger escalation:
- Hedge markers ("I'm not sure", "perhaps")
- Refusal markers ("I cannot", "I'm unable")
- Truncation, repetition, or very short output
- finish_reason of
lengthorcontent_filter - Expected tool call missing
- Response echoes the prompt
Below a configurable confidence threshold (default 0.55), we re-route to the strong model. Both calls are logged, and the dashboard tracks escalation rate so you can validate the trade is net-positive.
Provider failover
We track rolling-window error rate and p99 latency per provider. As a provider degrades, its score drops and traffic shifts to healthier candidates; when it goes "down", its models are excluded until health recovers.
PII detection
Regex-based detection across 11 categories. Luhn-validated credit card recognition. Actions configurable per customer:
| action | behavior |
|---|---|
| redact | Replace matched spans with [REDACTED_<CATEGORY>] before sending upstream. Default. |
| block | 403 the request entirely. Use for HIPAA / PCI workloads where leaks are unrecoverable. |
| allow | Log the detection but pass through. Audit trail only. |
Region routing
Models carry region tags (us, eu, apac, global). Customers configure allowed_regions. Models whose region list doesn't intersect are filtered from candidates. Empty constraint = no restriction.
HIPAA mode
Set require_hipaa_models=true on the customer config and the router only selects models flagged hipaa_compliant: true — today the AWS Bedrock-hosted family (Claude Haiku / Sonnet / Opus, plus Amazon Nova Lite / Micro). The filter runs before scoring, so a HIPAA-on customer cannot route to a non-compliant model regardless of policy.
Audit log
Append-only record of every routing decision, escalation, fallback, PII redaction, and policy block. Queryable via GET /admin/audit?days=30. Exportable to S3 via POST /admin/export/s3.
Rate limits
Per-customer sliding-window. Default 60 req/min for chat, separate bucket for embeddings. Configurable on the CustomerConfig. 429 includes a Retry-After header.
In In-VPC mode the rate limit is yours to tune — you control the ECS task scaling.