Documentation

Amperes API reference

OpenAI-compatible inference control plane. Drop-in replacement for api.openai.com with smart routing, governance, and observability.


Quickstart

Amperes runs in one of two deployment modes today. The API is identical across both — only the base_url changes.

If you want…UseOnboarding
Fastest start, full visibilityHosted~5 minutes · we provision a key
Zero data egress, your AWS accountIn-VPC deploy~30 min · CloudFormation in your AWS account

An in-process SDK that keeps the routing decision inside your service (we only receive metadata) is in development. Book a demo if you'd like to be on the early-access list.

1. Get credentials

Book a demo to discuss your team's deployment mode and expected monthly volume. Keys are provisioned out of the call (Hosted SaaS) or as part of the CloudFormation stack (In-VPC).

2. Set environment variables

$ export AMPERES_API_KEY="amperes-..." $ export AMPERES_BASE_URL="https://api.amperes.pro/v1" # Hosted SaaS # Or your in-VPC ALB URL: https://amperes.<your-domain>.internal/v1

3. Swap one line in your code

from openai import OpenAI import os client = OpenAI( api_key=os.environ["AMPERES_API_KEY"], base_url=os.environ["AMPERES_BASE_URL"], # ← only change ) response = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Parse this CSV..."}], ) print(response.choices[0].message.content) print(response.headers.get("x-router-model")) # which model we routed to
import OpenAI from 'openai'; const client = new OpenAI({ apiKey: process.env.AMPERES_API_KEY, baseURL: process.env.AMPERES_BASE_URL, // ← only change }); const response = await client.chat.completions.create({ model: 'auto', messages: [{ role: 'user', content: 'Parse this CSV...' }], }); console.log(response.choices[0].message.content);
$ curl $AMPERES_BASE_URL/chat/completions \ -H "Authorization: Bearer $AMPERES_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Parse this CSV..."}] }'

That's it. Existing call sites work without modification. The response shape is byte-identical to OpenAI's.


Authentication

All non-public endpoints require a bearer token in the Authorization header.

Authorization: Bearer amperes-...

Keys are per-customer. Lost or compromised keys can be rotated by booking a quick call. We hash keys before storage (SHA-256, hmac.compare_digest on lookup) so a SQL leak cannot recover them.

What's public

One endpoint requires no auth: GET /health. Use it for load-balancer liveness probes.


Deployment modes

Same routing engine. Two places it can live today. The choice is about where your prompt data is processed — not what the product does.

Hosted

~5-minute onboarding · your traffic passes through our proxy

The fastest start. We run the proxy, dashboard, and audit log on AWS in us-east-2. By default we store only prompt hashes (SHA-256); full prompt storage is opt-in per customer via store_full_prompts=true.

SOC 2 Type I is planned (not yet started). If procurement needs a current attestation today, In-VPC mode keeps your data in your own account.

$ curl https://api.amperes.pro/v1/chat/completions \ -H "Authorization: Bearer $AMPERES_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "auto", "messages": [...]}'

Good fit for: AI-native startups and mid-market teams where the data-residency question is "US is fine."

In-VPC deploy · CloudFormation

~30-minute deploy · zero data egress to Amperes

Amperes runs as an ECS Fargate task inside your AWS account. The stack (infra/aws/cloudformation.yaml) provisions a VPC, RDS Postgres (encrypted at rest), a Bedrock VPC endpoint, a scoped IAM role, and an internal ALB.

Your inference traffic stays inside your VPC end to end — we never see a token. We can publish the proxy image to your private ECR or supply a tagged build on request.

$ aws cloudformation deploy \ --template-file infra/aws/cloudformation.yaml \ --stack-name amperes-prod \ --parameter-overrides \ VpcId=vpc-0123abcd \ SubnetIds=subnet-aaa,subnet-bbb \ ContainerImage=<your-ecr-uri>/amperes-proxy:<tag> \ BedrockRegion=us-east-1 \ --capabilities CAPABILITY_NAMED_IAM

Your application talks to the internal ALB URL (for example https://amperes.internal.acme.com/v1). All other code is unchanged from the Hosted snippet above — same headers, same response shape.

Good fit for: healthcare (HIPAA), fintech, insurance, government, any F500 where a CISO has to sign off. Book a demo to get the template walked through.


Chat completions

POST/v1/chat/completions

The primary routing endpoint. Mirrors OpenAI's chat completion API; accepts every field OpenAI does. The proxy classifies, picks a model, calls upstream, and returns the response unchanged.

Request body

Same as OpenAI's spec, with two notable behaviors:

  • model accepts the literal string "auto" to let the router decide. Pass a specific model ID (from GET /v1/models) to force a model — the policy still applies governance + region filters.
  • Credential / mock / callback fields are stripped before reaching upstream.

Streaming

Set "stream": true for SSE. The proxy forwards each chunk as it arrives, preserves OpenAI's data: {...}\n\n framing, and emits data: [DONE] at the end. Time to first token (TTFB) and total latency are logged separately.

Tools

Tool calls work end-to-end. Models that don't support tools are filtered out of the candidate set automatically.


Embeddings

POST/v1/embeddings

Passthrough to your preferred embedding provider with governance enforcement. PII detection and region constraints apply. HIPAA-only customers receive 403 until a HIPAA-eligible embedding provider is configured.

Why governance on embeddings?

Embeddings move the prompt data through the same network and providers as chat. Without governance, a HIPAA customer could embed patient notes through a US-only model — same compliance risk as a chat completion. We close that gap.


Model registry

GET/v1/models

OpenAI-shape model list with our routing metadata attached. Each model entry:

{ "id": "claude-sonnet-4-6", "object": "model", "owned_by": "anthropic", "tier": "medium", "context_window": 200000, "supports_tools": true, "supports_streaming": true, "supports_structured_output": true, "cost_per_1m_input_usd": 3.00, "cost_per_1m_output_usd": 15.00, "regions": ["us", "eu", "global"], "hipaa_compliant": false, "soc2_compliant": true }

Response headers

Every chat completion (streaming or not) carries control-plane headers. Read them in your client to audit routing decisions:

HeaderValue typeDescription
x-router-modelstringThe model that served this request
x-router-tierlow/medium/highClassified complexity
x-router-task-typestringcoding / extraction / qa / planning / etc.
x-router-cost-usdfloatWhat this request actually cost
x-router-baseline-cost-usdfloatWhat the same prompt on Opus would have cost
x-router-policystringDecision reasoning with score breakdown
x-router-escalatedtrue / absentCheap-first failed confidence check; escalated to strong model
x-router-pii-detectedtrue / absentPII detected and handled per policy
x-router-pii-categoriescomma-listemail, phone, ssn, credit_card, …
x-router-governance-actionredacted / blockedWhat we did with PII
x-router-agent-stepstringplanning / retrieval_synthesis / formatting / tool_use
x-router-drift-activecomma-listModel ids currently flagged by the drift detector (sent only when there is one)
x-router-request-iduuidFor correlation with our audit log

Errors

OpenAI-shape error responses. Same JSON envelope, same client retry logic.

{ "error": { "message": "Rate limit: 60 requests/minute for chat", "type": "rate_limit_error", "code": 429 } }
StatustypeWhen
400invalid_request_errorBad JSON, missing fields, malformed messages
401authentication_errorMissing / invalid Bearer token
403permission_errorTenant isolation, PII block, HIPAA mismatch, region mismatch
413request_too_largeBody exceeds 5 MB
429rate_limit_errorPer-customer rate limit; Retry-After header included
502upstream_errorAll routing candidates failed; non-retryable provider error

Policies

A policy is a tier-stratified allowlist of candidate models. The router picks the best candidate via multi-objective scoring.

PolicyProfile
balancedDefault. Cheapest viable per tier across providers.
aggressiveMaximum savings. Downshifts tiers when possible.
conservativeQuality-first. Upshifts tiers when in doubt.
anthropic_onlyAnthropic direct + Bedrock-Claude.
openai_onlyOpenAI direct.
bedrock_onlyAWS Bedrock-Claude only. HIPAA-friendly.
local_onlyOn-prem models in your VPC. $0/token. HIPAA + EU + air-gap.
hybrid_local_cloudLocal for 80% of traffic; cloud Opus for the hardest 20%.

Task types

Detected from prompt content (regex + keyword scoring; embedding fallback). Drives task-affinity scoring in the policy.

qa · summarization · extraction · structured_output · coding · reasoning · planning · agentic · creative · tool_use · long_context · general

Confidence escalation

When eligible (tier > low, no tools, no structured output), the router calls a cheaper model first and inspects the response. Signals that trigger escalation:

  • Hedge markers ("I'm not sure", "perhaps")
  • Refusal markers ("I cannot", "I'm unable")
  • Truncation, repetition, or very short output
  • finish_reason of length or content_filter
  • Expected tool call missing
  • Response echoes the prompt

Below a configurable confidence threshold (default 0.55), we re-route to the strong model. Both calls are logged, and the dashboard tracks escalation rate so you can validate the trade is net-positive.

Provider failover

We track rolling-window error rate and p99 latency per provider. As a provider degrades, its score drops and traffic shifts to healthier candidates; when it goes "down", its models are excluded until health recovers.


PII detection

Regex-based detection across 11 categories. Luhn-validated credit card recognition. Actions configurable per customer:

actionbehavior
redactReplace matched spans with [REDACTED_<CATEGORY>] before sending upstream. Default.
block403 the request entirely. Use for HIPAA / PCI workloads where leaks are unrecoverable.
allowLog the detection but pass through. Audit trail only.

Region routing

Models carry region tags (us, eu, apac, global). Customers configure allowed_regions. Models whose region list doesn't intersect are filtered from candidates. Empty constraint = no restriction.

HIPAA mode

Set require_hipaa_models=true on the customer config and the router only selects models flagged hipaa_compliant: true — today the AWS Bedrock-hosted family (Claude Haiku / Sonnet / Opus, plus Amazon Nova Lite / Micro). The filter runs before scoring, so a HIPAA-on customer cannot route to a non-compliant model regardless of policy.

Audit log

Append-only record of every routing decision, escalation, fallback, PII redaction, and policy block. Queryable via GET /admin/audit?days=30. Exportable to S3 via POST /admin/export/s3.


Rate limits

Per-customer sliding-window. Default 60 req/min for chat, separate bucket for embeddings. Configurable on the CustomerConfig. 429 includes a Retry-After header.

In In-VPC mode the rate limit is yours to tune — you control the ECS task scaling.