ADR-008: Use Azure AI Foundry (serverless) for LLM inference
Context
Aucert's 5-layer AI testing pipeline requires LLM inference for test generation (L1), analysis (L3), decision-making (L4), and reporting (L5). We need a platform that:
- Is credit-eligible under our Azure Founders Hub credits (~$5K)
- Has zero or near-zero idle cost at our current low volume
- Requires no GPU expertise (no vLLM tuning, CUDA drivers, or OOM debugging)
- Supports our cloud-agnostic architecture (swappable via environment variable)
- Provides OpenAI-compatible API format for broad library support
We evaluated four options:
| Option | Credit-eligible | Idle cost | GPU ops needed | Cloud-agnostic |
|---|---|---|---|---|
| Azure AI Foundry (Global Standard) | Yes ("Direct from Azure") | Zero | No | Yes (OpenAI-compatible) |
| Self-hosted on AKS GPU nodes | N/A (compute cost) | $2-8/hr | Yes | Yes (but heavy ops) |
| AWS Bedrock | Yes (via Activate) | Zero | No | Yes |
| Direct API (Anthropic, OpenAI) | No | Zero | No | Yes |
Decision
Use Azure AI Foundry with serverless Global Standard deployments for all product LLM inference. Models are deployed as API endpoint configurations within an azurerm_cognitive_account resource, managed by Terraform in the foundation tier (infra/terraform/foundation/foundry.tf).
Key properties:
- Resource:
aucert-aiinaucert-foundation-rg(westus) - Deployment type: Global Standard (serverless, pay-per-token, zero idle cost)
- Auth: Managed Identity from AKS (primary), API key from Key Vault (local dev)
- API format: OpenAI-compatible — any library that works with OpenAI works with Foundry
- Model management: Terraform
azurerm_cognitive_deploymentblocks — version-controlled, reviewable
The Foundry resource is a billing/configuration wrapper. Microsoft hosts the model weights on their GPU infrastructure. No GPU nodes in our VNet, no model weights on our disks.
Alternatives considered
| Option | Pros | Cons |
|---|---|---|
| Azure AI Foundry (chosen) | Zero idle cost, credit-eligible, no GPU ops, instant model switching via Terraform, OpenAI-compatible API | Cannot use Anthropic Claude models with credits (marketplace items). Dependent on Microsoft availability/rate limits. No data residency without private endpoints. |
| Self-hosted on AKS | Full control, lowest per-token cost at high volume, data stays in VNet | GPU node pool $2-8/hr idle cost, vLLM/TGI ops burden, model weight management. Overkill at our volume. |
| AWS Bedrock | Zero idle cost, broad model selection | Requires AWS infrastructure. Activate credits not yet approved. Cross-cloud complexity. |
| Direct API | Simplest integration, widest model selection | No credit coverage. Higher cost. No unified billing. |
Consequences
What becomes easier
- Adding new models: one Terraform block +
terraform apply— no GPU provisioning - Cost management: all models are pay-per-token with zero idle cost; $1K credits last 10-22 months
- Local development: same API format as OpenAI — developers can use familiar libraries
- Model switching: update ConfigMap deployment name, pods pick up on next request
What becomes harder
- Using Anthropic Claude models: marketplace items are NOT credit-eligible — separate billing required
- Debugging inference issues: model internals are opaque (Microsoft-managed)
- Latency optimization: no control over GPU placement or model serving configuration
Risks
- Microsoft may deprecate or change pricing for specific models (mitigated by multi-model strategy)
- Rate limits may throttle during usage spikes (mitigated by TPM quota config, can increase)
- Founders Hub credits may expire before revenue (mitigated by low burn rate: $45-105/month)
Amendment — 2026-04-20: region strategy, rollout staggering, dual-account pattern
What we learned
While deploying three approved gated models (gpt-5.4, gpt-5.4-pro, gpt-5.3-codex) to our existing aucert-ai account in westus, two deployments failed:
gpt-5.3-codex: wrong model version (caller used2025-12-20; actual catalog is2026-02-24in westus)gpt-5.4-pro: not available in westus at all — only in eastus2 and southcentralus as of 2026-04-20
Root-cause analysis surfaced three structural gaps in the original decision:
- Region was inherited, not evaluated. The Foundry account was placed in westus because AKS/VNet/KV/PG were already there. The original decision did not treat region as an independent dimension; the "West US" line in §Decision was a statement, not a justification. (Additional context: at creation time, our Azure subscription did not have eastus access approved — westus was the only region available to us, which reinforces that the choice was constraint-driven.)
- "GlobalStandard" created a false inference. The SKU name suggests region doesn't matter. It does —
GlobalStandardgoverns how inference requests route (globally across Microsoft's GPU pool), but the deployment registration still requires the model to be rolled out to the control-plane region. Two models can both beGlobalStandardand have different regional availability. - Availability was checked Day-1 only. The four Day-1 models (GPT-5.1-Codex, Kimi K2.5, DeepSeek V3.2, GPT-4o-mini) all happened to be in westus at deploy time. No forward-looking check was run against the "Pending gated models" list we already knew we'd want. This class of silent failure was reintroduced when the gated models were finally added.
Updated decision
Region strategy:
- Primary Foundry account (
aucert-ai) remains in westus — co-located with AKS and the rest of foundation. Hosts the majority of product inference models. - Secondary Foundry account (
aucert-ai-east) in eastus2 — provisioned specifically to host models that Microsoft has not yet rolled out to westus. As of 2026-04-20, this is onlygpt-5.4-pro. More models may land here if regional rollout lag recurs. - Both accounts are in the same resource group (
aucert-foundation-rg). The Key Vault (in westus) stores credentials for both (foundry-api-key+foundry-endpointfor primary;foundry-east-api-key+foundry-east-endpointfor secondary). - AKS kubelet managed identity has
Cognitive Services Useron both accounts. - Because SKU is
GlobalStandard, inference latency from westus AKS to an eastus2 endpoint is equivalent to latency to the westus endpoint — Microsoft routes to the nearest GPU regardless of the account's control-plane region.
Why two accounts, not migration: migrating aucert-ai to eastus2 would require destroying and recreating the resource (breaking all existing deployments, AKS RBAC, Key Vault secrets, backend endpoint config) for the benefit of one currently-unavailable model. The dual-account pattern isolates the risk and is reversible — once gpt-5.4-pro ships to westus, we can move the deployment back and retire the east account.
Backend routing: ConfigMap llm-config now exposes FOUNDRY_ENDPOINT (primary) plus per-model overrides like FOUNDRY_ENDPOINT_REASONING_PRO (east endpoint). The backend adapter must check for a per-model override first and fall back to the primary endpoint otherwise. TODO: wire per-model endpoint selection into FoundryOpenAIAdapter.
Pre-deploy verification is now mandatory:
A script at infra/terraform/foundation/scripts/verify-foundry-models.sh parses every azurerm_cognitive_deployment block across foundry*.tf, extracts (account, model.name, model.version, sku.name), and verifies each tuple is available in the account's region via az cognitiveservices model list. It must be run before terraform plan whenever foundry*.tf is modified. Intent is to gate this in CI next.
What this amendment changes
| Aspect | Before | After |
|---|---|---|
| Number of Foundry accounts | 1 (aucert-ai in westus) | 2 (aucert-ai in westus + aucert-ai-east in eastus2) |
| Decision dimension "region" | Inherited from AKS | Explicitly evaluated per deployment; secondary account allowed when westus rollout lags |
| Pre-deploy check | None | scripts/verify-foundry-models.sh mandatory before terraform plan on foundry*.tf |
| Backend endpoint config | Single FOUNDRY_ENDPOINT | Primary + per-model overrides (FOUNDRY_ENDPOINT_<KEY>) |
| gpt-5.4-pro home | (undefined — failed to deploy in westus) | aucert-ai-east in eastus2 |
Revisit triggers
Revisit this amendment if any of the following becomes true:
- Microsoft rolls out
gpt-5.4-pro(or whatever the east account's sole occupant is) to westus — then collapse back to a single account. - A third model requires a region neither westus nor eastus2 supports — re-evaluate whether a third account, or migration of one existing account, is cheaper.
- Cross-region latency measurements reveal
GlobalStandardrouting is NOT region-agnostic in practice — would invalidate the "two accounts, negligible latency delta" assumption.
Amendment — 2026-05-03: deployment-name-equals-model-name + Kimi K2.6 swap
What changed
- All Foundry deployment
namevalues are now identical (case + dots + hyphens) to the underlying modelname. This unifies what previously diverged across the dash-vs-dot footgun documented in the original "API compatibility note" of the foundry usage guide. - Kimi K2.5 retired; Kimi K2.6 (Moonshot version
2026-04-20) deployed as the L1 Generation primary. gpt-4o-minideployment removed. L5 Reporting now shares the Kimi-K2.6 deployment as a placeholder — the cost delta vs. a dedicated mini deployment is rounding error at MVP volume, and L1+L5 sharing one deployment is acceptable until quota contention is observed.
Why
- opencode + portal triage: when the deployment name and model name differ in case or punctuation, both opencode integration and the Foundry portal triage path become harder to navigate. The dash-vs-dot inconsistency was a long-standing footgun that contradicted the "no surprises" goal.
- K2.6 capability: confirmed available in westus on 2026-05-03 via
az cognitiveservices model list -l westus; supersedes K2.5 in raw capability. - gpt-4o-mini retirement window: Microsoft retires gpt-4o-mini 2026-09-30. Removing it now and using Kimi-K2.6 for L5 Reporting buys runway and simplifies the Foundry inventory by one deployment.
What this amendment changes
| Aspect | Before | After |
|---|---|---|
| Deployment name convention | Snake-style (kimi-k2-5, gpt-5-1-codex) | Match model name verbatim (Kimi-K2.6, gpt-5.1-codex) |
| L1 Generation model | Kimi K2.5 | Kimi K2.6 |
| L5 Reporting model | gpt-4o-mini (dedicated deployment) | Kimi-K2.6 (shared with L1) |
| Foundry deployments in westus | 6 (incl. gpt-4o-mini, kimi-k2-5) | 5 (incl. Kimi-K2.6, no gpt-4o-mini) |
Revisit triggers
- L1 + L5 contention causes 429s on the shared Kimi-K2.6 deployment → split L5 onto a dedicated deployment (consider
gpt-5.4-miniorgpt-5-nano). - Kimi K2.6 retired or replaced by K2.7+ → swap and update both env-var pointers in lockstep.