ADR-008: Use Azure AI Foundry (serverless) for LLM inference

Context

Aucert's 5-layer AI testing pipeline requires LLM inference for test generation (L1), analysis (L3), decision-making (L4), and reporting (L5). We need a platform that:

Is credit-eligible under our Azure Founders Hub credits (~$5K)
Has zero or near-zero idle cost at our current low volume
Requires no GPU expertise (no vLLM tuning, CUDA drivers, or OOM debugging)
Supports our cloud-agnostic architecture (swappable via environment variable)
Provides OpenAI-compatible API format for broad library support

We evaluated four options:

Option	Credit-eligible	Idle cost	GPU ops needed	Cloud-agnostic
Azure AI Foundry (Global Standard)	Yes ("Direct from Azure")	Zero	No	Yes (OpenAI-compatible)
Self-hosted on AKS GPU nodes	N/A (compute cost)	$2-8/hr	Yes	Yes (but heavy ops)
AWS Bedrock	Yes (via Activate)	Zero	No	Yes
Direct API (Anthropic, OpenAI)	No	Zero	No	Yes

Decision

Use Azure AI Foundry with serverless Global Standard deployments for all product LLM inference. Models are deployed as API endpoint configurations within an azurerm_cognitive_account resource, managed by Terraform in the foundation tier (infra/terraform/foundation/foundry.tf).

Key properties:

Resource: aucert-ai in aucert-foundation-rg (westus)
Deployment type: Global Standard (serverless, pay-per-token, zero idle cost)
Auth: Managed Identity from AKS (primary), API key from Key Vault (local dev)
API format: OpenAI-compatible — any library that works with OpenAI works with Foundry
Model management: Terraform azurerm_cognitive_deployment blocks — version-controlled, reviewable

The Foundry resource is a billing/configuration wrapper. Microsoft hosts the model weights on their GPU infrastructure. No GPU nodes in our VNet, no model weights on our disks.

Alternatives considered

Option	Pros	Cons
Azure AI Foundry (chosen)	Zero idle cost, credit-eligible, no GPU ops, instant model switching via Terraform, OpenAI-compatible API	Cannot use Anthropic Claude models with credits (marketplace items). Dependent on Microsoft availability/rate limits. No data residency without private endpoints.
Self-hosted on AKS	Full control, lowest per-token cost at high volume, data stays in VNet	GPU node pool $2-8/hr idle cost, vLLM/TGI ops burden, model weight management. Overkill at our volume.
AWS Bedrock	Zero idle cost, broad model selection	Requires AWS infrastructure. Activate credits not yet approved. Cross-cloud complexity.
Direct API	Simplest integration, widest model selection	No credit coverage. Higher cost. No unified billing.

Consequences

What becomes easier

Adding new models: one Terraform block + terraform apply — no GPU provisioning
Cost management: all models are pay-per-token with zero idle cost; $1K credits last 10-22 months
Local development: same API format as OpenAI — developers can use familiar libraries
Model switching: update ConfigMap deployment name, pods pick up on next request

What becomes harder

Using Anthropic Claude models: marketplace items are NOT credit-eligible — separate billing required
Debugging inference issues: model internals are opaque (Microsoft-managed)
Latency optimization: no control over GPU placement or model serving configuration

Risks

Microsoft may deprecate or change pricing for specific models (mitigated by multi-model strategy)
Rate limits may throttle during usage spikes (mitigated by TPM quota config, can increase)
Founders Hub credits may expire before revenue (mitigated by low burn rate: $45-105/month)

Amendment — 2026-04-20: region strategy, rollout staggering, dual-account pattern

What we learned

While deploying three approved gated models (gpt-5.4, gpt-5.4-pro, gpt-5.3-codex) to our existing aucert-ai account in westus, two deployments failed:

gpt-5.3-codex: wrong model version (caller used 2025-12-20; actual catalog is 2026-02-24 in westus)
gpt-5.4-pro: not available in westus at all — only in eastus2 and southcentralus as of 2026-04-20

Root-cause analysis surfaced three structural gaps in the original decision:

Region was inherited, not evaluated. The Foundry account was placed in westus because AKS/VNet/KV/PG were already there. The original decision did not treat region as an independent dimension; the "West US" line in §Decision was a statement, not a justification. (Additional context: at creation time, our Azure subscription did not have eastus access approved — westus was the only region available to us, which reinforces that the choice was constraint-driven.)
"GlobalStandard" created a false inference. The SKU name suggests region doesn't matter. It does — GlobalStandard governs how inference requests route (globally across Microsoft's GPU pool), but the deployment registration still requires the model to be rolled out to the control-plane region. Two models can both be GlobalStandard and have different regional availability.
Availability was checked Day-1 only. The four Day-1 models (GPT-5.1-Codex, Kimi K2.5, DeepSeek V3.2, GPT-4o-mini) all happened to be in westus at deploy time. No forward-looking check was run against the "Pending gated models" list we already knew we'd want. This class of silent failure was reintroduced when the gated models were finally added.

Updated decision

Region strategy:

Primary Foundry account (aucert-ai) remains in westus — co-located with AKS and the rest of foundation. Hosts the majority of product inference models.
Secondary Foundry account (aucert-ai-east) in eastus2 — provisioned specifically to host models that Microsoft has not yet rolled out to westus. As of 2026-04-20, this is only gpt-5.4-pro. More models may land here if regional rollout lag recurs.
Both accounts are in the same resource group (aucert-foundation-rg). The Key Vault (in westus) stores credentials for both (foundry-api-key + foundry-endpoint for primary; foundry-east-api-key + foundry-east-endpoint for secondary).
AKS kubelet managed identity has Cognitive Services User on both accounts.
Because SKU is GlobalStandard, inference latency from westus AKS to an eastus2 endpoint is equivalent to latency to the westus endpoint — Microsoft routes to the nearest GPU regardless of the account's control-plane region.

Why two accounts, not migration: migrating aucert-ai to eastus2 would require destroying and recreating the resource (breaking all existing deployments, AKS RBAC, Key Vault secrets, backend endpoint config) for the benefit of one currently-unavailable model. The dual-account pattern isolates the risk and is reversible — once gpt-5.4-pro ships to westus, we can move the deployment back and retire the east account.

Backend routing: ConfigMap llm-config now exposes FOUNDRY_ENDPOINT (primary) plus per-model overrides like FOUNDRY_ENDPOINT_REASONING_PRO (east endpoint). The backend adapter must check for a per-model override first and fall back to the primary endpoint otherwise. TODO: wire per-model endpoint selection into FoundryOpenAIAdapter.

Pre-deploy verification is now mandatory:

A script at infra/terraform/foundation/scripts/verify-foundry-models.sh parses every azurerm_cognitive_deployment block across foundry*.tf, extracts (account, model.name, model.version, sku.name), and verifies each tuple is available in the account's region via az cognitiveservices model list. It must be run before terraform plan whenever foundry*.tf is modified. Intent is to gate this in CI next.

What this amendment changes

Aspect	Before	After
Number of Foundry accounts	1 (`aucert-ai` in westus)	2 (`aucert-ai` in westus + `aucert-ai-east` in eastus2)
Decision dimension "region"	Inherited from AKS	Explicitly evaluated per deployment; secondary account allowed when westus rollout lags
Pre-deploy check	None	`scripts/verify-foundry-models.sh` mandatory before `terraform plan` on `foundry*.tf`
Backend endpoint config	Single `FOUNDRY_ENDPOINT`	Primary + per-model overrides (`FOUNDRY_ENDPOINT_<KEY>`)
gpt-5.4-pro home	(undefined — failed to deploy in westus)	`aucert-ai-east` in eastus2

Revisit triggers

Revisit this amendment if any of the following becomes true:

Microsoft rolls out gpt-5.4-pro (or whatever the east account's sole occupant is) to westus — then collapse back to a single account.
A third model requires a region neither westus nor eastus2 supports — re-evaluate whether a third account, or migration of one existing account, is cheaper.
Cross-region latency measurements reveal GlobalStandard routing is NOT region-agnostic in practice — would invalidate the "two accounts, negligible latency delta" assumption.

Amendment — 2026-05-03: deployment-name-equals-model-name + Kimi K2.6 swap

What changed

All Foundry deployment name values are now identical (case + dots + hyphens) to the underlying model name. This unifies what previously diverged across the dash-vs-dot footgun documented in the original "API compatibility note" of the foundry usage guide.
Kimi K2.5 retired; Kimi K2.6 (Moonshot version 2026-04-20) deployed as the L1 Generation primary.
gpt-4o-mini deployment removed. L5 Reporting now shares the Kimi-K2.6 deployment as a placeholder — the cost delta vs. a dedicated mini deployment is rounding error at MVP volume, and L1+L5 sharing one deployment is acceptable until quota contention is observed.

Why

opencode + portal triage: when the deployment name and model name differ in case or punctuation, both opencode integration and the Foundry portal triage path become harder to navigate. The dash-vs-dot inconsistency was a long-standing footgun that contradicted the "no surprises" goal.
K2.6 capability: confirmed available in westus on 2026-05-03 via az cognitiveservices model list -l westus; supersedes K2.5 in raw capability.
gpt-4o-mini retirement window: Microsoft retires gpt-4o-mini 2026-09-30. Removing it now and using Kimi-K2.6 for L5 Reporting buys runway and simplifies the Foundry inventory by one deployment.

What this amendment changes

Aspect	Before	After
Deployment name convention	Snake-style (`kimi-k2-5`, `gpt-5-1-codex`)	Match model name verbatim (`Kimi-K2.6`, `gpt-5.1-codex`)
L1 Generation model	Kimi K2.5	Kimi K2.6
L5 Reporting model	gpt-4o-mini (dedicated deployment)	Kimi-K2.6 (shared with L1)
Foundry deployments in westus	6 (incl. gpt-4o-mini, kimi-k2-5)	5 (incl. Kimi-K2.6, no gpt-4o-mini)

Revisit triggers

L1 + L5 contention causes 429s on the shared Kimi-K2.6 deployment → split L5 onto a dedicated deployment (consider gpt-5.4-mini or gpt-5-nano).
Kimi K2.6 retired or replaced by K2.7+ → swap and update both env-var pointers in lockstep.

Amendment — 2026-06-26: Foundry reached via the Tower gateway (SPEC-042)

Status: introduced by SPEC-042 (draft, under review). Documents the direction; implementation follows spec approval. Foundry remains the sole runtime provider — the choice of Foundry (this ADR) stands; what changes is how the backend reaches it.

What changes

Product LLM inference no longer calls Foundry directly from backend app code. It flows through Tower — an in-house, provider-agnostic gateway port (TowerInternalApi) — backed by a self-hosted LiteLLM engine that holds the provider integration: backend (Tower port) → LiteLLM engine → Foundry. Foundry stays the only runtime provider at MVP; additional providers become config-only additions to the engine with zero app-code change.

Why

Provider/model-agnostic LLM integration is an MVP MUST that Foundry-direct cannot satisfy.
Hard Rule #3 (no cloud SDKs in app code): the engine holds provider SDKs/keys; the backend speaks OpenAI-compatible HTTP to the engine.
Future BYOK / data-residency for customer app screenshots needs a per-tenant credential seam the engine provides.

What this amendment changes

Aspect	Before	After
Backend → Foundry access	Direct (`FoundryOpenAIAdapter`, ConfigMap deployment names)	Via `TowerInternalApi` → LiteLLM engine → Foundry
Foundry API key location	Injected into backend pods	Held by the LiteLLM engine; backend no longer holds it
Provider posture	"all product inference via Foundry"	Foundry is the sole runtime provider at MVP, behind Tower; more providers are engine-config additions
Model routing	`llm-config` ConfigMap	Engine config-as-code (see ADR-009 amendment 2026-06-26)
Dual-account endpoint selection (westus/eastus2)	Backend adapter per-model override	The engine (holds both account endpoints/keys)

Revisit triggers

A second provider goes live at runtime → Foundry is no longer sole; update the "sole runtime provider" statement.
BYOK ships → per-tenant provider credentials.
The single-instance engine proves an unacceptable SPOF → revisit HA (SPEC-042 §K).

Context​

Decision​

Alternatives considered​

Consequences​

What becomes easier​

What becomes harder​

Risks​

Amendment — 2026-04-20: region strategy, rollout staggering, dual-account pattern​

What we learned​

Updated decision​

What this amendment changes​

Revisit triggers​

Amendment — 2026-05-03: deployment-name-equals-model-name + Kimi K2.6 swap​

What changed​

Why​

What this amendment changes​

Revisit triggers​

Amendment — 2026-06-26: Foundry reached via the Tower gateway (SPEC-042)​

What changes​

Why​

What this amendment changes​

Revisit triggers​

Context

Decision

Alternatives considered

Consequences

What becomes easier

What becomes harder

Risks

Amendment — 2026-04-20: region strategy, rollout staggering, dual-account pattern

What we learned

Updated decision

What this amendment changes

Revisit triggers

Amendment — 2026-05-03: deployment-name-equals-model-name + Kimi K2.6 swap

What changed

Why

What this amendment changes

Revisit triggers

Amendment — 2026-06-26: Foundry reached via the Tower gateway (SPEC-042)

What changes

Why

What this amendment changes

Revisit triggers