Skip to main content

ADR-008: Use Azure AI Foundry (serverless) for LLM inference

Context

Aucert's 5-layer AI testing pipeline requires LLM inference for test generation (L1), analysis (L3), decision-making (L4), and reporting (L5). We need a platform that:

  1. Is credit-eligible under our Azure Founders Hub credits (~$5K)
  2. Has zero or near-zero idle cost at our current low volume
  3. Requires no GPU expertise (no vLLM tuning, CUDA drivers, or OOM debugging)
  4. Supports our cloud-agnostic architecture (swappable via environment variable)
  5. Provides OpenAI-compatible API format for broad library support

We evaluated four options:

OptionCredit-eligibleIdle costGPU ops neededCloud-agnostic
Azure AI Foundry (Global Standard)Yes ("Direct from Azure")ZeroNoYes (OpenAI-compatible)
Self-hosted on AKS GPU nodesN/A (compute cost)$2-8/hrYesYes (but heavy ops)
AWS BedrockYes (via Activate)ZeroNoYes
Direct API (Anthropic, OpenAI)NoZeroNoYes

Decision

Use Azure AI Foundry with serverless Global Standard deployments for all product LLM inference. Models are deployed as API endpoint configurations within an azurerm_cognitive_account resource, managed by Terraform in the foundation tier (infra/terraform/foundation/foundry.tf).

Key properties:

  • Resource: aucert-ai in aucert-foundation-rg (westus)
  • Deployment type: Global Standard (serverless, pay-per-token, zero idle cost)
  • Auth: Managed Identity from AKS (primary), API key from Key Vault (local dev)
  • API format: OpenAI-compatible — any library that works with OpenAI works with Foundry
  • Model management: Terraform azurerm_cognitive_deployment blocks — version-controlled, reviewable

The Foundry resource is a billing/configuration wrapper. Microsoft hosts the model weights on their GPU infrastructure. No GPU nodes in our VNet, no model weights on our disks.

Alternatives considered

OptionProsCons
Azure AI Foundry (chosen)Zero idle cost, credit-eligible, no GPU ops, instant model switching via Terraform, OpenAI-compatible APICannot use Anthropic Claude models with credits (marketplace items). Dependent on Microsoft availability/rate limits. No data residency without private endpoints.
Self-hosted on AKSFull control, lowest per-token cost at high volume, data stays in VNetGPU node pool $2-8/hr idle cost, vLLM/TGI ops burden, model weight management. Overkill at our volume.
AWS BedrockZero idle cost, broad model selectionRequires AWS infrastructure. Activate credits not yet approved. Cross-cloud complexity.
Direct APISimplest integration, widest model selectionNo credit coverage. Higher cost. No unified billing.

Consequences

What becomes easier

  • Adding new models: one Terraform block + terraform apply — no GPU provisioning
  • Cost management: all models are pay-per-token with zero idle cost; $1K credits last 10-22 months
  • Local development: same API format as OpenAI — developers can use familiar libraries
  • Model switching: update ConfigMap deployment name, pods pick up on next request

What becomes harder

  • Using Anthropic Claude models: marketplace items are NOT credit-eligible — separate billing required
  • Debugging inference issues: model internals are opaque (Microsoft-managed)
  • Latency optimization: no control over GPU placement or model serving configuration

Risks

  • Microsoft may deprecate or change pricing for specific models (mitigated by multi-model strategy)
  • Rate limits may throttle during usage spikes (mitigated by TPM quota config, can increase)
  • Founders Hub credits may expire before revenue (mitigated by low burn rate: $45-105/month)

Amendment — 2026-04-20: region strategy, rollout staggering, dual-account pattern

What we learned

While deploying three approved gated models (gpt-5.4, gpt-5.4-pro, gpt-5.3-codex) to our existing aucert-ai account in westus, two deployments failed:

  • gpt-5.3-codex: wrong model version (caller used 2025-12-20; actual catalog is 2026-02-24 in westus)
  • gpt-5.4-pro: not available in westus at all — only in eastus2 and southcentralus as of 2026-04-20

Root-cause analysis surfaced three structural gaps in the original decision:

  1. Region was inherited, not evaluated. The Foundry account was placed in westus because AKS/VNet/KV/PG were already there. The original decision did not treat region as an independent dimension; the "West US" line in §Decision was a statement, not a justification. (Additional context: at creation time, our Azure subscription did not have eastus access approved — westus was the only region available to us, which reinforces that the choice was constraint-driven.)
  2. "GlobalStandard" created a false inference. The SKU name suggests region doesn't matter. It does — GlobalStandard governs how inference requests route (globally across Microsoft's GPU pool), but the deployment registration still requires the model to be rolled out to the control-plane region. Two models can both be GlobalStandard and have different regional availability.
  3. Availability was checked Day-1 only. The four Day-1 models (GPT-5.1-Codex, Kimi K2.5, DeepSeek V3.2, GPT-4o-mini) all happened to be in westus at deploy time. No forward-looking check was run against the "Pending gated models" list we already knew we'd want. This class of silent failure was reintroduced when the gated models were finally added.

Updated decision

Region strategy:

  • Primary Foundry account (aucert-ai) remains in westus — co-located with AKS and the rest of foundation. Hosts the majority of product inference models.
  • Secondary Foundry account (aucert-ai-east) in eastus2 — provisioned specifically to host models that Microsoft has not yet rolled out to westus. As of 2026-04-20, this is only gpt-5.4-pro. More models may land here if regional rollout lag recurs.
  • Both accounts are in the same resource group (aucert-foundation-rg). The Key Vault (in westus) stores credentials for both (foundry-api-key + foundry-endpoint for primary; foundry-east-api-key + foundry-east-endpoint for secondary).
  • AKS kubelet managed identity has Cognitive Services User on both accounts.
  • Because SKU is GlobalStandard, inference latency from westus AKS to an eastus2 endpoint is equivalent to latency to the westus endpoint — Microsoft routes to the nearest GPU regardless of the account's control-plane region.

Why two accounts, not migration: migrating aucert-ai to eastus2 would require destroying and recreating the resource (breaking all existing deployments, AKS RBAC, Key Vault secrets, backend endpoint config) for the benefit of one currently-unavailable model. The dual-account pattern isolates the risk and is reversible — once gpt-5.4-pro ships to westus, we can move the deployment back and retire the east account.

Backend routing: ConfigMap llm-config now exposes FOUNDRY_ENDPOINT (primary) plus per-model overrides like FOUNDRY_ENDPOINT_REASONING_PRO (east endpoint). The backend adapter must check for a per-model override first and fall back to the primary endpoint otherwise. TODO: wire per-model endpoint selection into FoundryOpenAIAdapter.

Pre-deploy verification is now mandatory:

A script at infra/terraform/foundation/scripts/verify-foundry-models.sh parses every azurerm_cognitive_deployment block across foundry*.tf, extracts (account, model.name, model.version, sku.name), and verifies each tuple is available in the account's region via az cognitiveservices model list. It must be run before terraform plan whenever foundry*.tf is modified. Intent is to gate this in CI next.

What this amendment changes

AspectBeforeAfter
Number of Foundry accounts1 (aucert-ai in westus)2 (aucert-ai in westus + aucert-ai-east in eastus2)
Decision dimension "region"Inherited from AKSExplicitly evaluated per deployment; secondary account allowed when westus rollout lags
Pre-deploy checkNonescripts/verify-foundry-models.sh mandatory before terraform plan on foundry*.tf
Backend endpoint configSingle FOUNDRY_ENDPOINTPrimary + per-model overrides (FOUNDRY_ENDPOINT_<KEY>)
gpt-5.4-pro home(undefined — failed to deploy in westus)aucert-ai-east in eastus2

Revisit triggers

Revisit this amendment if any of the following becomes true:

  • Microsoft rolls out gpt-5.4-pro (or whatever the east account's sole occupant is) to westus — then collapse back to a single account.
  • A third model requires a region neither westus nor eastus2 supports — re-evaluate whether a third account, or migration of one existing account, is cheaper.
  • Cross-region latency measurements reveal GlobalStandard routing is NOT region-agnostic in practice — would invalidate the "two accounts, negligible latency delta" assumption.

Amendment — 2026-05-03: deployment-name-equals-model-name + Kimi K2.6 swap

What changed

  • All Foundry deployment name values are now identical (case + dots + hyphens) to the underlying model name. This unifies what previously diverged across the dash-vs-dot footgun documented in the original "API compatibility note" of the foundry usage guide.
  • Kimi K2.5 retired; Kimi K2.6 (Moonshot version 2026-04-20) deployed as the L1 Generation primary.
  • gpt-4o-mini deployment removed. L5 Reporting now shares the Kimi-K2.6 deployment as a placeholder — the cost delta vs. a dedicated mini deployment is rounding error at MVP volume, and L1+L5 sharing one deployment is acceptable until quota contention is observed.

Why

  • opencode + portal triage: when the deployment name and model name differ in case or punctuation, both opencode integration and the Foundry portal triage path become harder to navigate. The dash-vs-dot inconsistency was a long-standing footgun that contradicted the "no surprises" goal.
  • K2.6 capability: confirmed available in westus on 2026-05-03 via az cognitiveservices model list -l westus; supersedes K2.5 in raw capability.
  • gpt-4o-mini retirement window: Microsoft retires gpt-4o-mini 2026-09-30. Removing it now and using Kimi-K2.6 for L5 Reporting buys runway and simplifies the Foundry inventory by one deployment.

What this amendment changes

AspectBeforeAfter
Deployment name conventionSnake-style (kimi-k2-5, gpt-5-1-codex)Match model name verbatim (Kimi-K2.6, gpt-5.1-codex)
L1 Generation modelKimi K2.5Kimi K2.6
L5 Reporting modelgpt-4o-mini (dedicated deployment)Kimi-K2.6 (shared with L1)
Foundry deployments in westus6 (incl. gpt-4o-mini, kimi-k2-5)5 (incl. Kimi-K2.6, no gpt-4o-mini)

Revisit triggers

  • L1 + L5 contention causes 429s on the shared Kimi-K2.6 deployment → split L5 onto a dedicated deployment (consider gpt-5.4-mini or gpt-5-nano).
  • Kimi K2.6 retired or replaced by K2.7+ → swap and update both env-var pointers in lockstep.