Azure AI Foundry architecture

This document explains how Azure AI Foundry works and why there are no LLM models running inside our AKS cluster or VNet. It is written for both technical and non-technical team members.

What Foundry IS

Azure AI Foundry is a managed SaaS platform where Microsoft hosts LLM models on their GPU infrastructure. When we "deploy a model," we are creating an API endpoint configuration — not provisioning hardware.

We create a "resource" (billing wrapper) and "deployments" (API endpoint configs)
Microsoft handles GPU allocation, model weights, scaling, uptime, and security patches
We call it over HTTPS, the same way we call Stripe for payments or Twilio for SMS

The Foundry resource in our Azure resource group (aucert-ai) is a configuration object. The actual inference (the GPU crunching numbers) happens on Microsoft's shared infrastructure, somewhere in their data centers.

What Foundry is NOT

Not infrastructure in our VNet (aucert-vnet, 10.0.0.0/16) — no subnets, no NICs, no private IPs
Not pods in our AKS (aucert-aks) — no deployments, no containers, no CPU/memory allocation
Not GPU nodes we manage — no CUDA drivers, no vLLM tuning, no OOM errors
Not model weights on our disks — no storage, no downloads, no version management
Not something with idle cost — Global Standard deployments are pay-per-token; a deployment that receives zero requests costs $0

Architecture diagram

Request flow

Why this is the right choice for our stage

Zero idle cost — We pay per token, not per GPU-hour. At our volume ($45-105/month), this saves thousands compared to self-hosting. A GPU node pool costs $2-8/hour even when idle.
No GPU expertise needed — We don't need to tune vLLM serving parameters, manage CUDA drivers, handle GPU memory fragmentation, or debug OOM errors. Microsoft handles all of that.
Instant model switching — Adding a new model is a Terraform block + terraform apply (2 minutes). Switching which model a pipeline layer uses is a ConfigMap update (1 minute). No new node pools, no image builds, no Helm upgrades.
Credit-eligible — "Direct from Azure" models are covered by our Founders Hub credits. Our $1,000 in credits lasts 10-22 months at current estimates.
Same API everywhere — OpenAI-compatible format means our backend code works with Foundry, Bedrock, Vertex, or direct API by changing one environment variable (LLM_PROVIDER). This preserves our cloud-agnostic architecture (ADR-001).

When self-hosting would make sense

Self-hosting open models (Llama, DeepSeek, Mistral) on GPU VMs becomes cheaper when API costs exceed ~$10,000/month. At that volume, a dedicated GPU node pool amortizes the idle cost across enough requests to beat per-token pricing.

We are nowhere near that volume. Our estimated monthly cost is $45-105. Revisit self-hosting at Month 12+ if we have paying customers generating high inference volume.

Analogy for non-technical readers

Using Foundry is like using Gmail.

Google runs the email servers — you don't install an email server in your office
You call Gmail's API to send email — you don't manage mail queues or spam filters
You pay for the service (or get it free) — you don't buy server hardware

Similarly:

Microsoft runs the AI models — we don't run AI models on our own servers
We call their API to get predictions — we don't manage GPUs or model weights
We pay per usage (covered by credits) — we don't buy GPU hardware

The "AI Foundry resource" in our Azure subscription is like your Gmail account. It's a configuration that says "this is the Aucert account, these are the models it has access to." The actual email servers (GPU servers) are somewhere in Google's (Microsoft's) data centers.

Deployment details

The Foundry resource is managed by Terraform in infra/terraform/foundation/foundry.tf. Each model deployment is a azurerm_cognitive_deployment resource:

Deployment names match the Foundry model name exactly (case + dots).

Deployment	Model	SKU	Tier	TPM limit
`Kimi-K2.6`	Kimi K2.6	GlobalStandard	Standard	Per-token
`DeepSeek-V3.2`	DeepSeek V3.2	GlobalStandard	Standard	Per-token
`gpt-5.1-codex`	GPT-5.1-Codex	GlobalStandard	Standard	Per-token
`gpt-5.4`	GPT-5.4	GlobalStandard	Standard	Per-token
`gpt-5.3-codex`	GPT-5.3-Codex	GlobalStandard	Standard	Per-token

All deployments use the Global Standard SKU, which means:

Pay-per-token pricing (no reserved capacity)
Requests may be served from any Azure region (Microsoft routes for lowest latency)
No guaranteed throughput — acceptable for our development-stage volume

tip

To add a new model, add a azurerm_cognitive_deployment block to foundry.tf, run terraform apply, then update the llm-config ConfigMap to point the desired pipeline layer to the new deployment name.

What's next

Foundry usage guide — How to call models from different environments
Model orchestration — Pipeline layer to model mapping
ADR-008 — Why we chose Foundry
ADR-009 — Why these specific models

What Foundry IS​

What Foundry is NOT​

Architecture diagram​

Request flow​

Why this is the right choice for our stage​

When self-hosting would make sense​

Analogy for non-technical readers​

Deployment details​

What's next​