Skip to main content

Azure AI Foundry architecture

This document explains how Azure AI Foundry works and why there are no LLM models running inside our AKS cluster or VNet. It is written for both technical and non-technical team members.

What Foundry IS

Azure AI Foundry is a managed SaaS platform where Microsoft hosts LLM models on their GPU infrastructure. When we "deploy a model," we are creating an API endpoint configuration — not provisioning hardware.

  • We create a "resource" (billing wrapper) and "deployments" (API endpoint configs)
  • Microsoft handles GPU allocation, model weights, scaling, uptime, and security patches
  • We call it over HTTPS, the same way we call Stripe for payments or Twilio for SMS

The Foundry resource in our Azure resource group (aucert-ai) is a configuration object. The actual inference (the GPU crunching numbers) happens on Microsoft's shared infrastructure, somewhere in their data centers.

What Foundry is NOT

  • Not infrastructure in our VNet (aucert-vnet, 10.0.0.0/16) — no subnets, no NICs, no private IPs
  • Not pods in our AKS (aucert-aks) — no deployments, no containers, no CPU/memory allocation
  • Not GPU nodes we manage — no CUDA drivers, no vLLM tuning, no OOM errors
  • Not model weights on our disks — no storage, no downloads, no version management
  • Not something with idle cost — Global Standard deployments are pay-per-token; a deployment that receives zero requests costs $0

Architecture diagram

Request flow

Why this is the right choice for our stage

  1. Zero idle cost — We pay per token, not per GPU-hour. At our volume ($45-105/month), this saves thousands compared to self-hosting. A GPU node pool costs $2-8/hour even when idle.

  2. No GPU expertise needed — We don't need to tune vLLM serving parameters, manage CUDA drivers, handle GPU memory fragmentation, or debug OOM errors. Microsoft handles all of that.

  3. Instant model switching — Adding a new model is a Terraform block + terraform apply (2 minutes). Switching which model a pipeline layer uses is a ConfigMap update (1 minute). No new node pools, no image builds, no Helm upgrades.

  4. Credit-eligible — "Direct from Azure" models are covered by our Founders Hub credits. Our $1,000 in credits lasts 10-22 months at current estimates.

  5. Same API everywhere — OpenAI-compatible format means our backend code works with Foundry, Bedrock, Vertex, or direct API by changing one environment variable (LLM_PROVIDER). This preserves our cloud-agnostic architecture (ADR-001).

When self-hosting would make sense

Self-hosting open models (Llama, DeepSeek, Mistral) on GPU VMs becomes cheaper when API costs exceed ~$10,000/month. At that volume, a dedicated GPU node pool amortizes the idle cost across enough requests to beat per-token pricing.

We are nowhere near that volume. Our estimated monthly cost is $45-105. Revisit self-hosting at Month 12+ if we have paying customers generating high inference volume.

Analogy for non-technical readers

Using Foundry is like using Gmail.

  • Google runs the email servers — you don't install an email server in your office
  • You call Gmail's API to send email — you don't manage mail queues or spam filters
  • You pay for the service (or get it free) — you don't buy server hardware

Similarly:

  • Microsoft runs the AI models — we don't run AI models on our own servers
  • We call their API to get predictions — we don't manage GPUs or model weights
  • We pay per usage (covered by credits) — we don't buy GPU hardware

The "AI Foundry resource" in our Azure subscription is like your Gmail account. It's a configuration that says "this is the Aucert account, these are the models it has access to." The actual email servers (GPU servers) are somewhere in Google's (Microsoft's) data centers.

Deployment details

The Foundry resource is managed by Terraform in infra/terraform/foundation/foundry.tf. Each model deployment is a azurerm_cognitive_deployment resource:

Deployment names match the Foundry model name exactly (case + dots).

DeploymentModelSKUTierTPM limit
Kimi-K2.6Kimi K2.6GlobalStandardStandardPer-token
DeepSeek-V3.2DeepSeek V3.2GlobalStandardStandardPer-token
gpt-5.1-codexGPT-5.1-CodexGlobalStandardStandardPer-token
gpt-5.4GPT-5.4GlobalStandardStandardPer-token
gpt-5.3-codexGPT-5.3-CodexGlobalStandardStandardPer-token

All deployments use the Global Standard SKU, which means:

  • Pay-per-token pricing (no reserved capacity)
  • Requests may be served from any Azure region (Microsoft routes for lowest latency)
  • No guaranteed throughput — acceptable for our development-stage volume
tip

To add a new model, add a azurerm_cognitive_deployment block to foundry.tf, run terraform apply, then update the llm-config ConfigMap to point the desired pipeline layer to the new deployment name.

What's next