Skip to main content

Troubleshooting

Common errors and their solutions, organized by platform.

Backend (Kotlin)

"Connection refused" — Database not accessible

Symptom: org.postgresql.util.PSQLException: Connection to localhost:5432 refused

Cause: PostgreSQL is not running or not port-forwarded.

Fix:

# Port-forward the product database
kubectl port-forward -n aucert-dev svc/product-pg 5432:5432

# Or the internal platform database
kubectl port-forward -n internal-platform svc/internal-pg 5433:5432

Ensure your DATABASE_URL environment variable matches the forwarded port.

"Class not found" errors during build

Symptom: ClassNotFoundException or NoClassDefFoundError at compile time.

Fix:

cd backend/platform
./gradlew clean build

If using generated Protobuf classes:

bazel build //proto:all
./gradlew clean build
Gradle daemon issues

Symptom: Build hangs, OOM errors, or stale cache.

Fix:

./gradlew --stop
rm -rf ~/.gradle/caches/
./gradlew clean build

Frontend (TypeScript)

pnpm install fails with peer dependency errors

Symptom: ERR_PNPM_PEER_DEP_ISSUES during install.

Fix:

# Clear the pnpm store and reinstall
pnpm store prune
rm -rf node_modules
pnpm install

Check that your Node.js version matches (22+):

node --version  # Should be v22.x
API calls fail in development (proxy issues)

Symptom: ERR_CONNECTION_REFUSED or 502 errors when the frontend calls the API.

Cause: The backend is not running on the expected port.

Fix:

  1. Ensure the backend is running on port 8080
  2. Check the proxy config in next.config.ts
  3. Verify with: curl http://localhost:8080/health
TypeScript type errors after proto changes

Symptom: Type errors referencing generated types in schemas/generated/.

Fix:

# Regenerate from proto
bazel build //proto:all

# Restart the TS server in your IDE
# VS Code: Cmd+Shift+P → "TypeScript: Restart TS Server"

Kubernetes / AKS

Pod stuck in Pending

Symptom: Pod stays in Pending state indefinitely.

Diagnose:

kubectl describe pod -n <namespace> <pod-name>

Common causes:

  • Insufficient resources: Node pool doesn't have enough CPU/memory. Check Events section.
  • PVC not bound: Persistent volume claim waiting for storage.
  • Node affinity: Pod can't be scheduled to any available node.

Fix: Scale up the node pool or reduce resource requests in the Helm values file.

Pod in CrashLoopBackOff

Symptom: Pod starts, crashes, restarts repeatedly.

Diagnose:

# Check crash logs
kubectl logs -n <namespace> <pod-name> --previous

# Check environment variables
kubectl exec -n <namespace> <pod-name> -- env | sort

Common causes:

  • Missing environment variables (especially database URLs)
  • Database connection failure (PG not accessible from pod)
  • Application startup error (check logs for stack trace)
Pod in ImagePullBackOff

Symptom: Pod can't pull the Docker image from ACR.

Fix:

# Re-authenticate with ACR
az acr login --name aucertacr41e0x5

# Verify the image exists
az acr repository show-tags --name aucertacr41e0x5 --repository <image-name>

# Check AKS has ACR pull permissions
az aks check-acr --name aucert-aks --resource-group aucert-foundation-rg \
--acr aucertacr41e0x5.azurecr.io
Helm upgrade fails

Symptom: helm upgrade returns an error about conflicting resources.

Fix:

# Check current release status
helm list -n <namespace>
helm history <release-name> -n <namespace>

# If stuck in "pending-upgrade" state
helm rollback <release-name> -n <namespace>

Terraform

State lock — "Error acquiring the state lock"

Symptom: Terraform can't acquire the state lock.

Diagnose: Another terraform apply may be running. Check with your team.

Fix (only if confirmed no other process is running):

terraform force-unlock <lock-id>
danger

Only force-unlock if you are certain no other process is running. Forcing an unlock while another apply is in progress can corrupt state.

"Resource already exists" on apply

Symptom: Terraform tries to create a resource that already exists (created manually or by another process).

Fix:

# Import the existing resource into state
terraform import <resource_type>.<name> <azure-resource-id>

# Then plan to verify no changes
terraform plan
Provider authentication errors

Fix:

# Re-authenticate with Azure
az login

# Set the correct subscription
az account set --subscription "<subscription-id>"

# Verify
az account show

Database / Flyway

Migration failed — "Detected resolved migration not applied"

Symptom: Flyway detects a migration file that should have been applied before an already-applied one.

Cause: Migration version numbers are out of sequence (e.g., V003 was applied but V002 was added later).

Fix: Ensure migration version numbers are strictly sequential. If a migration was skipped, either:

  1. Apply it manually and update the flyway_schema_history table
  2. Use -baselineOnMigrate=true (already set in our CI workflow)
"Connection refused" from Flyway pod

Cause: The Flyway pod can't reach PostgreSQL on the private VNet.

Check:

# Verify the PG service is accessible from within the cluster
kubectl exec -n internal-platform -it <any-pod> -- \
pg_isready -h internal-pg -p 5432

Ensure the Flyway pod is running in the correct namespace with VNet access.

CI / GitHub Actions

Workflow not triggering on push

Check:

  1. Verify the paths filter matches your changed files
  2. Check that the workflow file is on the main branch
  3. Look at the Actions tab for skipped runs
# View recent workflow runs
gh run list --workflow=ci.yml --limit=5
OIDC authentication failure in CI

Symptom: azure/login step fails with OIDC token error.

Check:

  1. AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID secrets are set
  2. App registration federated credentials match the repo/branch
  3. permissions: id-token: write is set in the workflow

What's next