Interested in an on-premise deployment or AI transformation? Calculate your AI costs. Call/text πŸ“ž (571) 293-0242

AMI-Based Launch Pipeline

Automated pipeline for launching isolated staging environments from pre-built AMIs, running E2E tests, and tearing down β€” all via GitHub Actions.

Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GitHub Actions Workflow                       β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚  β”‚ Build        β”‚   β”‚ Launch EC2   β”‚                           β”‚
β”‚  β”‚ Playwright   β”‚   β”‚ from AMI     β”‚   (parallel)              β”‚
β”‚  β”‚ Image (OCIR) β”‚   β”‚ + Service    β”‚                           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   Update     β”‚                           β”‚
β”‚         β”‚           β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚         β”‚                  β”‚                                    β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                    β”‚
β”‚                  β–Ό                                              β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚
β”‚         β”‚ Run Playwrightβ”‚  (OCI Container Instances             β”‚
β”‚         β”‚ Tests         β”‚   hit mentorai.stgX.iblai.org)        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚
β”‚                β–Ό                                                β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚
β”‚         β”‚ Terminate    β”‚                                       β”‚
β”‚         β”‚ EC2 Instance β”‚                                       β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture

Each staging environment (stg1–stg4) has permanent AWS infrastructure:

ResourcePurposePersists between launches
VPC + SubnetsNetworkingYes
ALB + Target GroupLoad balancer with TLS terminationYes
ACM CertificatesSSL for *.stgX.iblai.orgYes
Route53 RecordsDNS β†’ ALBYes
Security GroupsFirewall rulesYes
S3 BucketsMedia + static storageYes
EC2 InstancePlatform serverNo β€” ephemeral

The EC2 is the only component created and destroyed per pipeline run. Everything else is pre-provisioned via Terraform and reused.

Pre-Built AMI Contents

Each AMI is a snapshot of a fully configured staging environment:

  • OS: Ubuntu 22.04 with Docker, pyenv, Python 3.11.8, AWS CLI
  • Platform CLI: iblai-cli-ops installed via iblai-prod-images
  • Services (Docker containers):
    • iblai-dm-pro (Django, PostgreSQL, Redis, Celery, Langfuse, ClickHouse, MinIO)
    • iblai-edx-pro (LMS, CMS, MySQL, MongoDB, Redis, Elasticsearch, MFE)
    • Auth SPA, Mentor SPA, Skills SPA
    • Nginx reverse proxy
  • Data: Test platforms, users, RBAC, analytics views pre-seeded
  • Config: S3 buckets, AWS credentials, TimescaleDB enabled

Pipeline Steps β€” Detailed

Step 1: Build Playwright Image

What: Builds a Docker image containing the Playwright test suite from the mentorai repo and pushes it to Oracle Cloud Container Registry (OCIR).

Where: GitHub Actions runner (ubuntu-latest) β†’ OCIR

Image: iad.ocir.io/idcwyla5j5cr/ibl-mentor-playwright:{tag}

Contents: Playwright browsers (Chromium, Firefox, WebKit), test specs from e2e/journeys/, page objects, test utilities, AWS CLI for S3 log upload.

Caching: Checks if image with the same tag already exists β€” skips build if so.

Step 2: Launch EC2 from AMI

What: Provisions a fresh EC2 instance from the pre-built AMI into the existing VPC/subnet/security group.

How (via boto3 in the iblai-infra-cli tool):

  1. ec2:RunInstances with the AMI ID, instance type (t3.2xlarge), 200GB gp3 volume
  2. Wait for instance to enter running state
  3. Get public IP address

Security: The workflow opens port 22 on the security group for the GitHub Actions runner IP, and revokes it after completion (always, even on failure).

Step 3: Service Update (Ansible)

What: Ensures all services on the launched EC2 are running and configured correctly.

Tool: iblai infra service-update --host from iblai-infra-cli

Ansible Playbook (service_update_playbook.yml, 2 roles):

Role: ibl_cli_ops
  • Installs latest iblai-prod-images package from iblai/iblai-prod-images@main
  • This pins all container image versions and includes ibl-cli-ops
Role: ibl_service_update
  1. Restore postgres data dir ownership to uid 999 (fixes chown from pre-tasks)
  2. ECR login β€” authenticate Docker with AWS ECR (using server's existing AWS creds)
  3. Save platform config β€” ibl config save regenerates all compose files
  4. Save edX tutor config β€” ibl tutor config save
  5. Ensure edX running β€” ibl edx start -d
  6. Wait for LMS β€” curl localhost:8600/heartbeat (40 retries Γ— 15s)
  7. Ensure DM containers running β€” docker compose up -d in background (avoids timeout on collectstatic)
  8. Wait for DM β€” curl localhost:8400 (60 retries Γ— 15s = 15 min max for collectstatic)
  9. Run DM migrations β€” docker compose exec web ./manage.py migrate --noinput
  10. Restart SPAs β€” docker compose down; docker compose up -d for auth, mentor, skills (with auto-restart for Mentor empty reply)
  11. OAuth/OIDC integrations β€” ibl launch --ibl-oauth --ibl-oidc --ibl-edx-manager + ibl dm auth-setup
  12. Sync edX users β€” ibl edx sync-with-manager --users
  13. Sync SSO credentials β€” reads spa-sso and ibl_web client IDs from LMS database, writes to config, restarts Auth SPA
  14. Reload proxy + restart nginx

Step 4: Register in ALB Target Group

What: Deregisters any existing targets from the ALB target group, then registers the new EC2 instance.

Why deregister first: Prevents split-brain routing where the ALB sends some requests to an old instance with stale OAuth credentials.

Health check: ALB verifies the instance returns HTTP 200-399 on / before routing traffic.

Step 5: Run Playwright Tests (OCI)

What: Launches Docker containers on Oracle Cloud Infrastructure (OCI) Container Instances that run the Playwright test suite against the staging environment.

Test target: mentorai.stgX.iblai.org (via ALB β†’ EC2)

Configuration:

  • Browsers: chrome, firefox, safari, edge (configurable, default: all 4 parallel)
  • Workers: 3 per browser
  • Max wait: 5400s (90 minutes)
  • Retries: 2 per test

Test users: Each browser has its own dedicated test user to avoid conflicts:

  • Chrome: iblaiuserchromenew
  • Firefox: iblaiuserfirefoxnew
  • Safari: iblaiusersafarinew
  • Edge: iblaiuseredgenew

Results: Uploaded to S3 for resumption on subsequent runs.

Step 6: Terminate EC2

What: aws ec2 terminate-instances --instance-ids

When: Always runs, even if tests fail. The if: always() condition ensures cleanup.

What persists: VPC, ALB, Route53, S3 buckets β€” all reused on next launch.

Timing

StepDuration
Build Playwright image2-5 min (cached: instant)
Launch EC2~20s
SSH ready~45s
Service update (Ansible)20-40 min (DM collectstatic dominates)
ALB health check~30s
Playwright tests (4 browsers)15-90 min
Terminateinstant
Total40-90 min

Repository Map

RepositoryRole
iblai-infra-cliCLI tool with service-update command, Ansible playbooks, Terraform templates
iblai-web-opsReusable GitHub Actions workflows (OCI test runner, Docker builds, domain locking)
iblai-prod-imagesContainer image version pins (DM, edX, SPAs)
mentoraiSPA source code, Playwright tests, PR validation workflows

Secrets & Variables

Variables (on mentorai repo)

VariableExample
STG1_AMI_IDami-02dff3992891505ba
STG1_SUBNET_IDsubnet-022ff062fe90b23b1
STG1_SG_IDsg-0d56a7433d4b2a364
STG1_TG_ARNarn:aws:elasticloadbalancing:...
STG1_KEY_PAIRstg1-staging-key

Repeat for STG2, STG3, STG4.

Secrets

SecretPurpose
SERVICE_UPDATE_ACCESS_KEYAWS IAM key for EC2 launch/terminate + SG rule management
SERVICE_UPDATE_SECRET_KEYAWS IAM secret
STG1_SSH_KEY – STG4_SSH_KEYSSH private keys for each stg environment
GIT_TOKENGitHub PAT for private repo access
SSH_PRIVATE_DEPLOY_OPSSSH key for OCI/deployment operations
OCI secretsOracle Cloud credentials for container instances
S3 secretsAWS credentials for test log storage

IAM Policy (SERVICE_UPDATE keys)

{
  "Statement": [
    {
      "Action": [
        "ec2:RunInstances", "ec2:DescribeInstances", "ec2:DescribeImages",
        "ec2:CreateTags", "ec2:TerminateInstances",
        "ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupIngress"
      ],
      "Resource": "*"
    },
    {
      "Action": [
        "elasticloadbalancing:RegisterTargets",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:DescribeTargetHealth"
      ],
      "Resource": "*"
    }
  ]
}

Known Behaviors

DM collectstatic (15-20 min cold boot)

The DM container entrypoint runs collectstatic --noinput before starting gunicorn. This takes 15-20 minutes on a fresh AMI boot at 100% CPU. The service-update flow uses docker compose up -d (idempotent, no recreate) to avoid triggering collectstatic unnecessarily.

Mentor SPA empty reply

Mentor SPA occasionally returns empty HTTP replies for 60-90s after startup despite reporting "Ready". The service-update role detects this and auto-restarts the container, with ignore_errors so the pipeline continues.

ALB split-brain routing

If old EC2 instances remain registered in the ALB target group, the ALB load-balances between old and new instances with different OAuth credentials β€” causing intermittent 409 auth errors. The pipeline deregisters all existing targets before registering the new instance.

OAuth credential sync

ibl config save regenerates auth.yml but doesn't preserve SSO credentials. The pipeline reads spa-sso and ibl_web client credentials directly from the LMS database and writes them to config before restarting the Auth SPA.

Creating New AMIs

When the platform or test data changes, create new AMIs:

  1. Launch a stg env from an existing AMI
  2. Make changes (add platforms, users, config)
  3. Verify all services healthy
  4. Create AMI from the EC2 instance
  5. Update STGx_AMI_ID variables on mentorai (and skillsai)

AMI requirements:

  • All containers must be in a startable state (they may not be running β€” the service-update handles startup)
  • S3 config must be baked in (ENABLE_S3_BUCKET_STORAGE=True, bucket names, region, credentials)
  • Test platforms and users must be pre-seeded
  • iblai-cli-ops virtualenv must exist with pyenv

Copyright Β© ibl.ai | support@iblai.zendesk.com