AMI-Based Launch Pipeline
Automated pipeline for launching isolated staging environments from pre-built AMIs, running E2E tests, and tearing down β all via GitHub Actions.
Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GitHub Actions Workflow β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β Build β β Launch EC2 β β
β β Playwright β β from AMI β (parallel) β
β β Image (OCIR) β β + Service β β
β ββββββββ¬ββββββββ β Update β β
β β ββββββββ¬ββββββββ β
β β β β
β ββββββββββ¬ββββββββββ β
β βΌ β
β ββββββββββββββββ β
β β Run Playwrightβ (OCI Container Instances β
β β Tests β hit mentorai.stgX.iblai.org) β
β ββββββββ¬ββββββββ β
β βΌ β
β ββββββββββββββββ β
β β Terminate β β
β β EC2 Instance β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture
Each staging environment (stg1βstg4) has permanent AWS infrastructure:
| Resource | Purpose | Persists between launches |
|---|---|---|
| VPC + Subnets | Networking | Yes |
| ALB + Target Group | Load balancer with TLS termination | Yes |
| ACM Certificates | SSL for *.stgX.iblai.org | Yes |
| Route53 Records | DNS β ALB | Yes |
| Security Groups | Firewall rules | Yes |
| S3 Buckets | Media + static storage | Yes |
| EC2 Instance | Platform server | No β ephemeral |
The EC2 is the only component created and destroyed per pipeline run. Everything else is pre-provisioned via Terraform and reused.
Pre-Built AMI Contents
Each AMI is a snapshot of a fully configured staging environment:
- OS: Ubuntu 22.04 with Docker, pyenv, Python 3.11.8, AWS CLI
- Platform CLI: iblai-cli-ops installed via iblai-prod-images
- Services (Docker containers):
- iblai-dm-pro (Django, PostgreSQL, Redis, Celery, Langfuse, ClickHouse, MinIO)
- iblai-edx-pro (LMS, CMS, MySQL, MongoDB, Redis, Elasticsearch, MFE)
- Auth SPA, Mentor SPA, Skills SPA
- Nginx reverse proxy
- Data: Test platforms, users, RBAC, analytics views pre-seeded
- Config: S3 buckets, AWS credentials, TimescaleDB enabled
Pipeline Steps β Detailed
Step 1: Build Playwright Image
What: Builds a Docker image containing the Playwright test suite from the mentorai repo and pushes it to Oracle Cloud Container Registry (OCIR).
Where: GitHub Actions runner (ubuntu-latest) β OCIR
Image: iad.ocir.io/idcwyla5j5cr/ibl-mentor-playwright:{tag}
Contents: Playwright browsers (Chromium, Firefox, WebKit), test specs from e2e/journeys/, page objects, test utilities, AWS CLI for S3 log upload.
Caching: Checks if image with the same tag already exists β skips build if so.
Step 2: Launch EC2 from AMI
What: Provisions a fresh EC2 instance from the pre-built AMI into the existing VPC/subnet/security group.
How (via boto3 in the iblai-infra-cli tool):
ec2:RunInstanceswith the AMI ID, instance type (t3.2xlarge), 200GB gp3 volume- Wait for instance to enter
runningstate - Get public IP address
Security: The workflow opens port 22 on the security group for the GitHub Actions runner IP, and revokes it after completion (always, even on failure).
Step 3: Service Update (Ansible)
What: Ensures all services on the launched EC2 are running and configured correctly.
Tool: iblai infra service-update --host from iblai-infra-cli
Ansible Playbook (service_update_playbook.yml, 2 roles):
Role: ibl_cli_ops
- Installs latest
iblai-prod-imagespackage fromiblai/iblai-prod-images@main - This pins all container image versions and includes ibl-cli-ops
Role: ibl_service_update
- Restore postgres data dir ownership to uid 999 (fixes chown from pre-tasks)
- ECR login β authenticate Docker with AWS ECR (using server's existing AWS creds)
- Save platform config β
ibl config saveregenerates all compose files - Save edX tutor config β
ibl tutor config save - Ensure edX running β
ibl edx start -d - Wait for LMS β curl
localhost:8600/heartbeat(40 retries Γ 15s) - Ensure DM containers running β
docker compose up -din background (avoids timeout on collectstatic) - Wait for DM β curl
localhost:8400(60 retries Γ 15s = 15 min max for collectstatic) - Run DM migrations β
docker compose exec web ./manage.py migrate --noinput - Restart SPAs β
docker compose down; docker compose up -dfor auth, mentor, skills (with auto-restart for Mentor empty reply) - OAuth/OIDC integrations β
ibl launch --ibl-oauth --ibl-oidc --ibl-edx-manager+ibl dm auth-setup - Sync edX users β
ibl edx sync-with-manager --users - Sync SSO credentials β reads
spa-ssoandibl_webclient IDs from LMS database, writes to config, restarts Auth SPA - Reload proxy + restart nginx
Step 4: Register in ALB Target Group
What: Deregisters any existing targets from the ALB target group, then registers the new EC2 instance.
Why deregister first: Prevents split-brain routing where the ALB sends some requests to an old instance with stale OAuth credentials.
Health check: ALB verifies the instance returns HTTP 200-399 on / before routing traffic.
Step 5: Run Playwright Tests (OCI)
What: Launches Docker containers on Oracle Cloud Infrastructure (OCI) Container Instances that run the Playwright test suite against the staging environment.
Test target: mentorai.stgX.iblai.org (via ALB β EC2)
Configuration:
- Browsers: chrome, firefox, safari, edge (configurable, default: all 4 parallel)
- Workers: 3 per browser
- Max wait: 5400s (90 minutes)
- Retries: 2 per test
Test users: Each browser has its own dedicated test user to avoid conflicts:
- Chrome:
iblaiuserchromenew - Firefox:
iblaiuserfirefoxnew - Safari:
iblaiusersafarinew - Edge:
iblaiuseredgenew
Results: Uploaded to S3 for resumption on subsequent runs.
Step 6: Terminate EC2
What: aws ec2 terminate-instances --instance-ids
When: Always runs, even if tests fail. The if: always() condition ensures cleanup.
What persists: VPC, ALB, Route53, S3 buckets β all reused on next launch.
Timing
| Step | Duration |
|---|---|
| Build Playwright image | 2-5 min (cached: instant) |
| Launch EC2 | ~20s |
| SSH ready | ~45s |
| Service update (Ansible) | 20-40 min (DM collectstatic dominates) |
| ALB health check | ~30s |
| Playwright tests (4 browsers) | 15-90 min |
| Terminate | instant |
| Total | 40-90 min |
Repository Map
| Repository | Role |
|---|---|
| iblai-infra-cli | CLI tool with service-update command, Ansible playbooks, Terraform templates |
| iblai-web-ops | Reusable GitHub Actions workflows (OCI test runner, Docker builds, domain locking) |
| iblai-prod-images | Container image version pins (DM, edX, SPAs) |
| mentorai | SPA source code, Playwright tests, PR validation workflows |
Secrets & Variables
Variables (on mentorai repo)
| Variable | Example |
|---|---|
STG1_AMI_ID | ami-02dff3992891505ba |
STG1_SUBNET_ID | subnet-022ff062fe90b23b1 |
STG1_SG_ID | sg-0d56a7433d4b2a364 |
STG1_TG_ARN | arn:aws:elasticloadbalancing:... |
STG1_KEY_PAIR | stg1-staging-key |
Repeat for STG2, STG3, STG4.
Secrets
| Secret | Purpose |
|---|---|
SERVICE_UPDATE_ACCESS_KEY | AWS IAM key for EC2 launch/terminate + SG rule management |
SERVICE_UPDATE_SECRET_KEY | AWS IAM secret |
STG1_SSH_KEY β STG4_SSH_KEY | SSH private keys for each stg environment |
GIT_TOKEN | GitHub PAT for private repo access |
SSH_PRIVATE_DEPLOY_OPS | SSH key for OCI/deployment operations |
| OCI secrets | Oracle Cloud credentials for container instances |
| S3 secrets | AWS credentials for test log storage |
IAM Policy (SERVICE_UPDATE keys)
{
"Statement": [
{
"Action": [
"ec2:RunInstances", "ec2:DescribeInstances", "ec2:DescribeImages",
"ec2:CreateTags", "ec2:TerminateInstances",
"ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupIngress"
],
"Resource": "*"
},
{
"Action": [
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:DeregisterTargets",
"elasticloadbalancing:DescribeTargetHealth"
],
"Resource": "*"
}
]
}
Known Behaviors
DM collectstatic (15-20 min cold boot)
The DM container entrypoint runs collectstatic --noinput before starting gunicorn. This takes 15-20 minutes on a fresh AMI boot at 100% CPU. The service-update flow uses docker compose up -d (idempotent, no recreate) to avoid triggering collectstatic unnecessarily.
Mentor SPA empty reply
Mentor SPA occasionally returns empty HTTP replies for 60-90s after startup despite reporting "Ready". The service-update role detects this and auto-restarts the container, with ignore_errors so the pipeline continues.
ALB split-brain routing
If old EC2 instances remain registered in the ALB target group, the ALB load-balances between old and new instances with different OAuth credentials β causing intermittent 409 auth errors. The pipeline deregisters all existing targets before registering the new instance.
OAuth credential sync
ibl config save regenerates auth.yml but doesn't preserve SSO credentials. The pipeline reads spa-sso and ibl_web client credentials directly from the LMS database and writes them to config before restarting the Auth SPA.
Creating New AMIs
When the platform or test data changes, create new AMIs:
- Launch a stg env from an existing AMI
- Make changes (add platforms, users, config)
- Verify all services healthy
- Create AMI from the EC2 instance
- Update
STGx_AMI_IDvariables on mentorai (and skillsai)
AMI requirements:
- All containers must be in a startable state (they may not be running β the service-update handles startup)
- S3 config must be baked in (
ENABLE_S3_BUCKET_STORAGE=True, bucket names, region, credentials) - Test platforms and users must be pre-seeded
iblai-cli-opsvirtualenv must exist with pyenv