[MAICE Dev Log 5] Zero-downtime deployment for heavy AI containers (Blue-Green with Jenkins)
0. 2026-03 update note
This post records the period when Blue-Green deployment was first introduced in MAICE backend.
Core ideas are still valid (zero-downtime, health checks, fast rollback), while current operations are more segmented:
- separate build and deploy nodes
- private registry with tag-based release history
- service-specific health endpoints for deployment judgment
Code examples below focus on principles.
1. Why deployment is harder for AI services
Classic web servers restart fast. AI stacks do not.
Model/runtime initialization (libraries, vector DB connections, warm-up) can take 30-60 seconds. If every deploy causes downtime, study flow is interrupted.
So zero-downtime deployment became a practical priority.
During the 3-week study period, measured uptime was 99.2% under experiment-scale traffic. Blue-Green was one contributing factor.
2. Blue-Green strategy
We operated two equivalent environments:
- Blue live, Green idle
- deploy new build to idle side
- wait for health checks to pass
- switch traffic at proxy layer
- keep previous side as fallback for rollback window
3. Jenkins pipeline automation
Deployment was automated via Jenkins Groovy scripts.
def activeColor = sh(script: "./scripts/get_active_color.sh", returnStdout: true).trim()
def targetColor = (activeColor == "blue") ? "green" : "blue"
sh "docker-compose -f docker-compose.${targetColor}.yml up -d"
sh "./scripts/health_check.sh ${targetColor}"
4. Nginx traffic switch
Traffic switch used lightweight upstream config replacement + reload.
echo "upstream backend { server <BACKEND_HOST>:${TARGET_PORT}; }" > nginx/conf.d/backend_upstream.new
mv nginx/conf.d/backend_upstream.new nginx/conf.d/backend_upstream.conf
nginx -s reload
Existing connections remain stable; new requests move to the new side.
5. Rollback design
AI behavior can regress unexpectedly after release.
With rollback_backend_blue_green.sh, we could revert to the previous color in roughly 10 seconds, because the previous container was intentionally kept alive for a short fallback window.
This allowed releases even during evening study hours with reduced operational risk.
💬 댓글
이 글에 대한 의견을 남겨주세요