ai

[MAICE Dev Log 5] Zero-downtime deployment for heavy AI containers (Blue-Green with Jenkins)

#agent

0. 2026-03 update note

This post records the period when Blue-Green deployment was first introduced in MAICE backend.

Core ideas are still valid (zero-downtime, health checks, fast rollback), while current operations are more segmented:

  • separate build and deploy nodes
  • private registry with tag-based release history
  • service-specific health endpoints for deployment judgment

Code examples below focus on principles.


1. Why deployment is harder for AI services

Classic web servers restart fast. AI stacks do not.

Model/runtime initialization (libraries, vector DB connections, warm-up) can take 30-60 seconds. If every deploy causes downtime, study flow is interrupted.

So zero-downtime deployment became a practical priority.

During the 3-week study period, measured uptime was 99.2% under experiment-scale traffic. Blue-Green was one contributing factor.


2. Blue-Green strategy

We operated two equivalent environments:

  1. Blue live, Green idle
  2. deploy new build to idle side
  3. wait for health checks to pass
  4. switch traffic at proxy layer
  5. keep previous side as fallback for rollback window

3. Jenkins pipeline automation

Deployment was automated via Jenkins Groovy scripts.

def activeColor = sh(script: "./scripts/get_active_color.sh", returnStdout: true).trim()
def targetColor = (activeColor == "blue") ? "green" : "blue"

sh "docker-compose -f docker-compose.${targetColor}.yml up -d"
sh "./scripts/health_check.sh ${targetColor}"

4. Nginx traffic switch

Traffic switch used lightweight upstream config replacement + reload.

echo "upstream backend { server <BACKEND_HOST>:${TARGET_PORT}; }" > nginx/conf.d/backend_upstream.new
mv nginx/conf.d/backend_upstream.new nginx/conf.d/backend_upstream.conf
nginx -s reload

Existing connections remain stable; new requests move to the new side.


5. Rollback design

AI behavior can regress unexpectedly after release.

With rollback_backend_blue_green.sh, we could revert to the previous color in roughly 10 seconds, because the previous container was intentionally kept alive for a short fallback window.

This allowed releases even during evening study hours with reduced operational risk.

💬 댓글

이 글에 대한 의견을 남겨주세요