From Local Training to Live API: A Minimal, Reliable CI/CD Workflow

Goal: Train your ML model on your local machine, package it as a Docker image, push it to GitHub Container Registry (GHCR), and let GitHub Actions deploy the new version to your server automatically. No cloud training required, no surprises.


Why this setup?

  • Local-first: Train where you have full control (and maybe a GPU).
  • Simple CI/CD: GitHub Actions only deploys (no heavy training in the cloud).
  • Fast rollouts & rollbacks: Version your releases with Docker tags and Git tags.
  • Portable: Runs on any VPS with Docker + Compose, or behind Nginx/Traefik.

Prerequisites

  • Local machine with Python 3.11+, Docker, Git.
  • A VPS/server with Docker and Docker Compose installed (port 8000 open or behind a reverse proxy).
  • A GitHub repo (public or private).
  • GHCR login (via GITHUB_TOKEN or a Personal Access Token).

Project Structure

ml-api-template/
├── data/                     # sample/small data (big data stays out of git)
├── models/                   # local training output (e.g., model.pkl)
├── src/
│   ├── train.py              # trains and writes models/model.pkl + metrics
│   ├── inference.py          # loads model, predicts
│   └── app.py                # FastAPI app (/health, /predict)
├── tests/                    # optional: unit/smoke tests
├── requirements.txt
├── Dockerfile
├── docker-compose.yml        # runs on the VPS
├── Makefile                  # local helper commands
└── .github/workflows/
    └── deploy.yml            # deployment-only workflow

Core Code

src/train.py — minimal training + metrics

# src/train.py
import json, joblib
from pathlib import Path
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

def main():
    X, y = load_iris(return_X_y=True)  # replace with your data pipeline
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

    clf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
    clf.fit(Xtr, ytr)
    acc = clf.score(Xte, yte)

    Path("models").mkdir(exist_ok=True)
    joblib.dump(clf, "models/model.pkl")
    Path("metrics").mkdir(exist_ok=True)
    json.dump({"accuracy": acc}, open("metrics/eval_metrics.json", "w"))
    print(f"[train] accuracy={acc:.4f}  -> models/model.pkl")

if __name__ == "__main__":
    main()

src/inference.py — inference wrapper

# src/inference.py
import joblib, numpy as np

class Predictor:
    def __init__(self, model_path: str):
        self.model = joblib.load(model_path)

    def predict(self, inputs):
        """
        inputs: list[list[float]], e.g., [[5.1, 3.5, 1.4, 0.2], ...]
        """
        X = np.array(inputs, dtype=float)
        return self.model.predict(X).tolist()

src/app.py — FastAPI service

# src/app.py
from fastapi import FastAPI
from pydantic import BaseModel
from src.inference import Predictor
import os

MODEL_PATH = os.getenv("MODEL_PATH", "models/model.pkl")
predictor = Predictor(MODEL_PATH)

app = FastAPI(title="ML Inference API")

class PredictRequest(BaseModel):
    inputs: list[list[float]]

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict")
def predict(req: PredictRequest):
    return {"predictions": predictor.predict(req.inputs)}

requirements.txt

fastapi==0.115.0
uvicorn==0.30.6
scikit-learn==1.5.2
joblib==1.4.2
numpy==2.1.1
pydantic==2.9.2

Containerization & Compose

Dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src ./src
# simplest path: bake your locally trained model into the image
COPY models ./models
ENV MODEL_PATH=/app/models/model.pkl
EXPOSE 8000
CMD ["uvicorn", "src.app:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose.yml (on the VPS)

services:
  api:
    image: ghcr.io/<your-org>/<repo>/ml-api-template:latest  # first run uses latest
    ports: ["8000:8000"]
    environment:
      - MODEL_PATH=/app/models/model.pkl
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:8000/health"]
      interval: 10s
      timeout: 3s
      retries: 5
    restart: always

One-time GHCR Login (Local)

echo <YOUR_GITHUB_PAT_OR_TOKEN> | docker login ghcr.io -u <your_github_username> --password-stdin

Deployment Workflow (GitHub Actions)

.github/workflows/deploy.ymldeployment-only: pull the image you built locally and restart the service on the VPS whenever you push a Git tag like v0.2.0.

name: Deploy (pull image & restart)
on:
  push:
    tags:
      - "v*.*.*"        # e.g., v0.2.0
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy via SSH to VPS
        uses: appleboy/ssh-action@v1.2.0
        with:
          host: ${{ secrets.VPS_HOST }}
          username: ${{ secrets.VPS_USER }}
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            set -e
            IMAGE=ghcr.io/${{ github.repository }}/ml-api-template:${GITHUB_REF_NAME}
            docker login ghcr.io -u ${{ github.actor }} -p ${{ secrets.GITHUB_TOKEN }}
            docker pull $IMAGE
            # swap the image tag in your compose file
            sed -i "s#ghcr.io/.*/ml-api-template:.*#$IMAGE#g" /opt/ml-api/docker-compose.yml
            cd /opt/ml-api
            docker compose up -d
            docker image prune -f

Add repo secrets (GitHub → Settings → Secrets and variables → Actions):

  • VPS_HOST, VPS_USER, VPS_SSH_KEY

Local Release Flow (Your Daily Routine)

IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG?=v0.1.0

train:
\tpython -m src.train

build:
\tdocker build -t $(IMAGE):$(TAG) -t $(IMAGE):latest .

push:
\tdocker push $(IMAGE):$(TAG)
\tdocker push $(IMAGE):latest

release: build push
\tgit add -A && git commit -m "release: $(TAG)" || true
\tgit tag $(TAG) || true
\tgit push origin main --tags

Usage:

make train TAG=v0.2.0        # train locally, outputs models/model.pkl
make release TAG=v0.2.0      # build & push image, tag repo → auto-deploy

(No Makefile? Do it manually: python -m src.traindocker builddocker pushgit tag && git push --tags.)


First-time Setup on the VPS

# copy your docker-compose.yml to the server
# path used in the workflow: /opt/ml-api/docker-compose.yml
cd /opt/ml-api
docker compose up -d
curl http://localhost:8000/health   # {"status":"ok"}

Open port 8000 or place behind Nginx/Traefik.


Test the API

curl -X POST http://<server>:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[5.1,3.5,1.4,0.2],[6.2,2.8,4.8,1.8]]}'
# {"predictions":[0,2]}  # with the iris example

Rollbacks & Gradual Releases

  • Blue/Green or Canary (optional):
    • Compose: run api_blue and api_green, switch via reverse proxy.
    • K3s/Kubernetes: use Deployment rolling updates.

Rollback: redeploy an older tag (ensure the image exists in GHCR):

git tag v0.1.1
git push origin v0.1.1

Optional: Don’t Bake the Model into the Image

A. Volume mount

services:
  api:
    image: ghcr.io/<org>/<repo>/ml-api-template:latest
    volumes:
      - /opt/ml-api/models:/app/models
    environment:
      - MODEL_PATH=/app/models/model.pkl

Update the model by dropping a new model.pkl into /opt/ml-api/models, then docker compose restart api.

B. Fetch at startup

  • Add scripts/fetch_model.py (download from NAS/S3/HTTP).

In Dockerfile:

CMD python scripts/fetch_model.py && uvicorn src.app:app --host 0.0.0.0 --port 8000

Security & Ops Tips

  • Keep all credentials in GitHub Actions Secrets.
  • Lock down server ports (UFW/nftables) and expose only what you need.
  • Add logging/metrics (FastAPI middleware, OpenTelemetry, Promtail, etc.).
  • For private repos/images, verify GITHUB_TOKEN/PAT permissions.

Troubleshooting

  • Image won’t pull: verify GHCR login and the exact image name/tag. Try docker pull ghcr.io/...:<tag> on the VPS.
  • Container restarting: docker logs <container>; check model path, dependencies, /health response.
  • Port unreachable: firewall/security group/reverse proxy configuration.
  • Tag didn’t trigger: ensure you pushed the tag to origin and your workflow on.push.tags pattern matches.
  • Private image access: confirm the token used by the workflow can pull from GHCR.

Replace These Placeholders

  • <your-org> — your GitHub org/username
  • <repo> — your repository name
  • <server> — your server’s domain or IP

TL;DR

Train locally → build a Docker image (with or without the model inside) → push to GHCR → push a Git tag → GitHub Actions SSHes into your server, pulls the image, and docker compose up -d → your API is live.

下面是一份“只在本地训练 + 一键发布为在线 API”的完整教程
目标:你在自己电脑训练模型 → 打包成 Docker 镜像 → 推到 GHCR(GitHub Container Registry) → 用 GitHub Actions 远程登录你的 VPS 拉镜像 & 滚动重启 → 对外提供 HTTP 推理接口。


0. 前置条件

  • 一台本地开发机(Windows/macOS/Linux 均可,安装好 Python 3.11+、Docker、Git)。
  • 一台VPS/服务器(已安装 Docker 与 Docker Compose,开放端口 8000 或配好反代)。
  • 一个 GitHub 仓库(public/ private 均可)。
  • 登录 GHCR 的 Token(用 GITHUB_TOKEN 或自己创建的 PAT)。

1. 目录结构(最小可用)

ml-api-template/
├── data/                     # 示例/小体量数据(大数据不进Git)
├── models/                   # 本地训练输出的模型文件(例如 model.pkl)
├── src/
│   ├── train.py              # 训练脚本:产出 models/model.pkl + metrics
│   ├── inference.py          # 推理封装:加载模型并预测
│   └── app.py                # FastAPI 应用(/health, /predict)
├── tests/                    # (可选)最小化单测/冒烟测试
├── requirements.txt
├── Dockerfile
├── docker-compose.yml        # VPS 上运行的Compose(健康检查/重启策略)
├── Makefile                  # 本地一键化命令
└── .github/workflows/
    └── deploy.yml            # 仅部署(从 GHCR 拉镜像+重启)

2. 代码模板

2.1 src/train.py(最简训练 + 度量记录)

# src/train.py
import json, os, joblib, yaml
from pathlib import Path
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# 也可以用 params.yaml 管理超参,这里直接写死以简化
params = {"n_estimators": 200, "max_depth": 8}

def main():
    X, y = load_iris(return_X_y=True)  # 示例数据;替换为你的数据管道
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

    clf = RandomForestClassifier(
        n_estimators=params["n_estimators"],
        max_depth=params["max_depth"],
        random_state=42
    )
    clf.fit(Xtr, ytr)
    acc = clf.score(Xte, yte)

    Path("models").mkdir(exist_ok=True)
    joblib.dump(clf, "models/model.pkl")
    Path("metrics").mkdir(exist_ok=True)
    json.dump({"accuracy": acc}, open("metrics/eval_metrics.json", "w"))
    print(f"[train] accuracy={acc:.4f}  -> models/model.pkl")

if __name__ == "__main__":
    main()

2.2 src/inference.py(推理封装)

# src/inference.py
import joblib, numpy as np

class Predictor:
    def __init__(self, model_path: str):
        self.model = joblib.load(model_path)

    def predict(self, inputs):
        """
        inputs: list[list[float]],例如 [[5.1, 3.5, 1.4, 0.2], ...]
        """
        X = np.array(inputs, dtype=float)
        preds = self.model.predict(X).tolist()
        return preds

2.3 src/app.py(FastAPI 服务)

# src/app.py
from fastapi import FastAPI
from pydantic import BaseModel
from src.inference import Predictor
import os

MODEL_PATH = os.getenv("MODEL_PATH", "models/model.pkl")
predictor = Predictor(MODEL_PATH)

app = FastAPI(title="ML Inference API")

class PredictRequest(BaseModel):
    inputs: list[list[float]]

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict")
def predict(req: PredictRequest):
    return {"predictions": predictor.predict(req.inputs)}

2.4 requirements.txt

fastapi==0.115.0
uvicorn==0.30.6
scikit-learn==1.5.2
joblib==1.4.2
numpy==2.1.1
pydantic==2.9.2

3. 容器与编排

3.1 Dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src ./src
# 把你本地训练好的模型打进镜像(最简单路径)
COPY models ./models
ENV MODEL_PATH=/app/models/model.pkl
EXPOSE 8000
CMD ["uvicorn", "src.app:app", "--host", "0.0.0.0", "--port", "8000"]
可选:不把模型打进镜像,改为启动时从 NAS/S3 拉取或通过卷挂载,见第 9 节。

3.2 VPS 上的 docker-compose.yml

services:
  api:
    image: ghcr.io/<your-org>/<repo>/ml-api-template:latest  # 首次手动跑一次
    ports: ["8000:8000"]
    environment:
      - MODEL_PATH=/app/models/model.pkl
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:8000/health"]
      interval: 10s
      timeout: 3s
      retries: 5
    restart: always

4. GHCR 登录(一次性)

本地执行(建议 PAT,或直接用 GITHUB_TOKEN):

echo <YOUR_GITHUB_PAT_OR_TOKEN> | docker login ghcr.io -u <your_github_username> --password-stdin

5. GitHub Actions 部署工作流(不训练,仅部署)

.github/workflows/deploy.yml

name: Deploy (pull image & restart)
on:
  push:
    tags:
      - "v*.*.*"        # 本地打 vX.Y.Z tag 自动触发部署
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy via SSH to VPS
        uses: appleboy/ssh-action@v1.2.0
        with:
          host: ${{ secrets.VPS_HOST }}
          username: ${{ secrets.VPS_USER }}
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            set -e
            IMAGE=ghcr.io/${{ github.repository }}/ml-api-template:${GITHUB_REF_NAME}
            docker login ghcr.io -u ${{ github.actor }} -p ${{ secrets.GITHUB_TOKEN }}
            docker pull $IMAGE
            # 用新tag替换Compose中的镜像引用
            sed -i "s#ghcr.io/.*/ml-api-template:.*#$IMAGE#g" /opt/ml-api/docker-compose.yml
            cd /opt/ml-api
            docker compose up -d
            docker image prune -f
在 GitHub 仓库 Settings → Secrets and variables → Actions 配置:VPS_HOST:你的服务器IP/域名VPS_USER:登录用户名(如 root 或非 root)VPS_SSH_KEY:私钥内容(PEM)无需额外设置 GHCR 凭据,默认 GITHUB_TOKEN 可拉取镜像(私有仓库/镜像请确保权限匹配)。

6. 本地一键发布流程

6.1 Makefile(强烈建议)

IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG?=v0.1.0

train:
\tpython -m src.train

build:
\tdocker build -t $(IMAGE):$(TAG) -t $(IMAGE):latest .

push:
\tdocker push $(IMAGE):$(TAG)
\tdocker push $(IMAGE):latest

release: build push
\tgit add -A && git commit -m "release: $(TAG)" || true
\tgit tag $(TAG) || true
\tgit push origin main --tags
之后一个版本只需:make train TAG=v0.2.0(本地训练出新模型)make release TAG=v0.2.0(构建、推镜像、推 tag → 自动部署)

6.2 手动执行(不用 Makefile 也行)

# 训练
python -m src.train

# 构建镜像
IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG=v0.1.0
docker build -t $IMAGE:$TAG -t $IMAGE:latest .

# 推送镜像
docker push $IMAGE:$TAG
docker push $IMAGE:latest

# 推送代码并打tag(触发部署)
git add -A && git commit -m "release: $TAG"
git tag $TAG
git push origin main --tags

7. VPS 首次准备

  1. 复制本地 docker-compose.yml 到 VPS:/opt/ml-api/docker-compose.yml
  2. 首次手动启动一次(镜像先用 latest,之后会被 Actions 替换成具体 tag):
cd /opt/ml-api
docker compose up -d
  1. 校验健康检查与端口:
curl http://localhost:8000/health
# {"status":"ok"}

如需对外访问,记得在防火墙开放端口或配 Nginx/Traefik 反向代理。


8. 使用与验证

本地/远程都可测试:

curl -X POST http://<server>:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[5.1,3.5,1.4,0.2],[6.2,2.8,4.8,1.8]]}'
# {"predictions":[0,2]}  # 以iris示例为准

9. 可选:模型不打进镜像(解耦)

如果你希望“镜像稳定、模型可替换”,使用以下任一方式:

方式 A:卷挂载

  • 更新模型时,只需把新 model.pkl 放到 /opt/ml-api/models,然后 docker compose restart api

修改 docker-compose.yml

services:
  api:
    image: ghcr.io/<org>/<repo>/ml-api-template:latest
    volumes:
      - /opt/ml-api/models:/app/models   # VPS 本地目录挂进去
    environment:
      - MODEL_PATH=/app/models/model.pkl

方式 B:启动时拉取

  • 在镜像中加入 scripts/fetch_model.py,从 NAS/S3/HTTP 拉下最新模型;

DockerfileCMD 改成:

CMD python scripts/fetch_model.py && uvicorn src.app:app --host 0.0.0.0 --port 8000

10. 回滚与灰度

  • 灰度/蓝绿(可选):
    • Compose 里做 api_blueapi_green 两个服务,反向代理根据权重切流量;
    • 或上 K3s/K8s,用 Deployment 滚动升级,更平滑。

回滚:只需切回旧 tag 并触发部署(确保 GHCR 里有旧镜像):

git tag v0.1.1
git push origin v0.1.1

11. 安全与合规

  • Secrets 放在 GitHub Actions Secrets;VPS 只开放必要端口。
  • 访问日志与错误日志(建议在 FastAPI 增加中间件或接入 OpenTelemetry/Promtail)。
  • 模型/数据如涉及隐私合规,考虑加密存储、访问审计、最小权限。

12. 常见问题排查(Checklist)

  • 镜像拉不下来:确认 GHCR 登录、镜像名、tag 是否存在;docker pull ghcr.io/...:<tag> 手测。
  • 容器反复重启:看 docker logs <container>;模型路径、依赖版本是否正确;/health 是否 200。
  • 端口访问不到:检查 VPS 防火墙、云服务商安全组、Nginx/Traefik 配置。
  • 打 tag 不触发部署:看 Actions 触发规则(on.push.tags)、是否推送了标签到远端。
  • 私有仓库权限GITHUB_TOKEN 在部署 Job 的权限是否允许拉私有镜像(必要时用 PAT 登录 GHCR)。

13. 你可以直接复制后替换的变量

  • <your-org>:GitHub 组织或用户名
  • <repo>:你的仓库名
  • <server>:你的服务器域名或IP

一句话复盘

在本地训练并产出模型 → 用 Dockerfile 构建镜像(模型打进去或运行时加载) → 推到 GHCR → 打 Git tag 触发 GitHub Actions → Actions SSH 到 VPS 拉新镜像并 compose up -d → API 自动更新。

如果你把你的 <your-org>/<repo>、VPS 路径和端口给我,我可以把上面的模板替换成可直接运行的版本。