From Local Training to Live API: A Minimal, Reliable CI/CD Workflow
Goal: Train your ML model on your local machine, package it as a Docker image, push it to GitHub Container Registry (GHCR), and let GitHub Actions deploy the new version to your server automatically. No cloud training required, no surprises.
Why this setup?
- Local-first: Train where you have full control (and maybe a GPU).
- Simple CI/CD: GitHub Actions only deploys (no heavy training in the cloud).
- Fast rollouts & rollbacks: Version your releases with Docker tags and Git tags.
- Portable: Runs on any VPS with Docker + Compose, or behind Nginx/Traefik.
Prerequisites
- Local machine with Python 3.11+, Docker, Git.
- A VPS/server with Docker and Docker Compose installed (port 8000 open or behind a reverse proxy).
- A GitHub repo (public or private).
- GHCR login (via
GITHUB_TOKENor a Personal Access Token).
Project Structure
ml-api-template/
├── data/ # sample/small data (big data stays out of git)
├── models/ # local training output (e.g., model.pkl)
├── src/
│ ├── train.py # trains and writes models/model.pkl + metrics
│ ├── inference.py # loads model, predicts
│ └── app.py # FastAPI app (/health, /predict)
├── tests/ # optional: unit/smoke tests
├── requirements.txt
├── Dockerfile
├── docker-compose.yml # runs on the VPS
├── Makefile # local helper commands
└── .github/workflows/
└── deploy.yml # deployment-only workflow
Core Code
src/train.py — minimal training + metrics
# src/train.py
import json, joblib
from pathlib import Path
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
def main():
X, y = load_iris(return_X_y=True) # replace with your data pipeline
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
clf.fit(Xtr, ytr)
acc = clf.score(Xte, yte)
Path("models").mkdir(exist_ok=True)
joblib.dump(clf, "models/model.pkl")
Path("metrics").mkdir(exist_ok=True)
json.dump({"accuracy": acc}, open("metrics/eval_metrics.json", "w"))
print(f"[train] accuracy={acc:.4f} -> models/model.pkl")
if __name__ == "__main__":
main()
src/inference.py — inference wrapper
# src/inference.py
import joblib, numpy as np
class Predictor:
def __init__(self, model_path: str):
self.model = joblib.load(model_path)
def predict(self, inputs):
"""
inputs: list[list[float]], e.g., [[5.1, 3.5, 1.4, 0.2], ...]
"""
X = np.array(inputs, dtype=float)
return self.model.predict(X).tolist()
src/app.py — FastAPI service
# src/app.py
from fastapi import FastAPI
from pydantic import BaseModel
from src.inference import Predictor
import os
MODEL_PATH = os.getenv("MODEL_PATH", "models/model.pkl")
predictor = Predictor(MODEL_PATH)
app = FastAPI(title="ML Inference API")
class PredictRequest(BaseModel):
inputs: list[list[float]]
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict")
def predict(req: PredictRequest):
return {"predictions": predictor.predict(req.inputs)}
requirements.txt
fastapi==0.115.0
uvicorn==0.30.6
scikit-learn==1.5.2
joblib==1.4.2
numpy==2.1.1
pydantic==2.9.2
Containerization & Compose
Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src ./src
# simplest path: bake your locally trained model into the image
COPY models ./models
ENV MODEL_PATH=/app/models/model.pkl
EXPOSE 8000
CMD ["uvicorn", "src.app:app", "--host", "0.0.0.0", "--port", "8000"]
docker-compose.yml (on the VPS)
services:
api:
image: ghcr.io/<your-org>/<repo>/ml-api-template:latest # first run uses latest
ports: ["8000:8000"]
environment:
- MODEL_PATH=/app/models/model.pkl
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:8000/health"]
interval: 10s
timeout: 3s
retries: 5
restart: always
One-time GHCR Login (Local)
echo <YOUR_GITHUB_PAT_OR_TOKEN> | docker login ghcr.io -u <your_github_username> --password-stdin
Deployment Workflow (GitHub Actions)
.github/workflows/deploy.yml — deployment-only: pull the image you built locally and restart the service on the VPS whenever you push a Git tag like v0.2.0.
name: Deploy (pull image & restart)
on:
push:
tags:
- "v*.*.*" # e.g., v0.2.0
workflow_dispatch:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Deploy via SSH to VPS
uses: appleboy/ssh-action@v1.2.0
with:
host: ${{ secrets.VPS_HOST }}
username: ${{ secrets.VPS_USER }}
key: ${{ secrets.VPS_SSH_KEY }}
script: |
set -e
IMAGE=ghcr.io/${{ github.repository }}/ml-api-template:${GITHUB_REF_NAME}
docker login ghcr.io -u ${{ github.actor }} -p ${{ secrets.GITHUB_TOKEN }}
docker pull $IMAGE
# swap the image tag in your compose file
sed -i "s#ghcr.io/.*/ml-api-template:.*#$IMAGE#g" /opt/ml-api/docker-compose.yml
cd /opt/ml-api
docker compose up -d
docker image prune -f
Add repo secrets (GitHub → Settings → Secrets and variables → Actions):
VPS_HOST,VPS_USER,VPS_SSH_KEY
Local Release Flow (Your Daily Routine)
Makefile (recommended)
IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG?=v0.1.0
train:
\tpython -m src.train
build:
\tdocker build -t $(IMAGE):$(TAG) -t $(IMAGE):latest .
push:
\tdocker push $(IMAGE):$(TAG)
\tdocker push $(IMAGE):latest
release: build push
\tgit add -A && git commit -m "release: $(TAG)" || true
\tgit tag $(TAG) || true
\tgit push origin main --tags
Usage:
make train TAG=v0.2.0 # train locally, outputs models/model.pkl
make release TAG=v0.2.0 # build & push image, tag repo → auto-deploy
(No Makefile? Do it manually: python -m src.train → docker build → docker push → git tag && git push --tags.)
First-time Setup on the VPS
# copy your docker-compose.yml to the server
# path used in the workflow: /opt/ml-api/docker-compose.yml
cd /opt/ml-api
docker compose up -d
curl http://localhost:8000/health # {"status":"ok"}
Open port 8000 or place behind Nginx/Traefik.
Test the API
curl -X POST http://<server>:8000/predict \
-H "Content-Type: application/json" \
-d '{"inputs": [[5.1,3.5,1.4,0.2],[6.2,2.8,4.8,1.8]]}'
# {"predictions":[0,2]} # with the iris example
Rollbacks & Gradual Releases
- Blue/Green or Canary (optional):
- Compose: run
api_blueandapi_green, switch via reverse proxy. - K3s/Kubernetes: use
Deploymentrolling updates.
- Compose: run
Rollback: redeploy an older tag (ensure the image exists in GHCR):
git tag v0.1.1
git push origin v0.1.1
Optional: Don’t Bake the Model into the Image
A. Volume mount
services:
api:
image: ghcr.io/<org>/<repo>/ml-api-template:latest
volumes:
- /opt/ml-api/models:/app/models
environment:
- MODEL_PATH=/app/models/model.pkl
Update the model by dropping a new model.pkl into /opt/ml-api/models, then docker compose restart api.
B. Fetch at startup
- Add
scripts/fetch_model.py(download from NAS/S3/HTTP).
In Dockerfile:
CMD python scripts/fetch_model.py && uvicorn src.app:app --host 0.0.0.0 --port 8000
Security & Ops Tips
- Keep all credentials in GitHub Actions Secrets.
- Lock down server ports (UFW/nftables) and expose only what you need.
- Add logging/metrics (FastAPI middleware, OpenTelemetry, Promtail, etc.).
- For private repos/images, verify
GITHUB_TOKEN/PAT permissions.
Troubleshooting
- Image won’t pull: verify GHCR login and the exact image name/tag. Try
docker pull ghcr.io/...:<tag>on the VPS. - Container restarting:
docker logs <container>; check model path, dependencies,/healthresponse. - Port unreachable: firewall/security group/reverse proxy configuration.
- Tag didn’t trigger: ensure you pushed the tag to origin and your workflow
on.push.tagspattern matches. - Private image access: confirm the token used by the workflow can pull from GHCR.
Replace These Placeholders
<your-org>— your GitHub org/username<repo>— your repository name<server>— your server’s domain or IP
TL;DR
Train locally → build a Docker image (with or without the model inside) → push to GHCR → push a Git tag → GitHub Actions SSHes into your server, pulls the image, and docker compose up -d → your API is live.下面是一份“只在本地训练 + 一键发布为在线 API”的完整教程。
目标:你在自己电脑训练模型 → 打包成 Docker 镜像 → 推到 GHCR(GitHub Container Registry) → 用 GitHub Actions 远程登录你的 VPS 拉镜像 & 滚动重启 → 对外提供 HTTP 推理接口。
0. 前置条件
- 一台本地开发机(Windows/macOS/Linux 均可,安装好 Python 3.11+、Docker、Git)。
- 一台VPS/服务器(已安装 Docker 与 Docker Compose,开放端口 8000 或配好反代)。
- 一个 GitHub 仓库(public/ private 均可)。
- 登录 GHCR 的 Token(用
GITHUB_TOKEN或自己创建的 PAT)。
1. 目录结构(最小可用)
ml-api-template/
├── data/ # 示例/小体量数据(大数据不进Git)
├── models/ # 本地训练输出的模型文件(例如 model.pkl)
├── src/
│ ├── train.py # 训练脚本:产出 models/model.pkl + metrics
│ ├── inference.py # 推理封装:加载模型并预测
│ └── app.py # FastAPI 应用(/health, /predict)
├── tests/ # (可选)最小化单测/冒烟测试
├── requirements.txt
├── Dockerfile
├── docker-compose.yml # VPS 上运行的Compose(健康检查/重启策略)
├── Makefile # 本地一键化命令
└── .github/workflows/
└── deploy.yml # 仅部署(从 GHCR 拉镜像+重启)
2. 代码模板
2.1 src/train.py(最简训练 + 度量记录)
# src/train.py
import json, os, joblib, yaml
from pathlib import Path
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# 也可以用 params.yaml 管理超参,这里直接写死以简化
params = {"n_estimators": 200, "max_depth": 8}
def main():
X, y = load_iris(return_X_y=True) # 示例数据;替换为你的数据管道
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=42
)
clf.fit(Xtr, ytr)
acc = clf.score(Xte, yte)
Path("models").mkdir(exist_ok=True)
joblib.dump(clf, "models/model.pkl")
Path("metrics").mkdir(exist_ok=True)
json.dump({"accuracy": acc}, open("metrics/eval_metrics.json", "w"))
print(f"[train] accuracy={acc:.4f} -> models/model.pkl")
if __name__ == "__main__":
main()
2.2 src/inference.py(推理封装)
# src/inference.py
import joblib, numpy as np
class Predictor:
def __init__(self, model_path: str):
self.model = joblib.load(model_path)
def predict(self, inputs):
"""
inputs: list[list[float]],例如 [[5.1, 3.5, 1.4, 0.2], ...]
"""
X = np.array(inputs, dtype=float)
preds = self.model.predict(X).tolist()
return preds
2.3 src/app.py(FastAPI 服务)
# src/app.py
from fastapi import FastAPI
from pydantic import BaseModel
from src.inference import Predictor
import os
MODEL_PATH = os.getenv("MODEL_PATH", "models/model.pkl")
predictor = Predictor(MODEL_PATH)
app = FastAPI(title="ML Inference API")
class PredictRequest(BaseModel):
inputs: list[list[float]]
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict")
def predict(req: PredictRequest):
return {"predictions": predictor.predict(req.inputs)}
2.4 requirements.txt
fastapi==0.115.0
uvicorn==0.30.6
scikit-learn==1.5.2
joblib==1.4.2
numpy==2.1.1
pydantic==2.9.2
3. 容器与编排
3.1 Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src ./src
# 把你本地训练好的模型打进镜像(最简单路径)
COPY models ./models
ENV MODEL_PATH=/app/models/model.pkl
EXPOSE 8000
CMD ["uvicorn", "src.app:app", "--host", "0.0.0.0", "--port", "8000"]
可选:不把模型打进镜像,改为启动时从 NAS/S3 拉取或通过卷挂载,见第 9 节。
3.2 VPS 上的 docker-compose.yml
services:
api:
image: ghcr.io/<your-org>/<repo>/ml-api-template:latest # 首次手动跑一次
ports: ["8000:8000"]
environment:
- MODEL_PATH=/app/models/model.pkl
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:8000/health"]
interval: 10s
timeout: 3s
retries: 5
restart: always
4. GHCR 登录(一次性)
本地执行(建议 PAT,或直接用 GITHUB_TOKEN):
echo <YOUR_GITHUB_PAT_OR_TOKEN> | docker login ghcr.io -u <your_github_username> --password-stdin
5. GitHub Actions 部署工作流(不训练,仅部署)
.github/workflows/deploy.yml
name: Deploy (pull image & restart)
on:
push:
tags:
- "v*.*.*" # 本地打 vX.Y.Z tag 自动触发部署
workflow_dispatch:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Deploy via SSH to VPS
uses: appleboy/ssh-action@v1.2.0
with:
host: ${{ secrets.VPS_HOST }}
username: ${{ secrets.VPS_USER }}
key: ${{ secrets.VPS_SSH_KEY }}
script: |
set -e
IMAGE=ghcr.io/${{ github.repository }}/ml-api-template:${GITHUB_REF_NAME}
docker login ghcr.io -u ${{ github.actor }} -p ${{ secrets.GITHUB_TOKEN }}
docker pull $IMAGE
# 用新tag替换Compose中的镜像引用
sed -i "s#ghcr.io/.*/ml-api-template:.*#$IMAGE#g" /opt/ml-api/docker-compose.yml
cd /opt/ml-api
docker compose up -d
docker image prune -f
在 GitHub 仓库 Settings → Secrets and variables → Actions 配置:VPS_HOST:你的服务器IP/域名VPS_USER:登录用户名(如root或非 root)VPS_SSH_KEY:私钥内容(PEM)无需额外设置 GHCR 凭据,默认GITHUB_TOKEN可拉取镜像(私有仓库/镜像请确保权限匹配)。
6. 本地一键发布流程
6.1 Makefile(强烈建议)
IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG?=v0.1.0
train:
\tpython -m src.train
build:
\tdocker build -t $(IMAGE):$(TAG) -t $(IMAGE):latest .
push:
\tdocker push $(IMAGE):$(TAG)
\tdocker push $(IMAGE):latest
release: build push
\tgit add -A && git commit -m "release: $(TAG)" || true
\tgit tag $(TAG) || true
\tgit push origin main --tags
之后一个版本只需:make train TAG=v0.2.0(本地训练出新模型)make release TAG=v0.2.0(构建、推镜像、推 tag → 自动部署)
6.2 手动执行(不用 Makefile 也行)
# 训练
python -m src.train
# 构建镜像
IMAGE=ghcr.io/<your-org>/<repo>/ml-api-template
TAG=v0.1.0
docker build -t $IMAGE:$TAG -t $IMAGE:latest .
# 推送镜像
docker push $IMAGE:$TAG
docker push $IMAGE:latest
# 推送代码并打tag(触发部署)
git add -A && git commit -m "release: $TAG"
git tag $TAG
git push origin main --tags
7. VPS 首次准备
- 复制本地
docker-compose.yml到 VPS:/opt/ml-api/docker-compose.yml - 首次手动启动一次(镜像先用
latest,之后会被 Actions 替换成具体 tag):
cd /opt/ml-api
docker compose up -d
- 校验健康检查与端口:
curl http://localhost:8000/health
# {"status":"ok"}
如需对外访问,记得在防火墙开放端口或配 Nginx/Traefik 反向代理。
8. 使用与验证
本地/远程都可测试:
curl -X POST http://<server>:8000/predict \
-H "Content-Type: application/json" \
-d '{"inputs": [[5.1,3.5,1.4,0.2],[6.2,2.8,4.8,1.8]]}'
# {"predictions":[0,2]} # 以iris示例为准
9. 可选:模型不打进镜像(解耦)
如果你希望“镜像稳定、模型可替换”,使用以下任一方式:
方式 A:卷挂载
- 更新模型时,只需把新
model.pkl放到/opt/ml-api/models,然后docker compose restart api。
修改 docker-compose.yml:
services:
api:
image: ghcr.io/<org>/<repo>/ml-api-template:latest
volumes:
- /opt/ml-api/models:/app/models # VPS 本地目录挂进去
environment:
- MODEL_PATH=/app/models/model.pkl
方式 B:启动时拉取
- 在镜像中加入
scripts/fetch_model.py,从 NAS/S3/HTTP 拉下最新模型;
Dockerfile 的 CMD 改成:
CMD python scripts/fetch_model.py && uvicorn src.app:app --host 0.0.0.0 --port 8000
10. 回滚与灰度
- 灰度/蓝绿(可选):
- Compose 里做
api_blue与api_green两个服务,反向代理根据权重切流量; - 或上 K3s/K8s,用
Deployment滚动升级,更平滑。
- Compose 里做
回滚:只需切回旧 tag 并触发部署(确保 GHCR 里有旧镜像):
git tag v0.1.1
git push origin v0.1.1
11. 安全与合规
- Secrets 放在 GitHub Actions Secrets;VPS 只开放必要端口。
- 访问日志与错误日志(建议在 FastAPI 增加中间件或接入 OpenTelemetry/Promtail)。
- 模型/数据如涉及隐私合规,考虑加密存储、访问审计、最小权限。
12. 常见问题排查(Checklist)
- 镜像拉不下来:确认 GHCR 登录、镜像名、tag 是否存在;
docker pull ghcr.io/...:<tag>手测。 - 容器反复重启:看
docker logs <container>;模型路径、依赖版本是否正确;/health是否 200。 - 端口访问不到:检查 VPS 防火墙、云服务商安全组、Nginx/Traefik 配置。
- 打 tag 不触发部署:看 Actions 触发规则(
on.push.tags)、是否推送了标签到远端。 - 私有仓库权限:
GITHUB_TOKEN在部署 Job 的权限是否允许拉私有镜像(必要时用 PAT 登录 GHCR)。
13. 你可以直接复制后替换的变量
<your-org>:GitHub 组织或用户名<repo>:你的仓库名<server>:你的服务器域名或IP
一句话复盘
在本地训练并产出模型 → 用 Dockerfile 构建镜像(模型打进去或运行时加载) → 推到 GHCR → 打 Git tag 触发 GitHub Actions → Actions SSH 到 VPS 拉新镜像并 compose up -d → API 自动更新。如果你把你的 <your-org>/<repo>、VPS 路径和端口给我,我可以把上面的模板替换成可直接运行的版本。