TL;DR
AWS ECR tokens expire every 12 hours. Every time the cron job that refreshes the pull secret fails, image pulls break cluster-wide. Docker Hub’s anonymous rate limit (100 pulls/6 hours) started hitting during CI builds that pull nginx:alpine and python:3.12-slim. I replaced both with self-hosted Harbor for container images and Gitea for package registries (PyPI, npm), backed by NFS on the NAS, deployed via Ansible and Helm, with Trivy vulnerability scanning on push. Thirteen CI workflows were updated in a single commit. Pull secrets never expire. Images never rate-limit. Monthly ECR cost drops to zero.
The Problem with ECR
ECR worked for over a month without incident. Then three things happened in the same week:
Token cron failed silently. The Kubernetes CronJob that runs
aws ecr get-login-passwordand patches theecr-pull-secrethit a transient AWS API error. It did not retry. The secret expired. Pods that restarted after that point gotImagePullBackOff. The cluster looked healthy until a node reboot forced every pod to re-pull, and half of them failed.Docker Hub rate limit during CI. Self-hosted ARC runners pull base images from Docker Hub during builds. With 8 runners building in parallel, the anonymous rate limit (100 pulls per 6 hours per IP) was exhausted by mid-afternoon. Builds started failing with
toomanyrequests: You have reached your pull rate limit.ECR cost creep. ECR charges per GB stored and per GB transferred. With 13 service images averaging 200MB each, plus multiple tagged versions, the monthly bill was small but non-zero and growing. For a homelab, any recurring AWS cost that can be eliminated should be.
The Docker Hub problem had an immediate fix: mirror nginx:alpine to ECR and reference the ECR image in Dockerfiles. But that just moved the dependency from one external registry to another.
Why Harbor
Harbor is the CNCF graduated container registry. It does four things that ECR does not:
| Feature | ECR | Harbor |
|---|---|---|
| Token expiry | 12 hours | Never (static credentials) |
| Vulnerability scanning | Separate service (Inspector) | Built-in Trivy |
| Image promotion | Manual re-tag | Project-based staging → production |
| Cost | Per-GB storage + transfer | Zero (self-hosted) |
Harbor also supports replication policies, robot accounts for CI, and a web UI for browsing repositories. But the four features above are the ones that justified the migration.
Deployment
Harbor runs on the k3s cluster itself, deployed via Ansible + Helm. The Ansible playbook handles the Helm repo, namespace, and values override. Storage is NFS-backed via the NAS DXP4800.
The Helm values configure:
- Ingress: Traefik with TLS via cert-manager (
harbor.k3s.internal.strommen.systems) - Storage: NFS PVCs for registry blobs, database, Redis, and Trivy cache
- Trivy: Enabled on push – every image is scanned before it lands in a project
- Projects:
stagingfor CI pushes,productionfor deployed images
Robot accounts replace IAM users. Each CI workflow gets a robot account scoped to push to the staging project only. Promotion to production happens via a workflow_dispatch GitHub Action that re-tags and pushes.
Gitea as Package Registry
Container images are only half the artifact story. Python packages (for services like OpenClaw and Polymarket Lab) and npm packages need a registry too. PyPI.org and npmjs.com work, but publishing private packages to public registries is not an option.
Gitea already supports PyPI, npm, Maven, Go, and generic package registries. It runs as a lightweight deployment alongside Harbor, also on NFS storage. The org is k3s-homelab. Publishing is a twine upload or npm publish with a Gitea token.
CI Workflow Changes
All 13 CI workflows were updated to dual-push: build once, push to both ECR (legacy, still referenced by some running pods) and Harbor staging. The migration is gradual – as manifests are updated to reference harbor.k3s.internal.strommen.systems/production/<image>, the ECR push can be removed.
The key workflow change:
- name: Push to Harbor staging
run: |
docker tag $IMAGE harbor.k3s.internal.strommen.systems/staging/$IMAGE
docker push harbor.k3s.internal.strommen.systems/staging/$IMAGE
Pull secrets in Kubernetes switched from ecr-pull-secret (12-hour token, CronJob-refreshed) to harbor-pull-secret (static credential, no expiry, no CronJob). This single change eliminated the most fragile piece of infrastructure in the cluster.
Monitoring
A Grafana dashboard tracks Harbor health: registry blob storage usage, push/pull rates, Trivy scan results, and project sizes. PrometheusRule alerts fire if the registry becomes unreachable or if storage exceeds 80% of the NFS quota. ServiceMonitors scrape Harbor’s built-in metrics endpoint.
The Docker Hub Problem
The Docker Hub rate limit fix landed in the same session. For base images like nginx:alpine used in multi-stage Dockerfile builds, the solution was to mirror them to ECR (and eventually Harbor) and hardcode the registry in the Dockerfile:
# Before (rate-limited)
FROM nginx:alpine
# After (self-hosted)
FROM harbor.k3s.internal.strommen.systems/production/nginx:alpine
This is a one-time change per Dockerfile. CI workflows log into Docker Hub with a PAT for authenticated pulls (200 pulls/6 hours instead of 100), but the Dockerfiles no longer depend on it at all.
Migration Status
Harbor and Gitea are deployed and serving traffic. All 13 CI workflows push to Harbor staging. The promotion workflow is operational. ECR still exists and will be decommissioned after all Kubernetes manifests are updated to reference Harbor images. The ECR token refresh CronJob is still running as a safety net but is no longer critical path.
The goal is full ECR decommission within two weeks. At that point, the only AWS dependency for container infrastructure will be the S3 bucket backing Terraform state – and that one is worth keeping.