TL;DR

A year ago my stack was the usual mix — GitHub for code, ECR for images, GitHub Actions for CI, Docker Hub for upstreams, Route53 + S3 + CloudFront for the blog. Most of that’s still where it should be. About a third of it isn’t. This post is the retrospective on what came home, what stayed rented, and the rule of thumb I now use when deciding which side of the line a new service goes on. The short version: self-host the things you operate; rent the things you’d never have time to operate.

The pattern I didn’t see at the time

I didn’t set out to “move off the cloud”. The first thing to come home — Harbor in place of ECR — was a surrender to a single annoyance: ECR pull-secret tokens expire every 12 hours and the cron that refreshed them kept failing. The second — Actions Runner Controller in the cluster — was about cost on a CI pipeline that runs frequently. The third, fourth, fifth — Actions cache, blog deploys, GitLab — each had its own local trigger.

It wasn’t until I’d done five of them and was looking at the sixth (the GitLab migration, last month) that I realized I’d been writing the same rationale every time:

“This managed service has an operational failure mode I can’t fix from inside the service, and a local replacement is one Helm chart away.”

That’s the rule. Everything in this series is a different instance of that rule.

The trigger events

Specific incidents, in chronological order:

ECR’s 12-hour token

Every namespace pulling from ECR needed a regcred Secret regenerated every 12 hours via a CronJob that called aws ecr get-login-password. When the CronJob’s pod got scheduled on a node with stale IRSA credentials, it failed silently. The first I’d hear about it was ImagePullBackOff on whatever I deployed the next morning. Harbor’s pull secret lasts forever. Ditching ECR for Harbor was the first one home.

Docker Hub’s anonymous rate limit

A k3s cluster behind a single NAT IP is one IP to Docker Hub’s rate limiter. 100 anonymous pulls per six hours sounds generous until your CI pulls python:3.12-slim for every job. Authentication doubled the budget; same problem came back six months later. A proxy cache caches forever. Harbor as proxy cache was the structural fix.

GitHub Actions on hosted runners hitting per-org API budgets

I was running ARC runners on k3s for cost, but the ARC listener still talks to GitHub’s API to receive job dispatches. Per-installation rate limits exhausted during noisy weeks, and when the listener crashlooped, the cluster scheduled nothing. The local fix didn’t fully exist on GitHub’s side; the only way to remove the API dependency was to remove GitHub. GitLab runners poll their own server. Moving source-of-truth to GitLab was the consequence.

A self-hosted runner pulling its cache from the public internet

The most ridiculous one. My self-hosted runners were caching builds via actions/cache, which speaks to GitHub’s cache backend, which lives in Microsoft Azure. So a runner ten feet from the NAS was pushing GB of cache through the ISP and back. A cache server costs zero RAM in a homelab. Self-hosted cache closed that gap.

None of these were catastrophic. All of them were avoidable. The pattern was that I’d been paying a coordination tax to keep a homelab and a cloud-hosted toolchain talking to each other, and the tax had quietly grown.

The math of small-blast-radius sovereignty

There’s a sovereignty argument here too, but I want to be honest about how much weight to give it.

The strong form — “my data is on someone else’s lawyers’ servers” — is real but rarely actionable. GitHub is not going to delete home_k3s_cluster on a TOS interpretation; the worst plausible scenario is account suspension, and my account has been fine for years.

The weak form — “I can’t fix the failure mode from inside the service” — is the one that actually drove the moves. When ECR breaks, I can’t kubectl exec into ECR. When GitHub’s API rate-limits me, I can’t tail its logs. The cloud’s encapsulation is exactly what makes managed services valuable on the happy path and exactly what makes them frustrating on the unhappy one.

A useful test: if this service has a 90-minute outage, can I work around it?

  • Harbor down? Restart the Helm release, look at the logs, check Longhorn. Yes, I can work around it.
  • ECR down? Wait for AWS to fix it. No, I cannot.
  • GitLab down? Same as Harbor. Yes.
  • GitHub down? Wait. No.

When the answer is “wait”, the service is a vendor. When the answer is “fix”, it’s a system. Some things are appropriately vendor-shaped (electricity, DNS, certificate authorities); some are not (your CI runner, your image registry, your source of truth).

When self-hosting is the wrong call

I want to be very clear: most of my stack is still rented. The next post in this series (the seam) is the explicit list. The short version of “wrong call” looks like:

  • The service has compounding security obligations. A managed CA, a managed identity-of-last-resort, a managed payment processor — these are not “one Helm chart”. Don’t.
  • The service is the long pole on availability and you don’t run a 24/7 NOC. DNS is the classic example. Authoritative DNS on Route53 means my blog stays up at 3 a.m. while I’m asleep. DNS on the cluster means I’m the on-call rotation. No.
  • The cost difference is rounding error. S3 at single-digit GB scale is cheaper than the time you’ll spend operating MinIO. Don’t.
  • The service is your last working tool when everything else is broken. Keeping one repo (k3s_bootstrap) on GitHub specifically because if the cluster is down I can’t pull break-glass scripts from GitLab-hosted-on-the-cluster. A circular dependency I refuse to debug at 2 a.m.

What’s still on the chopping block

The honest list of “things I’m considering bringing home”:

  • PKI for internal services. Let’s Encrypt + cert-manager handles external; internal services have a mix of self-signed and “we’ll fix this later”. Vault’s PKI engine is the natural pull, paired with the Vault deployment I just stood up. Likely the next post-worthy move.
  • Internal DNS. Currently Route53 with split-horizon via dnsmasq inside the cluster. Bringing internal DNS fully local with CoreDNS or Pi-hole is doable but adds an on-call surface.
  • The container registry’s “push” role. Harbor stays as a proxy cache; GitLab’s registry takes the push duty. Both can stay local; this isn’t a self-hosting decision so much as a tool-consolidation one.

What’s staying rented forever

Detailed in the next post, but the headline:

  • DNS (Route53)
  • Public TLS issuance (Let’s Encrypt via ACM where applicable)
  • Off-site backups (S3 with lifecycle to Glacier)
  • Vault auto-unseal (AWS KMS)
  • LLMs (Anthropic API direct for Claude; Bedrock for Amazon-only models)
  • Email delivery (a managed transactional sender, not my own MTA)
  • One “break-glass” Git repo on GitHub

The rule of thumb I now apply

When I’m about to add a new external dependency, I ask three questions:

  1. Could a 90-minute outage of this service be fixed by me, or do I just wait?
  2. If I self-hosted the equivalent, would the operational burden be one Helm chart or three weeks of expertise I don’t have?
  3. Does self-hosting introduce a circular dependency on something I’m already running?

“Self-host” is the answer if (1) says “wait”, (2) says “one Helm chart”, and (3) says “no”. Anything else, rent.

Lessons

  • You don’t decide to self-host up front. You self-host the things that bite you. I didn’t set out with a strategy; I noticed five years of similar incidents and started naming them.
  • The coordination tax is the hidden cost of “free” managed services. Refreshing tokens, rotating credentials, proxying caches, working around rate limits — none of those land on a budget line but all of them eat hours.
  • Sovereignty arguments are weak; ops arguments are strong. Lead with “I want to fix it when it breaks”, not “I don’t trust the vendor”. The first is true. The second is a fancier way of saying the first.
  • Keep one foot outside. The k3s_bootstrap repo on GitHub and the S3 bucket holding the backups exist precisely because they’re not on my cluster. The seam is the feature.

What’s next in this series

The next two posts are the keystones: the Saturday DR drill that proves the local stack survives a planned wipe, and the seam that lists everything I deliberately did not bring home. The first proves the architecture; the second admits its limits.