The seam — what I deliberately left in the cloud and why

TL;DR

This is the counterpart to the manifesto and the DR drill. After moving a chunk of the stack home, a list of things deliberately stayed rented: Route53, ACM, S3, AWS KMS, the Anthropic API for Claude, Bedrock for Amazon-only models, a transactional email sender, and one repo on GitHub. Each of them earns its place by being either the long pole on availability or the dependency that has to outlive the cluster. Self-hosting maximalism is a trap; the seam is the feature.

The principle: self-hosting is not maximization

The dumbest version of “move everything off the cloud” produces a homelab where the on-call rotation is one person and the failure modes include “the DNS is down so I can’t fix the DNS”.

A useful test, repeated from the manifesto: if this service has a 90-minute outage, can I work around it? If the answer is “wait”, it’s a vendor-shaped problem and should stay rented. If the answer is “fix”, it’s a system-shaped problem and might come home.

Some things stay rented even though the answer would be “fix” in theory, because the consequence of a real outage at 3 a.m. is worse than the operational tax of renting. DNS is the example everyone gets wrong.

What stayed rented

Route53 — DNS

Why it stays: DNS is the long pole on every other service’s availability. If blog.zolty.systems resolves wrong, nothing else matters. Route53’s SLA is 100% — they pay credits if they miss — and their global anycast outperforms anything I could stand up on residential hardware.

What it costs: ~$0.50/month per hosted zone plus negligible per-query charges. Single-digit dollars.

The break-glass case: If Route53 ever drifted out of acceptable territory (which it hasn’t), I’d move to a comparable managed authoritative DNS — Cloudflare or NS1, not “the cluster”. Authoritative DNS on residential hardware is “I’m the on-call for DNS forever”. No.

Internal DNS is a different story — split-horizon via dnsmasq plus a small CoreDNS deployment, fully on-cluster. The seam is “external authoritative = rented, internal = local.”

ACM + Let’s Encrypt — public TLS

Why it stays: Public-facing certs need to chain to a root the world’s browsers trust. That requires being a real CA, which I’m not. Let’s Encrypt issues via cert-manager + Route53 DNS-01 for free. ACM issues for CloudFront because CloudFront only accepts certs from ACM in us-east-1.

The seam: I could absolutely stand up an internal PKI (Vault PKI engine, smallstep CA) for internal services. That’s on the chopping block for the next quarter. External certs stay rented forever.

S3 — backups

Why it stays: A backup that lives on the same cluster it’s backing up is not a backup. Off-site is the entire point. S3 at single-digit GB scale costs less than the electricity to spin a NAS for that purpose.

What it costs: A few dollars a month for everything I back up — GitLab tarballs, Vault snapshots, Authentik Postgres dumps, miscellaneous app data. Glacier lifecycle on anything older than 30 days drops the cost another tier.

The break-glass case: If S3 is unavailable for the few hours per decade it goes down, my drills are temporarily blocked. Acceptable. If S3 is unavailable during a real disaster, I’m in trouble. That’s why I also have a second backup target: encrypted tarballs synced nightly to a friend’s storage at a different geographic location. S3 is primary; the friend-storage is the back-back-up. I would not run with S3 as the only off-site.

AWS KMS — Vault auto-unseal

Why it stays: Auto-unseal is the line between Vault as a tool I use and Vault as a tool I avoid because manually unsealing three replicas after every node reboot is a chore. KMS is the easiest auto-unseal backend if you’re already in AWS for other things. The same applies to GCP KMS, Azure Key Vault, or a hardware HSM — pick whichever cloud you live in.

The break-glass case: Recovery keys are sealed in Bitwarden. If KMS is permanently unavailable (which has never happened, and would be a generational event for AWS), I unseal manually with the recovery shards. The seam holds because the dependency is one-way: KMS unavailability degrades me to manual unseal, it does not lock me out.

Bedrock — only for Amazon-only models

Why it stays (in a narrow way): Per my own rule — Anthropic direct API for Claude models, Bedrock for the Amazon-only models like Nova and Titan when I want to try them. Bedrock is the only way to access those models, so it’s not a choice.

Why it does not stay for Claude: Bedrock’s Claude pricing has historically been at parity-or-worse with Anthropic’s API, with worse rate limits and slower access to new model versions. There’s no operational benefit and a real cost-and-latency benefit to going direct.

Anthropic API — Claude

Why it stays: Local inference on the Mac Studio pair handles a chunk of workloads (the trillion-param MoE post is the clearest example), but Claude — specifically Opus 4.7 and Sonnet 4.6 — is the LLM I actually want for non-trivial tasks. There’s no local equivalent in the same league. The API stays rented.

The honest cost story: My API spend is bounded by token volume, not by per-month commitment. A heavy month is in the low triple digits. The Mac Studio capex paid for several years of API spend up front; the Mac Studios are for the kinds of workloads where the API doesn’t fit, not as a wholesale replacement.

A transactional email sender — outbound mail

Why it stays: Outbound email is a deliverability problem, not a software problem. Running your own MTA in a residential IP space gets your mail dropped to spam by every major receiver. A managed sender solves the deliverability piece for less per month than a coffee.

The seam: Inbound mail is a different category — receiving on a custom domain is fine via a forwarding service. The seam is “delivery to inbox = rented, mailbox storage = whatever’s cheap.”

One repo on GitHub — the break-glass

Why it stays: k3s_bootstrap is on GitHub forever. The cluster’s recovery procedure starts with “clone this repo and run the playbook”. If the playbook is on GitLab running on the cluster I’m trying to bootstrap, I have a circular dependency at the worst possible moment. So one repo lives outside the seam.

GitHub also serves as the read-only mirror for the other repos that publish externally (zolty-blog, public projects). That’s not a sovereignty consideration; that’s where the audience reads.

The principle behind the seam, more honestly

If I had to compress the rationale to one sentence:

Anything that has to be working before the cluster comes up should not live on the cluster.

That’s why DNS stays rented (you can’t resolve cluster services without it), KMS stays rented (Vault can’t unseal without it), backups stay rented (you can’t restore from a backup that died with the cluster), and one repo stays on GitHub (the bootstrap script can’t live on a system that isn’t bootstrapped yet).

A second principle, less load-bearing but real:

Anything where my time replacing the vendor exceeds my time paying the vendor for the foreseeable future should stay rented.

Email deliverability. Public CA. Authoritative DNS. These are not impossible problems; they’re just problems that consume more of me than they’re worth.

Where the line might shift

A short list of things currently rented that I’m watching:

Internal PKI. Vault’s PKI engine is one Helm refactor away from issuing internal certs. Likely the next post.
Object storage for non-backup uses. Spaces, MinIO, or Garage on the cluster could absorb a chunk of the small-object workloads currently on S3 that don’t need 11-nines durability. Cost is fine on either side; the deciding factor will be operational simplicity.
Telemetry shipping. The Grafana Cloud free tier handles a slice of metrics; everything else is local. If the free tier shrinks, the slice comes home.

A short list of things currently rented that will never shift:

Authoritative DNS for public domains.
Public CA.
LLM API access for frontier models.
The break-glass GitHub repo.

Lessons

Self-hosting is not a religion. It’s a tool with a cost and a benefit. The benefit is faster fixes when things break; the cost is operational tax forever. Apply selectively.
Draw the seam first. Decide which side of the line a service lives on before you stand it up. Retrofitting the seam costs more than building to it.
Circular dependencies are the actual enemy. Self-hosting until you depend on yourself to recover from yourself is how you get stuck at 2 a.m.
The vendor exists to absorb the 0.1% problem you don’t want. Most of your stack should be in the 99.9%. Some of it should not.

What’s next in the series

This closes the three-post series (manifesto → DR drill → seam). The next natural post is the one-year retrospective: hours spent, incidents prevented, total cost moved on and off-prem, the things I got wrong. That one wants more elapsed time than I have right now. Bookmark it for early autumn.

TL;DR#

The principle: self-hosting is not maximization#

What stayed rented#

Route53 — DNS#

ACM + Let’s Encrypt — public TLS#

S3 — backups#

AWS KMS — Vault auto-unseal#

Bedrock — only for Amazon-only models#

Anthropic API — Claude#

A transactional email sender — outbound mail#

One repo on GitHub — the break-glass#

The principle behind the seam, more honestly#

Where the line might shift#

Lessons#

What’s next in the series#