TL;DR

I deployed Whisper (speech-to-text), Piper (text-to-speech), and openWakeWord (wake word detection) as Kubernetes workloads on my k3s cluster. Home Assistant connects to them over the Wyoming protocol for fully local voice pipelines. Total resource cost: ~1 CPU core and 1.75GB RAM. Total cloud cost: $0.

Why run voice services in Kubernetes

Home Assistant’s voice pipeline needs three things: something to listen for a wake word, something to transcribe speech, and something to speak back. The usual approach is running these on the same box as HA, or on a dedicated Pi. Both work fine until you want the services to survive node failures, be independently upgradeable, or share resources with other workloads.

I already have a k3s cluster with spare capacity. Running voice services as standard Kubernetes deployments means they get health checks, resource limits, and the same deployment workflow as everything else. No special infrastructure.

The Wyoming protocol

All three services speak Wyoming — a lightweight TCP protocol designed for voice assistants. HA discovers Wyoming services and chains them into a pipeline:

  1. openWakeWord listens for “Hey Jarvis” (or whatever you configure)
  2. Whisper transcribes the audio that follows
  3. HA processes the text through its conversation agent
  4. Piper speaks the response

The protocol is simple enough that each service is a single container with a single TCP port. No REST APIs, no message queues, no service mesh required.

The manifests

Everything lives in a single manifest file — one namespace, three deployments, three services. Here’s the interesting bits.

Whisper (speech-to-text)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: whisper
  namespace: voice-pipeline
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    spec:
      containers:
      - name: whisper
        image: harbor.k3s.internal.zolty.systems/dockerhub-cache/rhasspy/wyoming-whisper:latest
        args:
          - --model
          - base
          - --language
          - en
          - --beam-size
          - "1"
        ports:
        - containerPort: 10300
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: "2"
            memory: 1Gi

The --beam-size 1 flag is worth calling out. Beam search with width 1 is greedy decoding — it picks the most likely token at each step without exploring alternatives. For home voice commands (“turn off the kitchen lights”), greedy decoding is fast and accurate enough. You’re not transcribing parliamentary debates.

The base model balances speed and accuracy for conversational speech. Whisper has models from tiny (39M params) to large-v3 (1.5B params). For voice commands, base is the sweet spot — tiny struggles with compound sentences, large is overkill for “what’s the weather.”

Piper (text-to-speech)

- name: piper
  image: harbor.k3s.internal.zolty.systems/dockerhub-cache/rhasspy/wyoming-piper:latest
  args:
    - --voice
    - en_US-lessac-medium
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 512Mi

The en_US-lessac-medium voice is a neural TTS model trained on the Lessac dataset. It sounds natural without being uncanny. The medium quality variant is about 200MB — large enough for good prosody, small enough to fit in a minimal resource envelope.

Piper is the lightest of the three services. TTS synthesis on modern CPUs is fast — a typical response generates in under 100ms.

openWakeWord

- name: openwakeword
  image: harbor.k3s.internal.zolty.systems/dockerhub-cache/rhasspy/wyoming-openwakeword:latest
  args:
    - --preload-model
    - hey_jarvis
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi

Wake word detection is the lightest workload — models are embedded in the container image, and the detector runs continuously on a tiny audio stream. 128MB is more than enough.

The --preload-model hey_jarvis flag loads the wake word model into memory at startup instead of on first detection attempt. Without it, the first wake word takes noticeably longer to detect.

Resource footprint

ServiceCPU RequestCPU LimitRAM RequestRAM Limit
Whisper500m2000m512Mi1Gi
Piper250m1000m256Mi512Mi
openWakeWord100m500m128Mi256Mi
Total850m3500m896Mi1.75Gi

Under 1 CPU core at request level. These services sit idle 99% of the time — they only work when someone is actively talking to HA. The burst limits give Whisper room to transcode quickly when a command comes in.

Health checks

All three use TCP socket probes on their Wyoming ports. The protocol doesn’t expose HTTP health endpoints, but a successful TCP connection means the service is accepting Wyoming clients.

readinessProbe:
  tcpSocket:
    port: 10300
  initialDelaySeconds: 30
  periodSeconds: 10
livenessProbe:
  tcpSocket:
    port: 10300
  initialDelaySeconds: 60
  periodSeconds: 30

Whisper gets the longest initial delay (60s for liveness) because model loading can take a while on first boot — the base model downloads into an emptyDir volume on startup. openWakeWord starts fast (20s) since its models are baked into the image.

Model storage

Whisper and Piper download models on first boot into emptyDir volumes:

volumeMounts:
- name: whisper-cache
  mountPath: /data
volumes:
- name: whisper-cache
  emptyDir:
    sizeLimit: 1Gi

This is a deliberate trade-off. Using emptyDir means models re-download if the pod restarts, but it avoids managing PVCs for data that’s freely available. Whisper’s base model is about 150MB — it downloads in seconds on a local network.

If you’re on a slow connection or don’t want the startup delay, mount a PVC instead. But for a homelab with local bandwidth to spare, ephemeral storage keeps things simple.

Connecting Home Assistant

The services are ClusterIP only — no external ingress needed. HA connects via internal DNS:

whisper.voice-pipeline.svc.cluster.local:10300
piper.voice-pipeline.svc.cluster.local:10200
openwakeword.voice-pipeline.svc.cluster.local:10400

In HA’s UI: Settings → Devices & Services → Add Integration → Wyoming Protocol. Add each service by hostname and port. Then create an Assist pipeline that chains them: wake word → STT → conversation agent → TTS.

If HA is running in the same cluster (mine is), these DNS names just work. If HA is external, you’d need to expose the services — but for a homelab, keep them cluster-internal.

What it sounds like in practice

“Hey Jarvis, turn off the kitchen lights.”

The wake word triggers in about 200ms. Whisper transcribes the command in under a second. HA processes the intent and sends the response to Piper, which speaks it back in about 100ms. The whole round-trip is fast enough to feel responsive — noticeably faster than cloud-based assistants that need to round-trip to AWS or Google.

And if your internet goes down, it still works. Every part of the pipeline is local.

What I’d change

The one thing I’d improve is adding a PVC for Whisper’s model cache if I were running this on a cluster with slower storage or frequent pod churn. The emptyDir approach is fine for my setup but wouldn’t be ideal on a cluster where pods get rescheduled frequently.

I’d also consider the small Whisper model if I had more CPU headroom. The accuracy improvement over base is noticeable for longer sentences, and the resource cost is modest — about 2x the RAM.

Lessons

  • Wyoming is the right protocol for this. It’s simple, well-supported by HA, and each service is independently deployable. No coupling between STT and TTS.
  • CPU-only inference is fine for voice. You don’t need a GPU for home voice commands. The latency is already sub-second on commodity amd64 nodes.
  • emptyDir for models keeps it simple. Don’t over-engineer storage for freely downloadable data. PVCs add complexity that isn’t justified until pod churn becomes a problem.
  • Greedy decoding is enough for voice commands. Save the beam search for transcription workloads where accuracy matters more than latency.

Don’t have a homelab? The same manifests work on any k3s or K8s cluster. A DigitalOcean Kubernetes cluster with a small node pool handles this workload easily.