Kubernetes persists state in etcd, persistent volumes, and CRDs — and losing any of the three can end your company’s week. Velero is the open-source standard for backing up Kubernetes clusters, handling both the API objects and the volumes behind them. In 2026, with Velero 1.15 and its CSI snapshot data mover, you can ship encrypted backups of entire namespaces directly to S3 or any S3-compatible target and restore them into a different cluster, region, or cloud. This tutorial walks through a full installation on a production cluster, S3 setup, scheduled backups, disaster-recovery restores, and common pitfalls.
## Why Velero and Not kubectl yaml Dumps
A hand-rolled `kubectl get -o yaml` script looks tempting, but it misses CRDs, secrets in encrypted etcd, PVC data, and ordering dependencies on restore. Velero understands the full object graph, snapshots volumes via CSI, streams data to object storage, and handles namespace remapping and label selectors on restore. It is also the backup engine used under the hood by Rancher Backup, OpenShift OADP, and VMware Tanzu Mission Control.
## Prerequisites
You need a Kubernetes cluster running 1.28 or newer, `kubectl` configured, an S3-compatible bucket, and an IAM user or access key with read-write permissions to that bucket. This guide uses AWS S3, but Backblaze B2, MinIO, Wasabi, and Cloudflare R2 all work with identical configuration.
Install the Velero CLI on your workstation:
“`bash
VELERO_VERSION=v1.15.0
curl -L https://github.com/vmware-tanzu/velero/releases/download/${VELERO_VERSION}/velero-${VELERO_VERSION}-linux-amd64.tar.gz -o velero.tar.gz
tar -xzf velero.tar.gz
sudo install velero-${VELERO_VERSION}-linux-amd64/velero /usr/local/bin/
velero version –client-only
“`
## Creating the S3 Bucket and IAM
Create a dedicated bucket with versioning and encryption enabled:
“`bash
aws s3api create-bucket –bucket acme-velero-backups –region us-east-1
aws s3api put-bucket-versioning –bucket acme-velero-backups –versioning-configuration Status=Enabled
aws s3api put-bucket-encryption –bucket acme-velero-backups \
–server-side-encryption-configuration ‘{“Rules”:[{“ApplyServerSideEncryptionByDefault”:{“SSEAlgorithm”:”AES256″}}]}’
“`
Create the IAM policy in a file `velero-policy.json`:
“`json
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“ec2:DescribeVolumes”,”ec2:DescribeSnapshots”,”ec2:CreateTags”,
“ec2:CreateVolume”,”ec2:CreateSnapshot”,”ec2:DeleteSnapshot”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [“s3:GetObject”,”s3:DeleteObject”,”s3:PutObject”,”s3:AbortMultipartUpload”,”s3:ListMultipartUploadParts”],
“Resource”: “arn:aws:s3:::acme-velero-backups/*”
},
{
“Effect”: “Allow”,
“Action”: [“s3:ListBucket”],
“Resource”: “arn:aws:s3:::acme-velero-backups”
}
]
}
“`
Attach it to a user, save the credentials locally:
“`bash
cat > credentials-velero <
for: 10m
– alert: VeleroBackupPartialFailure
expr: velero_backup_partial_failure_total > 0
for: 10m
– alert: VeleroNoRecentSuccess
expr: time() – velero_backup_last_successful_timestamp > 172800
“`
## Common Pitfalls
Velero restores do not bring back running pods exactly as they were — it recreates controllers which then recreate pods. Services of type LoadBalancer get new external IPs. StatefulSet PVCs are restored, but if your storage class has `volumeBindingMode: WaitForFirstConsumer` you may need to restore into a cluster with the same topology.
Secrets encrypted with sealed-secrets restore correctly only if the controller’s private key is also backed up — back up `kube-system/sealed-secrets-key*` explicitly.
## FAQ
**Is Velero free for commercial use?** Yes, Apache 2.0 license. VMware offers paid support through Tanzu if you need it.
**Does Velero back up etcd?** Not directly. It uses the Kubernetes API. For a full etcd snapshot, also run `etcdctl snapshot save` on your control plane.
**How long does a 500 GB namespace restore take?** Typical is 20–40 minutes depending on network to S3 and storage class provisioning time.
**Can I encrypt backups at rest?** S3 server-side encryption handles this. For client-side encryption, enable Kopia repository encryption in the node agent.
**Does Velero work with OpenShift?** Yes, via the OADP operator which wraps Velero with OpenShift-specific defaults.
**How do I encrypt the backup metadata, not just the volumes?** Enable Kopia repository encryption in the node-agent ConfigMap. The encryption key is stored as a Kubernetes secret in the velero namespace — back that secret up out-of-band.
**Can I exclude specific resources from a backup?** Yes, use `–exclude-resources` for resource types and label selectors with `–selector` for fine-grained filtering. A common pattern is excluding `events` and `events.events.k8s.io` to avoid backing up audit noise.
**What is the difference between snapshot-based and filesystem-based backups?** Snapshot-based uses CSI VolumeSnapshots and is much faster because it relies on the storage driver. Filesystem-based reads file by file via the node agent and works on any volume type, but it is slower and more CPU-intensive.
## Pre and Post Hooks
Some workloads need special handling around backup time — flushing buffers, locking tables, dumping a database. Velero supports pre- and post-backup hooks defined as pod annotations:
“`yaml
metadata:
annotations:
pre.hook.backup.velero.io/command: ‘[“/bin/sh”,”-c”,”mysqldump -uroot -p$MYSQL_ROOT_PASSWORD –all-databases > /backup/dump.sql”]’
pre.hook.backup.velero.io/container: mysql
post.hook.backup.velero.io/command: ‘[“/bin/sh”,”-c”,”rm /backup/dump.sql”]’
“`
The pre hook runs inside the container before the volume snapshot is taken so the dump is included in the backup. Post hooks clean up after. The same pattern works for PostgreSQL with `pg_dumpall`, MongoDB with `mongodump`, and any application that needs a quiesce point.
## Multi-Cluster Disaster Recovery
A real DR strategy assumes the entire cluster is gone. Velero is designed for this: install it in a standby cluster (in a different region or even cloud) pointing at the same backup location, and you can restore the production cluster’s namespaces with one command. Combine with external DNS automation to flip user-facing hostnames to the standby ingress and you have warm standby DR with minutes of RTO.
For active-passive multi-region setups, schedule backups every 15 minutes for stateful workloads and use Velero’s `–from-schedule` filter on the standby side to apply a steady stream of incremental restores. Test the failover quarterly and document the runbook. The most common DR rehearsal failure is missing CRDs — operators install their own CRDs at deploy time, and those CRDs must exist in the target cluster before Velero can restore custom resources.
## Cost Optimization
Backups in S3 are cheap, but they grow forever if you do not curate them. Beyond Velero’s TTL, apply S3 lifecycle policies: transition to Infrequent Access after 30 days, Glacier after 90, and expire after 365. For Backblaze B2 and Wasabi, comparable lifecycle tiers exist. Encrypt buckets at rest with KMS-managed keys and enable bucket versioning to protect against accidental Velero `forget –prune` calls during a misconfiguration.
## Verifying Backup Integrity
A backup that does not restore is not a backup. Schedule a weekly job that picks the latest Velero backup and restores it into a scratch namespace, runs a smoke test (e.g., `kubectl exec` into a pod and check a known row in the database), then deletes the namespace. Alert if the smoke test fails. This catches corruption, missing CSI snapshot data, and certificate expirations long before a real incident. A short bash wrapper around `velero backup get -o json` and `velero restore create` is enough to automate this entirely.
## Velero with GitOps
If your cluster is reconciled by ArgoCD or Flux, restoring deployments and services is easier — those reconcile from Git automatically. The hard part is restoring stateful data: PVCs, secrets that are not in Git, and CRDs from operators. Use Velero specifically for the stateful tier and let GitOps handle the rest. The two tools are complementary: GitOps for desired state, Velero for persistent data. A pragmatic Velero policy in a GitOps shop excludes Deployments, Services, and ConfigMaps (because Git restores them) and focuses entirely on PersistentVolumeClaims, Secrets, and operator-managed custom resources.
## Security Considerations
Treat the backup bucket as if it contained your entire database in plaintext, because it does. Use a dedicated IAM principal with least privilege, enable MFA-delete on the bucket where supported, log every access via CloudTrail or equivalent, and rotate the credentials annually. The Velero service account inside the cluster should bind to a project-scoped role, not a cluster-wide one, where possible. Audit who can run `velero restore` — that command can resurrect deleted data, including secrets, in any namespace, which is a privilege-escalation path that auditors care about. Consider gating restore operations behind a CI/CD pipeline with approval rather than letting individual operators run them ad-hoc.
## Troubleshooting Failed Backups
When a backup ends in `PartiallyFailed`, run `velero backup describe
Was this article helpful?