Press ESC to close Press / to search

MariaDB Galera Cluster High-Availability Setup on AlmaLinux 9 in 2026

If your business cannot afford to lose a single write or tolerate even a few minutes of database downtime, a single MariaDB instance is a time bomb. MariaDB Galera Cluster gives you synchronous multi-master replication across three or more nodes: every write lands on every node before the client sees success, any node can accept writes, and failure of any single node is invisible to the application. In 2026 MariaDB 11.4 LTS with Galera 4 has become the default pick for teams who want HA without paying for Oracle RAC or depending on managed cloud databases. This guide walks through a full three-node Galera cluster installation on AlmaLinux 9.

## How Galera Works

Galera is a wsrep-based replication plugin that ships with MariaDB. Each transaction is certification-based: the transaction commits locally, is broadcast to all nodes, and each node independently certifies whether it conflicts with concurrent writes. If any node rejects, the transaction rolls back on all nodes. The result is true synchronous replication β€” no replica lag, no split brain if you run at least three nodes.

The catch: writes scale with the slowest node because every transaction waits for quorum certification. Galera is for availability, not write throughput.

## Prerequisites

Three AlmaLinux 9 hosts, each with 2+ vCPU and 4+ GB RAM, on the same low-latency network. Assign static IPs. This guide uses 10.0.0.11, 10.0.0.12, and 10.0.0.13. Synchronize time with chronyd β€” clock skew kills Galera:

“`bash
sudo dnf install -y chrony
sudo systemctl enable –now chronyd
“`

Open firewall ports 3306, 4567, 4568, and 4444 between all three nodes:

“`bash
sudo firewall-cmd –permanent –add-port=3306/tcp
sudo firewall-cmd –permanent –add-port=4567/tcp
sudo firewall-cmd –permanent –add-port=4568/tcp
sudo firewall-cmd –permanent –add-port=4444/tcp
sudo firewall-cmd –permanent –add-port=4567/udp
sudo firewall-cmd –reload
“`

Disable SELinux enforcing mode for the initial cluster bootstrap, or install the mariadb-server-galera SELinux module which does the right thing.

## Installing MariaDB 11.4

Add the MariaDB repo for 11.4:

“`bash
sudo tee /etc/yum.repos.d/mariadb.repo <rhel/9/x86_64/ gpgkey = https://rpm.mariadb.org/RPM-GPG-KEY-MariaDB gpgcheck = 1 EOF sudo dnf install -y MariaDB-server MariaDB-client galera-4 ``` Repeat on all three nodes. ## Galera Configuration On each node, create `/etc/my.cnf.d/60-galera.cnf`: ```ini [galera] wsrep_on=ON wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so wsrep_cluster_name="acme-prod" wsrep_cluster_address="gcomm://10.0.0.11,10.0.0.12,10.0.0.13" wsrep_node_address="10.0.0.11" wsrep_node_name="db01" wsrep_sst_method=mariabackup wsrep_sst_auth="sstuser:strongpass" binlog_format=row default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 innodb_flush_log_at_trx_commit=2 bind-address=0.0.0.0 ``` Change `wsrep_node_address` and `wsrep_node_name` on each node. ## Bootstrapping the Cluster On the first node only, bootstrap: ```bash sudo galera_new_cluster sudo systemctl status mariadb ``` Check cluster size: ```bash mysql -e "SHOW STATUS LIKE 'wsrep_cluster_size';" ``` It should report 1. Now start MariaDB normally on the second and third nodes: ```bash sudo systemctl enable --now mariadb ``` Rerun the status query on the first node; it should now report 3. If it does not, check `/var/log/mariadb/mariadb.log` on the joining node β€” 90% of issues are firewall or `wsrep_cluster_address` typos. ## Creating the SST User State Snapshot Transfer uses mariabackup to copy data to joining nodes. Create the user on any node: ```sql CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 'strongpass'; GRANT PROCESS, RELOAD, LOCK TABLES, BINLOG MONITOR, REPLICA MONITOR ON *.* TO 'sstuser'@'localhost'; FLUSH PRIVILEGES; ``` ## Load Balancing in Front Direct all writes to a single node to minimize certification conflicts, but failover to the others if it dies. HAProxy is the standard choice: ``` frontend mysql-front bind *:3306 mode tcp default_backend mysql-back backend mysql-back mode tcp option mysql-check user haproxy server db01 10.0.0.11:3306 check server db02 10.0.0.12:3306 check backup server db03 10.0.0.13:3306 check backup ``` For read scaling, point read-only traffic at a separate HAProxy backend that load-balances across all three. ## Testing Failure Stop MariaDB on db02: ```bash sudo systemctl stop mariadb ``` On db01, re-check `wsrep_cluster_size` β€” it is now 2. Applications continue. Start db02 again; it rejoins via IST (incremental state transfer), and cluster size goes back to 3. Stop two nodes simultaneously, and the survivor goes read-only to prevent split brain. That is correct behavior β€” Galera requires quorum. ## Backups Even with three-node synchronous replication, you still need logical and physical backups. A dropped table replicates to every node in milliseconds. Schedule mariabackup nightly on one node: ```bash mariabackup --backup --target-dir=/backup/$(date +%F) \ --user=backup --password=... mariabackup --prepare --target-dir=/backup/$(date +%F) ``` Ship the backup off-host to object storage with Restic or rclone. ## Monitoring Galera Export metrics with `mysqld_exporter` and build a Grafana dashboard tracking `wsrep_cluster_size`, `wsrep_local_state_comment`, `wsrep_flow_control_paused`, and `wsrep_local_recv_queue_avg`. Alert on cluster size drops or flow control > 0.1.

## Tuning for Low Latency

If your three nodes share a rack or datacenter, keep `wsrep_slave_threads` at 8–16 and `innodb_flush_log_at_trx_commit=2` for a good balance of durability and latency. For cross-AZ clusters, measure round-trip time and do not attempt synchronous replication over more than 10 ms latency.

## FAQ

**Can I run Galera on two nodes?** Technically yes, but any node failure will cause the survivor to go read-only. Three nodes is the minimum for real HA.

**Is Percona XtraDB Cluster different?** Same Galera engine under a different brand. Configuration is interchangeable.

**Does Galera support foreign keys and triggers?** Yes, with some caveats around DDL statements and non-deterministic triggers.

**What is the write performance hit?** Typical is 10–20% vs a standalone node. High-latency networks amplify this.

**Can I mix Galera versions across nodes?** Briefly during upgrades, yes. Not long term.

**Does Galera support row-based logical backups?** Yes β€” `mysqldump`, `mariadb-dump`, and mydumper all work normally. For consistent multi-table dumps use `–single-transaction` against any node.

**What is the right hardware for a Galera node?** Identical specs across all nodes, fast NVMe storage, 10 Gbps networking minimum, and at least 16 GB RAM for any production workload. Mixed-spec clusters slow down to the weakest node.

**Can I do a rolling upgrade across major versions?** Within the same major series (11.4.x to 11.4.y), yes. Across major versions (10.11 to 11.4), do a rolling upgrade with one node at a time, validating cluster health after each, and never run mixed for more than the upgrade window.

## Split Brain Prevention with Arbitrators

A two-node Galera cluster cannot survive a network partition because neither side has quorum. The galera arbitrator (`garbd`) is a lightweight daemon that participates in the quorum vote without storing data. Run it on a third small VM and a two-node “real” cluster behaves like a three-node one for quorum purposes:

“`bash
sudo dnf install -y galera-4
garbd –address gcomm://10.0.0.11,10.0.0.12 –group acme-prod –daemon
“`

This is the cheapest way to get HA on a budget that does not stretch to three full database servers.

## Schema Changes Without Downtime

DDL is the painful part of multi-master replication. By default Galera serializes ALTER TABLE across the cluster as a Total Order Isolation (TOI) operation, blocking writes everywhere for the duration. For large tables this is a multi-hour outage in disguise. Use Rolling Schema Upgrade (RSU) mode to apply DDL one node at a time:

“`sql
SET SESSION wsrep_OSU_method = ‘RSU’;
ALTER TABLE orders ADD COLUMN region VARCHAR(16);
SET SESSION wsrep_OSU_method = ‘TOI’;
“`

The node temporarily desyncs, applies the DDL locally, then rejoins. Repeat on each node. Combine with `pt-online-schema-change` from Percona Toolkit for the most demanding migrations β€” it builds a shadow table, copies data in chunks, and swaps atomically.

## Read Scaling and Connection Routing

Galera’s synchronous replication means every node has identical data. Send write traffic to a single node to minimize certification conflicts and load-balance reads across all three. Application drivers like ProxySQL handle this natively with query routing rules:

“`sql
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES (1, 1, ‘^SELECT’, 20, 1), (2, 1, ‘^.*’, 10, 1);
“`

Hostgroup 10 is the writer pool with one node; hostgroup 20 is the reader pool with all three. ProxySQL handles failover automatically when the writer dies.

## Encryption at Rest and in Transit

Modern compliance regimes require both. For replication encryption, configure SSL on each node:

“`ini
[mysqld]
ssl-ca = /etc/mysql/ssl/ca.pem
ssl-cert = /etc/mysql/ssl/server-cert.pem
ssl-key = /etc/mysql/ssl/server-key.pem
require_secure_transport = ON

wsrep_provider_options = “socket.ssl=yes;socket.ssl_key=/etc/mysql/ssl/server-key.pem;socket.ssl_cert=/etc/mysql/ssl/server-cert.pem;socket.ssl_ca=/etc/mysql/ssl/ca.pem”
“`

For data at rest, enable InnoDB tablespace encryption with a keyring plugin. The `file_key_management` plugin works with a local key file for testing; for production, use HashiCorp Vault or AWS KMS via the corresponding plugin.

## Operational Runbooks

Have written, tested procedures for the routine emergencies: the entire cluster going down, the writer node failing under load, a SST taking longer than expected, and a node refusing to rejoin after a crash. The most common production incident is “the cluster lost quorum at 3 a.m.” The recovery is `galera_new_cluster` on the most up-to-date node β€” but identifying which node has the highest `seqno` from the `grastate.dat` file is the step everyone forgets at 3 a.m. Document it, test it on a staging cluster monthly, and your real failures become routine.

## Performance Benchmarks

A baseline three-node Galera cluster on modern hardware (8 vCPU, 32 GB RAM, NVMe, 10 Gbps network) handles roughly 5,000–8,000 transactions per second with a typical OLTP mix. Read scaling is near-linear across nodes when read queries are distributed. Write throughput is bounded by the slowest node and the certification round-trip latency, so a cluster spanning two data centers 5 ms apart will commit substantially slower than one in a single rack. Benchmark with sysbench under your real workload before committing to a topology.

Was this article helpful?

Advertisement
🏷️ Tags: AlmaLinux database galera cluster high-availability mariadb
R

About Ramesh Sundararamaiah

Red Hat Certified Architect

Expert in Linux system administration, DevOps automation, and cloud infrastructure. Specializing in Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, and enterprise IT solutions.

🐧 Stay Updated with Linux Tips

Get the latest tutorials, news, and guides delivered to your inbox weekly.

Advertisement

Add Comment


↑