Red Hat 2-Node Cluster Setup: Complete Guide with Pacemaker and Corosync

Introduction to High Availability Clustering in Red Hat Enterprise Linux

High availability (HA) clustering is essential for mission-critical applications that require maximum uptime. In this comprehensive guide, you’ll learn how to set up a 2-node cluster using Red Hat Enterprise Linux (RHEL) with Pacemaker and Corosync—the industry-standard cluster stack for Linux systems.

📑 Table of Contents

Introduction to High Availability Clustering in Red Hat Enterprise Linux
Understanding Cluster Terminology
What is a Cluster?
Active vs Passive Cluster Configurations
Quorum: The Cluster’s Decision-Making Mechanism
Corosync: The Cluster Communication Layer
Heartbeat: Keeping the Cluster Alive
Pacemaker: The Cluster Resource Manager
Fencing (STONITH): Shoot The Other Node In The Head
Resources: Services Managed by the Cluster
Constraints: Resource Placement Rules
Prerequisites for 2-Node Cluster Setup
Hardware Requirements
Network Configuration
Software Requirements
Step 1: Prepare Both Nodes
1.1 Update System and Set Hostnames
1.2 Configure /etc/hosts
1.3 Test Network Connectivity
1.4 Configure Firewall
1.5 Disable SELinux (Optional but Recommended for Testing)
Step 2: Install Cluster Packages
For Red Hat Enterprise Linux (RHEL)
For CentOS Stream / Rocky Linux / AlmaLinux
For CentOS 7 (Legacy)
Troubleshooting Repository Issues
Package Breakdown
Step 3: Configure Cluster Authentication
3.1 Start and Enable pcsd Service
3.2 Set hacluster Password
3.3 Authenticate Cluster Nodes
Step 4: Create the Cluster
4.1 Initialize the Cluster
4.2 Start the Cluster
4.3 Verify Cluster Status
Step 5: Configure 2-Node Cluster Quorum
5.1 Disable Quorum Policy
5.2 Set Two-Node Mode
Step 6: Configure Fencing (STONITH)
Option A: IPMI/iLO Fencing (Physical Servers)
Option B: VMware Fencing (Virtual Machines)
Option C: Libvirt/KVM Fencing (KVM Virtual Machines)
6.1 Enable STONITH
6.2 Test Fencing (Optional but Recommended)
Step 7: Configure Cluster Resources
7.1 Create a Floating IP Resource
7.2 Add Apache Web Server Resource (Example)
7.3 Configure Resource Stickiness
Step 8: Configure Resource Constraints
8.1 Location Constraints (Preferred Node)
8.2 View All Constraints
Step 9: Testing Cluster Failover
9.1 Check Current Resource Location
9.2 Test Manual Failover
9.3 Test Node Failure
9.4 Verify Floating IP
Step 10: Cluster Maintenance and Monitoring
10.1 Monitor Cluster Status
10.2 Cluster Logs
10.3 Enter Maintenance Mode
10.4 Backup Cluster Configuration
Common Cluster Management Commands
Resource Management
Node Management
Cluster Properties
Troubleshooting Common Issues
Issue 1: Nodes Cannot See Each Other
Issue 2: Resources Won’t Start
Issue 3: Quorum Lost
Issue 4: Fencing Failures
Issue 5: Split-Brain Scenario
Advanced Configuration
Configure Multiple Cluster Networks (Redundancy)
Configure Resource Monitoring
Best Practices for Production Clusters
Conclusion

By the end of this tutorial, you’ll have a fully functional 2-node cluster capable of automatic failover, ensuring your services remain available even when one node fails.

Understanding Cluster Terminology

Before diving into the configuration, let’s understand the key concepts and components that make high availability clustering work.

What is a Cluster?

A cluster is a group of independent servers (nodes) working together as a single system to provide high availability, load balancing, or parallel processing. In an HA cluster, if one node fails, the cluster automatically transfers resources to surviving nodes, minimizing downtime.

Active vs Passive Cluster Configurations

Understanding the difference between active and passive clusters is crucial for designing the right high-availability solution:

Active/Passive Cluster (Failover Cluster)

What it is: In an active/passive configuration, only one node actively runs the application/service at any given time while the other node(s) remain in standby mode, ready to take over if the active node fails.

Characteristics:

Single Active Node: Resources run on only one node at a time
Standby Nodes: Passive nodes wait idle, consuming resources but not serving requests
Automatic Failover: When the active node fails, passive node becomes active
Resource Inefficiency: Standby hardware sits unused until needed
Simple Configuration: Easier to set up and manage
No Load Balancing: All traffic goes to the active node

Best for:

Databases that don’t support clustering (single-master)
Applications that can’t run on multiple nodes simultaneously
Stateful applications with shared storage
Legacy applications not designed for distributed architectures

Example Use Cases: Oracle Database, SAP systems, legacy ERP applications, file servers with exclusive locks

Note: The 2-node cluster configuration in this tutorial uses an active/passive model. The Apache web server and virtual IP will run on only one node at a time, automatically failing over to the standby node when needed.

Active/Active Cluster (Load Balancing Cluster)

What it is: In an active/active configuration, all nodes actively run the application simultaneously, sharing the workload and providing both high availability and load distribution.

Characteristics:

All Nodes Active: Resources run on all nodes concurrently
Load Distribution: Requests distributed across all nodes
Better Resource Utilization: All hardware actively serving requests
Higher Complexity: Requires application support for distributed operation
Session Synchronization: May require shared state or session replication
Scalability: Easy to add more nodes for capacity

Best for:

Stateless web applications
Load-balanced web servers (Nginx, Apache)
Multi-master databases (PostgreSQL with replication, Galera MySQL)
Distributed applications designed for clustering
Microservices architectures

Example Use Cases: Web server farms, Galera MySQL clusters, Elasticsearch clusters, Redis clusters, Kubernetes

Comparison Table: Active/Passive vs Active/Active

Feature	Active/Passive	Active/Active
Resource Usage	50% (one node idle)	~100% (all nodes working)
Complexity	Low to Medium	Medium to High
Load Balancing	No	Yes
Failover Time	30 seconds – 2 minutes	Instant (no failover needed)
Cost Efficiency	Lower (wasted capacity)	Higher (full utilization)
Application Support	Works with any application	Requires cluster-aware apps
Best Use Case	Databases, legacy apps	Web servers, stateless services

Important: This tutorial demonstrates an active/passive cluster configuration where resources (Virtual IP and Apache) run on one node at a time and fail over to the passive node when needed. This is the most common and reliable configuration for 2-node clusters, especially for applications that don’t support multi-master operation.

Quorum: The Cluster’s Decision-Making Mechanism

Quorum is the minimum number of nodes required to be online for the cluster to function properly. It prevents “split-brain” scenarios where network partitions cause multiple nodes to believe they’re the only active cluster.

Why Quorum Matters:

Split-Brain Prevention: Ensures only one set of nodes can run resources at a time
Data Integrity: Prevents multiple nodes from writing to shared storage simultaneously
Voting Mechanism: Nodes vote to determine cluster state and resource placement

Quorum Formula: Required votes = (Total votes ÷ 2) + 1

For a 2-node cluster: (2 ÷ 2) + 1 = 2 votes required, which is problematic because losing one node means losing quorum. We’ll solve this using special 2-node configuration.

Corosync: The Cluster Communication Layer

Corosync is the messaging layer that provides:

Membership Management: Tracks which nodes are part of the cluster
Message Passing: Enables communication between cluster nodes
Quorum Calculation: Determines if the cluster has enough nodes to operate
Configuration Synchronization: Distributes cluster configuration across nodes

Corosync uses UDP multicast (or unicast) to send heartbeat messages and maintain cluster membership.

Heartbeat: Keeping the Cluster Alive

Heartbeat refers to regular messages sent between cluster nodes to verify they’re still operational:

Purpose: Detects node failures quickly
Frequency: Typically sent every 1-2 seconds
Timeout: If heartbeats stop, the node is considered failed
Multiple Paths: Often sent over multiple network interfaces for redundancy

If a node misses several consecutive heartbeats, the cluster initiates failover procedures.

Pacemaker: The Cluster Resource Manager

Pacemaker is the brain of the cluster, responsible for:

Resource Management: Starting, stopping, and monitoring cluster resources
Resource Placement: Deciding which node should run each resource
Constraint Enforcement: Honoring location, ordering, and colocation rules
Failover Orchestration: Moving resources when nodes fail
Recovery Actions: Restarting failed resources or moving them to healthy nodes

Fencing (STONITH): Shoot The Other Node In The Head

Fencing is the most critical safety mechanism in clustering. When a node becomes unresponsive, the cluster must guarantee it’s truly offline before reassigning its resources.

STONITH (Shoot The Other Node In The Head) forcibly powers off or reboots unresponsive nodes to prevent:

Split-Brain: Two nodes thinking they’re the primary
Data Corruption: Multiple nodes accessing shared storage
Resource Conflicts: Same service running on multiple nodes

Common Fencing Methods:

Power Fencing: IPMI, iLO, DRAC, iDRAC (physically power off the node)
Network Fencing: Disable switch ports
Storage Fencing: Revoke storage access (SAN zoning)
Virtual Fencing: VM hypervisor APIs (for virtual clusters)

Important: STONITH is mandatory in production clusters. A cluster without fencing is not a true high-availability cluster.

Resources: Services Managed by the Cluster

A resource is any service, application, or component managed by the cluster:

Primitive Resources: Basic services (Apache, MySQL, IP addresses)
Clone Resources: Services running on all nodes simultaneously
Multi-state Resources: Services with master/slave roles (DRBD, PostgreSQL replication)
Resource Groups: Multiple resources that move together as a unit

Constraints: Resource Placement Rules

Constraints define how and where resources run:

Location Constraints: Prefer or avoid specific nodes
Colocation Constraints: Keep resources together on the same node
Order Constraints: Start/stop resources in specific sequences

Prerequisites for 2-Node Cluster Setup

Hardware Requirements

2 Physical or Virtual Servers: RHEL 8 or RHEL 9
RAM: Minimum 2GB per node (4GB+ recommended)
Network: Dedicated network interface for cluster communication (recommended)
Fencing Device: IPMI/iLO access or VM fence agent
Shared Storage (Optional): For clustered filesystems or databases

Network Configuration

Example Environment:

Component	Node 1	Node 2
Hostname	rhel-node1.example.com	rhel-node2.example.com
Management IP	192.168.1.10	192.168.1.11
Cluster IP	10.0.0.10	10.0.0.11
Virtual IP (VIP)	192.168.1.100 (floats between nodes)

Software Requirements

Red Hat Enterprise Linux 8.x or 9.x
Active Red Hat subscription
pcs (Pacemaker Configuration System)
pacemaker
corosync
fence-agents

Step 1: Prepare Both Nodes

Execute these commands on both nodes unless specified otherwise.

1.1 Update System and Set Hostnames

# Update system packages
sudo dnf update -y

# Set hostname on Node 1
sudo hostnamectl set-hostname rhel-node1.example.com

# Set hostname on Node 2 (run on node 2 only)
sudo hostnamectl set-hostname rhel-node2.example.com

# Verify hostname
hostnamectl

1.2 Configure /etc/hosts

Add cluster nodes to /etc/hosts on both nodes:

sudo tee -a /etc/hosts <<EOF
# Management Network
192.168.1.10    rhel-node1.example.com    rhel-node1
192.168.1.11    rhel-node2.example.com    rhel-node2

# Cluster Network
10.0.0.10       rhel-node1-cluster
10.0.0.11       rhel-node2-cluster
EOF

1.3 Test Network Connectivity

# From node1, ping node2
ping -c 3 rhel-node2

# From node2, ping node1
ping -c 3 rhel-node1

# Test cluster network
ping -c 3 10.0.0.11  # From node1
ping -c 3 10.0.0.10  # From node2

1.4 Configure Firewall

Open required ports for cluster communication:

# Enable and start firewalld
sudo systemctl enable --now firewalld

# Add high availability service (includes Corosync and Pacemaker ports)
sudo firewall-cmd --permanent --add-service=high-availability

# Explicitly add ports:
# - 2224/tcp: pcsd web UI and node-to-node communication
# - 3121/tcp: Pacemaker remote
# - 5403/tcp: Corosync Qnet
# - 5404-5405/udp: Corosync
# - 21064/tcp: DLM (Distributed Lock Manager)

sudo firewall-cmd --permanent --add-port=2224/tcp
sudo firewall-cmd --permanent --add-port=3121/tcp
sudo firewall-cmd --permanent --add-port=5403/tcp
sudo firewall-cmd --permanent --add-port=5404-5405/udp
sudo firewall-cmd --permanent --add-port=21064/tcp

# Reload firewall
sudo firewall-cmd --reload

# Verify rules
sudo firewall-cmd --list-all

1.5 Disable SELinux (Optional but Recommended for Testing)

Note: In production, keep SELinux enforcing and configure appropriate policies.

# Set SELinux to permissive (temporary)
sudo setenforce 0

# Make it permanent
sudo sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

# Verify
getenforce

Step 2: Install Cluster Packages

Install cluster software on both nodes:

For Red Hat Enterprise Linux (RHEL)

# Enable High Availability repository for RHEL 8
sudo subscription-manager repos --enable=rhel-8-for-x86_64-highavailability-rpms

# For RHEL 9:
sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms

# Install cluster packages
sudo dnf install -y pcs pacemaker corosync fence-agents-all

# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'

For CentOS Stream / Rocky Linux / AlmaLinux

CentOS Stream, Rocky Linux, and AlmaLinux include HighAvailability packages in their base repositories:

# For CentOS Stream 8
sudo dnf config-manager --set-enabled ha
sudo dnf install -y pcs pacemaker corosync fence-agents-all

# For CentOS Stream 9 / Rocky Linux 9 / AlmaLinux 9
# HA packages are in the HighAvailability repository
sudo dnf install -y pcs pacemaker corosync fence-agents-all --enablerepo=highavailability

# Alternative: Enable PowerTools/CRB repository if needed
sudo dnf config-manager --set-enabled powertools  # CentOS 8
sudo dnf config-manager --set-enabled crb         # CentOS 9/Rocky/Alma

# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'

For CentOS 7 (Legacy)

# Enable HighAvailability repository
sudo yum-config-manager --enable rhel-7-server-ha-rpms

# Or for CentOS 7 base repos:
sudo yum install -y pcs pacemaker corosync fence-agents-all

# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'

Troubleshooting Repository Issues

If you encounter repository errors:

# List available repositories
sudo dnf repolist all

# For CentOS Stream, ensure you have the base repos
sudo dnf install -y centos-release-ha

# For Rocky Linux/AlmaLinux, HA packages are typically in base
# Check available groups
sudo dnf group list --available | grep -i "high\|availability"

# Install HA group (alternative method)
sudo dnf group install -y "High Availability" --nobest

Package Breakdown

pcs: Command-line tool for cluster configuration and management
pacemaker: Cluster resource manager that controls resource placement and failover
corosync: Cluster communication engine for messaging and membership
fence-agents-all: Complete collection of fencing/STONITH agents for various hardware platforms

Note: The exact repository names may vary between distributions. CentOS Stream 9, Rocky Linux 9, and AlmaLinux 9 typically include HA packages in their standard repositories without requiring additional configuration.

Step 3: Configure Cluster Authentication

3.1 Start and Enable pcsd Service

# Start pcsd daemon on both nodes
sudo systemctl start pcsd
sudo systemctl enable pcsd

# Verify service is running
sudo systemctl status pcsd

3.2 Set hacluster Password

The hacluster user is created during package installation. Set the same password on both nodes:

# Set password (use the same password on both nodes)
sudo passwd hacluster

# Example: Enter "RedHat123!" (use a strong password in production)

3.3 Authenticate Cluster Nodes

Run this on Node 1 only:

# Authenticate all cluster nodes
sudo pcs host auth rhel-node1 rhel-node2 -u hacluster

# Enter the password you set for hacluster
# You should see:
# rhel-node1: Authorized
# rhel-node2: Authorized

Step 4: Create the Cluster

4.1 Initialize the Cluster

Run on Node 1 only:

# Create cluster named "ha_cluster"
sudo pcs cluster setup ha_cluster rhel-node1 rhel-node2

# This command:
# 1. Generates Corosync configuration
# 2. Distributes configuration to all nodes
# 3. Prepares cluster for startup

4.2 Start the Cluster

# Start cluster on all nodes
sudo pcs cluster start --all

# Enable cluster to start at boot
sudo pcs cluster enable --all

# Verify cluster is running
sudo pcs cluster status

4.3 Verify Cluster Status

# Check overall cluster status
sudo pcs status

# Check Corosync membership
sudo pcs status corosync

# Check node status
sudo pcs status nodes

# Detailed cluster information
sudo crm_mon -1

Step 5: Configure 2-Node Cluster Quorum

By default, a 2-node cluster cannot maintain quorum if one node fails (requires 2/2 votes). We need to disable quorum for 2-node clusters:

5.1 Disable Quorum Policy

# Allow cluster to operate with 1 node
sudo pcs property set no-quorum-policy=ignore

# Verify configuration
sudo pcs property list --all | grep quorum

What this does: Allows the cluster to continue operating even when it loses quorum (i.e., when one node fails). This is safe for 2-node clusters with proper fencing configured.

5.2 Set Two-Node Mode

# Enable special 2-node mode
sudo pcs property set stonith-enabled=false  # Temporarily, we'll enable later

# View all cluster properties
sudo pcs property

Step 6: Configure Fencing (STONITH)

Fencing is mandatory for production clusters. We’ll configure fence devices based on your environment.

Option A: IPMI/iLO Fencing (Physical Servers)

If your servers have IPMI or HP iLO:

# Install IPMI tools on both nodes
sudo dnf install -y ipmitool fence-agents-ipmilan

# Configure fence device for Node 1
sudo pcs stonith create fence_node1 fence_ipmilan \
    pcmk_host_list="rhel-node1" \
    ipaddr="192.168.1.20" \
    login="admin" \
    passwd="ipmi_password" \
    lanplus=1 \
    op monitor interval=60s

# Configure fence device for Node 2
sudo pcs stonith create fence_node2 fence_ipmilan \
    pcmk_host_list="rhel-node2" \
    ipaddr="192.168.1.21" \
    login="admin" \
    passwd="ipmi_password" \
    lanplus=1 \
    op monitor interval=60s

# Test fencing (this will reboot node2!)
# sudo stonith_admin --reboot rhel-node2

Option B: VMware Fencing (Virtual Machines)

For VMware environments:

# Install VMware fence agent
sudo dnf install -y fence-agents-vmware-rest

# Configure fence device
sudo pcs stonith create fence_vmware fence_vmware_rest \
    ip="vcenter.example.com" \
    ssl_insecure=1 \
    username="administrator@vsphere.local" \
    password="vcenter_password" \
    pcmk_host_map="rhel-node1:VM-Node1;rhel-node2:VM-Node2" \
    op monitor interval=60s

Option C: Libvirt/KVM Fencing (KVM Virtual Machines)

# For KVM/libvirt VMs
sudo pcs stonith create fence_kvm fence_virsh \
    ip="192.168.1.5" \
    login="root" \
    identity_file="/root/.ssh/id_rsa" \
    pcmk_host_map="rhel-node1:vm-node1;rhel-node2:vm-node2" \
    op monitor interval=60s

6.1 Enable STONITH

# Enable STONITH in cluster
sudo pcs property set stonith-enabled=true

# Verify fencing configuration
sudo pcs stonith status
sudo pcs stonith show

6.2 Test Fencing (Optional but Recommended)

# Test fence agent for node2 (this will reboot node2!)
sudo pcs stonith fence rhel-node2

# Watch cluster recover
watch -n 2 'sudo pcs status'

Step 7: Configure Cluster Resources

Let’s create a simple floating IP address resource that will move between nodes during failover.

7.1 Create a Floating IP Resource

# Create IPaddr2 resource (Virtual IP)
sudo pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    op monitor interval=10s \
    --group webservice

# Verify resource is running
sudo pcs status resources

7.2 Add Apache Web Server Resource (Example)

# Install Apache on both nodes
sudo dnf install -y httpd

# Create test page on both nodes
sudo tee /var/www/html/index.html <<EOF
<html>
<head><title>HA Cluster Test</title></head>
<body>
<h1>High Availability Cluster</h1>
<p>Served from: $(hostname)</p>
</body>
</html>
EOF

# Allow Apache in firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --reload

# Disable Apache from starting at boot (cluster will manage it)
sudo systemctl disable httpd

# Create Apache resource in the same group
sudo pcs resource create WebServer ocf:heartbeat:apache \
    configfile=/etc/httpd/conf/httpd.conf \
    statusurl="http://127.0.0.1/server-status" \
    op monitor interval=10s \
    --group webservice

# The group ensures IP and Apache move together

7.3 Configure Resource Stickiness

Prevent unnecessary failback when the preferred node recovers:

# Set resource stickiness (prefer to stay on current node)
sudo pcs resource defaults update resource-stickiness=100

# View all defaults
sudo pcs resource defaults

Step 8: Configure Resource Constraints

8.1 Location Constraints (Preferred Node)

# Prefer to run webservice on node1
sudo pcs constraint location webservice prefers rhel-node1=50

# View constraints
sudo pcs constraint location show

8.2 View All Constraints

# Show all configured constraints
sudo pcs constraint show --full

Step 9: Testing Cluster Failover

9.1 Check Current Resource Location

# See where resources are running
sudo pcs status resources

# Monitor cluster in real-time
sudo crm_mon -Afr1

9.2 Test Manual Failover

# Move resource to node2
sudo pcs resource move webservice rhel-node2

# Check status
sudo pcs status

# Clear the move constraint (allows resource to move back if needed)
sudo pcs resource clear webservice

9.3 Test Node Failure

Method 1: Stop Cluster on Active Node

# On the node currently running resources
sudo pcs cluster stop

# Watch failover from the other node
watch -n 2 'sudo pcs status'

# Start the cluster again
sudo pcs cluster start

Method 2: Put Node in Standby

# Put node1 in standby mode (resources will move)
sudo pcs node standby rhel-node1

# Verify resources moved
sudo pcs status

# Bring node back online
sudo pcs node unstandby rhel-node1

Method 3: Network Failure Simulation

# On node1, block cluster network (10.0.0.x)
sudo iptables -A INPUT -s 10.0.0.11 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.11 -j DROP

# Watch cluster detect failure and fence the node
# The node will be rebooted by fencing

# After recovery, clear iptables rules
sudo iptables -F

9.4 Verify Floating IP

# From external machine, ping the VIP
ping 192.168.1.100

# Access web server
curl http://192.168.1.100

# While monitoring, stop cluster on active node and verify VIP moves

Step 10: Cluster Maintenance and Monitoring

10.1 Monitor Cluster Status

# Real-time cluster monitoring
sudo crm_mon -Afr

# Cluster summary
sudo pcs status

# Node status
sudo pcs status nodes

# Resource status
sudo pcs status resources

# Show cluster configuration
sudo pcs config

10.2 Cluster Logs

# View Pacemaker logs
sudo journalctl -u pacemaker -f

# View Corosync logs
sudo journalctl -u corosync -f

# View pcsd logs
sudo journalctl -u pcsd -f

# Combined cluster logs
sudo tail -f /var/log/messages | grep -E 'corosync|pacemaker'

10.3 Enter Maintenance Mode

# Put cluster in maintenance mode (stops monitoring)
sudo pcs property set maintenance-mode=true

# Perform maintenance tasks...

# Exit maintenance mode
sudo pcs property set maintenance-mode=false

10.4 Backup Cluster Configuration

# Backup cluster configuration
sudo pcs config export pcs-commands | tee cluster-backup-$(date +%Y%m%d).txt

# Backup Corosync configuration
sudo cp /etc/corosync/corosync.conf /root/corosync.conf.backup

Common Cluster Management Commands

Resource Management

# List all resources
sudo pcs resource status

# Show resource configuration
sudo pcs resource show ResourceName

# Start a resource
sudo pcs resource enable ResourceName

# Stop a resource
sudo pcs resource disable ResourceName

# Delete a resource
sudo pcs resource delete ResourceName

# Clean up failed resource
sudo pcs resource cleanup ResourceName

# Force resource to restart
sudo pcs resource restart ResourceName

Node Management

# Put node in standby
sudo pcs node standby rhel-node1

# Remove node from standby
sudo pcs node unstandby rhel-node1

# Remove node from cluster
sudo pcs cluster node remove rhel-node2

# Add node to cluster
sudo pcs cluster node add rhel-node3

Cluster Properties

# List all properties
sudo pcs property list

# Set property
sudo pcs property set property-name=value

# Unset property (restore default)
sudo pcs property unset property-name

Troubleshooting Common Issues

Issue 1: Nodes Cannot See Each Other

Symptoms: pcs status shows offline nodes

Solutions:

# Check Corosync status
sudo pcs status corosync

# Verify network connectivity
ping rhel-node2

# Check firewall
sudo firewall-cmd --list-all

# Restart Corosync
sudo systemctl restart corosync

# Check Corosync logs
sudo journalctl -u corosync -n 50

Issue 2: Resources Won’t Start

Symptoms: Resources stuck in “Starting” or “Stopped” state

Solutions:

# Check resource details
sudo pcs resource status ResourceName

# View failed actions
sudo pcs status --full

# Cleanup resource
sudo pcs resource cleanup ResourceName

# Check resource agent logs
sudo grep ResourceName /var/log/messages

# Test resource manually
sudo /usr/lib/ocf/resource.d/heartbeat/apache start

Issue 3: Quorum Lost

Symptoms: Cluster stops working when one node fails

Solutions:

# Verify quorum settings
sudo pcs property | grep quorum

# Set no-quorum-policy to ignore (2-node clusters)
sudo pcs property set no-quorum-policy=ignore

# Check current quorum status
sudo corosync-quorumtool

Issue 4: Fencing Failures

Symptoms: Node failures don’t trigger fencing

Solutions:

# Check STONITH status
sudo pcs stonith status

# Verify STONITH is enabled
sudo pcs property | grep stonith

# Test fence agent manually
sudo fence_ipmilan -a 192.168.1.20 -l admin -p password -o status

# Check fence device configuration
sudo pcs stonith show fence_node1

# View fencing history
sudo stonith_admin --history rhel-node2

Issue 5: Split-Brain Scenario

Symptoms: Both nodes think they’re primary

Prevention:

Always enable STONITH/fencing
Use redundant cluster networks
Configure quorum correctly
Monitor cluster regularly

Recovery:

# Stop cluster on both nodes
sudo pcs cluster stop --all

# On one node, clear CIB (Cluster Information Base)
sudo pcs cluster cib-push --config /var/lib/pacemaker/cib/cib.xml

# Start cluster on one node first
sudo pcs cluster start rhel-node1

# Wait for it to stabilize, then start second node
sudo pcs cluster start rhel-node2

Advanced Configuration

Configure Multiple Cluster Networks (Redundancy)

Edit /etc/corosync/corosync.conf to add redundant ring:

sudo vi /etc/corosync/corosync.conf

# Add second interface:
interface {
    ringnumber: 1
    bindnetaddr: 172.16.0.0
    mcastaddr: 239.255.1.2
    mcastport: 5406
    ttl: 1
}

# Sync configuration to all nodes
sudo pcs cluster sync

# Restart cluster
sudo pcs cluster stop --all
sudo pcs cluster start --all

Configure Resource Monitoring

# Add monitoring to existing resource
sudo pcs resource update WebServer \
    op monitor interval=10s timeout=20s \
    op start interval=0 timeout=40s \
    op stop interval=0 timeout=60s

# View resource operations
sudo pcs resource op defaults

Best Practices for Production Clusters

Always Enable Fencing: Never run production clusters without STONITH
Use Dedicated Cluster Network: Separate cluster traffic from application traffic
Implement Redundant Networks: Multiple Corosync rings for reliability
Monitor Continuously: Set up alerts for cluster events
Test Failover Regularly: Verify failover works before you need it
Document Everything: Keep detailed records of cluster configuration
Use Resource Stickiness: Prevent unnecessary failback
Implement Proper Backup: Regular cluster configuration backups
Keep Software Updated: Apply security and stability patches
Use NTP Time Synchronization: Essential for cluster timing

Conclusion

You’ve successfully built a production-ready 2-node high availability cluster using Red Hat Enterprise Linux, Pacemaker, and Corosync. This cluster provides:

Automatic Failover: Services move to healthy nodes when failures occur
Resource Management: Centralized control of cluster services
Fencing Protection: Prevents split-brain and data corruption
High Availability: Minimized downtime for critical applications

Understanding cluster terminology like quorum, Corosync, heartbeat, Pacemaker, and fencing is essential for managing and troubleshooting your HA infrastructure. With this foundation, you can now expand your cluster to include databases, shared storage, and more complex resource configurations.

Next Steps:

Add shared storage with GFS2 or OCFS2
Configure database clustering (PostgreSQL, MySQL)
Implement load balancing with HAProxy
Set up monitoring with Nagios or Zabbix
Explore multi-state resources for master/slave configurations

High availability clustering is a critical skill for Red Hat system administrators and DevOps engineers. Keep practicing failover scenarios, and always test thoroughly before deploying to production!

Was this article helpful?

Introduction to High Availability Clustering in Red Hat Enterprise Linux

📑 Table of Contents

Understanding Cluster Terminology

What is a Cluster?

Active vs Passive Cluster Configurations

Active/Passive Cluster (Failover Cluster)

Active/Active Cluster (Load Balancing Cluster)

Comparison Table: Active/Passive vs Active/Active

Quorum: The Cluster’s Decision-Making Mechanism

Corosync: The Cluster Communication Layer

Heartbeat: Keeping the Cluster Alive

Pacemaker: The Cluster Resource Manager

Fencing (STONITH): Shoot The Other Node In The Head

Resources: Services Managed by the Cluster

Constraints: Resource Placement Rules

Prerequisites for 2-Node Cluster Setup

Hardware Requirements

Network Configuration

Software Requirements

Step 1: Prepare Both Nodes

1.1 Update System and Set Hostnames

1.2 Configure /etc/hosts

1.3 Test Network Connectivity

1.4 Configure Firewall

1.5 Disable SELinux (Optional but Recommended for Testing)

Step 2: Install Cluster Packages

📧 Subscribe to Our Newsletter

For Red Hat Enterprise Linux (RHEL)

For CentOS Stream / Rocky Linux / AlmaLinux

For CentOS 7 (Legacy)

Troubleshooting Repository Issues

Package Breakdown

Step 3: Configure Cluster Authentication

3.1 Start and Enable pcsd Service

3.2 Set hacluster Password

3.3 Authenticate Cluster Nodes

Step 4: Create the Cluster

4.1 Initialize the Cluster

4.2 Start the Cluster

4.3 Verify Cluster Status

Step 5: Configure 2-Node Cluster Quorum

5.1 Disable Quorum Policy

5.2 Set Two-Node Mode

Step 6: Configure Fencing (STONITH)

Option A: IPMI/iLO Fencing (Physical Servers)

Option B: VMware Fencing (Virtual Machines)

Option C: Libvirt/KVM Fencing (KVM Virtual Machines)

6.1 Enable STONITH

6.2 Test Fencing (Optional but Recommended)

Step 7: Configure Cluster Resources

7.1 Create a Floating IP Resource

7.2 Add Apache Web Server Resource (Example)

7.3 Configure Resource Stickiness

Step 8: Configure Resource Constraints

8.1 Location Constraints (Preferred Node)

8.2 View All Constraints

Step 9: Testing Cluster Failover

9.1 Check Current Resource Location

9.2 Test Manual Failover

9.3 Test Node Failure

9.4 Verify Floating IP

Step 10: Cluster Maintenance and Monitoring

10.1 Monitor Cluster Status

10.2 Cluster Logs

10.3 Enter Maintenance Mode

10.4 Backup Cluster Configuration

Common Cluster Management Commands

Resource Management

Node Management

Cluster Properties

Troubleshooting Common Issues

Issue 1: Nodes Cannot See Each Other

Issue 2: Resources Won’t Start

Issue 3: Quorum Lost

Issue 4: Fencing Failures

Issue 5: Split-Brain Scenario

Advanced Configuration

Configure Multiple Cluster Networks (Redundancy)

Configure Resource Monitoring

Best Practices for Production Clusters