Red Hat 2-Node Cluster Setup: Complete Guide with Pacemaker and Corosync
Introduction to High Availability Clustering in Red Hat Enterprise Linux
High availability (HA) clustering is essential for mission-critical applications that require maximum uptime. In this comprehensive guide, you’ll learn how to set up a 2-node cluster using Red Hat Enterprise Linux (RHEL) with Pacemaker and Corosyncβthe industry-standard cluster stack for Linux systems.
π Table of Contents
- Introduction to High Availability Clustering in Red Hat Enterprise Linux
- Understanding Cluster Terminology
- What is a Cluster?
- Active vs Passive Cluster Configurations
- Quorum: The Cluster’s Decision-Making Mechanism
- Corosync: The Cluster Communication Layer
- Heartbeat: Keeping the Cluster Alive
- Pacemaker: The Cluster Resource Manager
- Fencing (STONITH): Shoot The Other Node In The Head
- Resources: Services Managed by the Cluster
- Constraints: Resource Placement Rules
- Prerequisites for 2-Node Cluster Setup
- Hardware Requirements
- Network Configuration
- Software Requirements
- Step 1: Prepare Both Nodes
- 1.1 Update System and Set Hostnames
- 1.2 Configure /etc/hosts
- 1.3 Test Network Connectivity
- 1.4 Configure Firewall
- 1.5 Disable SELinux (Optional but Recommended for Testing)
- Step 2: Install Cluster Packages
- For Red Hat Enterprise Linux (RHEL)
- For CentOS Stream / Rocky Linux / AlmaLinux
- For CentOS 7 (Legacy)
- Troubleshooting Repository Issues
- Package Breakdown
- Step 3: Configure Cluster Authentication
- 3.1 Start and Enable pcsd Service
- 3.2 Set hacluster Password
- 3.3 Authenticate Cluster Nodes
- Step 4: Create the Cluster
- 4.1 Initialize the Cluster
- 4.2 Start the Cluster
- 4.3 Verify Cluster Status
- Step 5: Configure 2-Node Cluster Quorum
- 5.1 Disable Quorum Policy
- 5.2 Set Two-Node Mode
- Step 6: Configure Fencing (STONITH)
- Option A: IPMI/iLO Fencing (Physical Servers)
- Option B: VMware Fencing (Virtual Machines)
- Option C: Libvirt/KVM Fencing (KVM Virtual Machines)
- 6.1 Enable STONITH
- 6.2 Test Fencing (Optional but Recommended)
- Step 7: Configure Cluster Resources
- 7.1 Create a Floating IP Resource
- 7.2 Add Apache Web Server Resource (Example)
- 7.3 Configure Resource Stickiness
- Step 8: Configure Resource Constraints
- 8.1 Location Constraints (Preferred Node)
- 8.2 View All Constraints
- Step 9: Testing Cluster Failover
- 9.1 Check Current Resource Location
- 9.2 Test Manual Failover
- 9.3 Test Node Failure
- 9.4 Verify Floating IP
- Step 10: Cluster Maintenance and Monitoring
- 10.1 Monitor Cluster Status
- 10.2 Cluster Logs
- 10.3 Enter Maintenance Mode
- 10.4 Backup Cluster Configuration
- Common Cluster Management Commands
- Resource Management
- Node Management
- Cluster Properties
- Troubleshooting Common Issues
- Issue 1: Nodes Cannot See Each Other
- Issue 2: Resources Won’t Start
- Issue 3: Quorum Lost
- Issue 4: Fencing Failures
- Issue 5: Split-Brain Scenario
- Advanced Configuration
- Configure Multiple Cluster Networks (Redundancy)
- Configure Resource Monitoring
- Best Practices for Production Clusters
- Conclusion
By the end of this tutorial, you’ll have a fully functional 2-node cluster capable of automatic failover, ensuring your services remain available even when one node fails.
Understanding Cluster Terminology
Before diving into the configuration, let’s understand the key concepts and components that make high availability clustering work.
What is a Cluster?
A cluster is a group of independent servers (nodes) working together as a single system to provide high availability, load balancing, or parallel processing. In an HA cluster, if one node fails, the cluster automatically transfers resources to surviving nodes, minimizing downtime.
Active vs Passive Cluster Configurations
Understanding the difference between active and passive clusters is crucial for designing the right high-availability solution:
Active/Passive Cluster (Failover Cluster)
What it is: In an active/passive configuration, only one node actively runs the application/service at any given time while the other node(s) remain in standby mode, ready to take over if the active node fails.
Characteristics:
- Single Active Node: Resources run on only one node at a time
- Standby Nodes: Passive nodes wait idle, consuming resources but not serving requests
- Automatic Failover: When the active node fails, passive node becomes active
- Resource Inefficiency: Standby hardware sits unused until needed
- Simple Configuration: Easier to set up and manage
- No Load Balancing: All traffic goes to the active node
Best for:
- Databases that don’t support clustering (single-master)
- Applications that can’t run on multiple nodes simultaneously
- Stateful applications with shared storage
- Legacy applications not designed for distributed architectures
Example Use Cases: Oracle Database, SAP systems, legacy ERP applications, file servers with exclusive locks
Note: The 2-node cluster configuration in this tutorial uses an active/passive model. The Apache web server and virtual IP will run on only one node at a time, automatically failing over to the standby node when needed.
Active/Active Cluster (Load Balancing Cluster)
What it is: In an active/active configuration, all nodes actively run the application simultaneously, sharing the workload and providing both high availability and load distribution.
Characteristics:
- All Nodes Active: Resources run on all nodes concurrently
- Load Distribution: Requests distributed across all nodes
- Better Resource Utilization: All hardware actively serving requests
- Higher Complexity: Requires application support for distributed operation
- Session Synchronization: May require shared state or session replication
- Scalability: Easy to add more nodes for capacity
Best for:
- Stateless web applications
- Load-balanced web servers (Nginx, Apache)
- Multi-master databases (PostgreSQL with replication, Galera MySQL)
- Distributed applications designed for clustering
- Microservices architectures
Example Use Cases: Web server farms, Galera MySQL clusters, Elasticsearch clusters, Redis clusters, Kubernetes
Comparison Table: Active/Passive vs Active/Active
Feature | Active/Passive | Active/Active |
---|---|---|
Resource Usage | 50% (one node idle) | ~100% (all nodes working) |
Complexity | Low to Medium | Medium to High |
Load Balancing | No | Yes |
Failover Time | 30 seconds – 2 minutes | Instant (no failover needed) |
Cost Efficiency | Lower (wasted capacity) | Higher (full utilization) |
Application Support | Works with any application | Requires cluster-aware apps |
Best Use Case | Databases, legacy apps | Web servers, stateless services |
Important: This tutorial demonstrates an active/passive cluster configuration where resources (Virtual IP and Apache) run on one node at a time and fail over to the passive node when needed. This is the most common and reliable configuration for 2-node clusters, especially for applications that don’t support multi-master operation.
Quorum: The Cluster’s Decision-Making Mechanism
Quorum is the minimum number of nodes required to be online for the cluster to function properly. It prevents “split-brain” scenarios where network partitions cause multiple nodes to believe they’re the only active cluster.
Why Quorum Matters:
- Split-Brain Prevention: Ensures only one set of nodes can run resources at a time
- Data Integrity: Prevents multiple nodes from writing to shared storage simultaneously
- Voting Mechanism: Nodes vote to determine cluster state and resource placement
Quorum Formula: Required votes = (Total votes Γ· 2) + 1
For a 2-node cluster: (2 Γ· 2) + 1 = 2 votes required, which is problematic because losing one node means losing quorum. We’ll solve this using special 2-node configuration.
Corosync: The Cluster Communication Layer
Corosync is the messaging layer that provides:
- Membership Management: Tracks which nodes are part of the cluster
- Message Passing: Enables communication between cluster nodes
- Quorum Calculation: Determines if the cluster has enough nodes to operate
- Configuration Synchronization: Distributes cluster configuration across nodes
Corosync uses UDP multicast (or unicast) to send heartbeat messages and maintain cluster membership.
Heartbeat: Keeping the Cluster Alive
Heartbeat refers to regular messages sent between cluster nodes to verify they’re still operational:
- Purpose: Detects node failures quickly
- Frequency: Typically sent every 1-2 seconds
- Timeout: If heartbeats stop, the node is considered failed
- Multiple Paths: Often sent over multiple network interfaces for redundancy
If a node misses several consecutive heartbeats, the cluster initiates failover procedures.
Pacemaker: The Cluster Resource Manager
Pacemaker is the brain of the cluster, responsible for:
- Resource Management: Starting, stopping, and monitoring cluster resources
- Resource Placement: Deciding which node should run each resource
- Constraint Enforcement: Honoring location, ordering, and colocation rules
- Failover Orchestration: Moving resources when nodes fail
- Recovery Actions: Restarting failed resources or moving them to healthy nodes
Fencing (STONITH): Shoot The Other Node In The Head
Fencing is the most critical safety mechanism in clustering. When a node becomes unresponsive, the cluster must guarantee it’s truly offline before reassigning its resources.
STONITH (Shoot The Other Node In The Head) forcibly powers off or reboots unresponsive nodes to prevent:
- Split-Brain: Two nodes thinking they’re the primary
- Data Corruption: Multiple nodes accessing shared storage
- Resource Conflicts: Same service running on multiple nodes
Common Fencing Methods:
- Power Fencing: IPMI, iLO, DRAC, iDRAC (physically power off the node)
- Network Fencing: Disable switch ports
- Storage Fencing: Revoke storage access (SAN zoning)
- Virtual Fencing: VM hypervisor APIs (for virtual clusters)
Important: STONITH is mandatory in production clusters. A cluster without fencing is not a true high-availability cluster.
Resources: Services Managed by the Cluster
A resource is any service, application, or component managed by the cluster:
- Primitive Resources: Basic services (Apache, MySQL, IP addresses)
- Clone Resources: Services running on all nodes simultaneously
- Multi-state Resources: Services with master/slave roles (DRBD, PostgreSQL replication)
- Resource Groups: Multiple resources that move together as a unit
Constraints: Resource Placement Rules
Constraints define how and where resources run:
- Location Constraints: Prefer or avoid specific nodes
- Colocation Constraints: Keep resources together on the same node
- Order Constraints: Start/stop resources in specific sequences
Prerequisites for 2-Node Cluster Setup
Hardware Requirements
- 2 Physical or Virtual Servers: RHEL 8 or RHEL 9
- RAM: Minimum 2GB per node (4GB+ recommended)
- Network: Dedicated network interface for cluster communication (recommended)
- Fencing Device: IPMI/iLO access or VM fence agent
- Shared Storage (Optional): For clustered filesystems or databases
Network Configuration
Example Environment:
Component | Node 1 | Node 2 |
---|---|---|
Hostname | rhel-node1.example.com | rhel-node2.example.com |
Management IP | 192.168.1.10 | 192.168.1.11 |
Cluster IP | 10.0.0.10 | 10.0.0.11 |
Virtual IP (VIP) | 192.168.1.100 (floats between nodes) |
Software Requirements
- Red Hat Enterprise Linux 8.x or 9.x
- Active Red Hat subscription
- pcs (Pacemaker Configuration System)
- pacemaker
- corosync
- fence-agents
Step 1: Prepare Both Nodes
Execute these commands on both nodes unless specified otherwise.
1.1 Update System and Set Hostnames
# Update system packages
sudo dnf update -y
# Set hostname on Node 1
sudo hostnamectl set-hostname rhel-node1.example.com
# Set hostname on Node 2 (run on node 2 only)
sudo hostnamectl set-hostname rhel-node2.example.com
# Verify hostname
hostnamectl
1.2 Configure /etc/hosts
Add cluster nodes to /etc/hosts on both nodes:
sudo tee -a /etc/hosts <<EOF
# Management Network
192.168.1.10 rhel-node1.example.com rhel-node1
192.168.1.11 rhel-node2.example.com rhel-node2
# Cluster Network
10.0.0.10 rhel-node1-cluster
10.0.0.11 rhel-node2-cluster
EOF
1.3 Test Network Connectivity
# From node1, ping node2
ping -c 3 rhel-node2
# From node2, ping node1
ping -c 3 rhel-node1
# Test cluster network
ping -c 3 10.0.0.11 # From node1
ping -c 3 10.0.0.10 # From node2
1.4 Configure Firewall
Open required ports for cluster communication:
# Enable and start firewalld
sudo systemctl enable --now firewalld
# Add high availability service (includes Corosync and Pacemaker ports)
sudo firewall-cmd --permanent --add-service=high-availability
# Explicitly add ports:
# - 2224/tcp: pcsd web UI and node-to-node communication
# - 3121/tcp: Pacemaker remote
# - 5403/tcp: Corosync Qnet
# - 5404-5405/udp: Corosync
# - 21064/tcp: DLM (Distributed Lock Manager)
sudo firewall-cmd --permanent --add-port=2224/tcp
sudo firewall-cmd --permanent --add-port=3121/tcp
sudo firewall-cmd --permanent --add-port=5403/tcp
sudo firewall-cmd --permanent --add-port=5404-5405/udp
sudo firewall-cmd --permanent --add-port=21064/tcp
# Reload firewall
sudo firewall-cmd --reload
# Verify rules
sudo firewall-cmd --list-all
1.5 Disable SELinux (Optional but Recommended for Testing)
Note: In production, keep SELinux enforcing and configure appropriate policies.
# Set SELinux to permissive (temporary)
sudo setenforce 0
# Make it permanent
sudo sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config
# Verify
getenforce
Step 2: Install Cluster Packages
Install cluster software on both nodes:
For Red Hat Enterprise Linux (RHEL)
# Enable High Availability repository for RHEL 8
sudo subscription-manager repos --enable=rhel-8-for-x86_64-highavailability-rpms
# For RHEL 9:
sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
# Install cluster packages
sudo dnf install -y pcs pacemaker corosync fence-agents-all
# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'
For CentOS Stream / Rocky Linux / AlmaLinux
CentOS Stream, Rocky Linux, and AlmaLinux include HighAvailability packages in their base repositories:
# For CentOS Stream 8
sudo dnf config-manager --set-enabled ha
sudo dnf install -y pcs pacemaker corosync fence-agents-all
# For CentOS Stream 9 / Rocky Linux 9 / AlmaLinux 9
# HA packages are in the HighAvailability repository
sudo dnf install -y pcs pacemaker corosync fence-agents-all --enablerepo=highavailability
# Alternative: Enable PowerTools/CRB repository if needed
sudo dnf config-manager --set-enabled powertools # CentOS 8
sudo dnf config-manager --set-enabled crb # CentOS 9/Rocky/Alma
# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'
For CentOS 7 (Legacy)
# Enable HighAvailability repository
sudo yum-config-manager --enable rhel-7-server-ha-rpms
# Or for CentOS 7 base repos:
sudo yum install -y pcs pacemaker corosync fence-agents-all
# Verify installation
rpm -qa | grep -E 'pcs|pacemaker|corosync|fence'
Troubleshooting Repository Issues
If you encounter repository errors:
# List available repositories
sudo dnf repolist all
# For CentOS Stream, ensure you have the base repos
sudo dnf install -y centos-release-ha
# For Rocky Linux/AlmaLinux, HA packages are typically in base
# Check available groups
sudo dnf group list --available | grep -i "high\|availability"
# Install HA group (alternative method)
sudo dnf group install -y "High Availability" --nobest
Package Breakdown
- pcs: Command-line tool for cluster configuration and management
- pacemaker: Cluster resource manager that controls resource placement and failover
- corosync: Cluster communication engine for messaging and membership
- fence-agents-all: Complete collection of fencing/STONITH agents for various hardware platforms
Note: The exact repository names may vary between distributions. CentOS Stream 9, Rocky Linux 9, and AlmaLinux 9 typically include HA packages in their standard repositories without requiring additional configuration.
Step 3: Configure Cluster Authentication
3.1 Start and Enable pcsd Service
# Start pcsd daemon on both nodes
sudo systemctl start pcsd
sudo systemctl enable pcsd
# Verify service is running
sudo systemctl status pcsd
3.2 Set hacluster Password
The hacluster
user is created during package installation. Set the same password on both nodes:
# Set password (use the same password on both nodes)
sudo passwd hacluster
# Example: Enter "RedHat123!" (use a strong password in production)
3.3 Authenticate Cluster Nodes
Run this on Node 1 only:
# Authenticate all cluster nodes
sudo pcs host auth rhel-node1 rhel-node2 -u hacluster
# Enter the password you set for hacluster
# You should see:
# rhel-node1: Authorized
# rhel-node2: Authorized
Step 4: Create the Cluster
4.1 Initialize the Cluster
Run on Node 1 only:
# Create cluster named "ha_cluster"
sudo pcs cluster setup ha_cluster rhel-node1 rhel-node2
# This command:
# 1. Generates Corosync configuration
# 2. Distributes configuration to all nodes
# 3. Prepares cluster for startup
4.2 Start the Cluster
# Start cluster on all nodes
sudo pcs cluster start --all
# Enable cluster to start at boot
sudo pcs cluster enable --all
# Verify cluster is running
sudo pcs cluster status
4.3 Verify Cluster Status
# Check overall cluster status
sudo pcs status
# Check Corosync membership
sudo pcs status corosync
# Check node status
sudo pcs status nodes
# Detailed cluster information
sudo crm_mon -1
Step 5: Configure 2-Node Cluster Quorum
By default, a 2-node cluster cannot maintain quorum if one node fails (requires 2/2 votes). We need to disable quorum for 2-node clusters:
5.1 Disable Quorum Policy
# Allow cluster to operate with 1 node
sudo pcs property set no-quorum-policy=ignore
# Verify configuration
sudo pcs property list --all | grep quorum
What this does: Allows the cluster to continue operating even when it loses quorum (i.e., when one node fails). This is safe for 2-node clusters with proper fencing configured.
5.2 Set Two-Node Mode
# Enable special 2-node mode
sudo pcs property set stonith-enabled=false # Temporarily, we'll enable later
# View all cluster properties
sudo pcs property
Step 6: Configure Fencing (STONITH)
Fencing is mandatory for production clusters. We’ll configure fence devices based on your environment.
Option A: IPMI/iLO Fencing (Physical Servers)
If your servers have IPMI or HP iLO:
# Install IPMI tools on both nodes
sudo dnf install -y ipmitool fence-agents-ipmilan
# Configure fence device for Node 1
sudo pcs stonith create fence_node1 fence_ipmilan \
pcmk_host_list="rhel-node1" \
ipaddr="192.168.1.20" \
login="admin" \
passwd="ipmi_password" \
lanplus=1 \
op monitor interval=60s
# Configure fence device for Node 2
sudo pcs stonith create fence_node2 fence_ipmilan \
pcmk_host_list="rhel-node2" \
ipaddr="192.168.1.21" \
login="admin" \
passwd="ipmi_password" \
lanplus=1 \
op monitor interval=60s
# Test fencing (this will reboot node2!)
# sudo stonith_admin --reboot rhel-node2
Option B: VMware Fencing (Virtual Machines)
For VMware environments:
# Install VMware fence agent
sudo dnf install -y fence-agents-vmware-rest
# Configure fence device
sudo pcs stonith create fence_vmware fence_vmware_rest \
ip="vcenter.example.com" \
ssl_insecure=1 \
username="administrator@vsphere.local" \
password="vcenter_password" \
pcmk_host_map="rhel-node1:VM-Node1;rhel-node2:VM-Node2" \
op monitor interval=60s
Option C: Libvirt/KVM Fencing (KVM Virtual Machines)
# For KVM/libvirt VMs
sudo pcs stonith create fence_kvm fence_virsh \
ip="192.168.1.5" \
login="root" \
identity_file="/root/.ssh/id_rsa" \
pcmk_host_map="rhel-node1:vm-node1;rhel-node2:vm-node2" \
op monitor interval=60s
6.1 Enable STONITH
# Enable STONITH in cluster
sudo pcs property set stonith-enabled=true
# Verify fencing configuration
sudo pcs stonith status
sudo pcs stonith show
6.2 Test Fencing (Optional but Recommended)
# Test fence agent for node2 (this will reboot node2!)
sudo pcs stonith fence rhel-node2
# Watch cluster recover
watch -n 2 'sudo pcs status'
Step 7: Configure Cluster Resources
Let’s create a simple floating IP address resource that will move between nodes during failover.
7.1 Create a Floating IP Resource
# Create IPaddr2 resource (Virtual IP)
sudo pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
ip=192.168.1.100 \
cidr_netmask=24 \
op monitor interval=10s \
--group webservice
# Verify resource is running
sudo pcs status resources
7.2 Add Apache Web Server Resource (Example)
# Install Apache on both nodes
sudo dnf install -y httpd
# Create test page on both nodes
sudo tee /var/www/html/index.html <<EOF
<html>
<head><title>HA Cluster Test</title></head>
<body>
<h1>High Availability Cluster</h1>
<p>Served from: $(hostname)</p>
</body>
</html>
EOF
# Allow Apache in firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --reload
# Disable Apache from starting at boot (cluster will manage it)
sudo systemctl disable httpd
# Create Apache resource in the same group
sudo pcs resource create WebServer ocf:heartbeat:apache \
configfile=/etc/httpd/conf/httpd.conf \
statusurl="http://127.0.0.1/server-status" \
op monitor interval=10s \
--group webservice
# The group ensures IP and Apache move together
7.3 Configure Resource Stickiness
Prevent unnecessary failback when the preferred node recovers:
# Set resource stickiness (prefer to stay on current node)
sudo pcs resource defaults update resource-stickiness=100
# View all defaults
sudo pcs resource defaults
Step 8: Configure Resource Constraints
8.1 Location Constraints (Preferred Node)
# Prefer to run webservice on node1
sudo pcs constraint location webservice prefers rhel-node1=50
# View constraints
sudo pcs constraint location show
8.2 View All Constraints
# Show all configured constraints
sudo pcs constraint show --full
Step 9: Testing Cluster Failover
9.1 Check Current Resource Location
# See where resources are running
sudo pcs status resources
# Monitor cluster in real-time
sudo crm_mon -Afr1
9.2 Test Manual Failover
# Move resource to node2
sudo pcs resource move webservice rhel-node2
# Check status
sudo pcs status
# Clear the move constraint (allows resource to move back if needed)
sudo pcs resource clear webservice
9.3 Test Node Failure
Method 1: Stop Cluster on Active Node
# On the node currently running resources
sudo pcs cluster stop
# Watch failover from the other node
watch -n 2 'sudo pcs status'
# Start the cluster again
sudo pcs cluster start
Method 2: Put Node in Standby
# Put node1 in standby mode (resources will move)
sudo pcs node standby rhel-node1
# Verify resources moved
sudo pcs status
# Bring node back online
sudo pcs node unstandby rhel-node1
Method 3: Network Failure Simulation
# On node1, block cluster network (10.0.0.x)
sudo iptables -A INPUT -s 10.0.0.11 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.11 -j DROP
# Watch cluster detect failure and fence the node
# The node will be rebooted by fencing
# After recovery, clear iptables rules
sudo iptables -F
9.4 Verify Floating IP
# From external machine, ping the VIP
ping 192.168.1.100
# Access web server
curl http://192.168.1.100
# While monitoring, stop cluster on active node and verify VIP moves
Step 10: Cluster Maintenance and Monitoring
10.1 Monitor Cluster Status
# Real-time cluster monitoring
sudo crm_mon -Afr
# Cluster summary
sudo pcs status
# Node status
sudo pcs status nodes
# Resource status
sudo pcs status resources
# Show cluster configuration
sudo pcs config
10.2 Cluster Logs
# View Pacemaker logs
sudo journalctl -u pacemaker -f
# View Corosync logs
sudo journalctl -u corosync -f
# View pcsd logs
sudo journalctl -u pcsd -f
# Combined cluster logs
sudo tail -f /var/log/messages | grep -E 'corosync|pacemaker'
10.3 Enter Maintenance Mode
# Put cluster in maintenance mode (stops monitoring)
sudo pcs property set maintenance-mode=true
# Perform maintenance tasks...
# Exit maintenance mode
sudo pcs property set maintenance-mode=false
10.4 Backup Cluster Configuration
# Backup cluster configuration
sudo pcs config export pcs-commands | tee cluster-backup-$(date +%Y%m%d).txt
# Backup Corosync configuration
sudo cp /etc/corosync/corosync.conf /root/corosync.conf.backup
Common Cluster Management Commands
Resource Management
# List all resources
sudo pcs resource status
# Show resource configuration
sudo pcs resource show ResourceName
# Start a resource
sudo pcs resource enable ResourceName
# Stop a resource
sudo pcs resource disable ResourceName
# Delete a resource
sudo pcs resource delete ResourceName
# Clean up failed resource
sudo pcs resource cleanup ResourceName
# Force resource to restart
sudo pcs resource restart ResourceName
Node Management
# Put node in standby
sudo pcs node standby rhel-node1
# Remove node from standby
sudo pcs node unstandby rhel-node1
# Remove node from cluster
sudo pcs cluster node remove rhel-node2
# Add node to cluster
sudo pcs cluster node add rhel-node3
Cluster Properties
# List all properties
sudo pcs property list
# Set property
sudo pcs property set property-name=value
# Unset property (restore default)
sudo pcs property unset property-name
Troubleshooting Common Issues
Issue 1: Nodes Cannot See Each Other
Symptoms: pcs status
shows offline nodes
Solutions:
# Check Corosync status
sudo pcs status corosync
# Verify network connectivity
ping rhel-node2
# Check firewall
sudo firewall-cmd --list-all
# Restart Corosync
sudo systemctl restart corosync
# Check Corosync logs
sudo journalctl -u corosync -n 50
Issue 2: Resources Won’t Start
Symptoms: Resources stuck in “Starting” or “Stopped” state
Solutions:
# Check resource details
sudo pcs resource status ResourceName
# View failed actions
sudo pcs status --full
# Cleanup resource
sudo pcs resource cleanup ResourceName
# Check resource agent logs
sudo grep ResourceName /var/log/messages
# Test resource manually
sudo /usr/lib/ocf/resource.d/heartbeat/apache start
Issue 3: Quorum Lost
Symptoms: Cluster stops working when one node fails
Solutions:
# Verify quorum settings
sudo pcs property | grep quorum
# Set no-quorum-policy to ignore (2-node clusters)
sudo pcs property set no-quorum-policy=ignore
# Check current quorum status
sudo corosync-quorumtool
Issue 4: Fencing Failures
Symptoms: Node failures don’t trigger fencing
Solutions:
# Check STONITH status
sudo pcs stonith status
# Verify STONITH is enabled
sudo pcs property | grep stonith
# Test fence agent manually
sudo fence_ipmilan -a 192.168.1.20 -l admin -p password -o status
# Check fence device configuration
sudo pcs stonith show fence_node1
# View fencing history
sudo stonith_admin --history rhel-node2
Issue 5: Split-Brain Scenario
Symptoms: Both nodes think they’re primary
Prevention:
- Always enable STONITH/fencing
- Use redundant cluster networks
- Configure quorum correctly
- Monitor cluster regularly
Recovery:
# Stop cluster on both nodes
sudo pcs cluster stop --all
# On one node, clear CIB (Cluster Information Base)
sudo pcs cluster cib-push --config /var/lib/pacemaker/cib/cib.xml
# Start cluster on one node first
sudo pcs cluster start rhel-node1
# Wait for it to stabilize, then start second node
sudo pcs cluster start rhel-node2
Advanced Configuration
Configure Multiple Cluster Networks (Redundancy)
Edit /etc/corosync/corosync.conf
to add redundant ring:
sudo vi /etc/corosync/corosync.conf
# Add second interface:
interface {
ringnumber: 1
bindnetaddr: 172.16.0.0
mcastaddr: 239.255.1.2
mcastport: 5406
ttl: 1
}
# Sync configuration to all nodes
sudo pcs cluster sync
# Restart cluster
sudo pcs cluster stop --all
sudo pcs cluster start --all
Configure Resource Monitoring
# Add monitoring to existing resource
sudo pcs resource update WebServer \
op monitor interval=10s timeout=20s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
# View resource operations
sudo pcs resource op defaults
Best Practices for Production Clusters
- Always Enable Fencing: Never run production clusters without STONITH
- Use Dedicated Cluster Network: Separate cluster traffic from application traffic
- Implement Redundant Networks: Multiple Corosync rings for reliability
- Monitor Continuously: Set up alerts for cluster events
- Test Failover Regularly: Verify failover works before you need it
- Document Everything: Keep detailed records of cluster configuration
- Use Resource Stickiness: Prevent unnecessary failback
- Implement Proper Backup: Regular cluster configuration backups
- Keep Software Updated: Apply security and stability patches
- Use NTP Time Synchronization: Essential for cluster timing
Conclusion
You’ve successfully built a production-ready 2-node high availability cluster using Red Hat Enterprise Linux, Pacemaker, and Corosync. This cluster provides:
- Automatic Failover: Services move to healthy nodes when failures occur
- Resource Management: Centralized control of cluster services
- Fencing Protection: Prevents split-brain and data corruption
- High Availability: Minimized downtime for critical applications
Understanding cluster terminology like quorum, Corosync, heartbeat, Pacemaker, and fencing is essential for managing and troubleshooting your HA infrastructure. With this foundation, you can now expand your cluster to include databases, shared storage, and more complex resource configurations.
Next Steps:
- Add shared storage with GFS2 or OCFS2
- Configure database clustering (PostgreSQL, MySQL)
- Implement load balancing with HAProxy
- Set up monitoring with Nagios or Zabbix
- Explore multi-state resources for master/slave configurations
High availability clustering is a critical skill for Red Hat system administrators and DevOps engineers. Keep practicing failover scenarios, and always test thoroughly before deploying to production!
Was this article helpful?
About Ramesh Sundararamaiah
Red Hat Certified Architect
Expert in Linux system administration, DevOps automation, and cloud infrastructure. Specializing in Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, and enterprise IT solutions.