SteelEye LifeKeeper is a high availability clustering solution that provides application protection and failover capabilities for Linux and Windows servers. It ensures continuous service availability by automatically detecting failures and transferring resources to backup servers, minimizing downtime for critical business applications. This comprehensive guide covers essential SteelEye LifeKeeper commands, configuration tasks, and best practices for managing clustered environments.
📑 Table of Contents
- Introduction to SteelEye LifeKeeper
- Checking SteelEye Cluster Status
- Verify Running Services
- Increasing Heartbeat Values in SteelEye Cluster
- Step-by-Step Heartbeat Configuration
- Essential SteelEye Commands
- Stop and Start LifeKeeper Services
- Managing Individual Resources
- Email Notification Configuration
- Troubleshooting Corrupt Flags
- Generate Support Configuration
- Best Practices for SteelEye Management
- Regular Status Monitoring
- Heartbeat Tuning Guidelines
- Maintenance Windows
- Common SteelEye Status Codes
- Frequently Asked Questions
- What is the purpose of increasing heartbeat values in SteelEye?
- How do I verify that heartbeat changes have taken effect?
- What should I do if resources don’t come back online after lkstart?
- Can I adjust heartbeat values without restarting the cluster?
- What is the difference between perform_action with and without the -b flag?
- How often should I generate lksupport bundles?
- Why would I want to disable email notifications?
- What does a corrupt flag in SteelEye indicate?
- How can I monitor SteelEye status automatically?
- What are the risks of setting heartbeat values too high?
- Conclusion
Introduction to SteelEye LifeKeeper
SteelEye LifeKeeper (now part of SIOS Technology) is an enterprise-class high availability solution designed to eliminate single points of failure in IT infrastructure. By monitoring applications, services, and system resources across cluster nodes, LifeKeeper can detect failures within seconds and automatically restart services on healthy backup servers. This capability is crucial for mission-critical applications that require 99.99% uptime or better.
Checking SteelEye Cluster Status
Verify Running Services
To check if LifeKeeper services are currently running and in service:
# Check services that are running (ISP = In Service Primary)
/opt/LifeKeeper/bin/lcdstatus | grep -i ISP
# Check services that are stopped (OSU = Out of Service Unprotected)
/opt/LifeKeeper/bin/lcdstatus | grep -i OSU
# View complete cluster status
/opt/LifeKeeper/bin/lcdstatus
The lcdstatus command displays the current state of all protected resources across the cluster. Resources showing “ISP” are actively running and protected, while “OSU” indicates stopped or unprotected services that may require attention.
Increasing Heartbeat Values in SteelEye Cluster
Heartbeat configuration is critical for cluster stability. The heartbeat mechanism allows cluster nodes to monitor each other’s health. Adjusting heartbeat timing can prevent false failovers caused by network latency or temporary slowdowns.
Step-by-Step Heartbeat Configuration
1. Edit the LifeKeeper configuration file:
vim /etc/default/LifeKeeper
2. Add or modify heartbeat parameters at the end of the file:
LCMNUMHBEATS=6
LCMBEATTIME=5
These parameters control:
- LCMNUMHBEATS: Number of consecutive missed heartbeats before declaring a node failed (default is typically 3-4)
- LCMBEATTIME: Time in seconds between heartbeat messages (default is often 2 seconds)
With these settings, a node will be declared failed after missing 6 heartbeats at 5-second intervals (30 seconds total), reducing the likelihood of false failovers due to temporary network issues.
3. Restart LifeKeeper services to apply changes:
# Stop LifeKeeper services (force stop)
lkstop -f
# Start LifeKeeper services
lkstart
4. Validate that all resources are back in service:
# Check for resources "In Service"
/opt/LifeKeeper/bin/lcdstatus | grep -i ISP
# Verify no resources are "Out of Service"
/opt/LifeKeeper/bin/lcdstatus | grep -i OSP
If the second command shows any output, it indicates resources that failed to come back online and require investigation.
Essential SteelEye Commands
Stop and Start LifeKeeper Services
Stop all LifeKeeper services:
lkstop -f
The -f flag forces a stop even if resources are currently in service.
Start all LifeKeeper services:
lkstart
Managing Individual Resources
Remove (stop) a specific resource hierarchy:
# Stop resource on all nodes
/opt/LifeKeeper/bin/perform_action -a remove -t <Service_Name>
# Stop resource on backup node only
/opt/LifeKeeper/bin/perform_action -a remove -t <Service_Name> -b
Restore (start) a specific resource hierarchy:
# Start resource on all nodes
/opt/LifeKeeper/bin/perform_action -a restore -t <Service_Name>
# Start resource on backup node only
/opt/LifeKeeper/bin/perform_action -a restore -t <Service_Name> -b
The -b flag limits the action to the backup server, useful when you need to bring resources online on the secondary node without affecting the primary.
Email Notification Configuration
To stop LifeKeeper from sending email notifications (useful during maintenance):
# Edit configuration file
vim /etc/default/LifeKeeper
# Comment out the notification alias line
#LK_NOTIFY_ALIAS=admin@example.com,ops@example.com
# Verify notification configuration
/opt/LifeKeeper/bin/lk_confignotifyalias --query
The query command displays the current notification recipients. If properly disabled, it should show no email addresses.
Troubleshooting Corrupt Flags
If you encounter corrupt flag errors in SteelEye logs, check for corrupted resource files:
cd /opt/LifeKeeper/subsys/scsi/resources/netraid
ll | grep -i corrupt
Any files with “corrupt” in their name indicate resources that may need recreation or recovery from backup.
Generate Support Configuration
When working with SteelEye support, they often request a complete system configuration bundle:
/opt/LifeKeeper/bin/lksupport
This command generates a compressed archive containing all LifeKeeper configuration files, logs, and system information needed for troubleshooting. The file can be uploaded to SIOS support for analysis.
Best Practices for SteelEye Management
Regular Status Monitoring
- Schedule automated checks of lcdstatus every 5-10 minutes
- Alert on any resources showing OSU (Out of Service) status
- Monitor heartbeat communications for packet loss
- Review LifeKeeper logs in /var/log regularly
Heartbeat Tuning Guidelines
- Increase LCMNUMHBEATS in environments with occasional network latency
- Keep LCMBEATTIME low (2-5 seconds) for faster failure detection
- Test heartbeat settings under normal and stressed conditions
- Document all heartbeat changes for future reference
Maintenance Windows
- Always test resource failover before maintenance windows
- Disable email notifications during planned maintenance
- Verify all resources return to ISP status after maintenance
- Keep detailed logs of all manual interventions
Common SteelEye Status Codes
Understanding LifeKeeper status codes helps with troubleshooting:
- ISP: In Service Primary – resource is running on the primary server
- ISB: In Service Backup – resource is running on the backup server
- OSU: Out of Service Unprotected – resource is down with no protection
- OSF: Out of Service Failed – resource failed and couldn’t restart
- OSP: Out of Service Protected – resource is intentionally stopped but protected
Frequently Asked Questions
What is the purpose of increasing heartbeat values in SteelEye?
Increasing heartbeat values (LCMNUMHBEATS and LCMBEATTIME) prevents unnecessary failovers caused by temporary network delays or server slowdowns. By requiring more consecutive missed heartbeats before declaring a node failed, you reduce false positives while still maintaining high availability. This is especially important in environments with occasional network congestion or when running resource-intensive applications that may briefly delay heartbeat responses.
How do I verify that heartbeat changes have taken effect?
After modifying /etc/default/LifeKeeper and restarting services with lkstop -f and lkstart, verify changes by checking the LifeKeeper logs in /var/log. You can also monitor cluster behavior under load to ensure failover timing matches your new settings. The lcdstatus command should show all resources returning to ISP status, confirming the cluster is functioning properly with the new heartbeat configuration.
What should I do if resources don’t come back online after lkstart?
If resources remain in OSU or OSF status after restarting LifeKeeper, first check /var/log/lifekeeper.log for specific error messages. Common issues include network connectivity problems, application configuration errors, or filesystem issues. Use the perform_action -a restore command to attempt manual resource restoration. If problems persist, generate an lksupport bundle and review it for configuration inconsistencies or contact SIOS support.
Can I adjust heartbeat values without restarting the cluster?
No, changes to heartbeat parameters in /etc/default/LifeKeeper require a full LifeKeeper service restart (lkstop -f followed by lkstart) to take effect. This is because heartbeat timing is initialized when the LifeKeeper daemon starts. Plan heartbeat changes during maintenance windows to minimize service disruption, and ensure you have a backup node available to take over services during the restart.
What is the difference between perform_action with and without the -b flag?
Without the -b flag, perform_action affects the entire resource hierarchy across all nodes in the cluster. With the -b flag, the action is limited to the backup server only. Use -b when you need to start or stop resources on the secondary node without impacting the primary, such as when performing maintenance on one node or manually controlling where resources run.
How often should I generate lksupport bundles?
Generate lksupport bundles before and after major configuration changes, during troubleshooting sessions, or when requested by SIOS support. It’s good practice to save periodic lksupport bundles (monthly or quarterly) as configuration baselines. These bundles are invaluable for comparing configurations when issues arise and for documenting your cluster setup history.
Why would I want to disable email notifications?
Disable email notifications during planned maintenance windows to prevent alert fatigue from expected service transitions. When testing failover scenarios or performing cluster upgrades, temporary email disabling prevents flooding administrators with notifications for intentional state changes. Always remember to re-enable notifications after maintenance by uncommenting the LK_NOTIFY_ALIAS line and verifying with lk_confignotifyalias –query.
What does a corrupt flag in SteelEye indicate?
Corrupt flags in /opt/LifeKeeper/subsys/scsi/resources/netraid indicate that LifeKeeper’s internal resource state tracking files have become inconsistent or damaged. This can occur after unexpected system shutdowns, filesystem issues, or software bugs. Corrupt flags may prevent resources from starting properly. Resolution typically involves removing corrupt flag files and recreating the affected resource hierarchies from a known-good configuration.
How can I monitor SteelEye status automatically?
Create a monitoring script that runs lcdstatus regularly and alerts on any OSU, OSF, or unexpected ISB status codes. Many organizations integrate LifeKeeper monitoring with tools like Nagios, Zabbix, or Splunk by parsing lcdstatus output. SIOS also provides SNMP MIBs for integration with enterprise monitoring platforms. Automated monitoring ensures rapid detection and response to cluster issues before they impact users.
What are the risks of setting heartbeat values too high?
Setting LCMNUMHBEATS or LCMBEATTIME too high increases the time before the cluster detects and responds to actual node failures, extending service outage duration. For example, with LCMNUMHBEATS=10 and LCMBEATTIME=10, a true node failure won’t trigger failover for 100 seconds. Balance false-positive prevention with acceptable failover time based on your application’s RTO (Recovery Time Objective). Most environments use values between 15-30 seconds total detection time.
Conclusion
SteelEye LifeKeeper provides robust high availability clustering capabilities for enterprise Linux environments. By understanding essential commands for status checking, resource management, and heartbeat configuration, administrators can maintain reliable clustered services with minimal downtime. Proper heartbeat tuning prevents unnecessary failovers while ensuring rapid response to genuine failures. Regular monitoring, careful configuration management, and following best practices ensure that your SteelEye cluster continues to protect critical applications effectively.
Whether you’re performing routine maintenance, troubleshooting service issues, or optimizing cluster behavior, the commands and procedures outlined in this guide provide the foundation for successful SteelEye LifeKeeper administration.
Was this article helpful?