SteelEye LifeKeeper Commands and Configuration Guide

SteelEye LifeKeeper is a high availability clustering solution that provides application protection and failover capabilities for Linux and Windows servers. It ensures continuous service availability by automatically detecting failures and transferring resources to backup servers, minimizing downtime for critical business applications. This comprehensive guide covers essential SteelEye LifeKeeper commands, configuration tasks, and best practices for managing clustered environments.

📑 Table of Contents

Introduction to SteelEye LifeKeeper
Checking SteelEye Cluster Status
Verify Running Services
Increasing Heartbeat Values in SteelEye Cluster
Step-by-Step Heartbeat Configuration
Essential SteelEye Commands
Stop and Start LifeKeeper Services
Managing Individual Resources
Email Notification Configuration
Troubleshooting Corrupt Flags
Generate Support Configuration
Best Practices for SteelEye Management
Regular Status Monitoring
Heartbeat Tuning Guidelines
Maintenance Windows
Common SteelEye Status Codes
Frequently Asked Questions
What is the purpose of increasing heartbeat values in SteelEye?
How do I verify that heartbeat changes have taken effect?
What should I do if resources don’t come back online after lkstart?
Can I adjust heartbeat values without restarting the cluster?
What is the difference between perform_action with and without the -b flag?
How often should I generate lksupport bundles?
Why would I want to disable email notifications?
What does a corrupt flag in SteelEye indicate?
How can I monitor SteelEye status automatically?
What are the risks of setting heartbeat values too high?
Conclusion

Introduction to SteelEye LifeKeeper

SteelEye LifeKeeper (now part of SIOS Technology) is an enterprise-class high availability solution designed to eliminate single points of failure in IT infrastructure. By monitoring applications, services, and system resources across cluster nodes, LifeKeeper can detect failures within seconds and automatically restart services on healthy backup servers. This capability is crucial for mission-critical applications that require 99.99% uptime or better.

Checking SteelEye Cluster Status

Verify Running Services

To check if LifeKeeper services are currently running and in service:

# Check services that are running (ISP = In Service Primary)
/opt/LifeKeeper/bin/lcdstatus | grep -i ISP

# Check services that are stopped (OSU = Out of Service Unprotected)
/opt/LifeKeeper/bin/lcdstatus | grep -i OSU

# View complete cluster status
/opt/LifeKeeper/bin/lcdstatus

The lcdstatus command displays the current state of all protected resources across the cluster. Resources showing “ISP” are actively running and protected, while “OSU” indicates stopped or unprotected services that may require attention.

Increasing Heartbeat Values in SteelEye Cluster

Heartbeat configuration is critical for cluster stability. The heartbeat mechanism allows cluster nodes to monitor each other’s health. Adjusting heartbeat timing can prevent false failovers caused by network latency or temporary slowdowns.

Step-by-Step Heartbeat Configuration

1. Edit the LifeKeeper configuration file:

vim /etc/default/LifeKeeper

2. Add or modify heartbeat parameters at the end of the file:

LCMNUMHBEATS=6
LCMBEATTIME=5

These parameters control:

LCMNUMHBEATS: Number of consecutive missed heartbeats before declaring a node failed (default is typically 3-4)
LCMBEATTIME: Time in seconds between heartbeat messages (default is often 2 seconds)

With these settings, a node will be declared failed after missing 6 heartbeats at 5-second intervals (30 seconds total), reducing the likelihood of false failovers due to temporary network issues.

3. Restart LifeKeeper services to apply changes:

# Stop LifeKeeper services (force stop)
lkstop -f

# Start LifeKeeper services
lkstart

4. Validate that all resources are back in service:

# Check for resources "In Service"
/opt/LifeKeeper/bin/lcdstatus | grep -i ISP

# Verify no resources are "Out of Service"
/opt/LifeKeeper/bin/lcdstatus | grep -i OSP

If the second command shows any output, it indicates resources that failed to come back online and require investigation.

Essential SteelEye Commands

Stop and Start LifeKeeper Services

Stop all LifeKeeper services:

lkstop -f

The -f flag forces a stop even if resources are currently in service.

Start all LifeKeeper services:

lkstart

Managing Individual Resources

Remove (stop) a specific resource hierarchy:

# Stop resource on all nodes
/opt/LifeKeeper/bin/perform_action -a remove -t <Service_Name>

# Stop resource on backup node only
/opt/LifeKeeper/bin/perform_action -a remove -t <Service_Name> -b

Restore (start) a specific resource hierarchy:

# Start resource on all nodes
/opt/LifeKeeper/bin/perform_action -a restore -t <Service_Name>

# Start resource on backup node only
/opt/LifeKeeper/bin/perform_action -a restore -t <Service_Name> -b

The -b flag limits the action to the backup server, useful when you need to bring resources online on the secondary node without affecting the primary.

Email Notification Configuration

To stop LifeKeeper from sending email notifications (useful during maintenance):

# Edit configuration file
vim /etc/default/LifeKeeper

# Comment out the notification alias line
#LK_NOTIFY_ALIAS=admin@example.com,ops@example.com

# Verify notification configuration
/opt/LifeKeeper/bin/lk_confignotifyalias --query

The query command displays the current notification recipients. If properly disabled, it should show no email addresses.

Troubleshooting Corrupt Flags

If you encounter corrupt flag errors in SteelEye logs, check for corrupted resource files:

cd /opt/LifeKeeper/subsys/scsi/resources/netraid
ll | grep -i corrupt

Any files with “corrupt” in their name indicate resources that may need recreation or recovery from backup.

Generate Support Configuration

When working with SteelEye support, they often request a complete system configuration bundle:

/opt/LifeKeeper/bin/lksupport

This command generates a compressed archive containing all LifeKeeper configuration files, logs, and system information needed for troubleshooting. The file can be uploaded to SIOS support for analysis.

Best Practices for SteelEye Management

Regular Status Monitoring

Schedule automated checks of lcdstatus every 5-10 minutes
Alert on any resources showing OSU (Out of Service) status
Monitor heartbeat communications for packet loss
Review LifeKeeper logs in /var/log regularly

Heartbeat Tuning Guidelines

Increase LCMNUMHBEATS in environments with occasional network latency
Keep LCMBEATTIME low (2-5 seconds) for faster failure detection
Test heartbeat settings under normal and stressed conditions
Document all heartbeat changes for future reference

Maintenance Windows

Always test resource failover before maintenance windows
Disable email notifications during planned maintenance
Verify all resources return to ISP status after maintenance
Keep detailed logs of all manual interventions

Common SteelEye Status Codes

Understanding LifeKeeper status codes helps with troubleshooting:

ISP: In Service Primary – resource is running on the primary server
ISB: In Service Backup – resource is running on the backup server
OSU: Out of Service Unprotected – resource is down with no protection
OSF: Out of Service Failed – resource failed and couldn’t restart
OSP: Out of Service Protected – resource is intentionally stopped but protected

Frequently Asked Questions

What is the purpose of increasing heartbeat values in SteelEye?

Increasing heartbeat values (LCMNUMHBEATS and LCMBEATTIME) prevents unnecessary failovers caused by temporary network delays or server slowdowns. By requiring more consecutive missed heartbeats before declaring a node failed, you reduce false positives while still maintaining high availability. This is especially important in environments with occasional network congestion or when running resource-intensive applications that may briefly delay heartbeat responses.

How do I verify that heartbeat changes have taken effect?

After modifying /etc/default/LifeKeeper and restarting services with lkstop -f and lkstart, verify changes by checking the LifeKeeper logs in /var/log. You can also monitor cluster behavior under load to ensure failover timing matches your new settings. The lcdstatus command should show all resources returning to ISP status, confirming the cluster is functioning properly with the new heartbeat configuration.

What should I do if resources don’t come back online after lkstart?

If resources remain in OSU or OSF status after restarting LifeKeeper, first check /var/log/lifekeeper.log for specific error messages. Common issues include network connectivity problems, application configuration errors, or filesystem issues. Use the perform_action -a restore command to attempt manual resource restoration. If problems persist, generate an lksupport bundle and review it for configuration inconsistencies or contact SIOS support.

Can I adjust heartbeat values without restarting the cluster?

No, changes to heartbeat parameters in /etc/default/LifeKeeper require a full LifeKeeper service restart (lkstop -f followed by lkstart) to take effect. This is because heartbeat timing is initialized when the LifeKeeper daemon starts. Plan heartbeat changes during maintenance windows to minimize service disruption, and ensure you have a backup node available to take over services during the restart.

What is the difference between perform_action with and without the -b flag?

Without the -b flag, perform_action affects the entire resource hierarchy across all nodes in the cluster. With the -b flag, the action is limited to the backup server only. Use -b when you need to start or stop resources on the secondary node without impacting the primary, such as when performing maintenance on one node or manually controlling where resources run.

How often should I generate lksupport bundles?

Generate lksupport bundles before and after major configuration changes, during troubleshooting sessions, or when requested by SIOS support. It’s good practice to save periodic lksupport bundles (monthly or quarterly) as configuration baselines. These bundles are invaluable for comparing configurations when issues arise and for documenting your cluster setup history.

Why would I want to disable email notifications?

Disable email notifications during planned maintenance windows to prevent alert fatigue from expected service transitions. When testing failover scenarios or performing cluster upgrades, temporary email disabling prevents flooding administrators with notifications for intentional state changes. Always remember to re-enable notifications after maintenance by uncommenting the LK_NOTIFY_ALIAS line and verifying with lk_confignotifyalias –query.

What does a corrupt flag in SteelEye indicate?

Corrupt flags in /opt/LifeKeeper/subsys/scsi/resources/netraid indicate that LifeKeeper’s internal resource state tracking files have become inconsistent or damaged. This can occur after unexpected system shutdowns, filesystem issues, or software bugs. Corrupt flags may prevent resources from starting properly. Resolution typically involves removing corrupt flag files and recreating the affected resource hierarchies from a known-good configuration.

How can I monitor SteelEye status automatically?

Create a monitoring script that runs lcdstatus regularly and alerts on any OSU, OSF, or unexpected ISB status codes. Many organizations integrate LifeKeeper monitoring with tools like Nagios, Zabbix, or Splunk by parsing lcdstatus output. SIOS also provides SNMP MIBs for integration with enterprise monitoring platforms. Automated monitoring ensures rapid detection and response to cluster issues before they impact users.

What are the risks of setting heartbeat values too high?

Setting LCMNUMHBEATS or LCMBEATTIME too high increases the time before the cluster detects and responds to actual node failures, extending service outage duration. For example, with LCMNUMHBEATS=10 and LCMBEATTIME=10, a true node failure won’t trigger failover for 100 seconds. Balance false-positive prevention with acceptable failover time based on your application’s RTO (Recovery Time Objective). Most environments use values between 15-30 seconds total detection time.

Conclusion

SteelEye LifeKeeper provides robust high availability clustering capabilities for enterprise Linux environments. By understanding essential commands for status checking, resource management, and heartbeat configuration, administrators can maintain reliable clustered services with minimal downtime. Proper heartbeat tuning prevents unnecessary failovers while ensuring rapid response to genuine failures. Regular monitoring, careful configuration management, and following best practices ensure that your SteelEye cluster continues to protect critical applications effectively.

Whether you’re performing routine maintenance, troubleshooting service issues, or optimizing cluster behavior, the commands and procedures outlined in this guide provide the foundation for successful SteelEye LifeKeeper administration.

Was this article helpful?