System performance issues and hangs are among the most critical problems Linux administrators face. When applications slow down or systems become unresponsive, quick and systematic troubleshooting is essential. This guide covers comprehensive steps to diagnose and resolve system hang and performance issues using industry-standard tools.
📑 Table of Contents
- Understanding System Performance Problems
- Initial Health Check Commands
- 1. Quick System Overview
- 2. Check Running Processes
- CPU Performance Troubleshooting
- Identify CPU-Intensive Processes
- Analyze High Load Average
- Memory Performance Analysis
- Check Memory Usage
- Identify Memory Leaks
- Analyze Swap Activity
- Disk I/O Performance Troubleshooting
- Identify I/O Bottlenecks
- Find I/O Intensive Processes
- Check Disk Health and Errors
- Using SAR for Historical Analysis
- Install and Enable SAR
- SAR Analysis Commands
- Interpret SAR Output
- Using NMON for Performance Monitoring
- Install and Use NMON
- NMON Interactive Mode Keys
- Application Health Check During Performance Issues
- Web Application Checks
- Database Application Checks
- Java Application Checks
- System Hang Troubleshooting
- When System Becomes Unresponsive
- Analyze Kernel Messages
- Network Performance Impact
- Performance Tuning Quick Fixes
- Immediate Actions for High Load
- Long-term Tuning
- Creating Performance Reports
- Comprehensive System Report
- Best Practices for Performance Monitoring
- Conclusion
- Frequently Asked Questions
- 1. What is a safe load average for my system?
- 2. How do I know if my system is swapping excessively?
- 3. What does high I/O wait (%wa) indicate?
- 4. How can I access SAR data from previous days?
- 5. What should I do if ‘top’ shows high CPU but no process uses much CPU?
Understanding System Performance Problems
Performance issues typically stem from four main resource bottlenecks:
- CPU: High processor utilization or load average
- Memory: RAM exhaustion, excessive swapping
- Disk I/O: Slow read/write operations, I/O wait
- Network: Bandwidth saturation, packet loss
Initial Health Check Commands
1. Quick System Overview
Start with these immediate diagnostic commands:
# Overall system status
uptime
# Current load and processes
top
# Enhanced process viewer
htop
# Quick resource summary
vmstat 1 5
# I/O statistics
iostat -x 2 5
The uptime
command shows load averages for 1, 5, and 15 minutes. Load average above the number of CPU cores indicates potential issues.
2. Check Running Processes
# List all processes sorted by CPU usage
ps aux --sort=-%cpu | head -20
# List processes sorted by memory usage
ps aux --sort=-%mem | head -20
# Show process tree
pstree -p
# Detailed process information
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head
CPU Performance Troubleshooting
Identify CPU-Intensive Processes
# Real-time CPU monitoring
top -b -n 1 | head -20
# CPU usage per core
mpstat -P ALL 2 5
# Find processes using most CPU
pidstat -u 2 5
# Check CPU frequency and throttling
lscpu | grep -i mhz
cat /proc/cpuinfo | grep -i mhz
Analyze High Load Average
# Check processes in uninterruptible sleep (usually I/O)
ps aux | awk '$8 == "D"'
# Count processes by state
ps -eo state | sort | uniq -c
# Show load average details
cat /proc/loadavg
Memory Performance Analysis
Check Memory Usage
# Memory summary
free -h
# Detailed memory statistics
cat /proc/meminfo
# Per-process memory usage
smem -r
# Check for OOM killer activity
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/messages
Identify Memory Leaks
# Monitor specific process memory over time
while true; do ps aux | grep [p]rocess_name; sleep 5; done
# Track memory usage with pidstat
pidstat -r 2 10
# Check swap usage
swapon -s
vmstat -s | grep -i swap
Analyze Swap Activity
# Check current swap usage
free -h
# Monitor swap activity
vmstat 1 10
# Find processes using swap
for file in /proc/*/status; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | head
# Check swap configuration
cat /proc/sys/vm/swappiness
Disk I/O Performance Troubleshooting
Identify I/O Bottlenecks
# Disk I/O statistics
iostat -x 2 5
# Per-process I/O usage
iotop -o
# I/O wait time
vmstat 1 5 # Check 'wa' column
# Check disk latency
iostat -d -x 2 5 # Look at 'await' column
Find I/O Intensive Processes
# Per-process I/O statistics
pidstat -d 2 5
# Show which files are being accessed
lsof | grep -i process_name
# Monitor disk activity
iotop -oPa
Check Disk Health and Errors
# Check for disk errors in logs
dmesg | grep -i error
grep -i "i/o error" /var/log/messages
# SMART disk status
smartctl -a /dev/sda
# Check filesystem disk usage
df -h
# Find large files
du -ah / | sort -rh | head -20
Using SAR for Historical Analysis
SAR (System Activity Reporter) is crucial for analyzing historical performance data.
Install and Enable SAR
# Install sysstat package
yum install sysstat -y # RHEL/CentOS
apt install sysstat -y # Ubuntu/Debian
# Enable and start service
systemctl enable sysstat
systemctl start sysstat
SAR Analysis Commands
# CPU usage for today
sar -u
# CPU usage for specific date
sar -u -f /var/log/sa/sa10
# Memory usage
sar -r
# Swap activity
sar -S
# I/O statistics
sar -b
# Disk-specific I/O
sar -d
# Network statistics
sar -n DEV
# Load average and tasks
sar -q
# All statistics for specific time range
sar -A -s 10:00:00 -e 14:00:00
Interpret SAR Output
# Check CPU utilization trends
sar -u 1 10
# Key metrics to watch:
# %user - User space CPU usage
# %system - Kernel space CPU usage
# %iowait - Waiting for I/O
# %idle - CPU idle time
# Memory analysis
sar -r 1 10
# Key metrics:
# %memused - Percentage of memory used
# %commit - Percentage of memory needed for current workload
Using NMON for Performance Monitoring
Install and Use NMON
# Install NMON
yum install nmon -y # RHEL/CentOS
apt install nmon -y # Ubuntu/Debian
# Run interactive NMON
nmon
# Record data to file
nmon -f -s 30 -c 120
# -f: output to file
# -s 30: sample every 30 seconds
# -c 120: collect 120 samples
NMON Interactive Mode Keys
- c – CPU usage
- m – Memory stats
- d – Disk I/O
- n – Network stats
- t – Top processes
- q – Quit
Application Health Check During Performance Issues
Web Application Checks
# Check Apache/httpd status
systemctl status httpd
apachectl -S
# Check active connections
netstat -an | grep :80 | wc -l
# Check error logs
tail -100 /var/log/httpd/error_log
# Monitor Apache processes
ps aux | grep httpd | wc -l
Database Application Checks
# MySQL/MariaDB connections
mysqladmin processlist
mysqladmin status
# Check slow queries
mysql -e "SHOW FULL PROCESSLIST;"
# PostgreSQL connections
su - postgres -c "psql -c 'SELECT * FROM pg_stat_activity;'"
Java Application Checks
# Find Java processes
jps -v
# Thread dump for hung Java process
jstack > thread_dump.txt
# Heap dump for memory issues
jmap -dump:format=b,file=heap_dump.bin
# Java memory statistics
jstat -gcutil 1000 10
System Hang Troubleshooting
When System Becomes Unresponsive
# Check if system responds to SysRq keys
echo 1 > /proc/sys/kernel/sysrq
# Force sync filesystems
echo s > /proc/sysrq-trigger
# Show memory usage
echo m > /proc/sysrq-trigger
# Show blocked tasks
echo w > /proc/sysrq-trigger
# Kill all processes except init
echo i > /proc/sysrq-trigger
# Safe reboot
echo b > /proc/sysrq-trigger
Analyze Kernel Messages
# Check kernel ring buffer
dmesg -T | tail -50
# Look for specific errors
dmesg | grep -i "bug\|error\|fail"
# Check for hardware issues
dmesg | grep -i "hardware"
Network Performance Impact
# Check network connections
netstat -tupln
# Network interface statistics
ifconfig -a
ip -s link
# Packet loss and errors
netstat -i
# Active connections by state
netstat -ant | awk '{print $6}' | sort | uniq -c
# Monitor network bandwidth
iftop
nethogs
Performance Tuning Quick Fixes
Immediate Actions for High Load
# Kill specific high-CPU process
kill -9
# Nice down CPU-intensive process
renice +10
# Clear page cache (use cautiously)
sync
echo 3 > /proc/sys/vm/drop_caches
# Reduce swap usage
sysctl vm.swappiness=10
Long-term Tuning
# Optimize swappiness permanently
echo "vm.swappiness = 10" >> /etc/sysctl.conf
sysctl -p
# Increase file descriptors
ulimit -n 65536
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf
Creating Performance Reports
Comprehensive System Report
#!/bin/bash
# Save as perf_report.sh
REPORT_FILE="perf_report_$(date +%Y%m%d_%H%M%S).txt"
echo "=== System Performance Report ===" > $REPORT_FILE
echo "Date: $(date)" >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "=== Uptime and Load ===" >> $REPORT_FILE
uptime >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "=== Top CPU Processes ===" >> $REPORT_FILE
ps aux --sort=-%cpu | head -10 >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "=== Memory Usage ===" >> $REPORT_FILE
free -h >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "=== Disk Usage ===" >> $REPORT_FILE
df -h >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "=== I/O Statistics ===" >> $REPORT_FILE
iostat -x >> $REPORT_FILE
echo "Report saved to: $REPORT_FILE"
Best Practices for Performance Monitoring
- Establish baselines – Know your normal system metrics
- Enable SAR collection – Always have historical data available
- Monitor proactively – Don’t wait for users to report issues
- Document findings – Keep records of performance incidents
- Regular health checks – Schedule weekly performance reviews
- Automate monitoring – Use tools like Nagios, Zabbix, or Prometheus
- Set up alerts – Get notified before issues become critical
Conclusion
Effective troubleshooting of system performance and hang issues requires a systematic approach using the right tools. By mastering commands like SAR, NMON, top, iostat, and vmstat, you can quickly identify bottlenecks and resolve performance problems before they impact production systems.
Remember that performance troubleshooting is both an art and a science. While these tools provide valuable data, interpreting that data in context with your specific application and workload is crucial for effective resolution.
Frequently Asked Questions
1. What is a safe load average for my system?
A general rule is that load average should be below the number of CPU cores. For example, a 4-core system should ideally have load averages below 4.0. However, this depends on workload type – CPU-intensive applications can safely run at higher loads than I/O-intensive ones.
2. How do I know if my system is swapping excessively?
Run vmstat 1 5
and check the ‘si’ (swap in) and ‘so’ (swap out) columns. Values consistently above 0 indicate swapping. Also check free -h
– if swap usage is high and free memory is low, you have a memory pressure problem.
3. What does high I/O wait (%wa) indicate?
High I/O wait means the CPU is idle waiting for disk operations to complete. This usually indicates disk bottlenecks. Use iotop
to find which processes are causing high I/O and iostat -x
to identify which disks are slow.
4. How can I access SAR data from previous days?
SAR data is stored in /var/log/sa/
directory. Files named sa01, sa02, etc., correspond to days of the month. Use sar -f /var/log/sa/sa10
to view data from the 10th day of the current month.
5. What should I do if ‘top’ shows high CPU but no process uses much CPU?
This often indicates high system or kernel CPU usage. Check top
and look at the %sy (system) value. High system CPU can indicate excessive context switching, system calls, or kernel operations. Use perf top
to identify kernel functions consuming CPU.
Was this article helpful?