Linux System Performance and Hang Troubleshooting Guide: SAR, NMON, Top

Linux System Performance Troubleshooting Guide

System performance issues and hangs are among the most critical problems Linux administrators face. When applications slow down or systems become unresponsive, quick and systematic troubleshooting is essential. This guide covers comprehensive steps to diagnose and resolve system hang and performance issues using industry-standard tools.

Understanding System Performance Problems

Performance issues typically stem from four main resource bottlenecks:

  • CPU: High processor utilization or load average
  • Memory: RAM exhaustion, excessive swapping
  • Disk I/O: Slow read/write operations, I/O wait
  • Network: Bandwidth saturation, packet loss

Initial Health Check Commands

1. Quick System Overview

Start with these immediate diagnostic commands:

# Overall system status
uptime

# Current load and processes
top

# Enhanced process viewer
htop

# Quick resource summary
vmstat 1 5

# I/O statistics
iostat -x 2 5

The uptime command shows load averages for 1, 5, and 15 minutes. Load average above the number of CPU cores indicates potential issues.

2. Check Running Processes

# List all processes sorted by CPU usage
ps aux --sort=-%cpu | head -20

# List processes sorted by memory usage
ps aux --sort=-%mem | head -20

# Show process tree
pstree -p

# Detailed process information
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head

CPU Performance Troubleshooting

Identify CPU-Intensive Processes

# Real-time CPU monitoring
top -b -n 1 | head -20

# CPU usage per core
mpstat -P ALL 2 5

# Find processes using most CPU
pidstat -u 2 5

# Check CPU frequency and throttling
lscpu | grep -i mhz
cat /proc/cpuinfo | grep -i mhz

Analyze High Load Average

# Check processes in uninterruptible sleep (usually I/O)
ps aux | awk '$8 == "D"'

# Count processes by state
ps -eo state | sort | uniq -c

# Show load average details
cat /proc/loadavg

Memory Performance Analysis

Check Memory Usage

# Memory summary
free -h

# Detailed memory statistics
cat /proc/meminfo

# Per-process memory usage
smem -r

# Check for OOM killer activity
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/messages

Identify Memory Leaks

# Monitor specific process memory over time
while true; do ps aux | grep [p]rocess_name; sleep 5; done

# Track memory usage with pidstat
pidstat -r 2 10

# Check swap usage
swapon -s
vmstat -s | grep -i swap

Analyze Swap Activity

# Check current swap usage
free -h

# Monitor swap activity
vmstat 1 10

# Find processes using swap
for file in /proc/*/status; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | head

# Check swap configuration
cat /proc/sys/vm/swappiness

Disk I/O Performance Troubleshooting

Identify I/O Bottlenecks

# Disk I/O statistics
iostat -x 2 5

# Per-process I/O usage
iotop -o

# I/O wait time
vmstat 1 5   # Check 'wa' column

# Check disk latency
iostat -d -x 2 5   # Look at 'await' column

Find I/O Intensive Processes

# Per-process I/O statistics
pidstat -d 2 5

# Show which files are being accessed
lsof | grep -i process_name

# Monitor disk activity
iotop -oPa

Check Disk Health and Errors

# Check for disk errors in logs
dmesg | grep -i error
grep -i "i/o error" /var/log/messages

# SMART disk status
smartctl -a /dev/sda

# Check filesystem disk usage
df -h

# Find large files
du -ah / | sort -rh | head -20

Using SAR for Historical Analysis

SAR (System Activity Reporter) is crucial for analyzing historical performance data.

Install and Enable SAR

# Install sysstat package
yum install sysstat -y    # RHEL/CentOS
apt install sysstat -y    # Ubuntu/Debian

# Enable and start service
systemctl enable sysstat
systemctl start sysstat

SAR Analysis Commands

# CPU usage for today
sar -u

# CPU usage for specific date
sar -u -f /var/log/sa/sa10

# Memory usage
sar -r

# Swap activity
sar -S

# I/O statistics
sar -b

# Disk-specific I/O
sar -d

# Network statistics
sar -n DEV

# Load average and tasks
sar -q

# All statistics for specific time range
sar -A -s 10:00:00 -e 14:00:00

Interpret SAR Output

# Check CPU utilization trends
sar -u 1 10

# Key metrics to watch:
# %user  - User space CPU usage
# %system - Kernel space CPU usage
# %iowait - Waiting for I/O
# %idle  - CPU idle time

# Memory analysis
sar -r 1 10

# Key metrics:
# %memused - Percentage of memory used
# %commit - Percentage of memory needed for current workload

Using NMON for Performance Monitoring

Install and Use NMON

# Install NMON
yum install nmon -y       # RHEL/CentOS
apt install nmon -y       # Ubuntu/Debian

# Run interactive NMON
nmon

# Record data to file
nmon -f -s 30 -c 120
# -f: output to file
# -s 30: sample every 30 seconds
# -c 120: collect 120 samples

NMON Interactive Mode Keys

  • c – CPU usage
  • m – Memory stats
  • d – Disk I/O
  • n – Network stats
  • t – Top processes
  • q – Quit

Application Health Check During Performance Issues

Web Application Checks

# Check Apache/httpd status
systemctl status httpd
apachectl -S

# Check active connections
netstat -an | grep :80 | wc -l

# Check error logs
tail -100 /var/log/httpd/error_log

# Monitor Apache processes
ps aux | grep httpd | wc -l

Database Application Checks

# MySQL/MariaDB connections
mysqladmin processlist
mysqladmin status

# Check slow queries
mysql -e "SHOW FULL PROCESSLIST;"

# PostgreSQL connections
su - postgres -c "psql -c 'SELECT * FROM pg_stat_activity;'"

Java Application Checks

# Find Java processes
jps -v

# Thread dump for hung Java process
jstack  > thread_dump.txt

# Heap dump for memory issues
jmap -dump:format=b,file=heap_dump.bin 

# Java memory statistics
jstat -gcutil  1000 10

System Hang Troubleshooting

When System Becomes Unresponsive

# Check if system responds to SysRq keys
echo 1 > /proc/sys/kernel/sysrq

# Force sync filesystems
echo s > /proc/sysrq-trigger

# Show memory usage
echo m > /proc/sysrq-trigger

# Show blocked tasks
echo w > /proc/sysrq-trigger

# Kill all processes except init
echo i > /proc/sysrq-trigger

# Safe reboot
echo b > /proc/sysrq-trigger

Analyze Kernel Messages

# Check kernel ring buffer
dmesg -T | tail -50

# Look for specific errors
dmesg | grep -i "bug\|error\|fail"

# Check for hardware issues
dmesg | grep -i "hardware"

Network Performance Impact

# Check network connections
netstat -tupln

# Network interface statistics
ifconfig -a
ip -s link

# Packet loss and errors
netstat -i

# Active connections by state
netstat -ant | awk '{print $6}' | sort | uniq -c

# Monitor network bandwidth
iftop
nethogs

Performance Tuning Quick Fixes

Immediate Actions for High Load

# Kill specific high-CPU process
kill -9 

# Nice down CPU-intensive process
renice +10 

# Clear page cache (use cautiously)
sync
echo 3 > /proc/sys/vm/drop_caches

# Reduce swap usage
sysctl vm.swappiness=10

Long-term Tuning

# Optimize swappiness permanently
echo "vm.swappiness = 10" >> /etc/sysctl.conf
sysctl -p

# Increase file descriptors
ulimit -n 65536
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nofile 65536" >> /etc/security/limits.conf

Creating Performance Reports

Comprehensive System Report

#!/bin/bash
# Save as perf_report.sh

REPORT_FILE="perf_report_$(date +%Y%m%d_%H%M%S).txt"

echo "=== System Performance Report ===" > $REPORT_FILE
echo "Date: $(date)" >> $REPORT_FILE
echo "" >> $REPORT_FILE

echo "=== Uptime and Load ===" >> $REPORT_FILE
uptime >> $REPORT_FILE
echo "" >> $REPORT_FILE

echo "=== Top CPU Processes ===" >> $REPORT_FILE
ps aux --sort=-%cpu | head -10 >> $REPORT_FILE
echo "" >> $REPORT_FILE

echo "=== Memory Usage ===" >> $REPORT_FILE
free -h >> $REPORT_FILE
echo "" >> $REPORT_FILE

echo "=== Disk Usage ===" >> $REPORT_FILE
df -h >> $REPORT_FILE
echo "" >> $REPORT_FILE

echo "=== I/O Statistics ===" >> $REPORT_FILE
iostat -x >> $REPORT_FILE

echo "Report saved to: $REPORT_FILE"

Best Practices for Performance Monitoring

  1. Establish baselines – Know your normal system metrics
  2. Enable SAR collection – Always have historical data available
  3. Monitor proactively – Don’t wait for users to report issues
  4. Document findings – Keep records of performance incidents
  5. Regular health checks – Schedule weekly performance reviews
  6. Automate monitoring – Use tools like Nagios, Zabbix, or Prometheus
  7. Set up alerts – Get notified before issues become critical

Conclusion

Effective troubleshooting of system performance and hang issues requires a systematic approach using the right tools. By mastering commands like SAR, NMON, top, iostat, and vmstat, you can quickly identify bottlenecks and resolve performance problems before they impact production systems.

Remember that performance troubleshooting is both an art and a science. While these tools provide valuable data, interpreting that data in context with your specific application and workload is crucial for effective resolution.

Frequently Asked Questions

1. What is a safe load average for my system?

A general rule is that load average should be below the number of CPU cores. For example, a 4-core system should ideally have load averages below 4.0. However, this depends on workload type – CPU-intensive applications can safely run at higher loads than I/O-intensive ones.

2. How do I know if my system is swapping excessively?

Run vmstat 1 5 and check the ‘si’ (swap in) and ‘so’ (swap out) columns. Values consistently above 0 indicate swapping. Also check free -h – if swap usage is high and free memory is low, you have a memory pressure problem.

3. What does high I/O wait (%wa) indicate?

High I/O wait means the CPU is idle waiting for disk operations to complete. This usually indicates disk bottlenecks. Use iotop to find which processes are causing high I/O and iostat -x to identify which disks are slow.

4. How can I access SAR data from previous days?

SAR data is stored in /var/log/sa/ directory. Files named sa01, sa02, etc., correspond to days of the month. Use sar -f /var/log/sa/sa10 to view data from the 10th day of the current month.

5. What should I do if ‘top’ shows high CPU but no process uses much CPU?

This often indicates high system or kernel CPU usage. Check top and look at the %sy (system) value. High system CPU can indicate excessive context switching, system calls, or kernel operations. Use perf top to identify kernel functions consuming CPU.

Was this article helpful?

RS

About the Author: Ramesh Sundararamaiah

Red Hat Certified Architect

Ramesh is a Red Hat Certified Architect with extensive experience in enterprise Linux environments. He specializes in system administration, DevOps automation, and cloud infrastructure. Ramesh has helped organizations implement robust Linux solutions and optimize their IT operations for performance and reliability.

Expertise: Red Hat Enterprise Linux, CentOS, Ubuntu, Docker, Ansible, System Administration, DevOps

Add Comment