Why system monitoring matters

System monitoring is about answering a few key questions quickly: Is the system healthy? What is slow? Which component is the bottleneck? As a Solaris admin, you will use a combination of CPU, memory, disk and log tools to build this picture.

In this lesson we focus on tools you listed: top, prstat, iostat, df, du, uptime and system logs.

Key monitoring dimensions

Load

uptime and top give you a quick summary of load averages, logged-in users and overall system stress.

CPU & memory

prstat shows per-process and per-user CPU/memory usage and lets you sort or filter quickly.

Disk & I/O

iostat tells you if disks and controllers are saturated or error-prone, which often explains slow I/O.

Space & logs

df, du and logs show you where space is used and what errors the system is reporting.

Step-by-step monitoring commands

Walk through these flows in your lab. Open two terminals: one to run monitoring commands, another to start/stop test workloads and see their impact.

1. Overall system load: uptime, top, prstat -a

First get a quick feel of system load and active processes using uptime, top and prstat.

terminal — monitoring

solaris-lab

[root@solaris ~]# uptime
 11:15am  up 10 day(s),  3:42,  4 users,  load average: 0.24, 0.18, 0.15
 
[root@solaris ~]# top
last pid:  1234;  load averages:  0.24,  0.18,  0.15    up 10+03:42:10  11:15:32
  PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  891 oracle     1  59    0  120M   40M sleep   0:02   1.0% bash
  950 root      35  59    0  150M   60M sleep   0:05   0.5% java
  123 root       1  59    0   60M   20M sleep   0:02   0.2% sshd
  ...
 
[root@solaris ~]# prstat -a 1 3
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   891 oracle   120M   40M sleep   59    0   0:00:02 1.0% bash/1
   950 root     150M   60M sleep   59    0   0:00:05 0.5% java/35
   123 root      60M   20M sleep   59    0   0:00:02 0.2% sshd/1
   ...
 
Total: 45 processes, 125 lwps, load averages: 0.24, 0.18, 0.15

2. Memory usage: prstat sorted by RSS and ::memstat

Use prstat -a -s rss to see which processes consume most memory, and ::memstat in mdb -k for a kernel-level view.

terminal — monitoring

solaris-lab

[root@solaris ~]# prstat -a -s rss 1 3
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   950 root     800M  600M sleep   59    0   0:00:30 2.0% java/35
  1200 oracle   500M  300M sleep   59    0   0:00:20 1.5% oracle/50
   891 oracle   120M   40M sleep   59    0   0:00:02 0.5% bash/1
   ...
 
Total: 45 processes, 125 lwps, load averages: 0.30, 0.22, 0.18
 
[root@solaris ~]# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------                -----               ---- ----
Kernel                     150000               1171  35%
ZFS File Data              100000                781  23%
Anon                       120000                937  28%
Exec and libs               20000                156   5%
Page cache                  15000                117   4%
Free (unallocated)          15000                117   4%
 
Total                      420000               3280 100%

3. CPU usage by user: prstat -a -u user and prstat -t

Filter prstat output for a specific user, or aggregate usage per user with prstat -t.

terminal — monitoring

solaris-lab

[root@solaris ~]# prstat -a -u oracle 1 3
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  1200 oracle   500M  300M sleep   59    0   0:00:20 2.0% oracle/50
   891 oracle   120M   40M sleep   59    0   0:00:02 0.5% bash/1
   ...
 
Total: 10 processes, 60 lwps, load averages: 0.40, 0.30, 0.20
 
[root@solaris ~]# prstat -t 1 3
   USERNAME   SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   oracle    700M  340M sleep   59    0   0:00:25 2.5%
   root      600M  300M sleep   59    0   0:00:20 1.5%
   webuser   200M  100M sleep   59    0   0:00:05 0.5%
 
Total: 45 processes, 125 lwps, load averages: 0.45, 0.35, 0.25

4. Disk I/O: iostat -en and iostat -xnte

Use iostat to see device errors, throughput and extended statistics for disks and controllers.

terminal — monitoring

solaris-lab

[root@solaris ~]# iostat -en
            --- errors ---   --- transport ---   --- device ---
cmdk0           0    0    0   100.00%   0.00%   0.00%   0.00%
cmdk1           0    0    0   100.00%   0.00%   0.00%   0.00%
 
[root@solaris ~]# iostat -xnte 1 3
                    extended device statistics
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b
cmdk0     5.0  2.0  40.00  20.00 0.0  0.1   10.0   0  10
cmdk1     1.0  4.0  10.00  30.00 0.0  0.1   15.0   0  12
...

5. Filesystem usage: du and df

Check filesystem and directory usage to find where space is being used.

terminal — monitoring

solaris-lab

[root@solaris ~]# df -h
Filesystem             Size   Used  Available Capacity  Mounted on
rpool/ROOT/solaris      40G    10G        30G    25%    /
rpool/export/home       60G    25G        35G    42%    /export/home
rpool/data              80G    55G        25G    69%    /data
 
[root@solaris ~]# cd /data
[root@solaris /data]# du -sh
55G     .
 
[root@solaris /data]# du -sh *
 10G   oracle
 20G   backups
  5G   logs
 20G   appdata

6. System logs: quick checks for issues

Use tail and grep to scan important logs when you notice high load or errors.

terminal — monitoring

solaris-lab

[root@solaris ~]# tail -50 /var/adm/messages
Jan 11 11:10:02 sol11 sshd[1234]: [ID 800047 auth.info] Accepted publickey for oracle from 192.168.1.10 port 53218 ssh2
Jan 11 11:11:45 sol11 nfs: [ID 702911 kern.warning] WARNING: NFS server not responding
Jan 11 11:11:50 sol11 nfs: [ID 702911 kern.notice] NOTICE: NFS server ok
 
[root@solaris ~]# grep -i "error" /var/adm/messages | tail -10
Jan 11 10:55:21 sol11 someapp[987]: [ID 702911 user.error] Failed to connect to database
 
[root@solaris ~]# tail -f /var/adm/messages
# Follow the log live while reproducing an issue...

Typical monitoring patterns

When users say “system is slow”

Check uptime to see load averages and how long the system has been up.
Use prstat -a to see which processes are consuming CPU.
Check memory pressure with prstat -a -s rss and ::memstat.
Run iostat -xnte to see if disks are saturated or busy.
Review recent errors in /var/adm/messages.

When disk space is filling up

Use df -h to find which filesystem is close to 100% usage.
cd into that filesystem and run du -sh and du -sh * to see which directories are heavy.
Look for large log or backup directories and discuss cleanup or rotation with application owners.
Consider ZFS snapshots, compression or moving rarely-used data to a different pool.

Practice task – build a quick health-check routine

Create your own 5–10 command sequence (using uptime, prstat, iostat, df, du and logs) that you will always run when investigating a performance issue.

Run a stress script (CPU or disk heavy) in the background and observe how the metrics change in real time.

Note down what “normal” looks like for your lab VM so you can quickly recognise unusual patterns later on bigger systems.

In future lessons on ZFS and patching, these monitoring tools will help you verify the impact of your changes.