Utilization emails

Decoding utilization emails

The cluster systems monitor how well running jobs are utilizing nodes. They will send the job owners emails to report unusual utilization patterns, such as idle nodes in multi-node jobs, or using excessive amounts of memory.

Each such message has a header section that describes the basic problem, and a statistics section that reports information collected by the job manager. This document expands the somewhat terse messages into simpler explanations.

The variable names used as closely related to PBS options, such as ppn (processors per node), job (job number), and wall_time. More detailed information is available:

Underutilized: E124 - Exceeded Memory Allocation Message

When running in a single queue, the amount of memory allowed to a job is determined by the number of cores requested: total node memory divided by the cores-per-node times cores requested. For example, on a 16-core node with 32 GB of memory, each core is allowed 2 GB. If a job needs 8 GB, it must request 4 cores even if only 1 is used. In the example below, 3 GB is allowed per core. Because it exceeds this limit, the job was terminated.

E124 - Exceeded memory allocation
This Job 2064 appears to be using more memory (GB) than allocated (7 > 3).
Please allocate the amount of memory that the job will use.
This Job has 1 core(s) allocated (ppn=1), you need at least (ppn=3).
Memory is allocated 3gb/core.
The top physical memory process is 7285mb.

Job deleted

Node statistics::
Number of nodes: 1
Number of cores: 1
Total physical memory per node: 24096mb
Average memory usage per node: 11678mb, 48%
Average memory usage per core: 11678mb
Average virtual memory usage per node: 21736mb
Average virtual memory usage per core: 2717mb
Average CPU percent per node: 200%
Average CPU percent per core: 25%
Average load per node: 2.02
Reverified average load per node: 2.00
Effective maximum load on a node: 2.02

PBS_job=2064.philip3 user=flast allocation=hpc_alloc017
    queue=single total_load=2.02 cpu_hours=2.98 wall_hours=1.40
unused_nodes=0 total_nodes=1 ppn=1 avg_load=2.02 avg_cpu=200%
avg_mem=11678mb avg_vmem=21736mb
top_proc=flast:a:philip001:10.4G:4.3G:1.5hr:100%
toppm=flast:a.out:philip001:10692M:7285M node_processes=6
avg_avail_mem=8736mb min_avail_mem=8736mb reverified_avg_load=2.00

Name:  First Last
Mail:  flast@lsu.edu
Affil: First Last
Category: validation:current:10/21/2015
Allocations:
hpc_alloc017,flast,2000.00,

Underutilized: E132 - Low Memory Usage Message

A low memory usage error indicates that much less memory is being used than is provided by the system. A typical case is running a job that uses a small amount of memory in one of the job queues that provides large memory nodes. The example below shows that large memory nodes were requested (lines 2 and 22), but the average memory usage was less than 1 percent (line 3). Given that only 5.125 GB were required, the job could easily have fit on a 32 GB node provided by the workq or checkpt queues. In this case, the load of 16 (lines 4, 17, and 23) indicates the CPU cores were fully occupied. User account information is shown in lines 30 to 39.

 1)  E132 - Low memory usage
 2)  Total physical memory per node: 258272mb
 3)  Average memory usage: 5125mb, 1%
 4)  Average CPU percent is: 1566%
 5)  Please try to better use CPU resources.
 6) 
 7)  Node statistics::
 8)  Number of nodes: 1
 9)  Number of cores: 16
10)  Total physical memory per node: 258272mb
11)  Average memory usage per node: 5125mb, 1%
12)  Average memory usage per core: 320mb
13)  Average virtual memory usage per node: 5890mb
14)  Average virtual memory usage per core: 368mb
15)  Average CPU percent per node: 1566%
16)  Average CPU percent per core: 97%
17)  Average load per node: 16.09
18)  Reverified average load per node: 16.10
19)  Effective maximum load on a node: 16.10
20) 
21)  PBS_job=432161.mike3 user=flast allocation=hpc_alloc04
22)  queue=bigmem total_load=16.09 cpu_hours=5.38 wall_hours=3.96
23)  unused_nodes=0 total_nodes=1 ppn=16 avg_load=16.09 avg_cpu=1566%
24)  avg_mem=5125mb avg_vmem=5890mb
25)  top_proc=flast:mothur:mike438:355M:320M:0.1hr:100%
26)  toppm=flast:mothur:mike438:410M:359M node_processes=18
27)  avg_avail_mem=222946mb min_avail_mem=222946mb
28)  reverified_avg_load=16.10
29) 
30)  Name:  First Last
31)  Mail:  flast@somewhere.lsu.edu
32)  Affil: First Last
33)  Category: validation:current:01/27/2014
34)  Name:  First Last
35)  Mail:  flast@somewhere.lsu.edu
36)  Affil: First Last
37)  Category: validation:current:01/27/2014
38)  Allocations:
39)  hpc_alloc04,flast,50000.00,

Underutilized: E130 - Unused Nodes Message

An E130 error message indicates a job has several nodes allocated which appear to be idle. This might not be a problem if the job was examined while it was setting up for a new round of processing. But in general, it suggests that there is a job specification issue (i.e. a mismatch between the number of MPI processes requested by mpirun, and the number of nodes/cores available.). If the majority of the nodes assigned are never used, you may get an E131 error message instead.

The following example shows the case of a job which requested 4 nodes, but 2 are seen to be idle.

 1)  E130 - Unused nodes
 2)  Job 432234 has 2 unused nodes.
 3)  Please correct this problem.
 4) 
 5)  Node statistics::
 6)  Number of nodes: 4
 7)  Number of cores: 64
 8)  Total physical memory per node: 32046mb
 9)  Average memory usage per node: 3392mb, 10%
10)  Average memory usage per core: 212mb
11)  Average virtual memory usage per node: 4914mb
12)  Average virtual memory usage per core: 307mb
13)  Average CPU percent per node: 616%
14)  Average CPU percent per core: 38%
15)  Average load per node: 6.28
16)  Reverified average load per node: 6.35
17)  Effective maximum load on a node: 16.23
18) 
19) 
20)  PBS_job=432234.mike3 user=flast allocation=hpc_alloc02
21)  queue=checkpt total_load=25.15 cpu_hours=190.05 wall_hours=8.90
22)  unused_nodes=2 total_nodes=4 ppn=16 avg_load=6.28 avg_cpu=616%
23)  avg_mem=3392mb avg_vmem=4914mb
24)  top_proc=flast:d_hydro:mike032:672M:433M:7.8hr:100%
25)  toppm=flast:wave.exe:mike032:2946M:2880M node_processes=0
26)  avg_avail_mem=26911mb min_avail_mem=20687mb
27)  reverified_avg_load=6.35
28) 
29)  Name:  First Last
30)  Mail:  flast@somewhere.lsu.edu
31)  Affil: First Last
32)  Category:
33)  Name:  First Last
34)  Mail:  flast@somewhere.lsu.edu
35)  Affil: First Last
36)  Category: validation:current:09/03/2013
37)  Allocations:
38)  hpc_alloc02,flast,383311.16,

Underutilized: E131 - Too Many Unused Nodes Message

This error occurs if only 1 node assigned to a job is being used and all the others are idle. In this example, 31 nodes were found to be inactive. The load value represents the number of user processes running on the node. Ideally, there should be 1 user process per core. This job asked for 32 nodes, and there were 20 cores per node, so the total load should have been about 640. In fact, all the processes were started on one node, smic252 (see line 9), giving a load of 639.92. This caused resource starvation on the node. This is reflected in the fact that the job ran for almost 4 hours, but no CPU time was consumed (see line 14). Such a situation could destabilize the system, so the job was terminated. Information collected by the PBS job manager is reflected in lines 13 to 32. User account and allocation information is shown in lines 33 to 42.

 1)  E131 - Too many unused nodes
 2)  Job 74236 has 31 unused nodes.
 3)  Please correct this problem.
 4) 
 5)  Job deleted
 6) 
 7)  PBS job: 74236, nodes: 32
 8)  Hostname Days Load CPU U#(User:Process:VirtualMemory:Memory:Hours)
 9)  smic252     34    639.92   0       0
10)  smic253     34    0.16       0       0
11)  smic254     34    0.11       0       0
12)  . . . 29 similar lines removed . . .
13)  PBS_job=74236 user=flast allocation=hpc_alloc03
14)  queue=checkpt total_load=641.57 cpu_hours=0.00 wall_hours=3.90
15)  unused_nodes=31 total_nodes=32 ppn=20 avg_load=20.04
16)  avg_cpu=0% avg_mem=0mb avg_vmem=0mb top_proc=none:0.0hr:0%
17)  node_processes=0
18) 
19)  Node statistics::
20)  Number of nodes: 32
21)  Number of cores: 640
22)  Total physical memory per node: 64364mb
23)  Average memory usage per node: 0mb, 0%
24)  Average memory usage per core: 0mb
25)  Average virtual memory usage per node: 0mb
26)  Average virtual memory usage per core: 0mb
27)  Average CPU percent per node: 0%
28)  Average CPU percent per core: 0%
29)  Average load per node: 0.02
30)  Reverified average load per node: 19.89
31)  Effective maximum load on a node: 635.08
32) 
33)  Name:  First Last
34)  Mail:  flast@somewhere.lsu.edu
35)  Affil: First Last
36)  Category:
37)  Name:  First Last
38)  Mail:  flast@somewhere.lsu.edu
39)  Affil: First Last
40)  Category: validation:current:02/22/2011
41)  Allocations:
42)  hpc_alloc03,flast,1578202.88,default

Underutilized: E132 - Low Memory Usage Message

A low memory usage error indicates that much less memory is being used than is provided by the system. A typical case is running a job that uses a small amount of memory in one of the job queues that provides large memory nodes. The example below shows that large memory nodes were requested (lines 2 and 22), but the average memory usage was less than 1 percent (line 3). Given that only 5.125 GB were required, the job could easily have fit on a 32 GB node provided by the workq or checkpt queues. In this case, the load of 16 (lines 4, 17, and 23) indicates the CPU cores were fully occupied. User account information is shown in lines 30 to 39.

 1)  E132 - Low memory usage
 2)  Total physical memory per node: 258272mb
 3)  Average memory usage: 5125mb, 1%
 4)  Average CPU percent is: 1566%
 5)  Please try to better use CPU resources.
 6) 
 7)  Node statistics::
 8)  Number of nodes: 1
 9)  Number of cores: 16
10)  Total physical memory per node: 258272mb
11)  Average memory usage per node: 5125mb, 1%
12)  Average memory usage per core: 320mb
13)  Average virtual memory usage per node: 5890mb
14)  Average virtual memory usage per core: 368mb
15)  Average CPU percent per node: 1566%
16)  Average CPU percent per core: 97%
17)  Average load per node: 16.09
18)  Reverified average load per node: 16.10
19)  Effective maximum load on a node: 16.10
20) 
21)  PBS_job=432161.mike3 user=flast allocation=hpc_alloc04
22)  queue=bigmem total_load=16.09 cpu_hours=5.38 wall_hours=3.96
23)  unused_nodes=0 total_nodes=1 ppn=16 avg_load=16.09 avg_cpu=1566%
24)  avg_mem=5125mb avg_vmem=5890mb
25)  top_proc=flast:mothur:mike438:355M:320M:0.1hr:100%
26)  toppm=flast:mothur:mike438:410M:359M node_processes=18
27)  avg_avail_mem=222946mb min_avail_mem=222946mb
28)  reverified_avg_load=16.10
29) 
30)  Name:  First Last
31)  Mail:  flast@somewhere.lsu.edu
32)  Affil: First Last
33)  Category: validation:current:01/27/2014
34)  Name:  First Last
35)  Mail:  flast@somewhere.lsu.edu
36)  Affil: First Last
37)  Category: validation:current:01/27/2014
38)  Allocations:
39)  hpc_alloc04,flast,50000.00,

Underutilized: E133 - Low Load Message

A low load error indicates that the processing resources being used are much lower than what is available on a node. This might not be a problem if a job is using most of the memory but only 1 core on a multi-core node. A combination of low memory and low load suggests the job should be run in a single queue or set up to multi-process several tasks at a time. The example shows a load of 1.02 (line 2), and that the nominal load on the 20-core node (line 10) should be 20 (line 3). User account information is shown in lines 31 to 41.

 1)  E133 - Low load
 2)  The average load per node is low: 1.02
 3)  The average load should be: 20
 4)  The reverified average load per node: 0.99
 5)  The average memory usage per node: 320mb, 0%
 6)  Try to use CPU and memory resources wisely.
 7) 
 8)  Node statistics::
 9)  Number of nodes: 1
10)  Number of cores: 20
11)  Total physical memory per node: 64364mb
12)  Average memory usage per node: 320mb, 0%
13)  Average memory usage per core: 16mb
14)  Average virtual memory usage per node: 991mb
15)  Average virtual memory usage per core: 49mb
16)  Average CPU percent per node: 99%
17)  Average CPU percent per core: 4%
18)  Average load per node: 1.02
19)  Reverified average load per node: 0.99
20)  Effective maximum load on a node: 1.02
21) 
22)  PBS_job=78197.smic3 user=flast allocation=hpc_alloc04
23)  queue=workq total_load=1.02 cpu_hours=1.48 wall_hours=1.46
24)  unused_nodes=0 total_nodes=1 ppn=20 avg_load=1.02 avg_cpu=99%
25)  avg_mem=320mb avg_vmem=991mb
26)  top_proc=flast:fluent:smic246:451M:270M:1.4hr:99%
27)  toppm=flast:fluent.6.3.26:smic246:451M:270M node_processes=7
28)  avg_avail_mem=62440mb min_avail_mem=62440mb
29)  reverified_avg_load=0.99
30) 
31)  Name:  First Last
32)  Mail:  flast@somewhere.lsu.edu
33)  Affil: First Last
34)  Category: validation:current:05/11/2015
35)  Name:  First Last
36)  Mail:  flast@somewhere.lsu.edu
37)  Affil: First Last
38)  Category: validation:current:06/17/2014
39)  Allocations:
40)  hpc_alloc05,flast,1999994.14,
41)  hpc_alloc04,flast,1931.31,

Underutilized: E135 - Low CPU Percent Message

Low CPU percent indicates that the amount of work being done on a node is much less than what is possible. In the example below, the average load per node is 0.01 (line 4) when it should be closer to 20 (line 3). Memory utilization in this case is also very small (line 13). User account information is summarized in lines 31 - 40.

 1)  E135 - Low CPU percent
 2)  The average CPU percent per node is low: 0%
 3)  The average CPU percent should be: 2000%
 4)  Average load per node: 0.01
 5)  The average memory usage per node: 1381mb, 2%
 6)  Try to use CPU and memory resources wisely.
 7) 
 8)  Node statistics::
 9)  Number of nodes: 64
10)  Number of cores: 1280
11)  Total physical memory per node: 64364mb
12)  Average memory usage per node: 1381mb, 2%
13)  Average memory usage per core: 69mb
14)  Average virtual memory usage per node: 2561mb
15)  Average virtual memory usage per core: 128mb
16)  Average CPU percent per node: 0%
17)  Average CPU percent per core: 0%
18)  Average load per node: 0.01
19)  Reverified average load per node: 12.19
20)  Effective maximum load on a node: 17.74
21) 
22)  PBS_job=78212.smic3 user=flast allocation=hpc_alloc03
23)  queue=workq total_load=0.72 cpu_hours=0.00 wall_hours=1.05
24)  unused_nodes=0 total_nodes=64 ppn=20 avg_load=0.01 avg_cpu=0%
25)  avg_mem=1381mb avg_vmem=2561mb
26)  top_proc=flast:ddt-debugger:smic256:221M:7M:0.0hr:2%
27)  toppm=flast:gdb:smic035:1456M:1339M node_processes=8
28)  avg_avail_mem=60882mb min_avail_mem=56724mb
29)  reverified_avg_load=12.19
30) 
31)  Name:  First Last
32)  Mail:  flast@somewhere.lsu.edu
33)  Affil: First Last
34)  Category:
35)  Name:  First Last
36)  Mail:  flast@somewhere.lsu.edu
37)  Affil: First Last
38)  Category: validation:current:03/14/2011
39)  Allocations:
40)  hpc_alloc03,flast,12813004.16,default

Underutilized Error Message Terms

Average CPU percent per core

Average CPU usage per core. This is the percent of time a processing core spends in user processing rather than servicing I/O or system needs. 100% means fully committed to processing. 0% would mean something really bad is happening, such as I/O saturation due to virtual memory disk thrashing.

Average CPU percent per node

Average CPU percent per core times number of cores. This can give an idea of how node resources are being used overall.

Average memory usage per core

The average amount of memory used by a core for its processing. It should normally be less than the total memory available per node, divided by the number of cores. This leaves some memory free for system needs. Some jobs may choose to use fewer cores to allow higher memory usage by active cores.

Average memory usage per node

Average memory per core times number of cores.

Average virtual memory usage per core

The amount of virtual (disk-based) memory used per core.

Average virtual memory usage per node

Average virtual memory per core times number of cores

avg_cpu

The same as "Average CPU percent per node" above.

avg_load

The load summed from all assigned nodes divided by the number of nodes.

avg_mem

Total memory usage from all assigned nodes divided by the number of nodes.

avg_vmem

Total virtual memory usage from all assigned nodes divided by the number of nodes.

cpu_hours

Sum of CPU time consumed by user processes from all available cores.

gb

Data size in billions of bytes (10^9 bytes).

load

Number of running user processes.

mb

Data size in millions of bytes (10^6 bytes).

M

Data size in Megabytes (2^20 bytes, or 1,048,576).

ppn

Number of processors (cores) per node.

reverified_avg_load

A second sampling to double check the load summed from all assigned nodes divided by the number of nodes.

top_proc

The user process identified as the top CPU time consumer. The information is displayed in a ":" delimited format (but without spaces): "user name : command name : node (host) name : memory used : virtual memory used : wall clock hours : cpu%"

toppm

The user process identified as the top memory consumer. The information is displayed in a ":" delimited format (but without spaces): "user name: command name : node (host) name : memory used : virtual memory used"

total_load

Sum of the user processes running on all nodes assigned to the job.

total_nodes

Number of nodes assigned to the job.

wall_hours

Elapsed job time as measured by a wall clock.

unused_nodes

Number of assigned nodes measured as being idle.

user

User login name.

virtual memory

When system random access memory space is exhausted, the system will use space on disk to store data. Reading and writing to/from disk is vastly slower than to/from system memory, and can bring processing to a halt. Heavy use of virtual memory leads to a situation called disk thrashing where all the time is spent in disk I/O and nothing is available for CPU processing. On some systems, the OOM (out-of-memory) killer may start killing processes to free memory and wind up inadvertently stopping critical processes, crashing the node.