Unusual system bog down


 
Thread Tools Search this Thread
Operating Systems Solaris Unusual system bog down
# 1  
Old 02-25-2014
Unusual system bog down

Solaris 10 10/09 s10s_u8wos_08a SPARC 16cpus 128MB, uptime 150+ days,
2 db zones (Oracle 9 & 10), 3 application zones.

This is from a system that was literally crawling, 60 seconds to execute a
single command. I had to reboot to clear it. Data is from runs of
prstat and top, and iostat. The system is fine after the reboot.

Most of the waits were for oracle remote user processes in a
single db zone.

I ran dtrace and mdb to find cpu issues and file locks, found very few.
We lost a SAN controller (for a Windows fileserver SAN absolutely
not attached at all to this box) and this occurred as well - several hours
later.

Note: cpu is not occupied actually occupied but the load averages
are absurd. Context switches were low, less than 100/sec, per dtrace.

iostat shows two disks with excessively high svc_t times, but not that
much transfer of data.

Low priority processes are often in waits, this is normal.
I have historical sar data, sarcheck does not see any problems other than
ssd18 and ssd27 have excessive waits.

I had to reboot so this is what I now have to work with....

Any ideas? What would cause this:
Code:
PRSTAT
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 20125 oracle   3772M 3769M wait    59    0   0:00:32 0.1% oracle/1
 18435 oracle   3762M 3759M wait    59    0   0:13:35 0.1% oracle/1
 18430 appworx    50M   47M sleep   59    0   0:06:27 0.1% uzpplpl/1
  7264 oracle   3781M 3764M wait     1    0   0:07:45 0.1% oracle/11
 12839 oracle   2551M 2535M wait    47    0   0:03:52 0.1% oracle/11
 16458 root     7688K 4864K cpu10   59    0   0:00:00 0.0% prstat/1
 18337 oracle   3762M 3759M sleep    1    0   0:04:54 0.0% oracle/1
 25080 vssrt     170M  157M sleep   59    2   0:00:44 0.0% MrepApp/1
 13886 oracle   2566M 2535M wait    38    0   0:00:06 0.0% oracle/1
 25011 oracle   3772M 3769M wait     1    0   0:00:30 0.0% oracle/1
 18334 appworx    15M   12M sleep   59    0   0:03:44 0.0% uapsogn/1
  7480 oracle   2584M 2554M wait    59    0   0:07:52 0.0% oracle/11
  7470 oracle   2584M 2556M wait    59    0   0:07:51 0.0% oracle/11
  5488 oracle   3772M 3769M wait    55    0   0:00:22 0.0% oracle/1
  8591 oracle   3762M 3759M wait    59    0   0:00:00 0.0% oracle/1
 23924 vssrt     206M  193M wait     1    2   0:04:24 0.0% DrepApp/1
 25129 oracle   3768M 3765M wait    59    0   0:00:02 0.0% oracle/1
 12857 oracle   2551M 2534M wait     1    0   0:03:53 0.0% oracle/11
  3803 oracle   3777M 3773M wait     1    0   0:00:11 0.0% oracle/15
  3751 oracle   3772M 3769M wait     1    0   0:00:28 0.0% oracle/1
 26066 oracle   2550M 2534M wait    21    0   0:06:54 0.0% oracle/1
 20904 oracle   3768M 3765M wait     1    0   0:00:05 0.0% oracle/1
  7464 oracle   2549M 2532M wait     1    0   0:06:42 0.0% oracle/1
  7266 oracle   3781M 3764M wait     1    0   0:04:45 0.0% oracle/11
  7256 oracle   3769M 3752M wait     1    0   0:06:39 0.0% oracle/1
 23930 oracle   2554M 2538M wait    59    0   0:03:07 0.0% oracle/11
 19553 oracle   3772M 3769M wait    59    0   0:00:10 0.0% oracle/1
  4058 oracle   3768M 3765M wait    60    0   0:00:14 0.0% oracle/1
 14899 oracle   3768M 3765M wait    59    0   0:00:05 0.0% oracle/1
  8670 oracle   2554M 2537M wait    58    0   0:01:35 0.0% oracle/11
 25086 oracle   2553M 2537M wait    59    0   0:00:29 0.0% oracle/11
 15891 oracle   3762M 3758M wait    57    0   0:00:00 0.0% oracle/1
 17399 oracle   3772M 3769M wait    59    0   0:00:19 0.0% oracle/1
 18260 oracle   3772M 3769M wait    59    0   0:02:05 0.0% oracle/1
  4805 oracle   3772M 3769M wait    60    0   0:00:04 0.0% oracle/1
 23116 oracle   3772M 3769M wait     1    0   0:00:14 0.0% oracle/1
 15228 oracle   3765M 3749M cpu11   59    0   0:04:44 0.0% oracle/1
  4946 oracle   3772M 3769M sleep    1    0   0:00:34 0.0% oracle/1
 29429 oracle   3772M 3769M sleep   55    0   0:00:11 0.0% oracle/1
 12875 oracle   2551M 2534M sleep   59    0   0:04:21 0.0% oracle/11
 12632 oracle   2552M 2535M sleep    1    0   0:02:30 0.0% oracle/14
 12594 oracle   2549M 2532M sleep   59    0   0:02:11 0.0% oracle/1
 11515 vssrt     196M  180M wait     1    0   0:01:57 0.0% TbApp/1
 21481 vssrt      76M   62M wait     1    2   0:01:37 0.0% BmanApp/1
 24837 vssrt     178M  165M sleep   59    2   0:01:13 0.0% MrepApp/1
 20360 oracle   3772M 3769M wait     1    0   0:00:22 0.0% oracle/1
 21726 oracle   3777M 3773M wait    57    0   0:00:34 0.0% oracle/11
Total: 1425 processes, 8621 lwps, load averages: 142.80, 134.91, 144.84

top
last pid: 18794;  load avg: 144.64,  133.78,  144.80;  up 154+00:35:28 12:16:18
1425 processes: 601 waiting, 801 sleeping, 3 on cpu                                                                          
CPU states: 95.6% idle,  3.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
Memory: 128G phys mem, 78G free mem, 32G total swap, 32G free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 25326 oracle     1  59    0 3768M 3765M wait     0:10  0.42% oracle
 12632 oracle    14  59    0 2552M 2535M wait     2:31  0.25% oracle
 18435 oracle     1  59    0 3762M 3759M wait     3:47  0.15% oracle
 23924 vssrt      1   1    2  206M  193M wait     4:26  0.12% DrepApp
 18260 oracle     1  59    0 3772M 3769M wait     2:06  0.12% oracle
  7264 oracle    11   1    0 3781M 3764M wait     7:47  0.11% oracle
 18337 oracle     1   1    0 3762M 3759M wait     4:56  0.10% oracle
  8670 oracle    11  58    0 2554M 2537M wait     1:35  0.09% oracle
 25011 oracle     1   1    0 3772M 3769M wait     0:31  0.08% oracle
 23930 oracle    11  59    0 2554M 2538M wait     3:09  0.08% oracle
  8674 oracle    11  51    0 3770M 3753M wait     1:16  0.08% oracle
 13886 oracle     1  38    0 2564M 2535M wait     0:08  0.08% oracle
 18783 oracle     1  59    0 3762M 3758M wait     0:00  0.08% oracle
  7262 oracle     1   1    0 3960M 3943M wait     4:23  0.08% oracle
 18430 appworx    1  59    0   50M   47M sleep    6:30  0.08% uzpplpl

ssdnn devices are SAN Luns
Code:
 iostat -xm
 device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
 sd0       0.4    0.5   19.1    2.0  0.0  0.0   25.2   0   0 
 sd1       0.4    0.7   19.1    2.1  0.0  0.0   24.0   0   1 
 sd2       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd0      0.9    0.3   36.5    1.5  0.0  0.0    3.8   0   0 
 ssd1      1.0    0.3   38.9    1.6  0.0  0.0    4.0   0   0 
 ssd2      1.3    0.3   44.6    5.8  0.0  0.0    3.4   0   0 
 ssd3      0.9    0.3   37.6    2.3  0.0  0.0    3.7   0   0 
 ssd5     88.0   27.0 3181.2  311.0  0.0  0.3    2.7   0   8 
 ssd7      0.0    0.0    0.0    0.0  0.0  0.0    0.9   0   0 
 ssd8      0.1    0.0    0.5    0.0  0.0  0.0    2.1   0   0 
 ssd9      0.1    0.0    0.6    0.0  0.0  0.0    2.1   0   0 
 ssd10     0.5    1.2   14.2   49.1  0.0  0.0    2.6   0   0 
 ssd11     0.1    0.0    0.8    0.0  0.0  0.0    2.0   0   0 
 ssd12     0.3    0.0    5.8    0.1  0.0  0.0    3.5   0   0 
 ssd13     5.1    2.5  395.8  270.8  0.0  0.1    8.7   0   1 
 ssd14     2.4   23.7   46.2  121.9  0.0  0.0    1.4   0   2 
 ssd15     0.0    0.0    0.0    0.0  0.0  0.0    0.6   0   0 
 ssd16     0.1    0.0    0.2    0.0  0.0  0.0    1.9   0   0 
 ssd17     0.0    0.0    0.0    0.0  0.0  0.0    1.1   0   0 
 ssd18    73.5   12.0 13469.7  132.1  0.0  1.5   17.1   0  10
 ssd19     2.0    1.7  133.5   18.9  0.0  0.0    4.7   0   0 
 ssd23     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd24     0.0    0.0    0.0    0.0  0.0  0.0    1.1   0   0 
 ssd25     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd26     0.0    0.0    0.0    0.0  0.0  0.0    0.8   0   0 
 ssd27   594.9   65.9 12204.8  669.7  0.0  4.3   86.6   0  74
 ssd28     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd29     0.1    0.2    3.1    0.4  0.0  0.0    2.5   0   0 
 ssd30     0.1    0.0    1.8    0.0  0.0  0.0    2.2   0   0 
 ssd31   140.6   25.2 11266.5  315.0  2.9  5.4   60.3   2  15

Thanks for any comments.

Last edited by jim mcnamara; 02-25-2014 at 04:57 PM..
# 2  
Old 02-25-2014
What says "prstat -Z" from the global zone ?
# 3  
Old 02-25-2014
Well in situations like this (reboot performed) one can only offer suggestions from experience.

With uptime at +150, multiple zones and multiple Oracle instances I would be looking at two things.

1. Check the content of /tmp directories on all zones to see if one of them has five million files in it. If so, do we know why? Cleaning them up often clears the issue. If this is the problem (an O/S problem) then I would expect the problem to recur in the short term.

2. What is the setting of the parameter "pg_contig_disable" in the /etc/system files? On a long running uptime and Oracle instances, memory can become very fragmented and if Oracle dB requests contiguous memory then the system virtually hangs whilst working sets are shuffled to give Oracle what it wants. The cure is either to increase memory size or allow Oracle to use non-contiguous memory. If this is the problem (an Oracle problem) then I would expect the problem not to recur in the short term.

This really isn't very helpful I know, just thinking aloud.

Last edited by hicksd8; 02-26-2014 at 10:08 AM..
# 4  
Old 02-27-2014
Thanks!
@jlliagre - system was rebooted and the problem cleared. Back then prstat -Z did not show any one zone using cpu resources. Nobody had cpu. as you saw sys % time was low, too. So the kernel was not thrashing AFAIK.

/tmp gets cleaned up monthly, so maybe 200 files were out there.

@hicksd8 - pg_contig_disable = 0. I think this may have precipitated the problem. OTN has some similar information, we knew about it but decided against setting it. We rebooted, it is now set to 1. We also forced mgt to acquiesce to a periodic off-time reboot. We now are allowed reboots on the weekend. The whole thing is political, no technical person is allowed input in decisions like this until something goes South.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Script unusual behavior

Hello, I have noticed some unusual behavior while running the script. when i use below script it gives output 355.23 #!/bin/bash ONEDAY=`date +%Y%m%d --date="1 days ago"` cat /opt/occ/var/performance/counters_`date -d "1 day ago" +%Y%m%d`*|grep "Gy,Gy-Gy-CCR"|awk -F"," '{print... (5 Replies)
Discussion started by: scriptor
5 Replies

2. UNIX for Dummies Questions & Answers

unusual problem with cp command

I have made a simple script to zip a file then first copy it to a specific directory using cp command then move it to another directory. Files are getting generated at regular intervals in the dir. /one/two/three/four/. I have entry of my script in cron to run after every 2 min. #!/bin/sh... (9 Replies)
Discussion started by: Devesh5683
9 Replies

3. Shell Programming and Scripting

Using Awk specify Unusual Delimiter

# echo "size(JFJF" | awk -F"size(" '{print $1}' awk: fatal: Unmatched ( or \(: /size(/ the delimiter is "size(" but i'm not sure if awk is the best tool to use to specify it. i have tried: # echo "size(JFJF" | awk -F"size\(" '{print $1}' awk: warning: escape sequence `\(' treated as... (1 Reply)
Discussion started by: SkySmart
1 Replies

4. HP-UX

Unusual Behavior?

Our comp-operator has come across a peculiar ‘feature'. We have this directory where we save all the reports that were generated for a particular department for only one calendar year. Currently there are 45,869 files. When the operator tried to backup that drive it started to print a flie-listing... (3 Replies)
Discussion started by: vslewis
3 Replies

5. Shell Programming and Scripting

Unusual Problem

what is wrong with the below script: --------------------------------------------------------------------------------- #!/bin/bash echo "Setting JrePath..." grep -w "export JrePath" /etc/profile Export_Status=$? if echo "JrePath declared" elif echo "JrePath not declared" echo... (4 Replies)
Discussion started by: proactiveaditya
4 Replies

6. Shell Programming and Scripting

very unusual question about while

is there anyway to make while run a command faster than per second? timed=60 while do command sleep 1 done i need something that can run a script for me more than one time in one second. can someone help me out here? (3 Replies)
Discussion started by: Terrible
3 Replies

7. Programming

unusual function refrences

I'm wrting a program which needs to get the following information of a sever by calling some lib fuctions or system calls, so can anybody help to tell me those function names or where I can find the description of them ? CPU usage Memory usage Load procs per min Swap usage Page I/O ... (11 Replies)
Discussion started by: xbjxbj
11 Replies

8. UNIX for Advanced & Expert Users

unusual function refrences

I'm wrting a program which needs to get the following information of a sever by calling some lib fuctions or system calls, so can anybody help to tell me those function names or where I can find the description of them ? CPU usage Memory usage Load procs per min Swap usage Page I/O Net I/O... (1 Reply)
Discussion started by: xbjxbj
1 Replies

9. Shell Programming and Scripting

Deleting an unusual file

Hi everyone, I was doing some practising with Unix and accidentally created a file with the name -------------------- Yeah, it was UNINTENTIONALLY. I tried removing it various ways like rm '--------------' rm '-.*' and all other sorts, but Unix keeps detecting that as an option stuff... ... (2 Replies)
Discussion started by: scmay
2 Replies
Login or Register to Ask a Question