Entire server unresponsive


 
Thread Tools Search this Thread
Operating Systems Solaris Entire server unresponsive
# 1  
Old 02-12-2015
Entire server unresponsive

Hi guys,

I have a SUN M5000 server running several Solaris zones (whole root). In all the zones, I have SAP systems running. Recently, one of the SAP systems got stuck (hanged), I suppose was a memory issue. I was not able to log into the zone at all. In fact, I observed that I was not able to log onto the server (global) also. I started halting the zones one by one and then at some stage, I was able to log onto the global zone.
  1. Is it possible due to one particular zone, the entire server gets hung? What can be done to avoid this?
  2. What commands other than prstat -Z will help identify the issue/symptoms etc?

Of course, I'm also looking at SAP side in terms of memory fine-tuning so as to prevent this happening again.


regards.

Last edited by rbatte1; 02-13-2015 at 08:15 AM.. Reason: Added LIST=1 tags and ICODE tags
# 2  
Old 02-12-2015
Client-server? between what?
You should have reacted earlier if a zone created such a situation, because, after we can only guess few reasons
1) It can happen if badly designed...
2) I cant remember

But more what did you find in your logs? What caused the hang not the application, the system side? overload? etc...

If I were asked at a first glance a reason, if client-server box we lets say multiple (many hundreds...) concurrent access from PCs I would say look with netstat for *FiNWAIT and alterego stuff because it would think badly tuned you run out of sockets explaining you can open new connections...
I let others give you a better explanation than I can at the moment

Good Luck in your investigation
# 3  
Old 02-12-2015
You have to use zone resource management to prevent that problem. This is dummied-up output from prctl -i zone [zonename]

Code:
zone.max-swap
        system          16.0EB    max   deny                                 -
zone.max-locked-memory
        system          16.0EB    max   deny                                 -
zone.max-shm-memory
        system          20.0GB    max   deny                                 -
zone.max-shm-ids
        system            1.8M     max   deny                                 -
zone.max-sem-ids
        system          16.8M     max   deny                                 -
zone.max-msg-ids
        system          16.8M     max   deny                                 -
zone.max-lwps
        system            8.4K     max   deny                                 -
zone.cpu-cap
        privileged        200       -   deny                                 -
        system          4.29G     inf   deny                                 -
zone.cpu-shares
        privileged          1       -   none                                 -
        system          65.5K     max   none

You can control these settings with zonecfg or dynamically with prctl

Examining the running system requires using iostat, prstat, fsstat , netstat -s , and
echo '::memstat' | mdb -k # from global zone

to get a BASIC idea. Advanced probing usually requires dtrace.
# 4  
Old 02-13-2015
Thanks Vbe and Jim for yr replies.

I only detected the problem a bit late.

Jim, can you please briefly interpret the output of (what do I need to look for) :
1)
Code:
prtctl -i zone <zonename>




2) In my case, output of
Code:
echo '::memstat' | mdb -k # from global zone

is:
Code:
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    1536666             12005    5%
ZFS File Data            18056512            141066   55%
Anon                     11687507             91308   35%
Exec and libs              559302              4369    2%
Page cache                  76990               601    0%
Free (cachelist)            47223               368    0%
Free (freelist)           1010477              7894    3%

Total                    32974677            257614
Physical                 32952795            257443

How to interpret this?

regards.
# 5  
Old 02-13-2015
I would start by limiting zfs arc cache maximum value inside global zone as well as in kernel zones to some sane value, depending on the workload.

Depending what you run in zones, might want to limit ZFS arc cache to couple of GB max (leave everything to service in question).

This will, of course, limit the read performance of a host (cache is smaller, less cache hits more physical reads).

Do you run everything on ZFS filesystems (applications, databases) or some other combination ?
# 6  
Old 02-13-2015
Quote:
Originally Posted by frum
Is it possible due to one particular zone, the entire server gets 'hanged'? What can be done to avoid this?
Yes, it is possible for one zone to eat enough resources to grossly affect other zones and global.

The tools are there to cap the memory usage of this zone in the zone configuration (zonecfg) if its eating of physical memory is definitely the problem.

Oracle document the options here:
http://docs.oracle.com/cd/E19253-01/...v-1/index.html

Of course, users of this zone may experience new limitations. If that's a problem consider increasing the overall RAM in the system (again assuming your prognosis is correct about the problem being memory).

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

Sorry - just realized jim_mcnamara has already said this (but I'll leave this post now anyway).

Last edited by hicksd8; 02-13-2015 at 07:46 AM..
This User Gave Thanks to hicksd8 For This Post:
# 7  
Old 02-13-2015
Peasant is spot on - ZFS cache and databases do not play well together, limit the arc cache size.

The link hicksd8 gave you explains those resource limits show by prctl, I believe.

Be sure FSS (Fair share scheduling) is enabled. dispadmin does that.
- from the global zone.
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. SCO

SCO unresponsive after root disk

I am working on a system that uses SCO Unix 5. I started later in this troubleshooting process so I have had to play catch up on some of the earlier mistakes that were made. The HD was formatted using the SCSI controller cards bios. I have 2 recovery floppies the boot and the root. After... (19 Replies)
Discussion started by: graysona
19 Replies

2. Solaris

solaris 10 system unresponsive

Hi guys, I have solaris 10 installed in sun240 server. While I was practicing, the system became unresponsive. How can I fix this issue? Thanks for your help (3 Replies)
Discussion started by: cjashu
3 Replies

3. Solaris

Unresponsive Commands in Solaris 10

I'm currently running Solaris 10 on Sun V890 server. However recently some commands are running and some are not. Pressing the "ENTER" key after each CLI entry, it just stays there like it's frozen. Hitting the "ENTER" doesn't advance the cursor to the next line. I had to reboot the server, and... (0 Replies)
Discussion started by: ravzter
0 Replies

4. Red Hat

Capture Entire server configuration

Hi Is there any tool/package/command to get entire server's configuration of an RHEL Server? Conf info must incl. hostname, IP, domain name, all recent logs, OS info, disk info, CPU, RAM, swap, IO, services, all services' config files etc. thanks, Reddy (3 Replies)
Discussion started by: reddyr
3 Replies

5. Red Hat

comm: mysqld Not tainted ... Kernel Panic , System totally unresponsive

Hi, I am experiencing frequent system hangs, hard kernel panics, etc almost thrice a day. The system would be totally unresponsive and the only way is to reboot is hard power recycling (plug out the power cable and plug in back after 30 secs). I enabled kdump, but unfortunately the kdump files... (3 Replies)
Discussion started by: massoo
3 Replies

6. Solaris

Solaris 10 unresponsive

I am running Solaris 10 on a Sun X4200, and last week the OS froze to the point where I could not open an ssh session to it. It would just hang. I had to force reboot the server from the ILOM. This is what I saw in the /var/adm/messages log: Mar 27 10:31:04 testing in.mpathd: initifs: ioctl... (9 Replies)
Discussion started by: _wb_
9 Replies
Login or Register to Ask a Question