AIX Health Check


 
Thread Tools Search this Thread
Operating Systems AIX AIX Health Check
# 1  
Old 06-25-2015
AIX Health Check

Hi everyone, I am new to the Unix admin position, needed some help. My management wants to report how their over all AIX servers / environment is doing so far. I've been researching and found multiple commands to run on each LPAR, well I have few questions and also wanted to share the commands Im running, and wanted feed back if these commands are enough to show the environment is doing well, or should I do something different? also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system? (our system is very small maybe 40 LPARs big)

Please let me know if I should provide more details to maybe get a better response.

Thanks in advance Smilie

commands gathered so far:-
  1. topas
  2. svmon -G -O unit=GB
  3. nmon then press n
  4. df -g
  5. netstat -rn
  6. lsvg -p rootvg
  7. vmstat
  8. iostat

Last edited by rbatte1; 06-26-2015 at 08:11 AM.. Reason: Converted text to numbered list with LIST=1 tags
# 2  
Old 06-25-2015
Quote:
Originally Posted by Adnans2k
My management wants to report how their over all AIX servers / environment is doing so far.
That is a pretty wide field you are plowing there. Systems administration is not so much a question of doing something but tho painstakingly exact describe what has to be done. I'll be glad to provide commands for everything that needs to be done but let us first discuss what you understand when you say "how ... is doing".

Many commands you quoted are related to performance issues. You might want to read a little introduction to this for a discussion what "performance" is. But the question is: do you think performance issues need a constant monitoring? Are your systems that mission-critical performance-wise?

Many systems are in fact not. They need to run and might need to finish certain tasks on time but if they finish this task half an hour earlier or later wont't even be noticed. Most performance issues in fact are driven by the (complaining) customer. You don't need to monitor the systems in this respect at all, once they are too slow they will tell you. Further, to some extent you can trust the colleagues who set up the systems that they sized them more or less correctly for the respective purpose. (Now, this is not always the case but in a well-cared-for shop it mostly is. If you work in one which is not: don't try to develop monitoring, get out there while you can!)

After talking about so much about things you don't have to (daily) care for here are a few thing you do have to monitor: things which regularly (from of my experience) happen and are showstoppers:

Full file systems: this happens with a certain reularity and the upshot ranges from annoying to fatal. Get a full root-fs and AIX starts to throw fits. Get a full /tmp-fs and ksh (at least ksh93) produces unusual hiccups. Have a full /var and printing, spooling, job scheduling and much more will mostly not work any more (it might even not be possible to log on to the system because /var/wtmp cannot be written to - i had this once). Even more troublesome is if the FS with the archive logs for the database is full. "Archiver stuck" makes the Oracle database stand more or less still, dong nothing while grabbing up every ounce of processor- and memory-resources there are until the machine finally crashes.

Application not running: You might not like the idea but some application programs are just Serious-/Hardworking-/Ideal-/Thorough- -ly programmed, if you know what i mean. There are memory sinks which makes it necessary to restart them regularly, there are processes to exit without even so much as an error message and all other sorts of nightmares you can imagine - and then some. Monitoring an application means usually looking if certain processes are running (sometimes of a certain number of them are running) and raising an alarm if this is not the case.

Network-/Disk-errors: you might wonder why i mix up such seemingly different areas but the difference between SAN-services and LAN-services are starting to blur and the two begin to grow together. In a shop your size you probably have no physical disks any more but some sort of SAN box providing the storage. Some fabrics are notoriously losing pathes temporarily (i remember this being the case with AIX 5.3 and Hitachi storage - ultimately an AIX FC-driver problem). Depending on your precise setup it might be a good idea to test the network connection to some vital partners and the control the connectivity to the disks.

Backup-errors: There is a joke: the thing you positively do NOT want your systems administrator hear saying is: ahem, you do have a backup, yes? As funny as it sounds: at some time for everyone the excrement is hitting the air moving rotor and you are in deep kimchi. You need a backup in this case and it is usually exactly this moment when you find out that every backup you took in the last three years consists only of the message "couldn't continue, exiting now". Believe me, telling management about this very rarely gets you an immediate and substantial raise. Backups fail sometimes and this is no problem at all, but you need to know if this happens, because not having one backup doesn't matter but every day the same system complaining about about the backup being unsuccessful should ring every alarm bell there is.

VIOS: These are the most important systems you have! If they are not working, no other LPAR is working (at least in a way it could be noticed outisde the managed system). Particularly things like SEAs, SEA takeovers and similar events might be a good idea to track.

Quote:
Originally Posted by Adnans2k
also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system?
There is cron to set up a regular pattern of little scripts to carry out. When you take systems administration seriously you will need to pick up at least some scripting skills so the best time to start learning is right now. Don't be afraid, scripting is a lot of fun and you won't need big scripts to do what i talked about above. Some of the things will be one- or few-liners and you will pick that up in a moment. And, again: scripting is FUN! A creative and fulfilling process! On AIX you have the best shell there is for scripting at your command: the Korn Shell. I guarantee you once we get you started you will never want to stop.

I hope this helps.

bakunin

Last edited by bakunin; 06-25-2015 at 07:08 PM..
This User Gave Thanks to bakunin For This Post:
# 3  
Old 06-26-2015
I would suggest to use something like ganglia or lpar2rrd - both tools generate "manager-friendly" charts, although the installation procedures are not such easy...
# 4  
Old 06-26-2015
I absolutely second what bakunin and agent.kgb wrote. It can't harm though to set up nmon in your crontab to write some performance data automatically to files so that you have something in the hand in case, when those complains about performance reach you. Makes investigation afterwards much easier. This data can also be fed to nmon2rrd which agent.kgb mentioned. Check the IBM Wiki for setting it up with cron:
Click me: NMON Documentation.

If you pick up bakunin's advice to write some small scripts, the AIX Error Report (check man errpt is the central place where problems of any kind are gathered in a list with timestamp, details, category etc. etc.
You could write or acquire a filter script, that checks this and sends you a mail for instance, if anything bad occurs.
You can also add a stanza to the ODM that can automatically trigger a action like a mail, script etc. to inform you. A script might be the prefered action, since you want to filter the entries in errpt for sure and also want to prevent a message flood etc. in case you have something producing entries like 100 per second. Had this from a connected jukebox once and I was happy not to have a plain mail being sent as action.

This thing about ODM entry is called "errnotify" and documented in the official IBM documentation online. Though here is a very summary about the error handling capabilites and facilities on AIX:
AIX for System Administrators

A very good blog in every regard anyway.

If your systems are connected to a HMC, you can additionally check there the events that come in for faults.

Last edited by zaxxon; 06-26-2015 at 08:55 AM.. Reason: changed lpar2rrd to nmon2rrd, typos etc.
# 5  
Old 06-29-2015
Thank you so much for your support I appreciate it.

But just because I'm new, can anyone recommend any preferred sites or youtube channels where i can learn scripting to get these automated.
# 6  
Old 06-29-2015
A regular human review of the output from errpt and if necessary, errpt -a would be useful.

You can also run the hardware diagnostics to get reports on allocated real hardware (virtual devices are skipped) through the diag panels. I did know how to run this from the command line, but I've forgotten. Smilie



Robin
# 7  
Old 06-29-2015
I wouldn't recommend to use bash on AIX, but I think this guide can help to start scripting:
http://www.tldp.org/LDP/abs/abs-guide.pdf
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Health check report

Hi Team, I am writing a small script in that I want collect all servers of /opt and /stage. Below is my small script #!/bin/ksh #checking Media server opt_Disk_Space_logs myclient=`cat media_server.txt` > opt_logs.txt printf " Server Name\tsize\tused\tavail\tcapacity\tMounted... (12 Replies)
Discussion started by: bashi77
12 Replies

2. Shell Programming and Scripting

Script to check the health of a database

need a script to check the health of a session server database. It must read the data base and send an alert if the database is unavailable. If its unavailable, we will want to bring down the database listener to force failover. can u guyz help me in doing this. what information do i need... (1 Reply)
Discussion started by: remo999
1 Replies

3. Shell Programming and Scripting

Daily health check script

Hi I am still learning how to write shell scripts, so I started to write a script like this: #!/bin/sh date echo outputOK () { echo $1 "" } outputOK () { echo $1 "" } for vol in `/usr/bin/grep -E 'hfs|vxfs|nfs|cifs' /etc/fstab | egrep -v '^#' | awk '{ print $3 }'` do if... (7 Replies)
Discussion started by: fretagi
7 Replies

4. HP-UX

HP-UX Health Check

Hi Experts, I want to check health of hp-ux box. Basically I want to check if there are possibilities of network/memory/cpu bottleneck? Are there are any commands available other than glance in hp-ux for the same? (11 Replies)
Discussion started by: sai_2507
11 Replies

5. Shell Programming and Scripting

Health check script

There are 3 servers . I want to fire commands df -kh and mpstat -P ALL on those individual servers and retrieve particular values to genrate reports. This part is almost done. But i am facing issue when i need to compile the reports from all three servers on to one server in order to generate a... (1 Reply)
Discussion started by: pratikm23
1 Replies

6. AIX

AIX Health Check script

Hi Everyone, Can you please help me put together a aix health check script that will check the status of CPU,Memory,Adapter, Filesystems (threshold 80%) and Disks.Im thinking of deploying a script to gather the required data in all the 22 servers and probably send out a mail if anything needs... (3 Replies)
Discussion started by: R!C
3 Replies

7. Shell Programming and Scripting

Health check script

Hi, I have a server type(A group of AIX,HP-UX and Linux servers running with different appn) in which i need to do health check(memory,cpu,h/w etc). I am planning to automate the same. Please help me out in writing the same. Thanks Renjesh Raju (9 Replies)
Discussion started by: Renjesh
9 Replies

8. Solaris

sun server health check

do anybody has a procedure for daily weekley monthly health check for SUN server with solaris OS?? (5 Replies)
Discussion started by: mm00123
5 Replies

9. AIX

AIX Health Check

Hi All, I would like to know if there is a downloadable AIX health check script available from IBM that would print a report of a servers health status. I've been working on a number of Sun Solaris servers and Sun provide a sun checkup script which can give you an ORI figure as well as a list... (3 Replies)
Discussion started by: backslash
3 Replies

10. HP-UX

check health

Dear Gentelmen I need command for display to me the following statement: -battery state -The application if working or not -The cpu is working or not -The power supply is working or not -The Data base is workig or not (2 Replies)
Discussion started by: magasem
2 Replies
Login or Register to Ask a Question