AIX Health Check

06-25-2015

Registered User

19, 0

Join Date: Oct 2014

Last Activity: 12 January 2016, 8:08 AM EST

Posts: 19

Thanks Given: 12

Thanked 0 Times in 0 Posts

AIX Health Check

Hi everyone, I am new to the Unix admin position, needed some help. My management wants to report how their over all AIX servers / environment is doing so far. I've been researching and found multiple commands to run on each LPAR, well I have few questions and also wanted to share the commands Im running, and wanted feed back if these commands are enough to show the environment is doing well, or should I do something different? also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system? (our system is very small maybe 40 LPARs big)

Please let me know if I should provide more details to maybe get a better response.

Thanks in advance

commands gathered so far:-

topas
svmon -G -O unit=GB
nmon then press n
df -g
netstat -rn
lsvg -p rootvg
vmstat
iostat

Last edited by rbatte1; 06-26-2015 at 08:11 AM.. Reason: Converted text to numbered list with LIST=1 tags

Adnans2k

View Public Profile for Adnans2k

Find all posts by Adnans2k

06-25-2015

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Adnans2k

My management wants to report how their over all AIX servers / environment is doing so far.

That is a pretty wide field you are plowing there. Systems administration is not so much a question of doing something but tho painstakingly exact describe what has to be done. I'll be glad to provide commands for everything that needs to be done but let us first discuss what you understand when you say "how ... is doing".

Many commands you quoted are related to performance issues. You might want to read a little introduction to this for a discussion what "performance" is. But the question is: do you think performance issues need a constant monitoring? Are your systems that mission-critical performance-wise?

Many systems are in fact not. They need to run and might need to finish certain tasks on time but if they finish this task half an hour earlier or later wont't even be noticed. Most performance issues in fact are driven by the (complaining) customer. You don't need to monitor the systems in this respect at all, once they are too slow they will tell you. Further, to some extent you can trust the colleagues who set up the systems that they sized them more or less correctly for the respective purpose. (Now, this is not always the case but in a well-cared-for shop it mostly is. If you work in one which is not: don't try to develop monitoring, get out there while you can!)

After talking about so much about things you don't have to (daily) care for here are a few thing you do have to monitor: things which regularly (from of my experience) happen and are showstoppers:

Full file systems: this happens with a certain reularity and the upshot ranges from annoying to fatal. Get a full root-fs and AIX starts to throw fits. Get a full /tmp-fs and ksh (at least ksh93) produces unusual hiccups. Have a full /var and printing, spooling, job scheduling and much more will mostly not work any more (it might even not be possible to log on to the system because /var/wtmp cannot be written to - i had this once). Even more troublesome is if the FS with the archive logs for the database is full. "Archiver stuck" makes the Oracle database stand more or less still, dong nothing while grabbing up every ounce of processor- and memory-resources there are until the machine finally crashes.

Application not running: You might not like the idea but some application programs are just Serious-/Hardworking-/Ideal-/Thorough- -ly programmed, if you know what i mean. There are memory sinks which makes it necessary to restart them regularly, there are processes to exit without even so much as an error message and all other sorts of nightmares you can imagine - and then some. Monitoring an application means usually looking if certain processes are running (sometimes of a certain number of them are running) and raising an alarm if this is not the case.

Network-/Disk-errors: you might wonder why i mix up such seemingly different areas but the difference between SAN-services and LAN-services are starting to blur and the two begin to grow together. In a shop your size you probably have no physical disks any more but some sort of SAN box providing the storage. Some fabrics are notoriously losing pathes temporarily (i remember this being the case with AIX 5.3 and Hitachi storage - ultimately an AIX FC-driver problem). Depending on your precise setup it might be a good idea to test the network connection to some vital partners and the control the connectivity to the disks.

Backup-errors: There is a joke: the thing you positively do NOT want your systems administrator hear saying is: ahem, you do have a backup, yes? As funny as it sounds: at some time for everyone the excrement is hitting the air moving rotor and you are in deep kimchi. You need a backup in this case and it is usually exactly this moment when you find out that every backup you took in the last three years consists only of the message "couldn't continue, exiting now". Believe me, telling management about this very rarely gets you an immediate and substantial raise. Backups fail sometimes and this is no problem at all, but you need to know if this happens, because not having one backup doesn't matter but every day the same system complaining about about the backup being unsuccessful should ring every alarm bell there is.

VIOS: These are the most important systems you have! If they are not working, no other LPAR is working (at least in a way it could be noticed outisde the managed system). Particularly things like SEAs, SEA takeovers and similar events might be a good idea to track.

Quote:

Originally Posted by Adnans2k

also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system?

There is cron to set up a regular pattern of little scripts to carry out. When you take systems administration seriously you will need to pick up at least some scripting skills so the best time to start learning is right now. Don't be afraid, scripting is a lot of fun and you won't need big scripts to do what i talked about above. Some of the things will be one- or few-liners and you will pick that up in a moment. And, again: scripting is FUN! A creative and fulfilling process! On AIX you have the best shell there is for scripting at your command: the Korn Shell. I guarantee you once we get you started you will never want to stop.

I hope this helps.

bakunin

Last edited by bakunin; 06-25-2015 at 07:08 PM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

06-26-2015

Registered User

344, 99

Join Date: Feb 2015

Last Activity: 18 February 2020, 9:58 AM EST

Location: basement, Lubyanka, Moscow

Posts: 344

Thanks Given: 8

Thanked 99 Times in 88 Posts

I would suggest to use something like ganglia or lpar2rrd - both tools generate "manager-friendly" charts, although the installation procedures are not such easy...

agent.kgb

View Public Profile for agent.kgb

Find all posts by agent.kgb

06-26-2015

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

I absolutely second what bakunin and agent.kgb wrote. It can't harm though to set up nmon in your crontab to write some performance data automatically to files so that you have something in the hand in case, when those complains about performance reach you. Makes investigation afterwards much easier. This data can also be fed to nmon2rrd which agent.kgb mentioned. Check the IBM Wiki for setting it up with cron:
Click me: NMON Documentation.

If you pick up bakunin's advice to write some small scripts, the AIX Error Report (check man errpt is the central place where problems of any kind are gathered in a list with timestamp, details, category etc. etc.
You could write or acquire a filter script, that checks this and sends you a mail for instance, if anything bad occurs.
You can also add a stanza to the ODM that can automatically trigger a action like a mail, script etc. to inform you. A script might be the prefered action, since you want to filter the entries in errpt for sure and also want to prevent a message flood etc. in case you have something producing entries like 100 per second. Had this from a connected jukebox once and I was happy not to have a plain mail being sent as action.

This thing about ODM entry is called "errnotify" and documented in the official IBM documentation online. Though here is a very summary about the error handling capabilites and facilities on AIX:
AIX for System Administrators

A very good blog in every regard anyway.

If your systems are connected to a HMC, you can additionally check there the events that come in for faults.

Last edited by zaxxon; 06-26-2015 at 08:55 AM.. Reason: changed lpar2rrd to nmon2rrd, typos etc.

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

06-29-2015

Registered User

19, 0

Join Date: Oct 2014

Last Activity: 12 January 2016, 8:08 AM EST

Posts: 19

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thank you so much for your support I appreciate it.

But just because I'm new, can anyone recommend any preferred sites or youtube channels where i can learn scripting to get these automated.

Adnans2k

View Public Profile for Adnans2k

Find all posts by Adnans2k

06-29-2015

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

A regular human review of the output from errpt and if necessary, errpt -a would be useful.

You can also run the hardware diagnostics to get reports on allocated real hardware (virtual devices are skipped) through the diag panels. I did know how to run this from the command line, but I've forgotten.

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

06-29-2015

Registered User

344, 99

Join Date: Feb 2015

Last Activity: 18 February 2020, 9:58 AM EST

Location: basement, Lubyanka, Moscow

Posts: 344

Thanks Given: 8

Thanked 99 Times in 88 Posts

I wouldn't recommend to use bash on AIX, but I think this guide can help to start scripting:
http://www.tldp.org/LDP/abs/abs-guide.pdf

agent.kgb

View Public Profile for agent.kgb

Find all posts by agent.kgb

AIX

AIX Health Check

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Health check report

Discussion started by: bashi77

2. Shell Programming and Scripting

Script to check the health of a database

Discussion started by: remo999

3. Shell Programming and Scripting

Daily health check script

Discussion started by: fretagi

4. HP-UX

HP-UX Health Check

Discussion started by: sai_2507

5. Shell Programming and Scripting

Health check script

Discussion started by: pratikm23

6. AIX

AIX Health Check script

Discussion started by: R!C

7. Shell Programming and Scripting

Health check script

Discussion started by: Renjesh

8. Solaris

sun server health check

Discussion started by: mm00123

9. AIX

AIX Health Check

Discussion started by: backslash

10. HP-UX

check health

Discussion started by: magasem