Quote:
Originally Posted by
Adnans2k
My management wants to report how their over all AIX servers / environment is doing so far.
That is a pretty wide field you are plowing there. Systems administration is not so much a question of
doing something but tho painstakingly exact describe
what has to be done. I'll be glad to provide commands for everything that needs to be done but let us first discuss what you understand when you say "how ... is doing".
Many commands you quoted are related to performance issues. You might want to read a
little introduction to this for a discussion what "performance" is. But the question is: do you think performance issues need a constant monitoring? Are your systems that mission-critical performance-wise?
Many systems are in fact not. They need to run and might need to finish certain tasks on time but if they finish this task half an hour earlier or later wont't even be noticed. Most performance issues in fact are driven by the (complaining) customer. You don't need to monitor the systems in this respect at all, once they are too slow they will tell you. Further, to some extent you can trust the colleagues who set up the systems that they sized them more or less correctly for the respective purpose. (Now, this is not always the case but in a well-cared-for shop it mostly is. If you work in one which is not: don't try to develop monitoring, get out there while you can!)
After talking about so much about things you don't have to (daily) care for here are a few thing you do have to monitor: things which regularly (from of my experience) happen and are showstoppers:
Full file systems: this happens with a certain reularity and the upshot ranges from annoying to fatal. Get a full root-fs and AIX starts to throw fits. Get a full
/tmp-fs and
ksh (at least ksh93) produces unusual hiccups. Have a full
/var and printing, spooling, job scheduling and much more will mostly not work any more (it might even not be possible to log on to the system because
/var/wtmp cannot be written to - i had this once). Even more troublesome is if the FS with the archive logs for the database is full. "Archiver stuck" makes the Oracle database stand more or less still, dong nothing while grabbing up every ounce of processor- and memory-resources there are until the machine finally crashes.
Application not running: You might not like the idea but some application programs are just Serious-/Hardworking-/Ideal-/Thorough- -ly programmed, if you know what i mean. There are memory sinks which makes it necessary to restart them regularly, there are processes to exit without even so much as an error message and all other sorts of nightmares you can imagine - and then some. Monitoring an application means usually looking if certain processes are running (sometimes of a certain number of them are running) and raising an alarm if this is not the case.
Network-/Disk-errors: you might wonder why i mix up such seemingly different areas but the difference between SAN-services and LAN-services are starting to blur and the two begin to grow together. In a shop your size you probably have no physical disks any more but some sort of SAN box providing the storage. Some fabrics are notoriously losing pathes temporarily (i remember this being the case with AIX 5.3 and Hitachi storage - ultimately an AIX FC-driver problem). Depending on your precise setup it might be a good idea to test the network connection to some vital partners and the control the connectivity to the disks.
Backup-errors: There is a joke: the thing you positively do NOT want your systems administrator hear saying is: ahem, you
do have a backup, yes? As funny as it sounds: at some time for everyone the excrement is hitting the air moving rotor and you are in deep kimchi. You need a backup in this case and it is usually exactly this moment when you find out that every backup you took in the last three years consists only of the message "couldn't continue, exiting now". Believe me, telling management about this very rarely gets you an immediate and substantial raise. Backups fail sometimes and this is no problem at all, but you need to know if this happens, because not having one backup doesn't matter but every day the same system complaining about about the backup being unsuccessful should ring every alarm bell there is.
VIOS: These are the most important systems you have! If they are not working, no other LPAR is working (at least in a way it could be noticed outisde the managed system). Particularly things like SEAs, SEA takeovers and similar events might be a good idea to track.
Quote:
Originally Posted by
Adnans2k
also I'm not a scripter, but is there a way I can some how generate a script to automate this tast for me on each LPAR / every LPAR in the system?
There is
cron to set up a regular pattern of little scripts to carry out. When you take systems administration seriously you will need to pick up at least some scripting skills so the best time to start learning is right now. Don't be afraid, scripting is a lot of fun and you won't need big scripts to do what i talked about above. Some of the things will be one- or few-liners and you will pick that up in a moment. And, again: scripting is FUN! A creative and fulfilling process! On AIX you have the best shell there is for scripting at your command: the Korn Shell. I guarantee you once we get you started you will never want to stop.
I hope this helps.
bakunin