Sponsored Content
Operating Systems AIX [Opinion] A Public Answer To Rob McNelly Post 302978540 by bakunin on Sunday 31st of July 2016 10:45:09 PM
Old 07-31-2016
It took some time to proove my point, but here it is: this is what happened last week:

On Tuesday both of my HMCs were no longer getting a connection to any of the managed systems although both were responding quite normally at their public interfaces both via ssh and the web GUI.

Our environment consists of about 20 p780 and p880, along with some smaller systems (p740s) thrown in for good measure. On that run some 350 LPARs of various sizes. Yes, we are a big shop and having no HMC to manage it is kind of a problem.

The first thing i did was trying to reboot one of the HMCs. As i can not really diagnose any problem because all the tools necessary for that are not available this was the best and fastest i could do to bring the system to a defined state. The reboot did take place, i did see the IPs of the service processors but not the given names of the managed systems any more. It did, of course, change nothing.

Since both HMCs lost all their connections at seemingly the same time i came up with the theory that maybe the network was responsible for that. So i got me a network admin and we traced the switch onto which all the service processors and the HMCs were connected. This management network is closed and unrouted, but we were able to confirm it worked and all the correct ARP information was there.

Note: forget to find out things like the MAC address of an interface on the HMC. Because the ifconfig command to do so is such a complicated thing IBM made things very easy for me by not confusing me at all with such information and made us dig into the logs of the switch to make sure the MAC addresses were what they should be. Thank you, IBM, for making my work so much easier.

At this point i opened a Prio-2-call at IBM. It was 2:00 pm and i expected to be called within the next 30 minutes. As it is, when i started to work with AIX more than 20 years ago this would have been a Prio-3-call and the phone would have rung within minutes. Times have changed.

It was Wednesday, sowhat past 14:30 when IBM deemed me finally worth an answer. The dispatcher first asked if i would agree to continue in english, which i allowed. (Big mistake. The english the technician spoke was barely understandable at all and i probably would have better understood his native bulgarian even though i don't speak that at all.)

First i told him what exactly i did up to this point, including the network trace and the reboot. He told me he would send some procedures i should carry out for him per mail but it would take him ten minutes or so to prepare the mail. No problem! Something happens, finally. After 30 minutes i was wondering, after 1 hour i was angry. After waiting for two hours i called the hotline again and asked who they thought i am. Within minutes the same technician called me and told me that "something went wrong with my email because he tried several times but it always came back with 'address unresolvable'" or so. OK, things can happen - but: couldn't he have called me and asked?? He obviously knew how to call me, no?

Well, after sending a mail to him myself he was able to answer that. I got a mail about how to create the hscpe user and use that to create a dump. I did so then uploaded the dumps from both HMCs to IBMs support site. (If you ever have to do that: a dump is some 2.5 GB in size, so it takes some time.)

On Thursday i got a mail from the guy, telling me that the good news were that nothing was amiss with my hardware. He advised me to check for loose network connections. I wrote back a rather acerbic comment that i did that at first and i already told him so, painstakingly describing the network traces we did. Anyways, i went to the datacenter and made sure all the network connections were there (and, what a surprise, it turned out that an interface i was able to determine the MAC address from the switches ARP cache for was indeed connected to that switch). I was told i would be passed over to second-level support.

On Friday nothing was to be heard from IBM. I suppose they were searching for the person doing the second-level support for this planet. In the meantime my colleague had a breakthrough, though: it is not possible to do a simple df on a HMC because that would perhaps disrupt the intricate work IBM has done with the HMCs software, but issuing

Code:
lshmcfs

he was able to detect that on both HMCs the /var filesystem was 100% full. Yes, there is a method to remedy that, namely the chhmcfs command, but - as usual - it didn't work. So the final solution was to break into the HMC, become root and do what UNIX-Admins have always done: clean up the filesystem by using rm. After several reboots and several rediscovery rounds we saw - kudos to my colleague - all our managed systems again.

Conclusion:

Yes, it was my fault not to have the idea with the /var FS earlier. I was tricked by both HMCs losing connection at about the same time and investigated in the completely wrong direction. On the other hand, this is not a UNIX system, it is an appliance. Why am i supposed to act as am admin checking for filesystems when i was first denied all the tools admins have?

Second, my life was made so much easier by being forced to rely on tricks like pulling MAC addresses out of the routers logs instead of simply issuing ifconfig. FInd out how long a system is up: uptime. Find out how long a HMC is up: impossible. Check how many packets are being sent/received on a UNIX system: entstat or netstat. Find out the same on a HMC: impossible. This list goes on and on.

And finally: even if i had diagnosed the problem correctly it wouldn't have helped me any. We actually tried the "official" methods of cleaning up before, but they didn't work at all (as they usually do - i have seen them fail more often than not). Only breaking in and using normal UNIX commands did what was expected. And why did IBM not see that full FS in the 2.6GB dump they required me to upload? Do i really want to take the risk of my multi-million-dollar environment becoming completely unusable because i have a system at the center which i can neither diagnose nor administrate and it takes support three days to fail?

Why do i pay six-figure amounts of money only to be pestered by questions which i have answered before they where even asked just because the standard questionnaire says so? I can print that damned questionnaire out and read it to myself for free without having to wait a day just to be called back.

Now, please tell me again what this "appliance" is for and why it is making my life easier.

bakunin

Last edited by bakunin; 08-01-2016 at 04:14 AM..
These 4 Users Gave Thanks to bakunin For This Post:
 

6 More Discussions You Might Find Interesting

1. UNIX Desktop Questions & Answers

Need your help and opinion

Hey all, I'm brand new to Unix/Linux and have a couple of questions. I own a small education/consulting company that has a staff of approx. 50 employees. Most our work is geared towards the office-style environment (i.e. Word, Excel, Powerpoint, etc.). There are also some C and Java programmers... (4 Replies)
Discussion started by: dennie1
4 Replies

2. Solaris

Your Opinion requested

Ladies/Gentlemen, I am looking for a web-based tool to keep track of my Sun inventory. The following list of fields are fields I would like to store: Root Passwd (needs to be secure) / Hostid / Console Port / IP Address / Platform / Application / Hostname . . . you get the point. Do any of... (4 Replies)
Discussion started by: pc9456
4 Replies

3. Post Here to Contact Site Administrators and Moderators

Opinion

Hi, I am new at this site and at unix. I was reading some answers that the administrators and moderators have posted to others, and sometimes I feel like their a little sarcastic. I am asking just to be patient to me, I know nothing about unix but I do want to learn, and I think that positive... (7 Replies)
Discussion started by: HN19
7 Replies

4. UNIX for Advanced & Expert Users

Expert Opinion

This perhaps does not belong in ths category; apologies, however, we have a heated debate going and your input will decide the result. Should UNIX (HP, AIX, etc) be rebooted following a monthly cycle (Every month, or a qtr, etc.). We have some UX admins (grumps) who say they have seen a UX... (6 Replies)
Discussion started by: rsheikh
6 Replies

5. Shell Programming and Scripting

What are public keys in ssh and how do we create the public keys??

Hi All, I am having knowledge on some basics of ssh and wanted to know what are the public keys and how can we create and implement it in connecting server. Please provide the information for the above, it would be helpful for me. Thanks, Ravindra (1 Reply)
Discussion started by: ravi3cha
1 Replies

6. What is on Your Mind?

Something in my mind - what's your opinion ?

Dear Forum staff / Advisors / members , I am having something in my mind, about Linux / Unix possible Interview questions collections, I guess if I post them here,which might be useful for our members and for students, and in meantime we can discuss also about those questions, what's your... (4 Replies)
Discussion started by: Akshay Hegde
4 Replies
All times are GMT -4. The time now is 01:11 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy