Visit The New, Modern Unix Linux Community


[Opinion] A Public Answer To Rob McNelly


 
Thread Tools Search this Thread
Operating Systems AIX [Opinion] A Public Answer To Rob McNelly
# 8  
It's the monies.

If you have quality documentation how things work under the hood, less calls to support (IBM, Oracle, HP or their bastard support firms), equals less cash for them.

There is no love. Just plain cash hunting everywhere you turn.
I've seen in my short career (about 10 years, oppose to you unix masters), folks intentionally delivering broken stuff just to fix it after and fill their hourly/monthly quota.

Other problem is a new generation of kids who emerge from technical and other faculties which know nothing, and worst, even don't want to learn.

All they do is run scripts someone else wrote.
Click puppets/AI installers/PXE someone else configured.
Everything needs to be done before, so mindless automatons can do their jobs.

Looks like 90% of IT today is in 'human centipede' mode, just forwarding crap around.

Excuse me upfront if i'm dull to you Smilie

Regards
Peasant.
# 9  
It took some time to proove my point, but here it is: this is what happened last week:

On Tuesday both of my HMCs were no longer getting a connection to any of the managed systems although both were responding quite normally at their public interfaces both via ssh and the web GUI.

Our environment consists of about 20 p780 and p880, along with some smaller systems (p740s) thrown in for good measure. On that run some 350 LPARs of various sizes. Yes, we are a big shop and having no HMC to manage it is kind of a problem.

The first thing i did was trying to reboot one of the HMCs. As i can not really diagnose any problem because all the tools necessary for that are not available this was the best and fastest i could do to bring the system to a defined state. The reboot did take place, i did see the IPs of the service processors but not the given names of the managed systems any more. It did, of course, change nothing.

Since both HMCs lost all their connections at seemingly the same time i came up with the theory that maybe the network was responsible for that. So i got me a network admin and we traced the switch onto which all the service processors and the HMCs were connected. This management network is closed and unrouted, but we were able to confirm it worked and all the correct ARP information was there.

Note: forget to find out things like the MAC address of an interface on the HMC. Because the ifconfig command to do so is such a complicated thing IBM made things very easy for me by not confusing me at all with such information and made us dig into the logs of the switch to make sure the MAC addresses were what they should be. Thank you, IBM, for making my work so much easier.

At this point i opened a Prio-2-call at IBM. It was 2:00 pm and i expected to be called within the next 30 minutes. As it is, when i started to work with AIX more than 20 years ago this would have been a Prio-3-call and the phone would have rung within minutes. Times have changed.

It was Wednesday, sowhat past 14:30 when IBM deemed me finally worth an answer. The dispatcher first asked if i would agree to continue in english, which i allowed. (Big mistake. The english the technician spoke was barely understandable at all and i probably would have better understood his native bulgarian even though i don't speak that at all.)

First i told him what exactly i did up to this point, including the network trace and the reboot. He told me he would send some procedures i should carry out for him per mail but it would take him ten minutes or so to prepare the mail. No problem! Something happens, finally. After 30 minutes i was wondering, after 1 hour i was angry. After waiting for two hours i called the hotline again and asked who they thought i am. Within minutes the same technician called me and told me that "something went wrong with my email because he tried several times but it always came back with 'address unresolvable'" or so. OK, things can happen - but: couldn't he have called me and asked?? He obviously knew how to call me, no?

Well, after sending a mail to him myself he was able to answer that. I got a mail about how to create the hscpe user and use that to create a dump. I did so then uploaded the dumps from both HMCs to IBMs support site. (If you ever have to do that: a dump is some 2.5 GB in size, so it takes some time.)

On Thursday i got a mail from the guy, telling me that the good news were that nothing was amiss with my hardware. He advised me to check for loose network connections. I wrote back a rather acerbic comment that i did that at first and i already told him so, painstakingly describing the network traces we did. Anyways, i went to the datacenter and made sure all the network connections were there (and, what a surprise, it turned out that an interface i was able to determine the MAC address from the switches ARP cache for was indeed connected to that switch). I was told i would be passed over to second-level support.

On Friday nothing was to be heard from IBM. I suppose they were searching for the person doing the second-level support for this planet. In the meantime my colleague had a breakthrough, though: it is not possible to do a simple df on a HMC because that would perhaps disrupt the intricate work IBM has done with the HMCs software, but issuing

Code:
lshmcfs

he was able to detect that on both HMCs the /var filesystem was 100% full. Yes, there is a method to remedy that, namely the chhmcfs command, but - as usual - it didn't work. So the final solution was to break into the HMC, become root and do what UNIX-Admins have always done: clean up the filesystem by using rm. After several reboots and several rediscovery rounds we saw - kudos to my colleague - all our managed systems again.

Conclusion:

Yes, it was my fault not to have the idea with the /var FS earlier. I was tricked by both HMCs losing connection at about the same time and investigated in the completely wrong direction. On the other hand, this is not a UNIX system, it is an appliance. Why am i supposed to act as am admin checking for filesystems when i was first denied all the tools admins have?

Second, my life was made so much easier by being forced to rely on tricks like pulling MAC addresses out of the routers logs instead of simply issuing ifconfig. FInd out how long a system is up: uptime. Find out how long a HMC is up: impossible. Check how many packets are being sent/received on a UNIX system: entstat or netstat. Find out the same on a HMC: impossible. This list goes on and on.

And finally: even if i had diagnosed the problem correctly it wouldn't have helped me any. We actually tried the "official" methods of cleaning up before, but they didn't work at all (as they usually do - i have seen them fail more often than not). Only breaking in and using normal UNIX commands did what was expected. And why did IBM not see that full FS in the 2.6GB dump they required me to upload? Do i really want to take the risk of my multi-million-dollar environment becoming completely unusable because i have a system at the center which i can neither diagnose nor administrate and it takes support three days to fail?

Why do i pay six-figure amounts of money only to be pestered by questions which i have answered before they where even asked just because the standard questionnaire says so? I can print that damned questionnaire out and read it to myself for free without having to wait a day just to be called back.

Now, please tell me again what this "appliance" is for and why it is making my life easier.

bakunin

Last edited by bakunin; 08-01-2016 at 04:14 AM..
These 4 Users Gave Thanks to bakunin For This Post:
# 10  
This User Gave Thanks to agent.kgb For This Post:
# 11  
I responded at System's magazine - in the hope more of IBM will see that. My concluding remark is:
Quote:
And to the case of the customer and /var full. Clearly a bug which I hope IBM addresses quickly. The way the PMR is reported does not sound like it is being properly addressed by IBM support -- as a HMC bug. - As an appliance the HMC should be able to do what needs to be done to ensure that communication between HMC and Service Processors is not interrupted. Period -- regardless of any policy re: root (in)access(ability)
There is actually, or perhaps was, an easy path to become root by opening a PMR. And, in a prior life - as an AIX instructor I taught customers (aka students) how to open a PMR (we did so during the class) - and I also showed how to reuse the password from the previous class (officially the passwords are only valid from midnight to midnight of the day issued - guess how to reuse it :P)

While I can understand the desire for root on HMC I long decided I would not even 'desire' it - but take IBM at it's word about being an appliance and making sure - read demand - it work as an appliance.

I am quite capable of changing a pump in a car, washing machine or heating system. I am quite capable of administrating an HMC as root. However, all of these devices are sold and serviced by the sellar as an applicance. If the pump is not working - I expect someone asap (per terms of the SLA) to replace the pump.

(Hope you like my metaphor!)
# 12  
SmilieSmilie Sorry I have to laugh, but those guys in Bulgaria remind me of a Deja Vu with another big company that seems to have found the same cost friendly country to place their support at. Those poor chaps acted often the same like you described and sometimes didn't want to pass calls to the next level which was also no big help.

So what Rob wrote in his answer, that you as customer should escalate etc. is in my eyes not a nice but maybe a common business behaviour these days with some big companies, to have the customer involved to ensure the quality of the vendor's support.
We heared this with the other big company too, but having to involve an escalation manager etc. gets tideous after some time as well and one asks himself, what is going wrong there, that I have to do so much effort to get some help or sometimes at least someone that even understands what my problem is.
If you put people there, that have not enough experience to offer good support, then this is a problem of the vendor and must not be a problem for the customer. It feels a bit like the concept of green banana software being used for support structures.

So why is the HMC so locked up...
Yes, in the long time as AIX admin I did not like it at all and absolutely agree with you, that people that are responsible for plenty mission critical servers with sensible applications/users, that already have the knowledge at hand to get along with the HMC, should be allowed to do so by default. They can still open up a support call if they get stuck.

Because if an admin has no clue and screws up one or many LPARs, he will usually be in more serious trouble than the one that screws up a HMC, which usually comes redundant with 2 of them, where not all important LPARs are always redundant.
And don't forget the VIOS - do something wrong there and you have a good chance that really lot's of LPARs get problems, so what.
In the end in a professional environment one will have a backup for the LPAR, VIOS as well as for the HMC.
And severe LPAR damage has most often a direct impact to users ie. our customers, even if it is "just" a cluster switch that takes some minutes but gets maybe 10k users disconnected and maybe some unpleasant attention by your boss/managers. Trouble with the HMC will usually go unnoticed by our users.

So the HMC is at least locked up for 2 reasons in my eyes:

a) The customer has strongly to rely on the support of the vendor. This is a dependency and some kind of "bonding" of the customer to the vendor. The vendor gets cash, the customer has a helping hand and feels good withit, simply they are just good friends and will most likely have more business in the future Smilie So far the possible theory.

b) It was said in the discussion, that the admins often have not enough skill/experience - true, but these guys have been in business way back in time and such will be in the future.
I have the impression, that in favour of cost efficient support structures, they have tried to make the HMC to be being easily maintained by their support, not because the customer side is so unexperienced.


cheers
zaxxon

Last edited by zaxxon; 09-30-2016 at 10:35 AM.. Reason: this and that
These 2 Users Gave Thanks to zaxxon For This Post:

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #313
Difficulty: Easy
The programming language Python is based on a modified version of JavaScript.
True or False?

6 More Discussions You Might Find Interesting

1. What is on Your Mind?

Something in my mind - what's your opinion ?

Dear Forum staff / Advisors / members , I am having something in my mind, about Linux / Unix possible Interview questions collections, I guess if I post them here,which might be useful for our members and for students, and in meantime we can discuss also about those questions, what's your... (4 Replies)
Discussion started by: Akshay Hegde
4 Replies

2. Shell Programming and Scripting

What are public keys in ssh and how do we create the public keys??

Hi All, I am having knowledge on some basics of ssh and wanted to know what are the public keys and how can we create and implement it in connecting server. Please provide the information for the above, it would be helpful for me. Thanks, Ravindra (1 Reply)
Discussion started by: ravi3cha
1 Replies

3. UNIX for Advanced & Expert Users

Expert Opinion

This perhaps does not belong in ths category; apologies, however, we have a heated debate going and your input will decide the result. Should UNIX (HP, AIX, etc) be rebooted following a monthly cycle (Every month, or a qtr, etc.). We have some UX admins (grumps) who say they have seen a UX... (6 Replies)
Discussion started by: rsheikh
6 Replies

4. Post Here to Contact Site Administrators and Moderators

Opinion

Hi, I am new at this site and at unix. I was reading some answers that the administrators and moderators have posted to others, and sometimes I feel like their a little sarcastic. I am asking just to be patient to me, I know nothing about unix but I do want to learn, and I think that positive... (7 Replies)
Discussion started by: HN19
7 Replies

5. Solaris

Your Opinion requested

Ladies/Gentlemen, I am looking for a web-based tool to keep track of my Sun inventory. The following list of fields are fields I would like to store: Root Passwd (needs to be secure) / Hostid / Console Port / IP Address / Platform / Application / Hostname . . . you get the point. Do any of... (4 Replies)
Discussion started by: pc9456
4 Replies

6. UNIX Desktop Questions & Answers

Need your help and opinion

Hey all, I'm brand new to Unix/Linux and have a couple of questions. I own a small education/consulting company that has a staff of approx. 50 employees. Most our work is geared towards the office-style environment (i.e. Word, Excel, Powerpoint, etc.). There are also some C and Java programmers... (4 Replies)
Discussion started by: dennie1
4 Replies

Featured Tech Videos