Gurus needed to diagnose severe performance degradation


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Gurus needed to diagnose severe performance degradation
# 1  
Old 09-01-2009
Error Gurus needed to diagnose severe performance degradation

Hi everyone, newbie forum poster here. I'm an Oracle DBA and I require some guidance from the Unix gurus here about how to pinpoint where a problem is within a Solaris 9 system running on an 8 CPU Fujitsu server that acts as our Oracle database server. Our sysadmins are trying their best to resolve the issue but none of us are 100% sure where the issue resides - I'm hoping people here can help shed some light on things or help point us in a new/better direction.

Environment:
Server: Fujitsu P650, (7 cpu in use, 48GB RAM) Solaris 9 Generic_122300-22 sun4us sparc FJSV,GPUZC-M
Old Storage: EMC Clarion fibre attached storage
New Storage: NetApp storage, 3040 Controller, NFS mounted volumes via multi trunked1GB ethernet connection (not round robin)
Database: Oracle 9i

Problem: We are migrating our storage from fibre EMC to NFS NetApp and are encountering huge performance degradation... pin pointing where the problem is has been problematic. (As a DBA I seriously questioned this move, but this point is now moot as the money has been spent and we have to deal with it.)

Detail: We've been slowly migrating our databases off of the fibre EMC to NFS NetApp Some of our high performance databases struggled mightily on the NetApp storage and there has been lots of finger pointing as to why.

Symptoms: Over time (hours to days) database jobs and response times nosedive - lots of hooting and hollering from the business Smilie System response time can be extremely slow.. simple commands “df -h” “ls” slow in responding. However system load is typically minimal, almost non-existant. (low 1's and 2's for load) however at times we can see high kernel processing times.

Advice from our sysadmins: Current advice from one admin is that the Fujitsu server is older hardware that is not built for this kind of transaction processing. They have been monitoring “counters on the PCI bus (66MHz) and are seeing overflow issues” (forgive me if this isn't well articulated) and noticing that it “has problems keeping up”. Another sys admin feels that the PCI bus has nothing to do with it and that it is networking related: specifically that while we have trunking in place to the NetApp filer, it is not round robin and as a result the pipe from the server to the storage is too small for any given transaction (which from Oracle will necessarily be single threaded) Having conflicting reports from the sys admins is not great.

Are there any recommendations on where the problem possibly lies? (obviously this is very difficult to do from a few paragraphs). Or perhaps more realistically, aside from looking at top/prstat to see low load, iostat to see ok I/O processing times, sysadmins checking counters on a PCI bus, is there any other tools, either available in Solaris or 3rd party that can be used to definitively say “AHA! That is definitely where the bottleneck is!”

Many thanks in advance..
# 2  
Old 09-01-2009
There was a time you could buy HP Glance for solaris (if still possible try to get a free evaluation limited time copy). there is also sysload that is quite good...
NFS can be tricky to configure and optimize properly...
You havent said much about your network switches.. Ive seen switches go nuts and drive big cluster down (with NFS, oracle apps etc...)
# 3  
Old 09-01-2009
Quote:
Originally Posted by DBA_guy
Environment:
Old Storage: EMC Clarion fibre attached storage
New Storage: NetApp storage, 3040 Controller, NFS mounted volumes via multi trunked1GB ethernet connection (not round robin)
What is the network protocol used for the EMC fiber channel and what were the specs of that channel?

My first thoughts are that the EMC fiber channel is much higher performing than NFS over Ethernet.

That is the first place I would check.

Also, I hope you have not completely shut down the original system. That is unwise, because you have no baseline to check against any more.
# 4  
Old 09-01-2009
I would tend to think your troubles are IO related as you state your system is slow while processing is low. The nose diving points to a severe bottleneck situation, perhaps when Oracle has trouble writing to its archived redo logs or there is reporting activity during the day.

I know that if you want to use Oracle on Netapp / NFS there are some very specific instructions/settings for NFS on both server (the Netapp) and client (your Fujitsu Server) that you have to follow really meticulously.

Also, I think it may be advisable to create a separate storage LAN of VLAN, so you are not interfering with other traffic, or maybe have separate segments and spread the storage over several NFS mountpoints to different segments. If possible I would implement Jumbo ethernet frames, to optimize sequential IO a bit (full table scans, index scans).

Your server appears to have room for many slow (dedicated) PCI slots, so I would tend to spread the load over the PCI-slots to reach the required bandwidth.

Just some thoughts.. good luck.

S.

Last edited by Scrutinizer; 09-01-2009 at 07:19 PM..
# 5  
Old 09-01-2009
You post that this is Gigabit Ethernet. The fibre connection from your old solution would be better.

First check that all ethernet cards on the server are set to NOT autonegotiate and that none of the ports on the hub/switch are set to autonegotiate and that no port involving your storage device is set to autonegotiate. If you have to change anything, schedule a cold start afterwards.
# 6  
Old 09-01-2009
A 66MHz, 64-bit PCI bus can handle 4 gbps, so unless it's a badly-engineered bus, that shouldn't be your problem.

Also, do NOT remove the autonegotiate feature from your network cards and switches if you're running gigE over copper.

1. If your links are not running at 1000 mbps, full-duplex, there's a problem. Papering over that underlying problem by forcing the link to 1000 mpbs full-duplex doesn't fix the problem.

2. Disabling autonegotiation on copper gigE places you outside the specifications of IEEE 802.3: (http://www.sun.com/blueprints/0704/817-7526.pdf)

Quote:
The IEEE 802.3 standard states that you must support and test autonegotiation enabled to certify a product IEEE 802.3 compliant, and for multivendor interoperability (for example, testing at the UNH Interoperability Laboratory). There are no requirements in the standard to support locked down or forced configurations using autonegotiation disabled. As a result, there are no requirements for vendors to test multivendor interoperability between products with autonegotiation disabled.

The IEEE 802.3ab specification does not allow for forced mode 1000BASE-T with autonegotiation disabled running at 1000 Mbps. As a result, many switch vendors do not support forced mode.
I'd look to be sure you do all the tuning that Oracle advises for running over NFS. Especially make sure you're using jumbo frames.

And you might very well need to look into replacing your old hardware and software. There have been a lot of hardware and software advances in networking performance since Solaris 9 was current.
# 7  
Old 09-01-2009
achenlie is correct. If you have dropped to half-duplex it will be chronically slow. Just re-patching a cable can be enough to cause this issue when autonegotiation is in force. If you have the problem it may well be fixed by a total cold start where you bring up the network first, then the storage, then the servers.
Imho whether Jumbo Packets will cause or cure a performance issue depends on the network hardware.

To get an idea of scale, can your DBA post the contents of the following Oracle 9i system table along with how long the Oracle server had been up at the time:

Code:
v$sysstat

This table includes i/o stats and other useful pointers such as counts of performance killers like disc sorts.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Ld: 0711-851 SEVERE ERROR:

I need to install python 3.3.0 to AIX 6.1 I created folder where I want to install I downloaded files archive from python official website I extracted it into new folder and ran; 1)./configure --with-gcc="xlc_r" --with-cxx="xlC_r" --disable-ipv6 --prefix=my_folder CXX=xlC_r... (2 Replies)
Discussion started by: AIX_30
2 Replies

2. Shell Programming and Scripting

Shell script to diagnose the network

i have learnt a little bit of shell scripting but not alot. i want to write a script to diagnose the network using ping and another script to traceroute. how would i do this? (6 Replies)
Discussion started by: stefanere2k9
6 Replies

3. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Background ------------- The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files. File-1 ------ Contains 50,000 rows with 2 fields in each row, separated by pipe. Row structure is like Object_Id|Object_Name, as following: 111|XXX 222|YYY 333|ZZZ ... (6 Replies)
Discussion started by: Souvik
6 Replies

4. AIX

Diagnose high disk write IO

Hi, say for example if there is high disk write IO in one disk (detected from NMON), how to we identify what processes is writing on that particular disk? (3 Replies)
Discussion started by: ngaisteve1
3 Replies

5. Shell Programming and Scripting

Performance monitoring help needed.

How would i check for following? 1)open ports in my linux machine. 2)Hard disk read speed. 3)Hard disk write speed. (2 Replies)
Discussion started by: pinga123
2 Replies

6. Shell Programming and Scripting

Performance degradation with KSH93

Hi, I have a script that calls an external program to perform some calculations and then I read with "grep" and "sed" values from the output files. I've noticed that performance of KSH93 degrades with every iteration. The output files are all the same size, so I don't understand why after the... (2 Replies)
Discussion started by: i.f.schulz
2 Replies

7. Red Hat

Severe Error while starting the System

Dear All, I am facing a unknown error, I start the Linux (RHEL 4 update 6) as usual. After starting the various services(like network,sendmail,portmap etc) a error appears suddenly. The error looks like : Post_create: setxattr failed, rc=28 (dev=hda2 ino=772685) Post_create: setxattr... (2 Replies)
Discussion started by: akhtar.bhat
2 Replies

8. Solaris

error notification and diagnose

Hi All, How does Solaris 9/10 alert the server? Where do you get the error on the server? Is there some kind of verifying of errors (like in AIX, CERTIFY resources or diagnose)? Please let me know. Thanks, itik (4 Replies)
Discussion started by: itik
4 Replies

9. Shell Programming and Scripting

SED GURUS - Help!

I wish to substituite a string on each line but ONLY if it appears within double-quotes: this_string="abc#def#geh" # Comment here I wish to change the "#" characters within the double quoted string to "_": this_string="abc_def_geh" # Comment here ... but as you see, the "comment" hash... (2 Replies)
Discussion started by: Simerian
2 Replies
Login or Register to Ask a Question