Hi everyone, newbie forum poster here. I'm an Oracle DBA and I require some guidance from the Unix gurus here about how to pinpoint where a problem is within a Solaris 9 system running on an 8 CPU Fujitsu server that acts as our Oracle database server. Our sysadmins are trying their best to resolve the issue but none of us are 100% sure where the issue resides - I'm hoping people here can help shed some light on things or help point us in a new/better direction.
Environment:
Server: Fujitsu P650, (7 cpu in use, 48GB RAM) Solaris 9 Generic_122300-22 sun4us sparc FJSV,GPUZC-M
Old Storage: EMC Clarion fibre attached storage
New Storage: NetApp storage, 3040 Controller, NFS mounted volumes via multi trunked1GB ethernet connection (not round robin)
Database: Oracle 9i
Problem: We are migrating our storage from fibre EMC to NFS NetApp and are encountering huge performance degradation... pin pointing where the problem is has been problematic. (As a DBA I seriously questioned this move, but this point is now moot as the money has been spent and we have to deal with it.)
Detail: We've been slowly migrating our databases off of the fibre EMC to NFS NetApp Some of our high performance databases struggled mightily on the NetApp storage and there has been lots of finger pointing as to why.
Symptoms: Over time (hours to days) database jobs and response times nosedive - lots of hooting and hollering from the business
System response time can be extremely slow.. simple commands “df -h” “ls” slow in responding. However system load is typically minimal, almost non-existant. (low 1's and 2's for load) however at times we can see high kernel processing times.
Advice from our sysadmins: Current advice from one admin is that the Fujitsu server is older hardware that is not built for this kind of transaction processing. They have been monitoring “counters on the PCI bus (66MHz) and are seeing overflow issues” (forgive me if this isn't well articulated) and noticing that it “has problems keeping up”. Another sys admin feels that the PCI bus has nothing to do with it and that it is networking related: specifically that while we have trunking in place to the NetApp filer, it is not round robin and as a result the pipe from the server to the storage is too small for any given transaction (which from Oracle will necessarily be single threaded) Having conflicting reports from the sys admins is not great.
Are there any recommendations on where the problem possibly lies? (obviously this is very difficult to do from a few paragraphs). Or perhaps more realistically, aside from looking at top/prstat to see low load, iostat to see ok I/O processing times, sysadmins checking counters on a PCI bus, is there any other tools, either available in Solaris or 3rd party that can be used to definitively say “AHA! That is definitely where the bottleneck is!”
Many thanks in advance..