Gurus needed to diagnose severe performance degradation

09-01-2009

Registered User

2, 0

Join Date: Sep 2009

Last Activity: 8 September 2009, 1:34 PM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Gurus needed to diagnose severe performance degradation

Hi everyone, newbie forum poster here. I'm an Oracle DBA and I require some guidance from the Unix gurus here about how to pinpoint where a problem is within a Solaris 9 system running on an 8 CPU Fujitsu server that acts as our Oracle database server. Our sysadmins are trying their best to resolve the issue but none of us are 100% sure where the issue resides - I'm hoping people here can help shed some light on things or help point us in a new/better direction.

Environment:
Server: Fujitsu P650, (7 cpu in use, 48GB RAM) Solaris 9 Generic_122300-22 sun4us sparc FJSV,GPUZC-M
Old Storage: EMC Clarion fibre attached storage
New Storage: NetApp storage, 3040 Controller, NFS mounted volumes via multi trunked1GB ethernet connection (not round robin)
Database: Oracle 9i

Problem: We are migrating our storage from fibre EMC to NFS NetApp and are encountering huge performance degradation... pin pointing where the problem is has been problematic. (As a DBA I seriously questioned this move, but this point is now moot as the money has been spent and we have to deal with it.)

Detail: We've been slowly migrating our databases off of the fibre EMC to NFS NetApp Some of our high performance databases struggled mightily on the NetApp storage and there has been lots of finger pointing as to why.

Symptoms: Over time (hours to days) database jobs and response times nosedive - lots of hooting and hollering from the business

System response time can be extremely slow.. simple commands �df -h� �ls� slow in responding. However system load is typically minimal, almost non-existant. (low 1's and 2's for load) however at times we can see high kernel processing times.

Advice from our sysadmins: Current advice from one admin is that the Fujitsu server is older hardware that is not built for this kind of transaction processing. They have been monitoring �counters on the PCI bus (66MHz) and are seeing overflow issues� (forgive me if this isn't well articulated) and noticing that it �has problems keeping up�. Another sys admin feels that the PCI bus has nothing to do with it and that it is networking related: specifically that while we have trunking in place to the NetApp filer, it is not round robin and as a result the pipe from the server to the storage is too small for any given transaction (which from Oracle will necessarily be single threaded) Having conflicting reports from the sys admins is not great.

Are there any recommendations on where the problem possibly lies? (obviously this is very difficult to do from a few paragraphs). Or perhaps more realistically, aside from looking at top/prstat to see low load, iostat to see ok I/O processing times, sysadmins checking counters on a PCI bus, is there any other tools, either available in Solaris or 3rd party that can be used to definitively say �AHA! That is definitely where the bottleneck is!�

Many thanks in advance..

DBA_guy

View Public Profile for DBA_guy

Find all posts by DBA_guy

09-01-2009

Moderator

6,876, 694

Join Date: Sep 2005

Last Activity: 10 February 2021, 3:50 AM EST

Location: Switzerland - GE

Posts: 6,876

Thanks Given: 594

Thanked 694 Times in 627 Posts

There was a time you could buy HP Glance for solaris (if still possible try to get a free evaluation limited time copy). there is also sysload that is quite good...
NFS can be tricky to configure and optimize properly...
You havent said much about your network switches.. Ive seen switches go nuts and drive big cluster down (with NFS, oracle apps etc...)

vbe

View Public Profile for vbe

Find all posts by vbe

09-01-2009

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by DBA_guy

Environment:
Old Storage: EMC Clarion fibre attached storage
New Storage: NetApp storage, 3040 Controller, NFS mounted volumes via multi trunked1GB ethernet connection (not round robin)

What is the network protocol used for the EMC fiber channel and what were the specs of that channel?

My first thoughts are that the EMC fiber channel is much higher performing than NFS over Ethernet.

That is the first place I would check.

Also, I hope you have not completely shut down the original system. That is unwise, because you have no baseline to check against any more.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

09-01-2009

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

I would tend to think your troubles are IO related as you state your system is slow while processing is low. The nose diving points to a severe bottleneck situation, perhaps when Oracle has trouble writing to its archived redo logs or there is reporting activity during the day.

I know that if you want to use Oracle on Netapp / NFS there are some very specific instructions/settings for NFS on both server (the Netapp) and client (your Fujitsu Server) that you have to follow really meticulously.

Also, I think it may be advisable to create a separate storage LAN of VLAN, so you are not interfering with other traffic, or maybe have separate segments and spread the storage over several NFS mountpoints to different segments. If possible I would implement Jumbo ethernet frames, to optimize sequential IO a bit (full table scans, index scans).

Your server appears to have room for many slow (dedicated) PCI slots, so I would tend to spread the load over the PCI-slots to reach the required bandwidth.

Just some thoughts.. good luck.

S.

Last edited by Scrutinizer; 09-01-2009 at 07:19 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

09-01-2009

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

You post that this is Gigabit Ethernet. The fibre connection from your old solution would be better.

First check that all ethernet cards on the server are set to NOT autonegotiate and that none of the ports on the hub/switch are set to autonegotiate and that no port involving your storage device is set to autonegotiate. If you have to change anything, schedule a cold start afterwards.

methyl

View Public Profile for methyl

Find all posts by methyl

09-01-2009

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

A 66MHz, 64-bit PCI bus can handle 4 gbps, so unless it's a badly-engineered bus, that shouldn't be your problem.

Also, do NOT remove the autonegotiate feature from your network cards and switches if you're running gigE over copper.

1. If your links are not running at 1000 mbps, full-duplex, there's a problem. Papering over that underlying problem by forcing the link to 1000 mpbs full-duplex doesn't fix the problem.

2. Disabling autonegotiation on copper gigE places you outside the specifications of IEEE 802.3: (http://www.sun.com/blueprints/0704/817-7526.pdf)

Quote:

The IEEE 802.3 standard states that you must support and test autonegotiation enabled to certify a product IEEE 802.3 compliant, and for multivendor interoperability (for example, testing at the UNH Interoperability Laboratory). There are no requirements in the standard to support locked down or forced configurations using autonegotiation disabled. As a result, there are no requirements for vendors to test multivendor interoperability between products with autonegotiation disabled.

The IEEE 802.3ab specification does not allow for forced mode 1000BASE-T with autonegotiation disabled running at 1000 Mbps. As a result, many switch vendors do not support forced mode.

I'd look to be sure you do all the tuning that Oracle advises for running over NFS. Especially make sure you're using jumbo frames.

And you might very well need to look into replacing your old hardware and software. There have been a lot of hardware and software advances in networking performance since Solaris 9 was current.

achenle

View Public Profile for achenle

Find all posts by achenle

09-01-2009

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

achenlie is correct. If you have dropped to half-duplex it will be chronically slow. Just re-patching a cable can be enough to cause this issue when autonegotiation is in force. If you have the problem it may well be fixed by a total cold start where you bring up the network first, then the storage, then the servers.
Imho whether Jumbo Packets will cause or cure a performance issue depends on the network hardware.

To get an idea of scale, can your DBA post the contents of the following Oracle 9i system table along with how long the Oracle server had been up at the time:

Code:

v$sysstat

This table includes i/o stats and other useful pointers such as counts of performance killers like disc sorts.

methyl

View Public Profile for methyl

Find all posts by methyl

UNIX for Advanced & Expert Users

Gurus needed to diagnose severe performance degradation

9 More Discussions You Might Find Interesting

1. AIX

Ld: 0711-851 SEVERE ERROR:

Discussion started by: AIX_30

2. Shell Programming and Scripting

Shell script to diagnose the network

Discussion started by: stefanere2k9

3. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Discussion started by: Souvik

4. AIX

Diagnose high disk write IO

Discussion started by: ngaisteve1

5. Shell Programming and Scripting

Performance monitoring help needed.

Discussion started by: pinga123

6. Shell Programming and Scripting

Performance degradation with KSH93

Discussion started by: i.f.schulz

7. Red Hat

Severe Error while starting the System

Discussion started by: akhtar.bhat

8. Solaris

error notification and diagnose

Discussion started by: itik

9. Shell Programming and Scripting

SED GURUS - Help!

Discussion started by: Simerian