IO queue maxtime over 1 minute


 
Thread Tools Search this Thread
Operating Systems AIX IO queue maxtime over 1 minute
# 1  
Old 12-11-2017
IO queue maxtime over 1 minute

We are experiencing around 1minute IO delays on Oracle DB including both read and write operations even for writing <1kB or reading single 8kB block. Below is the output of iostat -D -T 60 1000 for the problematic time periods
Code:
asmdata03       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         99.3    334.6K     1.4      331.1K       3.5K  20:17:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                          1.3      9.5      1.0     33.0           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.2      0.9      0.7      1.3           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                          8.9S     0.0      1.1M     0.0       11.0         0.0
asmdata08       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         99.5    518.4K     2.9      511.2K       7.2K  20:22:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                          1.9     13.7      0.9     49.2           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.9      0.7      0.9      1.6           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                          6.0S     0.0      1.1M     0.0       15.0         0.1
asmdata17       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         86.5      4.7M    18.7        4.7M       2.0K  22:25:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                         18.1     15.2      0.1     69.9           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.6      1.3      0.7      8.4           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                        696.0      0.0      1.1M     0.0       10.0         0.0
asmdata10       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         84.2     29.4M   113.1       29.4M     546.1   22:35:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                        113.0      1.5      0.1      7.7           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.1      0.9      0.0      0.0           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                         10.6      0.0      1.1M     0.0        1.0         0.0
asmdata15       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         92.2      6.4M    31.2        6.4M     136.5   22:40:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                         31.2     16.4      0.2     96.8           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.0      1.5      1.5      1.5           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                        530.0      0.0      1.1M     0.0       13.0         0.6
asmdata06       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         93.0      6.7M    26.4        6.7M       3.7K  22:45:10
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                         25.6     18.8      0.4     82.0           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.8      1.0      0.6      6.6           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                        632.2      0.0      1.1M     0.0       13.0         0.5
asmdata05       xfer:  %tm_act      bps      tps      bread      bwrtn      time
                         76.9     20.7M    79.0       20.7M     273.1   06:59:11
                read:      rps  avgserv  minserv  maxserv   timeouts      fails
                         79.0     14.9      0.5     65.9           0          0
               write:      wps  avgserv  minserv  maxserv   timeouts      fails
                          0.0      0.6      0.0      0.0           0          0
               queue:  avgtime  mintime  maxtime  avgwqsz    avgsqsz     sqfull
                        216.5      0.0      1.1M     0.0       11.0         1.5

Does anyone have an idea what could be wrong or what to check further?

thanks a lot
# 2  
Old 12-11-2017
It depends.

What other activity is on the server at the time? Sadly, this could be many things.

A few possibles:-
  • Missing/invalid index causing a full table scan and/or locks
  • If there is NFS as part of the process then that is a contender for contention
  • If the queries are remote, there could be DNS issues (i.e. server cannot resolve source IP address)
  • Disk access IO issues (check paths for SCSI disks or SAN provided disk)

There will be plenty more and I've probably not guessed right but we might be lucky. If I think of any more, I will update the list or post again.


Have you ruled anything out yet?


I hope that these get you started,
Robin
# 3  
Old 01-05-2018
errpt -a -s 0102000018

shows these kind of errors when those 60sec IO delays are occurring:
Code:
LABEL:          SC_DISK_ERR7
IDENTIFIER:     DE3B8540

Date/Time:       Tue Jan  2 18:11:57 CET 2018
Sequence Number: 18526
Machine Id:      00FB0E544C00
Node Id:         xicdbl01a
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   asmdata14       
Resource Class:  disk
Resource Type:   2145
Location:        U78C7.001.KIC4358-P1-C9-T1-W500507680C238F26-L17000000000000

VPD:             
        Manufacturer................IBM     
        Machine Type and Model......2145            
        ROS Level and ID............0000
        Device Specific.(Z0)........0000063268181002
        Device Specific.(Z1)........0203202
        Serial Number...............600507680C808470680000000000007D

Description
PATH HAS FAILED

Probable Causes
ADAPTER HARDWARE OR CABLE
DASD DEVICE

Failure Causes
UNDETERMINED

    Recommended Actions
    PERFORM PROBLEM DETERMINATION PROCEDURES
    CHECK PATH

Detail Data
PATH ID
           0
SENSE DATA
0A00 2800 0040 6000 0002 0004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR4
IDENTIFIER:     DCB47997

Date/Time:       Tue Jan  2 18:11:57 CET 2018
Sequence Number: 18529
Machine Id:      00FB0E544C00
Node Id:         xicdbl01a
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   asmdata14       
Resource Class:  disk
Resource Type:   2145
Location:        U78C7.001.KIC4358-P1-C9-T1-W500507680C238F26-L17000000000000

VPD:             
        Manufacturer................IBM     
        Machine Type and Model......2145            
        ROS Level and ID............0000
        Device Specific.(Z0)........0000063268181002
        Device Specific.(Z1)........0203202
        Serial Number...............600507680C808470680000000000007D

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

    Recommended Actions
    FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
    PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

    Recommended Actions
    FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
    PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0A00 2800 0040 6000 0002 0004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 06E0 F73A 0008 7D80 0000 0000 0000 0000 0000 0000 0000 0007 0000 
0002 0035 001D 
---------------------------------------------------------------------------
LABEL:          SDDPCM_PATH_RECOVER
IDENTIFIER:     67150733

Date/Time:       Tue Jan  2 18:13:57 CET 2018
Sequence Number: 18530
Machine Id:      00FB0E544C00
Node Id:         xicdbl01a
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   asmdata14       

Description
PATH HAS RECOVERED

Probable Causes
ADAPTER HARDWARE OR CABLE
DASD DEVICE

Failure Causes
EF00

    Recommended Actions
    PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
          18
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 8000 0014 0000 
001B 0000 0003 0D96 FB04 0000 001E 0000 0000 0000 0011 0000 1000 0000 0005 0000 
0003 0000 8000 0000 5C30 F100 0A00 278A 4A40

# 4  
Old 01-05-2018
It looks like there are occasional sector read errors (where the retry worked).
If the disk is old then it is going bad - replacement needed.
If the disk is new then there might be a bug in the disk firmware (or in the disk driver in the kernel).
# 5  
Old 01-05-2018
The disk array is brand new, not yet in production. Couldn't it be on SAN? Any ideas how to get more diagnostic info?
Code:
lspath -F "path_id status path_status parent connection"

output is reporting failed paths just after the delay has occurred > > >
Code:
Fri Jan 5 10:32:14 CET 2018
1  Failed    Available fscsi0 500507680c138f26,19000000000000
3  Failed    Available fscsi1 500507680c148f26,19000000000000
Fri Jan 5 11:47:16 CET 2018
8  Failed    Available fscsi2 500507680c238f26,e000000000000
1  Failed    Available fscsi3 500507680c248f26,e000000000000

thx
# 6  
Old 01-05-2018
Not an AIX export but you could try to...

1. Failover a path to working path only (is possible dunno in aix, probably is).
2. Change the cable for faulty path.

I've debuged HPUX boxes with 'bad' cables experiencing exactly the same issues, long IO waits.

As i failed over to healthy path via scsimgr the machine got back to life.
Initially, the boot time was 20/30 minutes, dropped to normal after path switch.
This was VPAR on FC disk.

At the end i went to system room, pulled the 20 M fibre cable across data center to replace/diagnose the faulty cable.
Everything worked well, so the bad cable was replaced properly afterward.

Probably a factory error or someone damaged the cable during installation.

Of course, check the storage side and FC port switch to which the server is connected.

Hope that helps
Regards
Peasant.
# 7  
Old 01-06-2018
you don't show all the errors. SC_DISK_ERRs, you see, are SCSI timeout. They can be caused by SAN bottlenecks (such as lack of buffers) as well as by AIX settings. Storage device can cause such problems too. The easiest (and I think the fastest) way to solve the problem is to open a PerfPMR at IBM. You have IBM server, IBM operating system and IBM storage. They're interested in solving the problem as fast as possible.

If you'd like to analyze the problem further, you must first provide the information about the operating system and the storage.

  • which version do you have? oslevel -s
  • which version of storage drivers?
  • settings of the FC adapters? (for i in $(lsdev -l fcs* -F name) ; do echo $i ; lsattr -El $i ; lscfg -l $i; done
  • settings of FSCSI devices? lsattr -El fscsiX
  • Output of fcstat command for each FC HBA
  • full errpt output, because you provided just a short excerpt


It is just a small amount of questions I can remember after celebrating russian Christmas. IBM will ask more questions for sure, and will ask to run some scripts to gather the necessary information.

Moderator's Comments:
Mod Comment edit by bakunin: surrounded commands with ICODE-tags. Please use them for code-snippets so that they stand out better. Thank you!

Last edited by bakunin; 01-08-2018 at 09:59 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Minute(4) issue in teradata

I have values below for which diff field is giving error like "invalid time interval" in teradata Might be it is not doing calculation anymore after exceeding minute(4) value END_TS 2/2/2018 08:50:49.000000 START_TS 1/5/2018 17:30:02.000000 SLA_TIME 23:59:59.000000 select... (0 Replies)
Discussion started by: himanshupant
0 Replies

2. Shell Programming and Scripting

Take minute per minute from a log awk

Hi, I've been trying to develop a script that performs the parsing of a log every 1 minute and then generating some statistics. I'm fairly new to programming and this is why I come to ask if I can lend a hand. this is my log: xxxx 16/04/2012 17:00:52 - xxxx714 - E234 - Time= 119 ms.... (8 Replies)
Discussion started by: jockx
8 Replies

3. Shell Programming and Scripting

Process file every minute

Hey guy's, How can I schedule this script using AWK, what I require is it runs every minute as new data will be added to the file? And I need to capture these data and save it to the output file. I tried using sleep 1 to allow the file to be processed but is not working. awk ' ... (35 Replies)
Discussion started by: James_Owen
35 Replies

4. What is on Your Mind?

The 5 Minute Management Course (Six Lessons)

Lesson 1: A man is getting into the shower just as his wife is finishing up her shower, when the doorbell rings. The wife quickly wraps herself in a towel and runs downstairs. When she opens the door, there stands Bob, the next-door neighbour. Before she says a word, Bob says, 'I'll... (2 Replies)
Discussion started by: Neo
2 Replies

5. Shell Programming and Scripting

Crontab for every minute or every hour

How to set crontab for every minute or every hour (1 Reply)
Discussion started by: kaushik02018
1 Replies

6. Shell Programming and Scripting

How do i get only last 5 minute worth of data

I have a text file called 'tomcat_temp_out'. I want to get only last 5 minute worth of data from this file and redirect those data into another fule. Could you please help to work on this? (2 Replies)
Discussion started by: shivanete
2 Replies

7. Shell Programming and Scripting

scripting running every minute

Hi Experts, below similar thread i posted earlier. Although i wish to know ur suggestion newly. I want to run a script every 1 minute. I tried with Crontab. But the problem is cron send every 1 mins invertal- mail to the user mailbox. meanwhile, some expert telling me it is not wise to... (15 Replies)
Discussion started by: thepurple
15 Replies

8. Programming

how can i run a process for a whole minute?

here's the problem, i have two processes that i need to run and both process should be run at a total of 1 minute each. how do i do that? and one more here's what the processes do: the 1st process show the '+" sign infinitely while the 2nd process displays the "-" infinitely. how could i count the... (1 Reply)
Discussion started by: kelogs1347
1 Replies

9. Solaris

syslog messages every minute

Hi If we have for example a disk that is experiencing problems, ie read errors etc, a message will get generated via syslog every one minute....I come in the morning to hundreds upon hundreds of the same message. Is there anyway to change this to say 5 minutes or maybe 10 mins etc so that it... (1 Reply)
Discussion started by: hcclnoodles
1 Replies

10. Programming

linux 7.1 goes off every other minute

i have loaded linux 7.1 on my PC along with win98 the problem is that every other minute th system goes off please help me (1 Reply)
Discussion started by: vidya_harnal
1 Replies
Login or Register to Ask a Question