How to know exactly which physical partion contains data?

12-15-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

let me restructure your questions a bit:

Quote:

Originally Posted by bobochacha29

1. Is it OK to use the command " migratelp" to do something like this. The data availability, the synchronization between the 1st mirror and the 2nd mirror ... is OK ? Are there any errors of the filesystem, errors of data, ... after using this command? If 1 of 4 hard disk fails, the data is still OK ???

It is perfectly OK and in fact the command was made for exactly this purpose. Still, i think you do not need it, see below.

Quote:

Originally Posted by bobochacha29

In fact, I want to split the data of a filesystem on multiple hard disks equally.
[...]
The problem is when I checked the disk operation ( by command "topas" ), I found that the operation of hdisk1 and hdisk2 is very high, ~90%-100% while the operation of hdisk3 and hdisk4 is just ~ 10% ( most of the disk operation is writing - the application writes logs to the fs's mountpoint ).

This - the splitting you intend - makes sense only in a specific kind of situation, so please describe your hdisk-devices a bit better. What are they (single SCSI-disks, RAID-sets, LUNs from SAN, ...) and how do you access them?

First a little theory, so that ou can understand the output better:

Quote:

Originally Posted by bobochacha29

This is how it looks like

Code:

fslv01:/movelv_test
LP    PP1  PV1               PP2  PV2             
0001  0175 hdisk1            0111 hdisk2            
0002  0176 hdisk1            0112 hdisk2           
0003  0177 hdisk1            0113 hdisk2            
0004  0178 hdisk1            0114 hdisk2
[...]

Here you see several of the "layers" i talked about in a previous post at work: The LV consists of LPs (leftmost column) numbered 0001, 0002, 0003, ... and the space in this LV is continuous. That means that when byte nr. X is the last byte in in LP 0001 then byte nr X+1 is the first byte in LP 0002. Now, the LP 0001 consists in fact of two PPs, which hold identical copies; PP0175 on hdisk1 and PP 0111 on hdisk2. Similarily for all the other LPs.

The first question is if you really need the LV-copies. writing in parallel is a tad slower than writing on a single LV (without copies) and it might help with the performance if you do away with the mirroring. You will have to decide if the loss of security outweighs the gain on performance or if it is the other way round.

Second, you can place the LPs also in this way (schematically):

Code:

LP    PP1  PV1
0001  0001 hdisk1
0002  0001 hdisk2
0003  0001 hdisk3
0004  0001 hdisk4
0005  0002 hdisk1
0006  0002 hdisk2
0007  0002 hdisk3
0008  0002 hdisk4
[...]

The LVM of AIX has a special provision for doing that, called (somewhat counterintuitively) "Inter-Policy", look here:

Code:

# lslv -l mylv
mylv:/some/where
PV                COPIES        IN BAND       DISTRIBUTION  
hdiskpower2       1670:000:000  24%           270:410:409:409:172 
# lslv -L ap33p1lv
LOGICAL VOLUME:     mylv                   VOLUME GROUP:   myvg
LV IDENTIFIER:      00534b7a00004c000000011cd5e0067e.1 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            4096                   PP SIZE:        512 megabyte(s)
COPIES:             1                      SCHED POLICY:   parallel
LPs:                1670                   PPs:            1670
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    32
MOUNT POINT:        /some/where            LABEL:          /some/where
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no

At "minimum" the LVM places the LPs on the PPs in a way so that the minimum possible hdisks are involved. At "maximum" it will try to spread it over as many hdisks as possible, thus arriving at a placement similar to what i have sketched out above.

Note, btw, that the smaller the PP size you use (this is a property of the VG so you might have to create it anew and start over from scratch) the better it is for the effect to take place. You are ultimately trying to use the internal cache of the hdisks for maximum effect and the PPs should be small enough to fit into this cache. 512MB, like in my example, would be too big for that.

If you really have to create the VG anew it might also be a good idea to create a RAID set (faster than single disks but slower than a stripe set) or even a stripe set (faster than the RAID but lacking the security of the RAID). See this little tutorial for details about RAIDs, striping, etc..

You can even more fine-tune the creation process of the LV by using a so-called "map file" and using the "-m" switch (see the man page of "mklv" for details). Basically you can explicitly state which PP(s) should represent any given LP of the LV that way.

Further, please tell us which kind of data the FS holds. You already said "mostly writing log files", but a little more detail would help: many small files, a few very large files, do the files change often or are they mostly appended? How often are files deleted and recreated (like in log rotation)? How many processes write (typically) concurrently to the FS. It might be that you can gain a lot with different OS tuning parameters without even having to change the disk layout.

At last, about "migratepv": if you really need it (which is, as shown above, not sure IMHO) you will want to remove any mirror copy from the LV prior to migrating its PPs around. It is simply only half the work because every PP has to moved separately. Unmirror the LV you want to migrate, do the migration and when that is done create a new mirror. You can use a map file (see above, the "-m" switch) for this too.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

12-16-2014

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

Can I also raise a concern about your plan to write LPs from both copies spread over the same disks. What would happen if you lost hdisk3, for instance. You might end up with two copies both with some missing LPs and a lot of hard work to extricate yourself.

I've not found a mention of reducing FS size in the manual page for chfs on AIX 6. The details about the size= part are below:-

Quote:

-a size=NewSize
Specifies the size of the Journaled File System. The size can be specified in units of 512-byte blocks, megabytes or
gigabytes. If Value has the M suffix, it is interpreted to be in megabytes. If Value has a G suffix, it is interpreted
to be in gigabytes. If Value begins with a +, it is interpreted as a request to increase the file system size by the
specified amount. If the specified size is not evenly divisible by the physical partition size, it is rounded up to the
closest number that is evenly divisible.

The process to reduce the FS does work though

Code:

# lslv robin_test_lv
LOGICAL VOLUME:     robin_test_lv          VOLUME GROUP:   rebsvg
LV IDENTIFIER:      00cfe9f500004c000000012bcecff0de.18 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                32                     PPs:            64
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       edge                   UPPER BOUND:    32
MOUNT POINT:        /robin_test_fs         LABEL:          /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no            

# chfs -a size=-32M /robin_test_fs
Filesystem size changed to 2031616

# lslv robin_test_lv                   
LOGICAL VOLUME:     robin_test_lv          VOLUME GROUP:   rebsvg
LV IDENTIFIER:      00cfe9f500004c000000012bcecff0de.18 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2                   WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                31                     PPs:            62
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       edge                   UPPER BOUND:    32
MOUNT POINT:        /robin_test_fs         LABEL:          /robin_test_fs
MIRROR WRITE CONSISTENCY: on/ACTIVE                              
EACH LP COPY ON A SEPARATE PV ?: yes                                    
Serialize IO ?:     NO                                     
INFINITE RETRY:     no

How delighted am I?

Given that you can add a PP by using chfs -a size=+1 I tried chfs -a size=-1 /test_robin_fs and got en error, which I eventually deciphered as the reduction appears to need to be given in multiples of the PP size.

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

12-24-2014

Registered User

81, 3

Join Date: Apr 2014

Last Activity: 27 May 2020, 3:48 AM EDT

Posts: 81

Thanks Given: 88

Thanked 3 Times in 3 Posts

Sorry for replying so late. It took me a few days to check the data I/O to do the migrate, and a few days more to check the result after the migration.

Quote:

They are SCSI Disk Drive.

Quote:

Further, please tell us which kind of data the FS holds. You already said "mostly writing log files", but a little more detail would help: many small files, a few very large files, do the files change often or are they mostly appended? How often are files deleted and recreated (like in log rotation)? How many processes write (typically) concurrently to the FS. It might be that you can gain a lot with different OS tuning parameters without even having to change the disk layout.

Yes, many small files and these small files are compressed to a tar file after each day. Depending on which kinds of log, these tar files are deleted after 1 week, 1 month or 1 years ...

This i's what I've done for the last few days. I used "lvmstat" to collect the information of I/O stat of each lp/pp of fslv00.

Code:

Log_part  mirror#  iocnt   Kb_read   Kb_wrtn      Kbps
       1       1  993851   1208612   3869488      3.01
       1       2  892710    595072   3869440      2.65
       2       1  494569   1453116   2206252      2.17
       3       1  484349   1480348   2106752      2.13
      94       1  480105   1716412   2667692      2.60
       4       1  441866   1397696   2095044      2.07
       2       2  401524    746264   2206260      1.75
      94       2  395993    994520   2667688      2.17
      66       1  394574   1960732   2910004      2.89
       3       2  385884    783876   2106748      1.71
      66       2  378052   1193540   2909996      2.43
      93       1  363708   1538244   2186696      2.21
       4       2  360660    762820   2095048      1.69
      27       1  359079   1828708   2038160      2.29
      11       1  312996   1613992   2138716      2.22
      93       2  290093    866932   2186692      1.81
      27       2  283805    916232   2038156      1.75
      28       1  281631   1596708   1727532      1.97
       9       1  267060   1564664   1815164      2.00
      11       2  266783   1002556   2138700      1.86
       5       1  264425   1067524   1360380      1.44
      10       1  257942   1689472   1991040      2.18
      25       1  236913   1244748   1377272      1.55
      15       1  231909   1622516   1983764      2.14
      13       1  229836   1634964   2080424      2.20
      28       2  221740    777552   1727532      1.48
      12       1  219561   1570800   1824384      2.01

Then I used this information to split the lp/pps to the disks, tried to make it balance.

And this is the result

Code:

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  40
hdisk5   51.5     1.4K   97.5     0.0     1.4K  PgspOut       0  % Client   40
hdisk4   54.0     1.4K   96.5     0.0     1.4K  PageIn        0
hdisk1   47.5   282.5    81.5     0.0   282.5   PageOut     378  PAGING SPACE
hdisk3   41.0   282.5    81.5     0.0   282.5   Sios        381  Size,MB   28672
cd0       0.0     0.0     0.0     0.0     0.0                    % Used      0
hdisk0    0.0     0.0     0.0     0.0     0.0   NFS (calls/sec)  % Free    100

It seems that the "Busy%" is balance , but the "KB-Writ" is not balance as expected.

I've made the change to 5 servers. The last server remaining is also the most abnormal one

Code:

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  52
hdisk4   18.5     1.6K   40.5     0.0     1.6K  PgspOut       0  % Client   52
hdisk5   22.0     1.6K   40.5     0.0     1.6K  PageIn        0
hdisk0   84.5   573.4   122.5     2.0   571.4   PageOut     489  PAGING SPACE
hdisk3   82.0   571.4   122.0     0.0   571.4   Sios        505  Size,MB   28672
hdisk1    0.0     0.0     0.0     0.0     0.0                    % Used      0
hdisk2    0.0     0.0     0.0     0.0     0.0   NFS (calls/sec)  % Free    100

You can see that, however the "KB-Writ" of hdisk4 and hdisk5 is higher than these of hdisk0 and hdisk3, the "Busy%" of hdisk0 and hdisk3 is higher than these of hdisk4 and hdisk5.

It's so complicated.

Last edited by rbatte1; 12-29-2014 at 06:05 AM.. Reason: Change CODE tags to QUOTE tags for text quoted from bakunin. - RBATTE1 corrected some spelling.

bobochacha29

View Public Profile for bobochacha29

Find all posts by bobochacha29

12-26-2014

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by bobochacha29

It's so comlicated.

Of course it is. If systems administration would be simple we wouldn't be the heroes of the whole IT business, would we? So welcome to the job with the biggest demands and the greatest rewards our industry has to offer.

Quote:

Originally Posted by bobochacha29

,
They are SCSI Disk Drive.

Yes, many small files and these small files are compressed to a tar file after each day. Depending on which kinds of log, these tar files are deleted after 1 week, 1 month or 1 years ...

Well, this is something we can build on. I can help you better when i return to the office (and the documentation: there is a lot of detail i do not know from the top of my head).

Quote:

Originally Posted by bobochacha29

You can see that, however the "KB-Writ" of hdisk4 and hdisk5 is higher than these of hdisk0 and hdisk3, the "Busy%" of hdisk0 and hdisk3 is higher than these of hdisk4 and hdisk5.

A little general information about the disk statistics and what they mean:

Every disk has a "command queue": read- and write-requests are buffered and then worked on one after the other. If the queue is full the disk will not accept more commands until some room in the queue is free again. Keep this in mind for a moment.

The OS now asks every disk (in this regard, "disk" means "everything with a "hdisk"-device - it does not have to be a physical disk but also a LUN, a RAID-set, ....) in turn if this queue has length 0 at the moment or not, which the disk answers with "yes" (= length 0) or "no" (any length different then 0). From many of these answers the OS compiles a percentage which is shown as "disk busy %".

This means that "disk busy" is not as important as you think and that it has no meaning by itself. If a queue has "not length 0" it can have length 1 or length 15. The value is interesting because you get a measure of how many accesses a disk experiences. But you cannot measure the throughput of a device from that value alone. Disk operations come in different sizes, varying from 512 bytes (one disk block) up to several GB. Which of these is outstanding the busy% will not tell you, just that there is one outstanding at all.

You might consider balancing for "I/O" rather than for "busy%", but you might even do something different. I suggest you read my little introduction to performance tuning with emphasis to I/O-tuning in the meantime. I will come back to this thread once i am back in office (next Monday).

I hope this helps.

bakunin

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

AIX

How to know exactly which physical partion contains data?

9 More Discussions You Might Find Interesting

1. AIX

Maximum Limit of HMC to handle Physical Power Virtualization Physical Machine

Discussion started by: jenish_shah

2. Linux

Centos 5 install. Partion order

Discussion started by: honglus

3. Shell Programming and Scripting

Auto XP partion restore

Discussion started by: Zewbie

4. Solaris

How to create on partion from 2 disks?

Discussion started by: HishamN

5. UNIX for Dummies Questions & Answers

Physical volume- no free physical partitions

Discussion started by: markper

6. UNIX for Dummies Questions & Answers

Partion Magic problems.

Discussion started by: Syndrome_00

7. Linux

Doing stuff to files on a NTFS partion from linux

Discussion started by: CTroxtell21

8. UNIX for Dummies Questions & Answers

physical volume and physical disk.

Discussion started by: VeroL

9. UNIX for Dummies Questions & Answers

Partion size

Discussion started by: saood