The Fastest for copy huge data

09-16-2014

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

Quote:

Originally Posted by jim mcnamara

IF you have an ssh connection and have set up ssh-keys for an account that can write to /.
Where /parent is the the path primary member of /parent/path/to/files/

Code:

tar cf - ./path/to/files | ssh special_user@remoteserver ' cd /parent && tar xBf - '

This runs in one about half of the time of:

Code:

tar cf tarfile.tar
scp tarfile.tar remoteserver:
ssh remoteserver ' tar xf tarfile.tar'

When copying data via ssh pipe, always add "-e none" to the command in case any characters in the stream match the ssh escape characters:

Code:

tar cf - ./path/to/files | ssh -e none special_user@remoteserver ' cd /parent && tar xBf - '

This is really moot, though, until we get more details from the original poster.

achenle

View Public Profile for achenle

Find all posts by achenle

09-16-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

@achenle: the OP specified 3 Mio (3 million) files that are on average 1 KiB in size, so that should be in the order of 3 GiB. With a SATA disk with 120 (sequential, but small) iops, so with a 1KlB IO Size that should then theoretically be 3,000,000 IOS / 120 IOPS = 25000 seconds, i.e. around 7 hours for the data alone, limited either by the reading or the writing system (probably the writing side is faster since the IO's will more sequential in nature). This is excluding the IOPS required for the metadata. If the filesystem can do write combining / prefetching then perhaps that may be a bit more efficient. If the filesystem has a larger minimum block size, then that would not matter much for speed, since the block size would still be smallish.

When we take the disk out and put it in the other server we need another stream and the same amount to copy it to the disk on to the other server plus sneaker time..

If we would use the netwerk, we would probably not need much more time and we could do it with a single stream, reading from one computer, writing onto the other (the network would not be a bottleneck here..)... So that should take in the order of half the time..

If we we use any of the block copy methods in my post, there would be no need to copy the files individually nor all that metadata manipulation and can read large chunks of data with big IO's (for example 1 MiB per IO) which will be significantly faster probably in the order of 100MB/s so it should theoretically take in the order of 30-60 seconds for the data alone, if the network is not a bottleneck...

Of course if the data is on a large filesystem then that whole filesystem would need to be copied unless the method is smart like filesystem dumping methods or ZFS send / receive, which only copy the parts that are in use..

Last edited by Scrutinizer; 09-17-2014 at 01:07 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

09-16-2014

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

So that's what "mio" means...

120 IO operations per second from a SATA drive is quite optimistic. A single 7200 rpm SATA disk is realistically more likely to get about 60-70 IO operations per second, because the small reads in this case are not likely to be sequential - they'll effectively be random IO operations. If it's a 5400 rpm disk, the number would be even less.

And if atime modification isn't turned off, every read operation that reads a file will generate a write operation to update the inode data for that file.

So that's probably somewhere between 6 and 9 million IO operations because metadata has to be read to even find each file. Call it 6 million IO operations, and assume the disk can do 60 IO operations per second. That's 100,000 seconds, More like 28 hours. And that assumes the disk isn't servicing other IO operations.

Why not just share the file system via NFS and let other systems access the files that way.

achenle

View Public Profile for achenle

Find all posts by achenle

09-16-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Well, given our lack of information, there is no real answer.

IOPS are not knowable - our SAN does 12000 iops continuously if required. The sata disk on my desktop does maybe 70. And if the file systems were zfs and were on a SAN, then the "copy" time is the time it takes to type four or five zfs commands.

So maybe we are are comparing apples to elephants. Do not know.

In any event, when an app (or a user) is allowed to clutter a filesystem as described, there is not a lot of hope for it. A simple find or ls command can take hours to complete. On some systems. Copying it as is does not seem like a best practices idea to me.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-17-2014

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

cpio in pass through mode is generally regarded as being much faster than tar.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

Solaris

The Fastest for copy huge data

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Phrase XML with Huge Data

Discussion started by: pareshkp

2. Shell Programming and Scripting

Aggregation of huge data

Discussion started by: Ravichander

3. Red Hat

Disk is Full but really does not contain huge data

Discussion started by: kalpeer

4. UNIX for Dummies Questions & Answers

Copy huge data into vi editor

Discussion started by: alok.behria

5. AIX

Copy huge files system

Discussion started by: Mr.AIX

6. Solaris

The FASTEST copy method?

Discussion started by: Harleyrci

7. UNIX for Dummies Questions & Answers

copy and paste certain many lines of huge file in linux

Discussion started by: ariesto

8. UNIX for Advanced & Expert Users

A variable and sum of its value in a huge data.

Discussion started by: varungupta

9. Shell Programming and Scripting

How to extract data from a huge file?

Discussion started by: srsahu75

10. UNIX for Dummies Questions & Answers

fastest copy command

Discussion started by: vascobrito