![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| split based on the number of characters | chriss_58 | Shell Programming and Scripting | 6 | 07-06-2008 11:05 AM |
| Split a file based on pattern in awk, grep, sed or perl | kumarn | Shell Programming and Scripting | 5 | 06-20-2008 11:51 AM |
| Split a file with no pattern -- Split, Csplit, Awk | madhunk | UNIX for Dummies Questions & Answers | 10 | 12-17-2007 12:57 PM |
| extracting a line based on line number | narendra.pant | Shell Programming and Scripting | 2 | 09-20-2007 06:00 AM |
| awk script to split a file based on the condition | superprogrammer | Shell Programming and Scripting | 12 | 06-14-2005 04:59 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread |
Rating:
|
Display Modes |
|
|
|
|||||
|
With AWK (if I'm not missing something): [use nawk or /usr/xpg4/bin/awk on Solaris] Code:
awk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' filename
For best performance use mawk if available: Code:
% repeat 1000000 print ${(l:100::x:)l=line}$((++i)) >> data
% wc data
1000000 1000000 106888896 data
% time gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data
gawk data 3.28s user 0.37s system 97% cpu 3.756 total
% time mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data
mawk data 1.44s user 0.42s system 95% cpu 1.939 total
% time nawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++c%4?++i:i?i:++i))}
i==3{i=c=0}' data
nawk data 8.07s user 3.61s system 93% cpu 12.516 total
Last edited by radoulov; 09-30-2008 at 04:31 PM.. |
|
||||
|
Just for completeness, I should note that the modulo arithmetic in the Perl script I posted was a major brain fart. Here's a hopefully corrected version, with an explanation. Code:
perl -MIO::File -ne 'BEGIN {
@n = ("one.txt", "two.txt", "three.txt", "four.txt");
map { $file[$_] = IO::File->new(">$n[$_]") || die $!} 0..3;
@m = (3, 0, 1, 2, 0, 1, 2, 0, 1, 2);
}
$file[$m[$. % 10]]->print || die $!' filename
I threw in the mapping of arbitrary file names in the array @n for show. The BEGIN block creates an array @file of four file handles (indexed 0 through 3 -- Perl arrays start at zero) and a mapping @m of which line number to print to which handle. Somewhat confusingly, the first entry in the mapping is for line numbers 10, 20, 30, ... (array index zero), while only the second is for line numbers 1, 11, 21, etc. In the main loop (outside the BEGIN block) we simply calculate the remainder (modulo) of the line number $. divided by 10 (not 9!!) and use that as an index into @m to get the handle index, and then through another level of indexing print to the handle we are pointed to. Also for the record, the shell version will have an issue if there is input with backslashes in it. Change read to read -r or if your shell doesn't support that, see if you have the line command instead. Also for maintainability I suppose it would be better to use higher-numbered file descriptors -- file descriptors 1 and 2 are reserved for standard output and standard error, as you probably know. (I wanted to keep them in sync to make the script easier to follow, but it sucks if you try to debug it and lose all your errors into a file someplace.) As usual, Radoulov's solution is impressive, though a bit hard to follow. Apparently the names of the output files will be the input file name with a number suffix added. I speculate that mawk keeps the file handles open just in case, i.e. secretly does the file handle juggling that I did explicitly in the Perl script. (Incidentally, you don't really need IO::File for that, but it makes it a lot more readable -- the stuff you have to do to manipulate bare file handles in bare Perl is arcane even by Perl standards.) Last edited by era; 10-01-2008 at 04:04 AM.. Reason: Note on file descriptor numbering in sh implementation |
|
|||||
|
Thanks era! As far as I know [ngm]awk should maintain the files open until the end of the program or an explicit close call (close(filename)): Code:
% strace -eopen mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
open("tls/i686/sse2/cmov/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
[snip]
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
open("data", O_RDONLY) = 3
open("data1", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
open("data2", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
open("data3", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
open("data4", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 7
Process 8618 detached
Code:
% strace -eopen gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
open("tls/i686/sse2/cmov/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
[snip]
open("/usr/lib/locale/en_US.utf8/LC_TIME", O_RDONLY) = 3
open("data", O_RDONLY|O_LARGEFILE) = 3
open("data1", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4
open("data2", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 5
open("data3", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 6
open("data4", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 7
Process 8641 detached
Reading the strace output I notice some differences in read/write calls timings. I'm quite sure that the below output does not show all time consuming events. Code:
% strace -c mawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7865 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.48 0.003954 0 26097 write
25.83 0.001390 0 26313 read
0.69 0.000037 1 57 49 open
0.00 0.000000 0 10 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 time
0.00 0.000000 0 4 4 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 5 5 ioctl
0.00 0.000000 0 5 munmap
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 13 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 7 fstat64
0.00 0.000000 0 1 set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00 0.005381 52536 73 total
% rm data[1-4]
% sync;sync
% strace -c gawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7883 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
72.16 0.004391 0 26097 write
27.21 0.001656 0 26102 read
0.62 0.000038 0 89 72 open
0.00 0.000000 0 17 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 5 5 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 6 5 ioctl
0.00 0.000000 0 6 munmap
0.00 0.000000 0 4 mprotect
0.00 0.000000 0 4 _llseek
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 22 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 25 fstat64
0.00 0.000000 0 2 getgroups32
0.00 0.000000 0 13 fcntl64
0.00 0.000000 0 1 set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00 0.006085 52416 97 total
% rm data[1-4]
% sync;sync
% strace -c newawk '!(NR%10){print>(FILENAME 4);next}
{print>(FILENAME (++i))}i==3{i=0}' data
Process 7943 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.90 0.123052 0 1000000 write
1.10 0.001368 0 26101 read
0.00 0.000000 0 64 52 open
0.00 0.000000 0 15 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 4 4 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 7 munmap
0.00 0.000000 0 3 mprotect
0.00 0.000000 0 1 rt_sigaction
0.00 0.000000 0 18 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 12 fstat64
0.00 0.000000 0 1 set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00 0.124420 1026246 71 total
|
![]() |
| Bookmarks |
| Tags |
| split by line number, split to files |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|