Removing Lines if value exist in first file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing Lines if value exist in first file
# 15  
Old 08-30-2009
Quote:
Originally Posted by ripat
Check your ksh snippet as it throws an error with my ksh93 when evaluating your conditional expression:
Code:
if (( ${EXCLUDED[${fields[0]}]} != 1 )); then

error:
Code:
./ex.korn: line 8:   != 1 : arithmetic syntax error

Which is normal as it tries to evaluate a string (empty string) in a arithmetic expression.
Yep, you're right. Putting double quotes around the EXCLUDED expression should fix it and since it's a numeric comparison that's probably a better solution that using the [[ ]] as a string comparison. But I suppose they're about the same in this case.

Quote:
Talking about performance I did a test on large sample files:
excluded (cardinality: 50000 lines)
infile (cardinality: 29000 lines)
That seems a pretty reasonable size, although the OP didn't say how many records of each he actually has.

While playing with my ksh script I was surprised to find that the associative arrays in ksh don't have the content limit that indexed arrays have, i.e. an indexed array can only have up to 4096 elements but I was able to put 500k elements into an associative array without problems...

Quote:
Results:
Code:
jeanluc@ibm:~/scripts/test$ time ./ex.pl excluded infile > /tmp/out.pl
real    0m0.214s
user    0m0.176s
sys    0m0.032s

jeanluc@ibm:~/scripts/test$ time ./ex.korn excluded infile > /tmp/out.korn
real    0m1.154s
user    0m1.060s
sys    0m0.088s

jeanluc@ibm:~/scripts/test$ time ./ex.awk excluded infile > /tmp/out.awk
real    0m0.093s
user    0m0.072s
sys    0m0.016s

As often the case in data file crunching awk is fast and terse.Smilie
I'm guessing that each of these test times is a "second run" test so that the I/O of bringing the executable and libraries into memory has been factored out. If that's the case, I must admit that I'm quite surprised by the awk results. I've not seen awk produce faster runtimes compared to perl when large datasets are processed, yet your numbers show a 2:1 advantage for awk.

I wonder if it would be faster to remove the chomp() in the first perl loop and simply do the lookup by adding a \n in the main loop? I have to assume that concatenating a newline on each record of input would have to be slower than stripping the newline from the exclude list, except I notice that you've got an exclude list 60% larger than the input file so maybe that's a factor?

Very interesting. Thank you for posting those numbers as I now have something to investigate between other projects! Smilie
# 16  
Old 08-31-2009
Thank you too Azhrei, that was an interesting discussion.
I had ran some tests as well against a large inputfile BTW.

Code:
Seconds  terminal file  /dev/null
------------------------------------
ksh93    27,15    30,23    25,03
perl     12,89     3,54     3,29
awk       8,15     1,75     1,69

You were right, Perl is much faster with bigger inputs, although not 100 times faster. I thought the differences would be smaller. Well they were small with terminal output, but not with file output. I think the difference between terminal output and output to file is remarkable in the case of Perl. I tried shcomp that compiles ksh93 but that did not make it any faster..

Last edited by Scrutinizer; 08-31-2009 at 06:33 PM..
# 17  
Old 08-31-2009
Interesting numbers. Thanks. But you didn't say what your input file sizes were for your tests...?

The terminal output will be constrained by the efficiency of the terminal driver and, assuming you're using a network connection, the efficiency of your network stack. The output to /dev/null is a great test since it eliminates output costs from the processing entirely, although the OP still must perform output to save his data. It gets ugly to test I/O to a filesystem, however, since a fragmented fs will be slower. Ideally, multiple runs would be performed with the first set of results thrown away, and with successive executions overwriting the original data in the file (preventing new data block allocations from being required). Such testing isn't appropriate for the problem at hand, although they might make for some more interesting discussion. Smilie

And none of the above factor in usability-related issues, such as the Perl script automatically creating .bak files from the input files.

I think it's time to look at specifics of each execution platform. For example, what version of Korn shell are you using (for example, turn on vi editing mode and type <Esc>^V to see the version number), what version of Perl (just perl -v), and what version of awk (not sure for this one?). Then I'm going to look into this some more when I get some time later in the week. Smilie
# 18  
Old 09-01-2009
Interesting thread indeed.

Quote:
Originally Posted by Azhrei
... what version of Korn shell are you using (for example, turn on vi editing mode and type <Esc>^V to see the version number), what version of Perl (just perl -v), and what version of awk (not sure for this one?).
Simply awk --version and while you are at it, give mawk a try if available on your system. mawk interpretor is often much faster than straight awk.
# 19  
Old 09-01-2009
Quote:
Originally Posted by ripat
Simply awk --version and while you are at it, give mawk a try if available on your system. mawk interpretor is often much faster than straight awk.
Doesn't work on Mac OS 10.4.11, but strings found version 20040207; probably the BSD version on my machine and you've got the GNU version. I wonder if that's a significant factor? I'd be willing to bet that no one has tweaked at the BSD version of awk since BSD 4.4Lite was released.

I've got Korn shell "M 1993-12-28 p", and my Perl is 5.8.6 with some security patches. But no answer from anyone else on versions. Smilie
# 20  
Old 09-01-2009
I used 425412 infile records against 215 excludes (on a laptop Pentium III 850 MHz)

Perl: v5.8.8 built for i486-linux-gnu-thread-multi
Ksh: 93s+20071105-1 The real, AT&T version of the Korn shell
awk: mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF 32767
sprintf buffer 1020

Last edited by Scrutinizer; 09-01-2009 at 05:51 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing lines from a file

I have a file `/tmp/wrk` containing filenames with paths. I want to remove filenames from this file, for example remove all filenames containing alja cagr cavt clta cmdo or corl remove all filenames containing data for days in region `d.2016.001` to `d.2016.207` remove all filenames... (10 Replies)
Discussion started by: kristinu
10 Replies

2. Shell Programming and Scripting

Removing lines from a file

Hi, I have a linux server that was hacked and I have a bunch of files that sporadically contain the following lines through out the file: <?php eval(base64_decode("Xxxxxxxxxxxxxx/xxxxxxxx")); I did't put the exact lines of the file in this post. The "Xxxx" are random letters/numbers.... (8 Replies)
Discussion started by: nck
8 Replies

3. Shell Programming and Scripting

Remove lines from one file that exist in another file

Hello Everyone, I'm currently have a requirement where I've generated a list of files with specific attributes and I need to know what lines are similar between the two files. For example: -File 1- line1 line2 line3 -File 2- line1 line2 line4 line5 -Desires Output- line1 line2... (5 Replies)
Discussion started by: omnivir
5 Replies

4. Shell Programming and Scripting

Deleting lines of a file if they exist in another file

I have a reference file that needs to remain static and another file that may or may not have duplicate rows that match the reference file. I need help with a command that will delete any duplicate rows from the second file while leaving reference file intact For example reference file would... (4 Replies)
Discussion started by: bjdamon
4 Replies

5. UNIX for Dummies Questions & Answers

Removing a user that doesnt exist from a group

Hi there, normally if I want to remove a user tht I have added to a specific group, i would do the following this is what my group2 looks like # grep group2 /etc/group group2:x:7777:user2,user1,user4 user1 has been defined in a few groups # id -nG user1 group1 group2 group3 So... (3 Replies)
Discussion started by: rethink
3 Replies

6. UNIX for Dummies Questions & Answers

removing several lines from a file

Hi folks, I have a long string of DNA sequences, and I need to remove several lines, as well as the line directly following them. For example, here is a sample of my starting material: >548::GY31UMJ02DLYEH rank=0007170 x=1363.5 y=471.0 length=478... (1 Reply)
Discussion started by: kkohl78
1 Replies

7. Shell Programming and Scripting

Removing the first and last lines in a file

Hi Gurus, I'm a little new to UNIX. How can I do remove the first and last line in a file? Say, supppose I have a file as below: Code: 1DMA 400002BARRIE 401002CALGARY/LETHBRI 402002CARLETON 500001PORTLAND-AUBRN 501001NEW YORK, NY 502001BINGHAMTON, NY ... (2 Replies)
Discussion started by: naveendronavall
2 Replies

8. UNIX for Dummies Questions & Answers

Removing lines from a file

I'm trying to find a command which will allow me to remove a range of lines (2-4) from a .dat file from the command line without opening the file. Someone mentioned using the ex command? Does anyone have any ideas? thanks (6 Replies)
Discussion started by: computersaysno
6 Replies

9. Shell Programming and Scripting

Removing lines within a file

Hi There, I've written a script that processes a data file on our system. Basically the script reads a post code from a list file, looks in the data file for the first occurrence (using grep) and reads the line number. It then tails the data file, with the line number just read, and outputs to a... (3 Replies)
Discussion started by: tookers
3 Replies

10. Shell Programming and Scripting

Removing lines from a file

Hello i have 2 files file1 and file2 as shown below file1 110010000000206|567810008161509 110010000000207|567810072227627 110010000000208|567811368851555 110010000000209|567811422513652 110010000000210|567812130217683 110010000000211|567813220211182 110010000000212|567813449322589... (4 Replies)
Discussion started by: PradeepRed
4 Replies
Login or Register to Ask a Question