Removing Lines if value exist in first file

08-29-2009

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

One possibility:

filter.sh

Code:

#!/bin/bash
awk -F',' 'NR==FNR{_[$0]=1}NR!=FNR&&!_[$4]{print}' $1 $2  > $3

Code:

$ filter.sh exclude infile outfile

ripat

View Public Profile for ripat

Find all posts by ripat

08-29-2009

Registered User

310, 26

Join Date: Mar 2009

Last Activity: 27 December 2015, 12:35 PM EST

Posts: 310

Thanks Given: 35

Thanked 26 Times in 26 Posts

or always with awk...

Code:

awk -F "," 'NR==FNR{a[$1]=$1;next} !($4 in a) {print $0}' file1 file2

---------- Post updated at 09:59 AM ---------- Previous update was at 09:43 AM ----------

awk alternative...

Code:

awk -F "," 'NR==FNR{a[$1];next} !($4 in a) {print $0}' file1 file2

protocomm

View Public Profile for protocomm

Find all posts by protocomm

08-29-2009

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

Indeed. Corrected version:

Code:

awk -F',' 'NR==FNR{_[$0]=1;next}!_[$4]{print}' exclude infile

Merci.

This User Gave Thanks to ripat For This Post:

ripat

View Public Profile for ripat

Find all posts by ripat

08-29-2009

Registered User

31, 0

Join Date: Aug 2009

Last Activity: 11 January 2010, 1:51 PM EST

Posts: 31

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by svn

My question is how can i make that account number position a variable so that i can pass it at the same time I'm specifying the file names?

Now you're starting to get tricky.

How about a command line that looks like this:

Code:

filter excludeFile 4 file1 file2 file3...

Would that work? The exclude file comes first, followed by the numeric field number (starting at 1 for the first field), followed by a list of one or more files that use that particular field number. If you use a negative field number, it will count from the end of the line instead of the front, so a field of -2 would mean the second to the last field on every line, even if each line had a different number of fields.

In the code below I have told Perl to rename the original files so that they end in .bak and then write the changes to the original name. For the command above, you'd end up with file1 and file1.bak for example.

If that works for you, try the following. Note the extra -I.bak option on the first line and the extra $field variable.

Code:

#!/usr/bin/perl -I.bak

my @a, %exclude;
my $file = shift;
open(EXCLUDE_LIST, "< $file") or die;
chomp( @a=<EXCLUDE_LIST> );
close(EXCLUDE_LIST);
@exclude{@a}=@a;

my $field = shift;
if ($field =~ /\D/) {
    $field = 4;
}
die "Field specifier may not be zero.\n" unless $field;
$field-- if $field > 0;
while (<>) {
    print unless exists $exclude{ (split(/,/))[$field] };
}

If there are other options you want to add (such as using a different delimiter between fields), then it's time to start using Getopt::Std and specifying options using the same techniques other commands use: a dash followed by a letter.

@ripat: That's a cute trick with the NR==FNR for awk. I'm going to have to remember that one. Only useful for a single file, but still... (The file handling in awk is terrible!)

---------- Post updated at 04:00 PM ---------- Previous update was at 03:52 PM ----------

Quote:

Originally Posted by ripat

One possibility:

filter.sh

Code:

#!/bin/bash
awk -F',' 'NR==FNR{_[$0]=1}NR!=FNR&&!_[$4]{print}' $1 $2  > $3

It's a really bad idea to use variables without putting double quotes around them! I can screw up that awk command pretty bad by passing the script a filename with a space or wildcard character in it, especially as the third parameter.

Please put double quotes around ALL variable substitutions. Out of a thousand uses it will only be wrong 3-4 times, so you've got a 99.6% chance of getting it right. Those are pretty good odds.

Azhrei

View Public Profile for Azhrei

Find all posts by Azhrei

08-29-2009

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by Azhrei

No way would I use a shell for that job! The following Perl script is probably a hundred times faster and more efficient!

While I agree that Perl is usually well suited for this type of application, I do not think this generalization is accurate. The shell scripts above are fine but there is room for some significant speed optimizations. If we use ksh (ksh93s+) instead of bash and a method that resembles the one in your Perl script, I think there would not be a real big difference in speed.

filter.ksh93

Code:

#!/usr/bin/ksh
typeset -A EXCLUDED
EXCLUDE_LIST=$(< $1)
INFILE=$2
for excl in $EXCLUDE_LIST; do
  EXCLUDED[$excl]=1
done
IFS=","
while read a b c id d; do
  if [[ ${EXCLUDED[$id]} -ne 1 ]]; then
    echo "${a},${b},${c},${id},${d}"
  fi
done < $INFILE

Code:

./filter.ksh93 excludes infile > outfile

Last edited by Scrutinizer; 08-29-2009 at 05:32 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-29-2009

Registered User

31, 0

Join Date: Aug 2009

Last Activity: 11 January 2010, 1:51 PM EST

Posts: 31

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Scrutinizer

Hmm. Let's take a look at your script and its efficiency/performance and compare that to the Perl script, shall we?

First, the perl script loses big time in terms of startup cost; initializing the interpreter and compiling the script are overhead that can never be reclaimed (although it can be amortized if the data files are large enough). The perl script also loses (slightly) in that it's less readable to people unfamiliar with the language (although the OP was able to correctly determine how to change the field used for his particular case). The final lossage comes from the wordiness of my perl example -- it could've been done more concisely but I was at least partially concerned about the OP being able to understand its overall operation.

(I'm modifying your Korn shell script to add some performance and usage benefits, but it remains essentially the same.) Your Korn shell script does not have the startup cost, but as a true interpreter it will have to constantly be reparsing the loop body every time through the loop, so if there are a significant number of iterations it will be a performance problem. There's also the problem of single and double quotes occurring in the input; the Korn shell's read will handle paired quotes correctly (as it interprets the quotes) while perl will need help from a regular expression to do the work (or the Text::Balanced module). The reason I mention this as a problem is that a single apostrophe will screw up the Korn script but have no impact on the perl script (as the perl script ignores the issue entirely!).

Quote:

filter.ksh93

Code:

#!/usr/bin/ksh
typeset -A EXCLUDED
while read excl; do
  EXCLUDED[$excl]=1
done < "$1"
IFS=","
while read -A fields; do
  if (( ${EXCLUDED[${fields[3]}]} != 1 )); then
    echo "${fields[*]}"
  fi
done < "$2"

In any case, there is no comparison between the two languages when processing more than a few hundred lines of data. I wrote a Korn script to do some text processing for a client (similar to this task) that took 28+ minutes to process 300k records. The same task in Perl took a little over 2 minutes. That's 10k records per minute for the shell script and 150k records per minute for the perl script.

I attribute the difference to the efficiencies of pseudo-compiling and the nature of the I/O between the two scripts (the perl script was in "paragraph" mode, reading 10-20 lines at a time while the shell script had to do one line at a time and maintain a FSM).

Azhrei

View Public Profile for Azhrei

Find all posts by Azhrei

08-30-2009

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

@ Azhrei

Check your ksh snippet as it throws an error with my ksh93 when evaluating your conditional expression:

Code:

if (( ${EXCLUDED[${fields[0]}]} != 1 )); then

error:

Code:

./ex.korn: line 8:   != 1 : arithmetic syntax error

Which is normal as it tries to evaluate a string (empty string) in a arithmetic expression. Try with:

Code:

if [[ ${EXCLUDED[${fields[0]}]} == "" ]]; then

which is working well.

Talking about performance I did a test on large sample files:
excluded (cardinality: 50000 lines)
infile (cardinality: 29000 lines)

Results:

Code:

jeanluc@ibm:~/scripts/test$ time ./ex.pl excluded infile > /tmp/out.pl
real	0m0.214s
user	0m0.176s
sys	0m0.032s

jeanluc@ibm:~/scripts/test$ time ./ex.korn excluded infile > /tmp/out.korn
real	0m1.154s
user	0m1.060s
sys	0m0.088s

jeanluc@ibm:~/scripts/test$ time ./ex.awk excluded infile > /tmp/out.awk
real	0m0.093s
user	0m0.072s
sys	0m0.016s

As often the case in data file crunching awk is fast and terse.

ripat

View Public Profile for ripat

Find all posts by ripat

Shell Programming and Scripting

Removing Lines if value exist in first file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing lines from a file

Discussion started by: kristinu

2. Shell Programming and Scripting

Removing lines from a file

Discussion started by: nck

3. Shell Programming and Scripting

Remove lines from one file that exist in another file

Discussion started by: omnivir

4. Shell Programming and Scripting

Deleting lines of a file if they exist in another file

Discussion started by: bjdamon

5. UNIX for Dummies Questions & Answers

Removing a user that doesnt exist from a group

Discussion started by: rethink

6. UNIX for Dummies Questions & Answers

removing several lines from a file

Discussion started by: kkohl78

7. Shell Programming and Scripting

Removing the first and last lines in a file

Discussion started by: naveendronavall

8. UNIX for Dummies Questions & Answers

Removing lines from a file

Discussion started by: computersaysno

9. Shell Programming and Scripting

Removing lines within a file

Discussion started by: tookers

10. Shell Programming and Scripting

Removing lines from a file

Discussion started by: PradeepRed