The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Script to capture new lines in a file and copy it to new file fara_aris Shell Programming and Scripting 0 05-27-2008 10:11 PM
Deleting lines inside a file without opening the file toms Shell Programming and Scripting 3 09-24-2007 07:58 AM
need help appending lines/combining lines within a file... mr_manny Shell Programming and Scripting 2 01-06-2006 06:45 PM
How to read specific lines in a bulk file using C file Programming rajan_ka1 High Level Programming 10 11-10-2005 03:29 AM
Loop through file and write out lines to file(s) Jtrinh Shell Programming and Scripting 7 07-05-2005 03:06 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 02-18-2008
gneen gneen is offline
Registered User
  
 

Join Date: Feb 2008
Posts: 5
omitting lines from file A that are in file B

I've got file A with (say) 1M lines in it ... ascii text, space delimited ...

I've got file B with (say) 10M lines in it ... same structure.

I want to remove any lines from A that appear (identically) in B and print the remaining (say) 900K lines. (And I want to do it in zero time of course!)

Best I've come up with so far is somehow marking the lines in A, then doing a sort and applying an awk script to the result so that the marked lines are only printed if the following (or previous) line isn't "identical" except for the mark.

But after 1000 years of shell programming I've GOT to believe I'm missing an easier/faster solution ... I'm using bash and cygwin tools - and compiling is not an option.

ADVthanksANCE for your help!
=Gneen
  #2 (permalink)  
Old 02-18-2008
earnstaf earnstaf is offline
Registered User
  
 

Join Date: May 2007
Posts: 113
Quote:
Originally Posted by gneen View Post
I've got file A with (say) 1M lines in it ... ascii text, space delimited ...

I've got file B with (say) 10M lines in it ... same structure.

I want to remove any lines from A that appear (identically) in B and print the remaining (say) 900K lines. (And I want to do it in zero time of course!)

Best I've come up with so far is somehow marking the lines in A, then doing a sort and applying an awk script to the result so that the marked lines are only printed if the following (or previous) line isn't "identical" except for the mark.

But after 1000 years of shell programming I've GOT to believe I'm missing an easier/faster solution ... I'm using bash and cygwin tools - and compiling is not an option.

ADVthanksANCE for your help!
=Gneen
Code:
cat fileA | while read line
do
grep -q "$line" fileB
if [ $? -eq 1 ]; then
echo "$line" > fileC
fi
done
Not sure how fast that would be, but fileC will end up with all the lines that were in fileA that were in not in fileB.
  #3 (permalink)  
Old 02-18-2008
gneen gneen is offline
Registered User
  
 

Join Date: Feb 2008
Posts: 5
but ...

Heh - the grep inside the read loop would "work" ... but I'd have to come back in a year to see the results!

For tiny files this would clearly be the way to go - but for files the size I'm dealing with this would mean one million greps into a file that was ten million lines long ... can you spell "Rip Van Winkle"?


=Gneen
  #4 (permalink)  
Old 02-18-2008
joeyg's Avatar
joeyg joeyg is offline Forum Staff  
modérateur
  
 

Join Date: Dec 2007
Location: Home of 17-time world champion Boston Celtics
Posts: 1,311
Cool not knowing what the real data looks like, but...

How about?
This would effectively break up everything into 26 smaller files based on the first character of the file, and assuming it is lowercase. (Or, depending on the format of your data, could be ten numeric groups, etc...)


for outch in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
cat fileb | grep ^"$outch" > fileb_"$outc"
done

while read zf
do
leadch=$(echo $zf | cut -c1-1)

now do lookup to appropriate file
use the just determined $leadch variable
and write if found/not found, as you like


done <filea
  #5 (permalink)  
Old 02-18-2008
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,884
Use awk/perl hashes/assoc arrays

Assuming awk is fairly memory efficient and you have at least 1M x length-of-line bytes in virtual mem, this should work:

Code:
awk 'NR==FNR { A[$0]=1; next; } { if ($0 in A) { A[$0]=0; } END { for (k in A) { if (A[k]==1) { print A[k]; } } }'  A   B
  #6 (permalink)  
Old 02-18-2008
gneen gneen is offline
Registered User
  
 

Join Date: Feb 2008
Posts: 5
Smile Very promising awk script ...

Thanks otheus!
Nothing quite like a one-line cryptic awk script from a guru ... with a few minor typo corrections it shows excellent promise ... trying it with the giant files and the real data is going to need to wait for tomorrow. SWEET! (I'll post back here with some timing results.)

And thanks to to the other folks who replied - this is indeed an incredible resource!

Quote:
# FNR is the number of records in the current input file - it is reset
# when the next FILE is started but NR is the number of records processed
# so far and it is not reset ... so the first line effectively creates
# an associative array out of the lines in the first input file and marks
# them with a value of "1". Then the second line effectively examines
# the lines in the second file and sets the value to zero if it is there.
# Thus - by the time it finishes, only those lines in file A but NOT in
# file B will have a value of "1". And then we print those values.

awk ' NR==FNR { A[$0]=1; next; }
{ if ($0 in A) { A[$0]=0; } }
END { for (k in A) { if (A[k]==1) { print k; } } } ' $FILE1 $FILE2

-----------------------------------------------------------

The output from a test run follows:

FILE1:
1
2
3
4
5

FILE2:
5
3
1

AND THE OUTPUT IS:
4
2

  #7 (permalink)  
Old 02-19-2008
vino's Avatar
vino vino is offline Forum Staff  
Supporter (in vino veritas)
  
 

Join Date: Feb 2005
Location: Bangalore, India
Posts: 2,796
Code:
grep -v -f fileA fileB > output.txt
Sponsored Links
Closed Thread

Bookmarks

Tags
linux

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 02:47 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0