Comparing two huge files

09-03-2008

Registered User

21, 0

Join Date: Oct 2007

Last Activity: 20 July 2009, 8:49 AM EDT

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

you don't need to touch file B

Here's how I'd do it... I think it should be very quick.

osscl1head01 1447>cat fileA
// 223 missing
223,Jan,ee,bla,bla

// data not found
254-11,Jan,ee,bla,bla

// data rejected
214-1,Jan,ee,bla,bla
osscl1head01 1448>cat fileB
aaaa,bbbb,ccc,dddd,20054-11,fff,ggg...
aaaa,bbbb,ccc,dddd,254-11,fff,ggg...
aaaa,bbbb,ccc,dddd,2545456-1,fff,ggg...
osscl1head01 1449>grep . fileA | grep -v / | awk -F, '{print $1}' > fileC
osscl1head01 1450>cat fileC
223
254-11
214-1
osscl1head01 1451>join -1 1 -2 5 -t, fileC fileB > fileD
osscl1head01 1452>cat fileD
254-11,aaaa,bbbb,ccc,dddd,fff,ggg...
osscl1head01 1453>

EDIT You need to sort both the input files to join by the identifier, but that *should* be straight forward enough.
sort +4 -t, fileB > fileBsorted
sort fileC > fileCsorted

You can probably use awk to repair the structure of fileD if that is important.

Last edited by Digby; 09-03-2008 at 07:56 AM..

Digby

View Public Profile for Digby

Find all posts by Digby

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

hey dude
I appreciate your help. Once you found that Id '254-11' is common in both the files. You need to grep the File A to get ouptut as below. current line and previous line of the match. You are joining the fields of matched record of both the files. That is not what i need.

output:
// data not found ---- (previous line in file A)
254-11,Jan,ee,bla,bla ---- (current line in file A)

I am not touching file B after finding the match using comm.

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

09-03-2008

Registered User

21, 0

Join Date: Oct 2007

Last Activity: 20 July 2009, 8:49 AM EDT

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

I realize that's what you want to do, but with these large files grepping every query against every line isn't feasible (at least for my files). Even with a modest number of queries it is painfully slow.

I am joining the matching lines in the two files, but as fileC only contains the field to be matched, the outputted line is effectively the fileB line. Note that join only outputs lines that match, so it is what you (and I) need (I think).

The only problem is that the fields of the fileB line have been rearranged.
awk -F, '{print $2,$3,$4,$5,$1,$6....}' could sort this out. If you've got a very large number of fields in file B then I guess a perl or sed command could come in handy, but I don't know exactly how to write it.

If this wouldn't result in your required output, then I'm afraid I'm misunderstanding your problem.

Last edited by Digby; 09-03-2008 at 07:01 AM..

Digby

View Public Profile for Digby

Find all posts by Digby

09-03-2008

Registered User

21, 0

Join Date: Oct 2007

Last Activity: 20 July 2009, 8:49 AM EDT

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sorry dude, I just reread your post and realised what output you're looking for.
You want the the final output to be from file A.

Could you convert it to a single line format and then use join?

daisy 1860>perl -pe 's/\n/:/g' fileA | perl -pe 's/::/\n/g' | perl -pe 's/:$/\n/g' | awk -F: '{print $2":"$1}' | sort > fileAA
daisy 1861>awk -F, '{print $5}' fileB | sort > fileBB
daisy 1862>join -1 1 -2 1 -t, fileAA fileBB | awk -F: '{print $2":"$1}' | perl -pe 's/:/\n/' | perl -pe 's/^\//\n\//'

// data not found
254-11,Jan,ee,bla,bla

Last edited by Digby; 09-03-2008 at 08:27 AM..

Digby

View Public Profile for Digby

Find all posts by Digby

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

It worked great. Awsome dude. You are really great. hats off to your weighted brain. I am your fan from today. trust me...

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

View Poll Results: Does Anyone worked with Huge files
Yes	1	100.00%
No	0	0%
Voters: 1. This poll is closed

Shell Programming and Scripting

Comparing two huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Discussion started by: Ravichander

2. Shell Programming and Scripting

Work with huge Zipped files

Discussion started by: Homa

3. Shell Programming and Scripting

awk to parse huge files

Discussion started by: panyam

4. Shell Programming and Scripting

Perl: Need help comparing huge files

Discussion started by: mrn6430

5. Shell Programming and Scripting

Comparing 2 huge text files

Discussion started by: linuxgeek

6. Shell Programming and Scripting

Comparing two huge files on field basis.

Discussion started by: Suman Singh

7. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

8. UNIX for Advanced & Expert Users

Huge files manipulation

Discussion started by: Klashxx

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

10. UNIX for Dummies Questions & Answers

comparing Huge Files - Performance is very bad

Discussion started by: madhukalyan