Comparing two huge files

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Comparing two huge files

Hi,

I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of file A with fifth field of file B. It field values in file A and file B matches i need to write it to output file as below.

File A
// 223 missing
223,Jan,ee,bla,bla

// data not found
254-11,Jan,ee,bla,bla

// data rejected
214-1,Jan,ee,bla,bla

File B
aaaa,bbbb,ccc,dddd,20054-11,fff,ggg...
aaaa,bbbb,ccc,dddd,254-11,fff,ggg...
aaaa,bbbb,ccc,dddd,2545456-1,fff,ggg...

output:
// data not found
254-11,Jan,ee,bla,bla

if First field of File A and Fifth field of File B (254-11) matches, then i need to write the records from file A (current line and the previous line) to a output file as above.

I could achieve it very easily using awk and grep with if loop. Problem is files are hugh. Nearly 1 million records are in both the files. script run for 3-4 hours. I would appreciate if some one could help me in giving good logic or better script which could complete the task in few minutes.

Note: File A and File B look exactly in the same format. Caution about the blanks in file A and Client ID fomat 000 or 000-0 or 000-00.

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

09-03-2008

Registered User

146, 1

Join Date: Aug 2008

Last Activity: 13 September 2013, 10:47 AM EDT

Location: PUNE

Posts: 146

Thanks Given: 0

Thanked 1 Time in 1 Post

for compare:
comm file1 file2

for diffrence:
diff file1 file2

RahulJoshi

View Public Profile for RahulJoshi

Find all posts by RahulJoshi

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi Rahul,

Thank you for your reply. Command comm with the above files gives wrong output because i need to compare field 1 of file A and field 5 of file B and output the current and previous line from file A.

comm command will compare the files line by line. None of the lines will match except field 1 of file A and field 5 of file B.

Regards,
Mahesh k

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

09-03-2008

Registered User

21, 0

Join Date: Oct 2007

Last Activity: 20 July 2009, 8:49 AM EDT

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

By chance, I came on here with exactly the same problem.

I think that join may come in useful here:
> join -1 1 -2 5 -t, $fileA $fileB > $requiredFile

The thing is the files need to be sorted by the fields you plan to join... and I can't sort my files :-s.

Don't know if this is of any use.

Digby

View Public Profile for Digby

Find all posts by Digby

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

Thank you for your suggestion. In file A, Actual field is second line. Records seperated by ','. The first line is error or message where words are sepearated by space. so i cannot say first field because the fields are not proper.

Regards,
Mahesh K

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

09-03-2008

Registered User

21, 0

Join Date: Oct 2007

Last Activity: 20 July 2009, 8:49 AM EDT

Posts: 21

Thanks Given: 0

Thanked 0 Times in 0 Posts

can you extract the required identifiers into a different file with awk and/or grep?

Digby

View Public Profile for Digby

Find all posts by Digby

09-03-2008

Registered User

8, 0

Join Date: Sep 2008

Last Activity: 25 September 2008, 6:04 AM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

yes buddy, i can do that. I have extracted the field 1 to file C and Field 5 to file D. I used comm command to compare and find the exact match. It just run for few seconds.

The output file is also quite big. I count number of lines in output file, put the while loop, get the ID from output file 1 by 1 and grep the File a and generate the exact output.

This is my problem. the above task run for 2-3 hours due to big loop. I dont know how to over come this problem to optimize my script.

kmkbuddy_1983

View Public Profile for kmkbuddy_1983

Find all posts by kmkbuddy_1983

View Poll Results: Does Anyone worked with Huge files
Yes	1	100.00%
No	0	0%
Voters: 1. This poll is closed

Shell Programming and Scripting

Comparing two huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Discussion started by: Ravichander

2. Shell Programming and Scripting

Work with huge Zipped files

Discussion started by: Homa

3. Shell Programming and Scripting

awk to parse huge files

Discussion started by: panyam

4. Shell Programming and Scripting

Perl: Need help comparing huge files

Discussion started by: mrn6430

5. Shell Programming and Scripting

Comparing 2 huge text files

Discussion started by: linuxgeek

6. Shell Programming and Scripting

Comparing two huge files on field basis.

Discussion started by: Suman Singh

7. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

8. UNIX for Advanced & Expert Users

Huge files manipulation

Discussion started by: Klashxx

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

10. UNIX for Dummies Questions & Answers

comparing Huge Files - Performance is very bad

Discussion started by: madhukalyan