Comparing two huge files


View Poll Results: Does Anyone worked with Huge files
Yes 1 100.00%
No 0 0%
Voters: 1. This poll is closed

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Comparing two huge files
# 8  
Old 09-03-2008
you don't need to touch file B

Here's how I'd do it... I think it should be very quick.

osscl1head01 1447>cat fileA
// 223 missing
223,Jan,ee,bla,bla

// data not found
254-11,Jan,ee,bla,bla

// data rejected
214-1,Jan,ee,bla,bla
osscl1head01 1448>cat fileB
aaaa,bbbb,ccc,dddd,20054-11,fff,ggg...
aaaa,bbbb,ccc,dddd,254-11,fff,ggg...
aaaa,bbbb,ccc,dddd,2545456-1,fff,ggg...
osscl1head01 1449>grep . fileA | grep -v / | awk -F, '{print $1}' > fileC
osscl1head01 1450>cat fileC
223
254-11
214-1
osscl1head01 1451>join -1 1 -2 5 -t, fileC fileB > fileD
osscl1head01 1452>cat fileD
254-11,aaaa,bbbb,ccc,dddd,fff,ggg...
osscl1head01 1453>


EDIT You need to sort both the input files to join by the identifier, but that *should* be straight forward enough.
sort +4 -t, fileB > fileBsorted
sort fileC > fileCsorted

You can probably use awk to repair the structure of fileD if that is important.

Last edited by Digby; 09-03-2008 at 07:56 AM..
# 9  
Old 09-03-2008
hey dude
I appreciate your help. Once you found that Id '254-11' is common in both the files. You need to grep the File A to get ouptut as below. current line and previous line of the match. You are joining the fields of matched record of both the files. That is not what i need.

output:
// data not found ---- (previous line in file A)
254-11,Jan,ee,bla,bla ---- (current line in file A)

I am not touching file B after finding the match using comm.
# 10  
Old 09-03-2008
I realize that's what you want to do, but with these large files grepping every query against every line isn't feasible (at least for my files). Even with a modest number of queries it is painfully slow.

I am joining the matching lines in the two files, but as fileC only contains the field to be matched, the outputted line is effectively the fileB line. Note that join only outputs lines that match, so it is what you (and I) need (I think).

The only problem is that the fields of the fileB line have been rearranged.
awk -F, '{print $2,$3,$4,$5,$1,$6....}' could sort this out. If you've got a very large number of fields in file B then I guess a perl or sed command could come in handy, but I don't know exactly how to write it.

If this wouldn't result in your required output, then I'm afraid I'm misunderstanding your problem.

Last edited by Digby; 09-03-2008 at 07:01 AM..
# 11  
Old 09-03-2008
Sorry dude, I just reread your post and realised what output you're looking for.
You want the the final output to be from file A. Smilie

Could you convert it to a single line format and then use join?

daisy 1860>perl -pe 's/\n/:/g' fileA | perl -pe 's/::/\n/g' | perl -pe 's/:$/\n/g' | awk -F: '{print $2":"$1}' | sort > fileAA
daisy 1861>awk -F, '{print $5}' fileB | sort > fileBB
daisy 1862>join -1 1 -2 1 -t, fileAA fileBB | awk -F: '{print $2":"$1}' | perl -pe 's/:/\n/' | perl -pe 's/^\//\n\//'

// data not found
254-11,Jan,ee,bla,bla

Last edited by Digby; 09-03-2008 at 08:27 AM..
# 12  
Old 09-03-2008
It worked great. Awsome dude. You are really great. hats off to your weighted brain. I am your fan from today. trust me...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Hi Friends !! I am facing a hash total issue while performing over a set of files of huge volume: Command used: tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f' Pipe delimited file and 156 column is for hash totalling.... (14 Replies)
Discussion started by: Ravichander
14 Replies

2. Shell Programming and Scripting

Work with huge Zipped files

Hello dear members, I have one general and one specific question which I will be very grateful if you could help me with them. Let's start with my general question: 1. I am working on cluster computer shared with other people and I need to manipulate a big zipped text file of 13 GB. There is... (1 Reply)
Discussion started by: Homa
1 Replies

3. Shell Programming and Scripting

awk to parse huge files

Hello All, I have a situation as below: (1) Read a source file (a single file of 1.2 million rows in it ) (2) Read Destination files one by one and replace the content ( few fields in it ) with the corresponding matching field from source file. I tried as below: ( please note I am not... (4 Replies)
Discussion started by: panyam
4 Replies

4. Shell Programming and Scripting

Perl: Need help comparing huge files

What do i need to do have the below perl program load 205 million record files into the hash. It currently works on smaller files, but not working on huge files. Any idea what i need to do to modify to make it work with huge files: #!/usr/bin/perl $ot1=$ARGV; $ot2=$ARGV; open(mfileot1,... (12 Replies)
Discussion started by: mrn6430
12 Replies

5. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

6. Shell Programming and Scripting

Comparing two huge files on field basis.

Hi all, I have two large files and i want a field by field comparison for each record in it. All fields are tab seperated. file1: Email SELVAKUMAR RAMACHANDRAN Email SHILPA SAHU Web NIYATI SONI Web NIYATI SONI Email VIINII DOSHI Web RAJNISH KUMAR Web ... (4 Replies)
Discussion started by: Suman Singh
4 Replies

7. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

8. UNIX for Advanced & Expert Users

Huge files manipulation

Hi , i need a fast way to delete duplicates entrys from very huge files ( >2 Gbs ) , these files are in plain text. I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (memory core dump) In using HP-UX large servers. Any advice will... (8 Replies)
Discussion started by: Klashxx
8 Replies

9. UNIX for Dummies Questions & Answers

Difference between two huge files

Hi, As per my requirement, I need to take difference between two big files(around 6.5 GB) and get the difference to a output file without any line numbers or '<' or '>' in front of each new line. As DIFF command wont work for big files, i tried to use BDIFF instead. I am getting incorrect... (13 Replies)
Discussion started by: pyaranoid
13 Replies

10. UNIX for Dummies Questions & Answers

comparing Huge Files - Performance is very bad

Hi All, Can you please help me in resolving the following problem? My requirement is like this: 1) I have two files YESTERDAY_FILE and TODAY_FILE. Each one is having nearly two million data. 2) I need to check each record of TODAY_FILE in YESTERDAY_FILE. If exists we can skip that by... (5 Replies)
Discussion started by: madhukalyan
5 Replies
Login or Register to Ask a Question