Problem comparing 2 files with lot of data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem comparing 2 files with lot of data
# 1  
Old 07-23-2007
Data Problem comparing 2 files with lot of data

Hello everyone, here's the scenario

I have two files, each one has around 1,300,000 lines and each line has a column (phone numbers). I have to get the phones that are in file1 but not in file2. I can get these phones trough Oracle but my boss does not want that so he gave me the files with the phone numbers (he said it will take hours to finish the query and that will reduce the server resources or something like that).

First I tried to solve the problem with some perl scripting but it took like 10 minutes just to read the files and because my poor programming skills i tried to do the search with a double foreach, something like this:

@file1 = <SOME1>;
@file2 = <SOME2>;
$n = 0;
$flag = true; #if $flag = false then the element is in file2

foreach $row1 (@file1)
{
foreach $row2 (@file2)
{
if($row1 == $row2)
$flag = false
}
if($flag)
{
$anArray[$n]\=$row1; #ignore the backslash please
$n++;
}
$flag = true;
}

if($n > 0)
{
foreach $row3 (@anArray)
{
print OUT_FILE "$row3\n";
}
}



The data from the files is like this:


FILE1
----------------------------
1234567890
0987654321
2345678901
9012345678


FILE2
----------------------------
1234567890
0987654321
2345678901


OUT_FILE must be
----------------------------
9012345678



but this solution wil take ages to finish so now i am thinking in using awk or another lenguage but i really don't know which one is better for this problem and what algorithm i should use (besides i have never used awk or shell scripting, I'm new using UNIX), I was thinking in sort the files and then do a binary search but i have some doubts about it so i feel really lost now

Thanks for your help
# 2  
Old 07-24-2007
1. sort both files using "sort".

2. then use "diff" to show the differences.
# 3  
Old 07-24-2007
Hope this helps

You can try something like the below

your_path is the path where there is enough space to execute the sort command for your huge files.

Code:
cat FILE1 FILE2 | sort -T your_path | uniq -u 

(or) to avoid UUOC

sort -T your_path FILE1 FILE2 | uniq -u

# 4  
Old 07-24-2007
Try grep..
Code:
$ head file[12]
==> file1 <==
1234567890
0987654321
2345678901
9012345678

==> file2 <==
1234567890
0987654321
2345678901
$ grep -v -f file2 file1
9012345678

Not sure of the performance on large files. I think an Oracle SQL query would be better, e.g. select num from tab1 where num not in (select num from tab2)
# 5  
Old 07-25-2007
Computer Solution

Thanks to porter lorcan and ygor

Porter, your solution was not possible to perform because it needs a lot of memory (diff uses 6 times the size of the file in memory), and bdiff didnīt get exact results because the fragmentation of the files.

Lorcan: your solution worked great, it takes a few minutes

Ygor: after the problem with diff i was afraid to use greo so i didn't try it Smilie


The solution i found was this: first i had to take all the blank spaces out with awk, then i sort the files and then use COMM to get the diferences, comm worked great and get the results in a few seconds

Thanks again for your help Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert a lot of files in subdirectories automatically

Hi, I have a huge structure of directories and subdirectories contsining some data. The lowest folders contain a file "image.png" which need to be converted to "folder.jpg". But how can I do that for all these files automatically? That's what I alredy have find /path -type f -name... (1 Reply)
Discussion started by: KarlKarpfen
1 Replies

2. Shell Programming and Scripting

Comparing the data in a 2 files

Hi Friends, I have a file 1 CREATE MULTISET TABLE TEYT_Q9_T.TEST ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( XYZ DECIMAL(10,0), ABC VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC, PQR... (3 Replies)
Discussion started by: i150371485
3 Replies

3. UNIX for Dummies Questions & Answers

Lot of warn files filling /

hi guys I have suse 11 sp1 and I have a lot of warn file filling / these are under /var/log there's this big one -rw-r----- 1 root root 3.9G Feb 1 10:28 warn warn: ASCII text and the others that are about 2.5 to 3MB - they are about 130 warn-*.bz2 -rw-r----- 1 root root 3.9G Feb... (2 Replies)
Discussion started by: karlochacon
2 Replies

4. Shell Programming and Scripting

Need to modify a lot of html files

Hello, I have about 3400 files in a tree structure (about 80% are html files). 1. I need to modify every html file to remove <p> style and old things like font attribute and add another style. 2. I need to change the root of all links that are in the html. e.g. change /old/path/ to /new/path... (1 Reply)
Discussion started by: Yaazkal
1 Replies

5. Shell Programming and Scripting

Problem in comparing 2 fields from 2 files

I've 2 files. Need to compare File1.Field1,File1.Field2 with File2.Field1,File2.Field2. If matches then create a new file. File1 10 A|ADB|967143.24|1006101.5 3E HK|DHB|24294.76|242513.89 ABN ACU|ADB|22104.69|51647.14 ABN BU|DBA|39137.14|109128.38 ABN|ADB|64466.89|167936.55 ABOC... (2 Replies)
Discussion started by: buster
2 Replies

6. Shell Programming and Scripting

Rename a lot of files using shells script

Hi This is the list file that i have : The files is more than this. I will rename one by one file become like this : So just change the time stamp 200906 become 200905. Is it possible using script ? Thanks (3 Replies)
Discussion started by: justbow
3 Replies

7. Shell Programming and Scripting

chmod a lot of files

So i have about 600gb of data.. in which there are alot of directories and alot of files.. Im trying to put this on a ftp server.. So i want to set the permissions on the directories to be 755 and the permission on the files to be 644. So i used: find . -type d -exec chmod 755 {}\; and find .... (6 Replies)
Discussion started by: supermiguel
6 Replies

8. UNIX for Dummies Questions & Answers

Sorting with unique piping for a lot of files

Hi power user, if I have this file: file1.txt: 1111 1111 2222 2222 3333 3333 3333 4444 4444 4444 when I run the sort file1.txt | uniq > data1.txt the result is (2 Replies)
Discussion started by: anjas
2 Replies

9. Shell Programming and Scripting

Last field problem while comparing two csv files

Hi All, I've two .csv files as below file1.csv abc, tdf, 223, tpx jgsd, tex, 342, rpy a, jdjdsd, 423, djfkld Where as file2.csv is the new version of file1.csv with some added fields in the end of each line and some additional lines. lfj, eru, 98, jkldj, 39, jdkj9 abc, tdf, 223, tpx,... (3 Replies)
Discussion started by: ganapati
3 Replies

10. Shell Programming and Scripting

rename a lot of files again

here I go again...kinda hard to explain so I apologize. I need to rename a bunch of files in a directory. I need to remove the first three characters of the filename, and then toward the end of the filename there is constant text inside of brackets. here is a demo (not for real) 'ls -1' of the... (11 Replies)
Discussion started by: ajp7701
11 Replies
Login or Register to Ask a Question