07-23-2007
Problem comparing 2 files with lot of data
Hello everyone, here's the scenario
I have two files, each one has around 1,300,000 lines and each line has a column (phone numbers). I have to get the phones that are in file1 but not in file2. I can get these phones trough Oracle but my boss does not want that so he gave me the files with the phone numbers (he said it will take hours to finish the query and that will reduce the server resources or something like that).
First I tried to solve the problem with some perl scripting but it took like 10 minutes just to read the files and because my poor programming skills i tried to do the search with a double foreach, something like this:
@file1 = <SOME1>;
@file2 = <SOME2>;
$n = 0;
$flag = true; #if $flag = false then the element is in file2
foreach $row1 (@file1)
{
foreach $row2 (@file2)
{
if($row1 == $row2)
$flag = false
}
if($flag)
{
$anArray[$n]\=$row1; #ignore the backslash please
$n++;
}
$flag = true;
}
if($n > 0)
{
foreach $row3 (@anArray)
{
print OUT_FILE "$row3\n";
}
}
The data from the files is like this:
FILE1
----------------------------
1234567890
0987654321
2345678901
9012345678
FILE2
----------------------------
1234567890
0987654321
2345678901
OUT_FILE must be
----------------------------
9012345678
but this solution wil take ages to finish so now i am thinking in using awk or another lenguage but i really don't know which one is better for this problem and what algorithm i should use (besides i have never used awk or shell scripting, I'm new using UNIX), I was thinking in sort the files and then do a binary search but i have some doubts about it so i feel really lost now
Thanks for your help
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
here I go again...kinda hard to explain so I apologize.
I need to rename a bunch of files in a directory. I need to remove the first three characters of the filename, and then toward the end of the filename there is constant text inside of brackets. here is a demo (not for real) 'ls -1' of the... (11 Replies)
Discussion started by: ajp7701
11 Replies
2. Shell Programming and Scripting
Hi All,
I've two .csv files as below
file1.csv
abc, tdf, 223, tpx
jgsd, tex, 342, rpy
a, jdjdsd, 423, djfkld
Where as file2.csv is the new version of file1.csv with some added fields in the end of each line and some additional lines.
lfj, eru, 98, jkldj, 39, jdkj9
abc, tdf, 223, tpx,... (3 Replies)
Discussion started by: ganapati
3 Replies
3. UNIX for Dummies Questions & Answers
Hi power user,
if I have this file:
file1.txt:
1111
1111
2222
2222
3333
3333
3333
4444
4444
4444
when I run the
sort file1.txt | uniq > data1.txt
the result is (2 Replies)
Discussion started by: anjas
2 Replies
4. Shell Programming and Scripting
So i have about 600gb of data.. in which there are alot of directories and alot of files.. Im trying to put this on a ftp server.. So i want to set the permissions on the directories to be 755 and the permission on the files to be 644. So i used:
find . -type d -exec chmod 755 {}\;
and
find .... (6 Replies)
Discussion started by: supermiguel
6 Replies
5. Shell Programming and Scripting
Hi
This is the list file that i have :
The files is more than this.
I will rename one by one file become like this :
So just change the time stamp 200906 become 200905.
Is it possible using script ?
Thanks (3 Replies)
Discussion started by: justbow
3 Replies
6. Shell Programming and Scripting
I've 2 files. Need to compare File1.Field1,File1.Field2 with File2.Field1,File2.Field2. If matches then create a new file.
File1
10 A|ADB|967143.24|1006101.5
3E HK|DHB|24294.76|242513.89
ABN ACU|ADB|22104.69|51647.14
ABN BU|DBA|39137.14|109128.38
ABN|ADB|64466.89|167936.55
ABOC... (2 Replies)
Discussion started by: buster
2 Replies
7. Shell Programming and Scripting
Hello, I have about 3400 files in a tree structure (about 80% are html files).
1. I need to modify every html file to remove <p> style and old things like font attribute and add another style.
2. I need to change the root of all links that are in the html. e.g. change /old/path/ to /new/path... (1 Reply)
Discussion started by: Yaazkal
1 Replies
8. UNIX for Dummies Questions & Answers
hi guys
I have suse 11 sp1 and I have a lot of warn file filling / these are under /var/log
there's this big one
-rw-r----- 1 root root 3.9G Feb 1 10:28 warn
warn: ASCII text
and the others that are about 2.5 to 3MB - they are about 130 warn-*.bz2
-rw-r----- 1 root root 3.9G Feb... (2 Replies)
Discussion started by: karlochacon
2 Replies
9. Shell Programming and Scripting
Hi Friends,
I have a file 1
CREATE MULTISET TABLE TEYT_Q9_T.TEST ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
XYZ DECIMAL(10,0),
ABC VARCHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC,
PQR... (3 Replies)
Discussion started by: i150371485
3 Replies
10. Shell Programming and Scripting
Hi,
I have a huge structure of directories and subdirectories contsining some data. The lowest folders contain a file "image.png" which need to be converted to "folder.jpg". But how can I do that for all these files automatically? That's what I alredy have
find /path -type f -name... (1 Reply)
Discussion started by: KarlKarpfen
1 Replies
comm(1) User Commands comm(1)
NAME
comm - select or reject lines common to two files
SYNOPSIS
comm [-123] file1 file2
DESCRIPTION
The comm utility reads file1 and file2, which must be ordered in the current collating sequence, and produces three text columns as output:
lines only in file1; lines only in file2; and lines in both files.
If the input files were ordered according to the collating sequence of the current locale, the lines written will be in the collating
sequence of the original lines. If not, the results are unspecified.
OPTIONS
The following options are supported:
-1 Suppresses the output column of lines unique to file1.
-2 Suppresses the output column of lines unique to file2.
-3 Suppresses the output column of lines duplicated in file1 and file2.
OPERANDS
The following operands are supported:
file1 A path name of the first file to be compared. If file1 is -, the standard input is used.
file2 A path name of the second file to be compared. If file2 is -, the standard input is used.
USAGE
See largefile(5) for the description of the behavior of comm when encountering files greater than or equal to 2 Gbyte ( 2**31 bytes).
EXAMPLES
Example 1: Printing a list of utilities specified by files
If file1, file2, and file3 each contain a sorted list of utilities, the command
example% comm -23 file1 file2 | comm -23 - file3
prints a list of utilities in file1 not specified by either of the other files. The entry:
example% comm -12 file1 file2 | comm -12 - file3
prints a list of utilities specified by all three files. And the entry:
example% comm -12 file2 file3 | comm -23 -file1
prints a list of utilities specified by both file2 and file3, but not specified in file1.
ENVIRONMENT VARIABLES
See environ(5) for descriptions of the following environment variables that affect the execution of comm: LANG, LC_ALL, LC_COLLATE,
LC_CTYPE, LC_MESSAGES, and NLSPATH.
EXIT STATUS
The following exit values are returned:
0 All input files were successfully output as specified.
>0 An error occurred.
ATTRIBUTES
See attributes(5) for descriptions of the following attributes:
+-----------------------------+-----------------------------+
| ATTRIBUTE TYPE | ATTRIBUTE VALUE |
+-----------------------------+-----------------------------+
|Availability |SUNWesu |
+-----------------------------+-----------------------------+
|CSI |enabled |
+-----------------------------+-----------------------------+
|Interface Stability |Standard |
+-----------------------------+-----------------------------+
SEE ALSO
cmp(1), diff(1), sort(1), uniq(1), attributes(5), environ(5), largefile(5), standards(5)
SunOS 5.10 3 Mar 2004 comm(1)