Three Difference File Huge Data Comparison Problem.
I got three different file:
Part of File 1
Code:
ARTPHDFGAA
.
.
Part of File 2
Code:
ARTGHHYESA
.
.
Part of File 3
Code:
ARTPOLYWEA
.
.
Does anybody got idea to find out the answer below and generate the result into difference file:
1) Share data content among file 1, file 2 and file 3
Desired result file content
Code:
ART A
.
.
2) Share data content among file 1 and file 2
Desired result file content
Code:
ART H A
.
.
3) Share data content among file 1 and file 3
Desired result file content
Code:
ARTP A
.
.
4) Share data content among file 2 and file 3
Desired result file content
Code:
ART Y A
.
.
5) Data content only in file 1, but not in file 2 and file 3
Desired result file content
Code:
DFGA
.
.
6) Data content only in file 2, but not in file 1 and file 3
Desired result file content
Code:
G H ES
.
.
7) Data content only in file 3, but not in file 1 and file 2
Desired result file content
Code:
OL WE
.
.
"." refer to long list (eg. ASDASDASFJKJETET.....) of data file content.
All the file 1, file 2 and file 3 are exactly same file size, 110MB/110 000 million letter in each file.
The difference of the above three file, just some of their contents.
My purpose just plan to compare the three file data content and find out the common data content in all three files, unique content in each file, etc.
Thanks a lot for any advice and any comments to find out the solution of each different condition.
Last edited by patrick87; 10-22-2010 at 01:45 PM..
Please state what Operating System and version you have.
Please state you preferred Shell.
Please list what data processing tools you have available. We note that you have perl and awk. Do you have a high-level programming language too or are you trying to write this system in unix Shell.
Thirdly and most importantly:
Exactly how big are the files?
Are they fixed length records in standard unix text file format?
Does the full-stop appear in the data?
Do you have a larger sample (say 20 lines per file) of representative data?
$
$
$ # show the contents of file1, file2 and file3
$
$ cat -n file1
1 ARTPHDFGAA
2 DKDXCSIKER
3 QQELKRNIKJ
4 OJUUXGFBVP
$
$ cat -n file2
1 ARTGHHYESA
2 AGCBCHBCRB
3 CWWENBYITN
4 WMVXVPNANW
$
$ cat -n file3
1 ARTPOLYWEA
2 PXCSMTWUND
3 MNBLYALUUO
4 XPRAYHLPHT
$
$ # show the content of the Perl program
$
$ cat -n string_operations.pl
1 #perl -w
2 sub print_legend {
3 print "
4 LEGEND =>
5 (1) Characters common in file1, file2, file3.
6 (2) Characters common in file1 and file2.
7 (3) Characters common in file2 and file3.
8 (4) Characters common in file3 and file1.
9 (5) Characters in file1 that are absent from file2 and file3.
10 (6) Characters in file2 that are absent from file3 and file1.
11 (7) Characters in file3 that are absent from file1 and file2.
12 ";
13 }
14 sub common_all {
15 $n = shift;
16 $x1 = shift;
17 $x2 = shift;
18 $x3 = shift; # load all arguments to work on
19 print "\n","="x20," Line no. $n\n"; # print something nice
20 print "(1) ";
21 for ($j=0; $j<=length($x1); $j++) { # walk through the 1st string
22 if ( substr($x1,$j,1) eq substr($x2,$j,1) && # if 1st and 2nd string have identical character
23 substr($x2,$j,1) eq substr($x3,$j,1)) { # that is identical to that of 3rd string, then
24 print substr($x1,$j,1); # print it
25 } else {
26 print " "; # otherwise, print a blank space
27 }
28 }
29 print "\n";
30 }
31 sub common_xy {
32 $n = shift;
33 $x1 = shift;
34 $x2 = shift;
35 print "($n) ";
36 for ($j=0; $j<=length($x1); $j++) { # walk through the 1st string
37 if (substr($x1,$j,1) eq substr($x2,$j,1)) { # if 1st and 2nd string have identical character
38 print substr($x1,$j,1); # then print it
39 } else {
40 print " "; # otherwise, print a blank space
41 }
42 }
43 print "\n";
44 }
45 sub in_x_not_in_yz {
46 $n = shift;
47 $x1 = shift;
48 $x2 = shift;
49 $x3 = shift;
50 print "($n) ";
51 for ($j=0; $j<=length($x1); $j++) { # walk through the 1st string
52 if (substr($x1,$j,1) ne substr($x2,$j,1) && # if current character is not in 2nd string
53 substr($x1,$j,1) ne substr($x3,$j,1)) { # and not in 3rd string either, then
54 print substr($x1,$j,1); # print it
55 } else {
56 print " "; # otherwise, print a blank space
57 }
58 }
59 print "\n";
60 }
61
62 ## Main program starts here
63 print_legend;
64
65 # Open the 3 files and load data into 3 arrays
66 open (F1, "<", "file1") or die "Can't open file1: $!";
67 chomp(@a1 = <F1>);
68 close (F1) or die "Can't close file1: $!";
69 open (F2, "<", "file2") or die "Can't open file2: $!";
70 chomp(@a2 = <F2>);
71 close (F2) or die "Can't close file2: $!";
72 open (F3, "<", "file3") or die "Can't open file3: $!";
73 chomp(@a3 = <F3>);
74 close (F3) or die "Can't close file3: $!";
75
76 # Start processing the arrays now
77 for ($i=0; $i<=$#a1; $i++) {
78 common_all ($i+1, $a1[$i], $a2[$i], $a3[$i]); # Common in all three
79 common_xy (2, $a1[$i], $a2[$i]); # Common in file1 and file2
80 common_xy (3, $a2[$i], $a3[$i]); # Common in file2 and file3
81 common_xy (4, $a3[$i], $a1[$i]); # Common in file3 and file1
82 in_x_not_in_yz (5, $a1[$i], $a2[$i], $a3[$i]); # In file1 but not in file2 and file3
83 in_x_not_in_yz (6, $a2[$i], $a3[$i], $a1[$i]); # In file2 but not in file3 and file1
84 in_x_not_in_yz (7, $a3[$i], $a1[$i], $a2[$i]); # In file3 but not in file1 and file2
85 }
86 print "\n";
$
$
$ # Now run the Perl program
$
$ perl string_operations.pl
LEGEND =>
(1) Characters common in file1, file2, file3.
(2) Characters common in file1 and file2.
(3) Characters common in file2 and file3.
(4) Characters common in file3 and file1.
(5) Characters in file1 that are absent from file2 and file3.
(6) Characters in file2 that are absent from file3 and file1.
(7) Characters in file3 that are absent from file1 and file2.
==================== Line no. 1
(1) ART A
(2) ART H A
(3) ART Y A
(4) ARTP A
(5) DFGA
(6) G H ES
(7) OL WE
==================== Line no. 2
(1)
(2) C
(3) C
(4)
(5) DKDX SIKER
(6) AG B HBCRB
(7) PX SMTWUND
==================== Line no. 3
(1)
(2) I
(3)
(4) L
(5) QQE KRN KJ
(6) CWWENBY TN
(7) MNB YALUUO
==================== Line no. 4
(1)
(2)
(3)
(4)
(5) OJUUXGFBVP
(6) WMVXVPNANW
(7) XPRAYHLPHT
$
$
Hopefully, the inline script comments are self-explanatory.
file 1 , record 1
file 2 , record 1
file 3 , record 1
then
file 1 , record 2
file 2 , record 2
file 3 , record 2
then
file 1 , record 3
file 2 , record 3
file 3 , record 3
... etc.
Ps. It would really help to know what Operating System and software you have available.
Applying lateral thought we can deduce that some software wrote these 110Mb files. It is software written in a high-level programming language? If so which language?
Edit: Didn't see durden_tyler post while I was typing. Try that first.
I have 2 large file (.dat) around 70 g, 12 columns but the data not sorted in both the files.. need your inputs in giving the best optimized method/command to achieve this and redirect the not macthing lines to the thrid file ( diff.dat)
File 1 - 15 columns
File 2 - 15 columns
Data is... (9 Replies)
Hi all,
I hope you are well. I am very happy to see your contribution. I am eager to become part of it.
I have the following question. I have two huge files to compare (almost 3GB each). The files are simulation outputs. The format of the files are as below
For clear picture, please see... (9 Replies)
I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;”
Here is the sample of 5 lines in the file:
Name1;phone1;address1;city1;state1;zipcode1
Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Hello Everyone,
I have a perl script that reads two types of data files (txt and XML). These data files are huge and large in number. I am using something like this :
foreach my $t (@text)
{
open TEXT, $t or die "Cannot open $t for reading: $!\n";
while(my $line=<TEXT>){
... (4 Replies)
Hi i need to compare two fixed length files and produce the differences if any to a seperate file. I have to capture each and every differneces line by line. Ideally my files should not have any differences but if there are any then it should be captured without any miss. Also my files sizes are... (4 Replies)
Here is my problem. I have to find the differences in 2 XML files
This is my Old File contents - File1
<FILEHDR>
<Bag xsi:nil='true'></Bag>
</FILEHDR>
This is my New File contents - File2
<FILEHDR>
<Bag xsi:nil='true' ></Bag>
</FILEHDR>
When I do the following
diff -b File1 File2... (1 Reply)
I have a file with data extracted, and need to insert a header with a constant string, say: H|PayerDataExtract
if i use sed, i have to redirect the output to a seperate file like
sed ' sed commands' ExtractDataFile.dat > ExtractDataFileWithHeader.dat
the same is true for awk
and... (10 Replies)
Hi,
As per my requirement, I need to take difference between two big files(around 6.5 GB) and get the difference to a output file without any line numbers or '<' or '>' in front of each new line.
As DIFF command wont work for big files, i tried to use BDIFF instead.
I am getting incorrect... (13 Replies)
Hi,
I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows:
1. Needs to create folders as the strings starts with "item_*" from the input file
2. Create a file "contents" in each folders having "license.txt(tab... (5 Replies)
folks,
In my working directory, there a multiple large files which only contain one line in the file. The line is too long to use "grep", so any help?
For example, if I want to find if these files contain a string like "93849", what command I should use?
Also, there is oder_id number... (1 Reply)