Three Difference File Huge Data Comparison Problem.

10-22-2010

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Three Difference File Huge Data Comparison Problem.

I got three different file:
Part of File 1

Code:

ARTPHDFGAA
.
.

Part of File 2

Code:

ARTGHHYESA
.
.

Part of File 3

Code:

ARTPOLYWEA
.
.

Does anybody got idea to find out the answer below and generate the result into difference file:

1) Share data content among file 1, file 2 and file 3
Desired result file content

Code:

ART      A
.
.

2) Share data content among file 1 and file 2
Desired result file content

Code:

ART  H   A
.
.

3) Share data content among file 1 and file 3
Desired result file content

Code:

ARTP     A
.
.

4) Share data content among file 2 and file 3
Desired result file content

Code:

ART   Y  A
.
.

5) Data content only in file 1, but not in file 2 and file 3
Desired result file content

Code:

     DFGA
.
.

6) Data content only in file 2, but not in file 1 and file 3
Desired result file content

Code:

    G H ES
.
.

7) Data content only in file 3, but not in file 1 and file 2
Desired result file content

Code:

    OL WE
.
.

"." refer to long list (eg. ASDASDASFJKJETET.....) of data file content.
All the file 1, file 2 and file 3 are exactly same file size, 110MB/110 000 million letter in each file.
The difference of the above three file, just some of their contents.
My purpose just plan to compare the three file data content and find out the common data content in all three files, unique content in each file, etc.

Thanks a lot for any advice and any comments to find out the solution of each different condition.

Last edited by patrick87; 10-22-2010 at 01:45 PM..

patrick87

View Public Profile for patrick87

Find all posts by patrick87

10-22-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

Firstly this seems remarkably close to one of your many threads on this subject except that it has three files not two files.
https://www.unix.com/shell-programmin...1-problem.html

Secondly to avoid too much guesswork:

Please state what Operating System and version you have.
Please state you preferred Shell.
Please list what data processing tools you have available. We note that you have perl and awk. Do you have a high-level programming language too or are you trying to write this system in unix Shell.

Thirdly and most importantly:
Exactly how big are the files?
Are they fixed length records in standard unix text file format?
Does the full-stop appear in the data?
Do you have a larger sample (say 20 lines per file) of representative data?

This User Gave Thanks to methyl For This Post:

methyl

View Public Profile for methyl

Find all posts by methyl

10-22-2010

Registered User

242, 1

Join Date: Sep 2009

Last Activity: 24 August 2018, 1:52 AM EDT

Posts: 242

Thanks Given: 27

Thanked 1 Time in 1 Post

Thanks a lot for your remind, methyl.
Sorry for confusing you

I just edit my question.
Hopefully it is better to understand this times.

patrick87

View Public Profile for patrick87

Find all posts by patrick87

10-22-2010

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Here's a short Perl program for you to mull over.

Code:

$
$
$ # show the contents of file1, file2 and file3
$
$ cat -n file1
     1  ARTPHDFGAA
     2  DKDXCSIKER
     3  QQELKRNIKJ
     4  OJUUXGFBVP
$
$ cat -n file2
     1  ARTGHHYESA
     2  AGCBCHBCRB
     3  CWWENBYITN
     4  WMVXVPNANW
$
$ cat -n file3
     1  ARTPOLYWEA
     2  PXCSMTWUND
     3  MNBLYALUUO
     4  XPRAYHLPHT
$
$ # show the content of the Perl program
$
$ cat -n string_operations.pl
     1  #perl -w
     2  sub print_legend {
     3    print "
     4  LEGEND =>
     5  (1) Characters common in file1, file2, file3.
     6  (2) Characters common in file1 and file2.
     7  (3) Characters common in file2 and file3.
     8  (4) Characters common in file3 and file1.
     9  (5) Characters in file1 that are absent from file2 and file3.
    10  (6) Characters in file2 that are absent from file3 and file1.
    11  (7) Characters in file3 that are absent from file1 and file2.
    12  ";
    13  }
    14  sub common_all {
    15    $n = shift;
    16    $x1 = shift;
    17    $x2 = shift;
    18    $x3 = shift;                                       # load all arguments to work on
    19    print "\n","="x20," Line no. $n\n";                # print something nice
    20    print "(1) ";
    21    for ($j=0; $j<=length($x1); $j++) {                # walk through the 1st string
    22      if ( substr($x1,$j,1) eq substr($x2,$j,1) &&     # if 1st and 2nd string have identical character
    23           substr($x2,$j,1) eq substr($x3,$j,1)) {     # that is identical to that of 3rd string, then
    24        print substr($x1,$j,1);                        # print it
    25      } else {
    26        print " ";                                     # otherwise, print a blank space
    27      }
    28    }
    29    print "\n";
    30  }
    31  sub common_xy {
    32    $n = shift;
    33    $x1 = shift;
    34    $x2 = shift;
    35    print "($n) ";
    36    for ($j=0; $j<=length($x1); $j++) {                # walk through the 1st string
    37      if (substr($x1,$j,1) eq substr($x2,$j,1)) {      # if 1st and 2nd string have identical character
    38        print substr($x1,$j,1);                        # then print it
    39      } else {
    40        print " ";                                     # otherwise, print a blank space
    41      }
    42    }
    43    print "\n";
    44  }
    45  sub in_x_not_in_yz {
    46    $n = shift;
    47    $x1 = shift;
    48    $x2 = shift;
    49    $x3 = shift;
    50    print "($n) ";
    51    for ($j=0; $j<=length($x1); $j++) {                # walk through the 1st string
    52      if (substr($x1,$j,1) ne substr($x2,$j,1) &&      # if current character is not in 2nd string
    53          substr($x1,$j,1) ne substr($x3,$j,1)) {      # and not in 3rd string either, then
    54        print substr($x1,$j,1);                        # print it
    55      } else {
    56        print " ";                                     # otherwise, print a blank space
    57      }
    58    }
    59    print "\n";
    60  }
    61
    62  ## Main program starts here
    63  print_legend;
    64
    65  # Open the 3 files and load data into 3 arrays
    66  open (F1, "<", "file1") or die "Can't open file1: $!";
    67  chomp(@a1 = <F1>);
    68  close (F1) or die "Can't close file1: $!";
    69  open (F2, "<", "file2") or die "Can't open file2: $!";
    70  chomp(@a2 = <F2>);
    71  close (F2) or die "Can't close file2: $!";
    72  open (F3, "<", "file3") or die "Can't open file3: $!";
    73  chomp(@a3 = <F3>);
    74  close (F3) or die "Can't close file3: $!";
    75
    76  # Start processing the arrays now
    77  for ($i=0; $i<=$#a1; $i++) {
    78    common_all ($i+1, $a1[$i], $a2[$i], $a3[$i]);     # Common in all three
    79    common_xy (2, $a1[$i], $a2[$i]);                  # Common in file1 and file2
    80    common_xy (3, $a2[$i], $a3[$i]);                  # Common in file2 and file3
    81    common_xy (4, $a3[$i], $a1[$i]);                  # Common in file3 and file1
    82    in_x_not_in_yz (5, $a1[$i], $a2[$i], $a3[$i]);    # In file1 but not in file2 and file3
    83    in_x_not_in_yz (6, $a2[$i], $a3[$i], $a1[$i]);    # In file2 but not in file3 and file1
    84    in_x_not_in_yz (7, $a3[$i], $a1[$i], $a2[$i]);    # In file3 but not in file1 and file2
    85  }
    86  print "\n";
$
$
$ # Now run the Perl program
$
$ perl string_operations.pl

LEGEND =>
(1) Characters common in file1, file2, file3.
(2) Characters common in file1 and file2.
(3) Characters common in file2 and file3.
(4) Characters common in file3 and file1.
(5) Characters in file1 that are absent from file2 and file3.
(6) Characters in file2 that are absent from file3 and file1.
(7) Characters in file3 that are absent from file1 and file2.

==================== Line no. 1
(1) ART      A
(2) ART H    A
(3) ART   Y  A
(4) ARTP     A
(5)      DFGA
(6)    G H ES
(7)     OL WE

==================== Line no. 2
(1)
(2)     C
(3)   C
(4)
(5) DKDX SIKER
(6) AG B HBCRB
(7) PX SMTWUND

==================== Line no. 3
(1)
(2)        I
(3)
(4)    L
(5) QQE KRN KJ
(6) CWWENBY TN
(7) MNB YALUUO

==================== Line no. 4
(1)
(2)
(3)
(4)
(5) OJUUXGFBVP
(6) WMVXVPNANW
(7) XPRAYHLPHT

$
$

Hopefully, the inline script comments are self-explanatory.

tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

10-22-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

Are you comparing:

file 1 , record 1
file 2 , record 1
file 3 , record 1

then

file 1 , record 2
file 2 , record 2
file 3 , record 2

then

file 1 , record 3
file 2 , record 3
file 3 , record 3

... etc.

Ps. It would really help to know what Operating System and software you have available.
Applying lateral thought we can deduce that some software wrote these 110Mb files. It is software written in a high-level programming language? If so which language?

Edit: Didn't see durden_tyler post while I was typing. Try that first.

methyl

View Public Profile for methyl

Find all posts by methyl

Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

File comaprsons for the Huge data files ( around 60G) - Need optimized and teh best way to do this

Discussion started by: kartikirans

2. UNIX for Dummies Questions & Answers

File comparison of huge files

Discussion started by: kaaliakahn

3. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

Discussion started by: lv99

4. Shell Programming and Scripting

Problem running Perl Script with huge data files

Discussion started by: ad23

5. Shell Programming and Scripting

Huge File Comparison

Discussion started by: naveenn08

6. UNIX for Dummies Questions & Answers

Ignore a string pattern while doing file comparison/difference

Discussion started by: sksahu

7. Shell Programming and Scripting

insert a header in a huge data file without using an intermediate file

Discussion started by: deepaktanna

8. UNIX for Dummies Questions & Answers

Difference between two huge files

Discussion started by: pyaranoid

9. Shell Programming and Scripting

How to extract data from a huge file?

Discussion started by: srsahu75

10. UNIX for Dummies Questions & Answers

search and grab data from a huge file

Discussion started by: ting123