Number of matches in 2 strings


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Number of matches in 2 strings
# 1  
Old 06-18-2012
Number of matches in 2 strings

Hello all,

I have a file with column header which looks like this.

Code:
C1 C2 C3 
A A G
T T A
G C C

I want to make columnwise (and bitwise) comparison of strings and calculate the number of matches.

So the number of matches between C1 and C2 will be comparing ATG and ATC.
Here there are two matches, A and T.

Similarly C1 and C3 has no matches and C2 and C3 has 1 match C.

The output should look like

Code:
C1 C2 2
C1 C3 0
C2 C3 1

I have 10 million rows and 320 columns, so i`m not being able to deal with such a huge file in Windows.

Can someone please help with code for this. I have access to UNIX bash with Red Hat Enterprise Linux 6.

Thanks
# 2  
Old 06-18-2012
I hope you have enough memory on this box. Try this Perl script:
Code:
#!/usr/bin/perl
open I, $ARGV[0];
while (chomp($line=<I>)) {
  @col_names=split / /,$line if $.==1;
  if ($.>1) {
    @fields=split / /,$line;
    for ($i=0;$i<=$#col_names;$i++) {
      $col[$i].=$fields[$i];
    }
  }
}
for ($i=0;$i<=$#col;$i++) {
  for ($j=$i+1;$j<=$#col;$j++) {
    @chars1=split //, $col[$i];
    @chars2=split //, $col[$j];
    $max=($#chars1>=$#chars2)?$#chars1:$#chars2;
    $matches=0;
    for ($k=0;$k<=$max;$k++) {
      $matches++ if $chars1[$k] eq $chars2[$k];
    }
    print "$col_names[$i] $col_names[$j] $matches\n";
  }
}

Run it like this:
Code:
./script.pl file > output

This User Gave Thanks to bartus11 For This Post:
# 3  
Old 06-18-2012
Hi.

If memory issues, perhaps something that converts to bits (3 values, 2 bits), then a logical difference. Just an idle thought. Similarly, a data base ... cheers, drl
This User Gave Thanks to drl For This Post:
# 4  
Old 06-18-2012
Thanks a lot, it runs fine with the sample data, taking forever with the actual file.
# 5  
Old 06-19-2012
It is because your actual data is several gigabytes big. You should also check if there is no swapping going on when running that script, because then it would really take forever to complete. Check it with: vmstat 1 (look for nonzero values in si and so columns.

---------- Post updated 06-19-12 at 03:02 AM ---------- Previous update was 06-18-12 at 02:24 PM ----------

As adviced by drl I modified script a bit to show progress of loading the file:
Code:
#!/usr/bin/perl
open I, $ARGV[0];
print STDERR "Reading columns (one dot is 100000 lines)\n";
while (chomp($line=<I>)) {
  print STDERR "." if $.%100000==0;
  @col_names=split / /,$line if $.==1;
  if ($.>1) {
    @fields=split / /,$line;
    for ($i=0;$i<=$#col_names;$i++) {
      $col[$i].=$fields[$i];
    }
  }
}
for ($i=0;$i<=$#col;$i++) {
  for ($j=$i+1;$j<=$#col;$j++) {
    @chars1=split //, $col[$i];
    @chars2=split //, $col[$j];
    $max=($#chars1>=$#chars2)?$#chars1:$#chars2;
    $matches=0;
    for ($k=0;$k<=$max;$k++) {
      $matches++ if $chars1[$k] eq $chars2[$k];
    }
    print "$col_names[$i] $col_names[$j] $matches\n";
  }
}

# 6  
Old 06-19-2012
awk should be ok

Code:
awk '{
        if(NR==1){
                for(i=1;i<NF;i++){
                        for(j=i+1;j<=NF;j++){
                                _[i"-"j] = $i" "$j
                                __[i"-"j] = 0
                        }
                }
        }
        else{
                for(i=1;i<NF;i++){
                        for(j=i+1;j<=NF;j++){
                                if($i == $j){
                                        __[i"-"j] ++
                                }
                        }
                }
        }
}
END{
        for(i in _)
                print _[i]" "__[i]
}' yourfile


Last edited by Franklin52; 06-19-2012 at 05:51 AM.. Reason: code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace all string matches in file with unique random number

Hello Take this file... Test01 Ref test Version 01 Test02 Ref test Version 02 Test66 Ref test Version 66 Test99 Ref test Version 99 I want to substitute every occurrence of Test{2} with a unique random number, so for example, if I was using sed, substitution would be something... (1 Reply)
Discussion started by: funkman
1 Replies

2. Shell Programming and Scripting

Print line if values in fields matches number and text

datafile: 2017-03-24 10:26:22.098566|5|'No Route for Sndr:RETEK RMS 00040 /ZZ Appl:PF Func:PD Txn:832 Group Cntr:None ISA CntlNr:None Ver:003050 '|'2'|'PFI'|'-'|'EAI_ED_DeleteAll'|'EAI_ED'|NULL|NULL|NULL|139050594|ActivityLog| 2017-03-27 02:50:02.028706|5|'No Route for... (7 Replies)
Discussion started by: SkySmart
7 Replies

3. Shell Programming and Scripting

Number of matches and matched pattern(s) in awk

input: !@#$%2QW5QWERTAB$%^&* The string above is not separated (or FS=""). For clarity sake one could re-write the string by including a "|" as FS as follow: !|@|#|$|%|2QW|5QWERT|A|B|$|%|^|&|* Here, I am only interested in patterns (their numbers are variable between records) containing... (16 Replies)
Discussion started by: beca123456
16 Replies

4. Shell Programming and Scripting

Exclude lines in a file with matches with multiple Strings using egrep

Hi I have a txt file and I would like to use egrep without using -v option to exclude the lines which matches with multiple Strings. Let's say I have some text in the txt file. The command should not fetch lines if they have strings something like CAT MAT DAT The command should fetch me... (4 Replies)
Discussion started by: Sathwik
4 Replies

5. Shell Programming and Scripting

Count number of pattern matches per line for all files in directory

I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the: -file name -line number -number of matches to the pattern /comp/ for each line. Two example files: cat... (4 Replies)
Discussion started by: pathunkathunk
4 Replies

6. Shell Programming and Scripting

Help in printing n number of lines if a search string matches in a file

Hi I have below script which is used to grep specific errors and if error string matches send an email alert. Script is working fine , however , i wish to print next 10 lines of the string match to get the details of error in the email alert Current code:- #!/bin/bash tail -Fn0 --retry... (2 Replies)
Discussion started by: neha0785
2 Replies

7. Shell Programming and Scripting

grep - match files containing minimum number of pattern matches

I want to search a bunch of files and list only those containing a minimum number of pattern matches. So if I want to identify files containing 3 (or more) instances of the pattern "said:" and I have file1 that contains the lines: He said: She said: and file2 that contains the lines: He... (3 Replies)
Discussion started by: stumpyuk
3 Replies

8. Shell Programming and Scripting

Get line number when matches a string

If I have a file something like as shown below, ARM*187878*hjhj BAG*88778*jjjj COD*7777*kkkk BAG*87878*kjjhjk DEF*65656*89989*khjkk I need the line numbers to be added with a colon when it matches the string "BAG". Here in my case, I need something like ARM*187878*hjhj... (4 Replies)
Discussion started by: Muthuraj K
4 Replies

9. Shell Programming and Scripting

Display LineNo Incase Total Number Of Delimiter Does matches in a given variable

I have many files .dat extension. requirement is to display line no if no of delimiter does not matches in a given variable lets say File: REF_BETOS.dat HCPCS_OR_CPT_CODE~BETOS_CODE~TERMINATION_DATE 0001F~Z2~ 0003T~I4~B20061231 0005F~Z2~~~ 0008T~P8~B20061231... (1 Reply)
Discussion started by: ainuddin
1 Replies

10. Shell Programming and Scripting

Select matches between line number and end of file?

Hi Guys/Gals, I have a log file that is updated once every few seconds and I am looking for a way to speed up one of my scripts. Basically what I am trying to do is grep through a text file from start to finish once. Then each subsequent grep starts at the last line of the previous grep to... (4 Replies)
Discussion started by: Jerrad
4 Replies
Login or Register to Ask a Question