Joining files in a complex way


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Joining files in a complex way
# 8  
Old 03-07-2010
Thanx Tyler it's working great.

Thanx Tyler it's working great
# 9  
Old 03-12-2010
Nice ones

Last edited by ruby_sgp; 03-12-2010 at 11:47 AM..
# 10  
Old 03-12-2010
smallchanges

hEY small alteration at defining hashes
Code:
# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 A/G 2 G/G 3);
my %numtochar = qw(1 A/A 2 A/G 3 G/G);

Change like this
1st set = AA or CC or GG or TT
2nd set=AC, AG, AT, CG, CT, GT etc

2nd set MUST be 2
1stset MUST be 1 OR 3 - if it has T/T and G/G take one as 1 and another as 3
but output should have where (A/A or others ) 1 or 2 or 3 came from
For example

Code:
# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 T/T 3 G/G 1 C/C 3 A/T 2 A/G 2 A/C 2 T/A 2 T/G 2 T/C 2 G/A 2 G/C 2 C/A 2 C/T 2 C/G 2);

tHIS ONE IS WROKING BUT NOT GIVING WHERE THE VALUES CAME [A/A OR G/G OR OTHERS FROM like this
the second bold is shouldn't be T/T
input1
Code:
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

input2

Code:
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

Code:
$ perl combine.pl
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"       3           1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

# 11  
Old 03-13-2010
Quote:
Originally Posted by stateperl
...
the second bold is shouldn't be T/T
input1
Code:
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

input2

Code:
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

...
What should be the output for these files then ?

tyler_durden
# 12  
Old 03-13-2010
Here is an example

Condition::
Code:
1. if letters are same it has to be 1 or 3 [ A/A or T/T or G/G or C/C ]
2. if same ID has 2 same letters first one has to be 1 and other has to be 3 [ see the ID S1 has T/T(bold) as 1 and G/G as 3.:::Red bold ]
3. if letters are different it has to be 2 [ A/G or T/A or T/C or others ]

modified-input1

Code:
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

same old-input2
Code:
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

newoutput
Code:
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"         3          1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20


Last edited by stateperl; 03-13-2010 at 12:29 PM..
# 13  
Old 03-16-2010
Quote:
Originally Posted by stateperl
Condition::
Code:
1. if letters are same it has to be 1 or 3 [ A/A or T/T or G/G or C/C ]
2. if same ID has 2 same letters first one has to be 1 and other has to be 3 [ see the ID S1 has T/T(bold) as 1 and G/G as 3.:::Red bold ]
3. if letters are different it has to be 2 [ A/G or T/A or T/C or others ]

modified-input1

Code:
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

same old-input2
Code:
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

newoutput
Code:
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"         3          1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

Well, in this case, you'll have to generate the key-value pairs in the three hashes - %chartonum, %numtochar and %mainhash as you iterate through "input1", based on the 3 conditions mentioned.

Code:
$ 
$ 
$ cat input1
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"
$ 
$ cat input2
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20
$ 
$ cat -n combine_2.pl
     1  #!/usr/bin/perl -w
     2
     3  my %chartonum;
     4  my %numtochar;
     5  my %mainhash;
     6
     7  # first process all "input1" files i.e. all elements of the array @infile1
     8  $file1 = "input1";
     9
    10  open(INFILE, $file1) or die "Can't open $file1: $!";
    11  while (<INFILE>) {
    12    chomp;
    13    s/"//g;
    14    s/[ ]+/ /g;
    15    if ($. == 1) {
    16      @x = split/ /;
    17    } else {
    18      @y = split/ /;
    19      foreach $i (1..$#y) {
    20        @t = split (/\//, $y[$i]);
    21        if ($t[0] eq $t[1]) {
    22          if (not defined $chartonum{$x[$i].",".$y[$i]} and
    23              not defined $numtochar{$x[$i].",1"}       and
    24              not defined $numtochar{$x[$i].",3"}) {
    25            $numtochar{$x[$i].",1"} = $y[$i];
    26            $chartonum{$x[$i].",".$y[$i]} = 1;
    27          }
    28          elsif (not defined $chartonum{$x[$i].",".$y[$i]} and
    29                 not defined $numtochar{$x[$i].",3"}) {
    30            $numtochar{$x[$i].",3"} = $y[$i];
    31            $chartonum{$x[$i].",".$y[$i]} = 3;
    32          }
    33        } else {
    34          if (not defined $chartonum{$x[$i].",".$y[$i]} and
    35              not defined $numtochar{$x[$i].",2"}) {
    36            $numtochar{$x[$i].",2"} = $y[$i];
    37            $chartonum{$x[$i].",".$y[$i]} = 2;
    38          } # end of if not defined
    39        } # end of else i.e. t[0] ne t[1]
    40        $mainhash{$x[$i].",".$chartonum{$x[$i].",".$y[$i]}}++;
    41      } # end of foreach
    42    } # end of $. > 1
    43  }
    44  close(INFILE) or die "Can't close $file1: $!";
    45
    46  # print the header
    47  printf("%-12s%-12s%-12s%-12s%-12s%-s\n","\"ID\"","\"Label\"","\"StYPE\"","\"Ntype\"","\"Stype_No\"","\"log\"");
    48  # now start processing the "input2" file
    49  $infile2 = "input2";
    50  open(INFILE, $infile2) or die "Can't open $infile2: $!";
    51  while (<INFILE>) {
    52    if ($. > 1) {
    53      chomp;
    54      s/"//g;
    55      s/[ ]+/ /g;
    56      # print $_,"\n";
    57      @z = split/ /;
    58      if (!defined $prev or $z[0] ne $prev) {$num = 1} else {$num++};
    59      $prev = $z[0];
    60      printf("%-12s%-12s%-12s%-12s%-12s%-s\n",
    61             "\"$z[0]\"",
    62             "\"$z[1]\"",
    63             defined $numtochar{$z[0].",".$num} ? "\"".$numtochar{$z[0].",".$num}."\"" : "\"NULL\"",
    64             exists $numtochar{$z[0].",".$num} ? $num : "\"null\"",
    65             defined $mainhash{$z[0].",".$num} ? $mainhash{$z[0].",".$num} : "\"null\"", 
    66             $z[2]
    67            );
    68    }
    69  }
    70  close(INFILE) or die "Can't close $infile2: $!";
    71
$ 
$ perl combine_2.pl
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"       3           1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20
$ 
$

Use the Data :: Dumper module to check the values of the 3 hashes right after "input1" is done processing at line 45.

But frankly, given the complexity of calculations involved, I'd rather look for some Perl Bioinformatics modules that have subroutines to do this.
Or check BioPerl, or books like "Beginning/Mastering Perl for Bioinformatics" at amazon.com.

HTH,
tyler_durden

PS - I'm assuming those A, C, G, T are the nucleotide bases of a DNA strand, and these files are related to Bioinformatics.

Last edited by durden_tyler; 03-16-2010 at 02:50 PM..
# 14  
Old 03-16-2010
so helpful

Thank you very much for follow up and suggestions.
I think I should do more perl example practice. But thank you for your valuable time.

Last edited by stateperl; 03-16-2010 at 11:29 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Please help me in joining two files

I have two files with the below contents : sampleoutput3.txt 20150202;hostname1 20150223;hostname2 20150716;hostname3 sampleoutput1.txt hostname;packages_out_of_date;errata_out_of_date; hostname1;11;0; hostnamea;12;0; hostnameb;11;0; hostnamec;95;38; hostnamed;440;358;... (2 Replies)
Discussion started by: rahul2662
2 Replies

2. Shell Programming and Scripting

Joining 2 Files

File "A" (column names: Nickname Number GB) Nickname Number GB PROD_DB0034 100A 16 ASMIL1B_DATA_003 100B 16 PSPROD_0000 1014 36 PSPROD_0001 100D 223 ..... File "B" (column names: TYPE DEVICE NUMBER SIZE) TYPE DEVICE NUMBER SIZE 1750500 hdisk2 100A 16384 1750500 hdisk3 ... (4 Replies)
Discussion started by: Daniel Gate
4 Replies

3. Shell Programming and Scripting

Help with joining files and adding headers to files

Hi, I have about 20 tab delimited text files that have non sequential numbering such as: UCD2.summary.txt UCD45.summary.txt UCD56.summery.txt The first column of each file has the same number of lines and content. The next 2 column have data points: i.e UCD2.summary.txt: a 8.9 ... (8 Replies)
Discussion started by: rrdavis
8 Replies

4. Shell Programming and Scripting

Joining two files into one

Hi experts, I'm quite newbie here!! I have two seperate files. Contents of file like below File 1: 6213019212001 8063737 File:2 15703784 I want to join these two files into one where content will be File 3: 6213019212001 8063737 15703784 Regards, Ray Seilden (1 Reply)
Discussion started by: RayanS
1 Replies

5. UNIX for Dummies Questions & Answers

Joining two files

I have two comma separated files. I want to join those filesa nd put the result in separate file. smaple data are: file1: A1,1,100 A2,1,200 B1,2,100 B2,2,200 file2 1,50 1,25 1,25 1,100 1,100 2,50 2,50 (10 Replies)
Discussion started by: pandeesh
10 Replies

6. Shell Programming and Scripting

Joining Three Files

Hi guys, I have three files which needs to be joined to a single file. File 1: Col a, Col b, Col c File 2: Col 1a, Col 1b File 3: Col 2a, Col 2b Output: Col 1a, Col 2a, Col a, Col b, Col c. All the files are comma delimited. I need to join Col b with Col 1b and need to... (17 Replies)
Discussion started by: mac4rfree
17 Replies

7. Shell Programming and Scripting

joining two or more files

i have three files file a has contents 123 234 238 file b has contents 189 567 567 and file c has contents qwe ert ery (1 Reply)
Discussion started by: tomjones
1 Replies

8. Shell Programming and Scripting

Help with joining two files

Greetings, all. I've got a project that requires I join two data files together, then do some processing and output. Everything must be done in a shell script, using standard unix tools. The files look like the following: File_1 Layout: Acct#,Subacct#,Descrip Sample: ... (3 Replies)
Discussion started by: rjlohman
3 Replies

9. UNIX for Dummies Questions & Answers

joining 2 files

Hi, I have two files that I need to find difference between. Do I use diff or join? If join, how do I use it? thanks, webtekie (1 Reply)
Discussion started by: webtekie
1 Replies
Login or Register to Ask a Question