Add static text in perl

02-04-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

@Aia, please feel free to make any improvements/suggestions to any code posted by me. I am a scientist learning programming and so this is still new to me. I learn from each post and try to improve each time. Thank you very much Smilie

. I will try again tomorrow and post back.

Unfortunately, I can tell you what the code is doing, but I can only guess and infer your intentions, and that's the hard part. If you were to post a few representative lines of $input_file and an example of the expected result to be saved in $output_file, I am sure many people would be happy to help. And we, both, might learn something along the way.

---------- Post updated at 09:45 PM ---------- Previous update was at 08:14 PM ----------

This is an example of how you might be able to insert an extra member at the header line and later output with tabs

Code:

#!/usr/bin/env perl
my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

# to be inserted after "Amino Acid Change" header member
my $nineteenth_member = "Heredity";

# insert into @header members
splice @header, 18, 0, $nineteenth_member;

$" = "\t"; # output separator for print when interpolating

# every element of @header will be separated by tab
print "@header\n";

splice works as well for inserting an array in the middle of another array

Code:

#!/usr/bin/env perl

my @numbers = qw( 1 2 3 4 );

my @fractions = qw( 3.1 3.2 3.3 3.4 );

splice @numbers, 3, 0, @fractions;

{
    local $" = ",";
    print "@numbers\n";
}

print "@numbers\n";

Code:

perl example2.pl
1,2,3,3.1,3.2,3.3,3.4,4
1 2 3 3.1 3.2 3.3 3.4 4

Last edited by Aia; 02-04-2016 at 01:32 AM.. Reason: Add example2.pl

Aia

View Public Profile for Aia

Find all posts by Aia

02-04-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The basic idea of my program is that it combines multiple steps/processes into one. A set of data is inputted that is, for lack of a better term, not useful so I use a SOAP API to connect to a python tool that verifies that the input data is found in a database and converts the data into something that is useful. This data which is a set of coordinates15 25653864 25653864 G C is an example. Those cordinates are saved as a text file that is piped into a perl program (not the one posted) to apply meaning to the coordinate. That data is then reformatted using the perl posted and the process is complete.
The science behind this makes alot of sense to me but I am learning more and more about the programming aspect. Science, especially in my field of molecular genetics and genomics is advancing very quickly and using more and more programming. Thank you for all your help

.

edit (perlupdate).

Code:

#!/bin/perl
   use strict;
   my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");  # match NM and inheritence with gene
    
	# Accept the input and output files as parameters
       my $input_file = $ARGV[0];
       my $output_file = $ARGV[1];
     
       # Set the header columns to be added to the left
       # and to the right of the header in the input file
      my @left =  (
                       "Index",
                       "Chromosome Position",
                       "Gene",
                       "Inheritance",
                       "RNA Accession",
                       "Chr",
                       "Coverage",
                       "Score",
                       "A(#F,#R)",
                       "C(#F,#R)",
                       "G(#F,#R)",
                       "T(#F,#R)",
                       "Ins(#F,#R)",
                       "Del(#F,#R)",
                       "SNP db_xref",
                       "Mutation Call",
                       "Mutant Allele Frequency",
                       "Amino Acid Change"
                  );
      my @right = (
                      "HP",
                      "SPLICE",
                      "Pseudogene",
                      "Classification",
                      "HGMD",
                      "Disease",
                      "Sanger",
                      "References"
                  );
    
      # open the input file, read the header line and sandwich it
      # between @left and @right arrays
      my $final_header;
      open (FH, "<", $input_file) or die "Can't open $input_file: $!";
	  chomp(my $hdr=<FH>);

      $final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

      # final header is set, print it to the output file
      open (OF, ">", $output_file) or die "Can't open $output_file: $!";
      print OF "$final_header";
     # close (FH) or die "Can't close $output_file: $!";
	  
	   my @colsleft = map "Null",(1..$#left);	  
	   my @colsright = map "Null",(0..$#right);	 
	  	  
	  while(<FH>)  {  # puts row of input file into $_ 
	  chomp;
	  my @vals = split/\t/; # this splits the line at tabs
	  my @mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
	  my ($gene,$transcript,$exon,$coding,$aa);
	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}
		# warn join ("\t",$gene,$transcript,$exon,$coding,$aa);
		my @out=($.,@colsleft,$_,@colsright);
		$out[2]=$gene;
		$out[3]=$nms{$transcript};
		$out[4]=$transcript;
		$out[15]=$coding;
		$out[17]=$aa;
		$out[45] = "VUS";
                    print "$out[45]\n";
				
		#print OF join("\t",$.,@colsleft,$_,@colsright),"\n";   # row data is set, print it to the output file		
		print OF join("\t",@out),"\n";   # row data is set, print it to the output file			
   	 }

Output

Code:

Classification
Null

all the other $out fields populate from the split. Not sure why the "VUS" isn't populating but that string is hardcoded and not from the split.

Last edited by cmccabe; 02-04-2016 at 11:45 AM.. Reason: added details and perl edit

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-05-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Hi cmccabe,

I wish you would had posted at least four lines of the input file and how do you expect those four lines to be reformatted as output. Due to the lack of that information the situation has not changed much.

If you are unable to do that, please take a look at these parts:

Code:

my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

What do you think is happening to @colsleft and @colsright at each iteration of the while loop?

Since those are outside the while loop, they never get refreshed for each line of the input line. Therefore, they will contain pieces of data from previous iterations, if they are not rewritten in the loop. I do not know if that's what you want.

Code:

	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}

Code:

$_ or next; # skip if AB empty

$_ will always have something through the loop, if not, the for loop stops. So, I do not know what's your intention even when you commented with skip if AB empty.

Code:

split/\:/;

remove the \, : is not especial in any way there.

Please, explain this part. What do you think that this part is doing for you?

Code:

grep {$transcript eq $_} keys %nms or next;

Last edited by Aia; 02-05-2016 at 01:19 AM..

Aia

View Public Profile for Aia

Find all posts by Aia

02-05-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The input that goes into the perl is:

Code:

Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

Code:

my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

provided the additional columns to the left and right. Basically, the file that is used is only 24 columns (represented by $_) and @colsleft put 18 columns before them with values of Null and @colsright put 8 additional columns after with values of Null. The total is now 50 and that is what is expected. So basically it is my @out=($.,@colsleft,$_,@colsright); unique entry, followed by 18 columns, then 24, then 8.

The split of column 10 [9] provides the arrays that are used in the $out. grep {$transcript eq $_} keys %nms or next; is a special case that matches the Stranscript with the nms in the beginning. This actually all seems to work.

The loop doesn't get refreshed because if there are multiple enteries then they are on separate rows.

I can not seem to figure out why VUS doesn't populate in $out[45] as that is the only part that doesn't seem to work.

I hope this helps a bit and thank you for all your help

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-05-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

I can not seem to figure out why VUS doesn't populate in $out[45] as that is the only part that doesn't seem to work.

Earlier, in post #6, I mentioned the following:

Quote:

Originally Posted by Aia

$out[45] = "VUS"; pretty much assigns it and there is not way you have "NULL" after that unless it gets changed or your understanding of what you are looking at is not correct.

I still believe that what you expect to see is misleading you of what it really is.

Code:

$out[45] = "VUS";

At this point, "VUS" is still, the 46th element of array out.

Code:

print OF join("\t",@out),"\n";

At that point, array out has been flatten out as an string separated with tabs.
You have lost your ability to know what element it would be in a string that now contains new tab separated elements.
In fact, since $out[45] is the last part, it would, actually, become $out[64] if you were to split again by tab.
You have introduced new parts that have tabs as well.

You want to test it?
Here's your original code, with some prints to show some details. Run it with just the two lines of input you posted.

Code:

#!/bin/perl

use strict;

my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");
my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

my @left =  (
     "Index",
     "Chromosome Position",
     "Gene",
     "Inheritance",
     "RNA Accession",
     "Chr",
     "Coverage",
     "Score",
     "A(#F,#R)",
     "C(#F,#R)",
     "G(#F,#R)",
     "T(#F,#R)",
     "Ins(#F,#R)",
     "Del(#F,#R)",
     "SNP db_xref",
     "Mutation Call",
     "Mutant Allele Frequency",
     "Amino Acid Change"
 );
my @right = (
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References"
);

my $final_header;
open (FH, "<", $input_file) or die "Can't open $input_file: $!";
chomp(my $hdr=<FH>);
$final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

open (OF, ">", $output_file) or die "Can't open $output_file: $!";
print OF "$final_header";

my @colsleft = map "Null",(1..$#left);
print "\@colsleft = ", scalar @colsleft, ": @colsleft\n";

my @colsright = map "Null",(0..$#right);
print "\@colsright = ", scalar @colsright, ": @colsright\n";

while(<FH>)  {
chomp;
my @vals = split/\t/;
my @mutations=split/,/,$vals[9];
my ($gene,$transcript,$exon,$coding,$aa);
for (@mutations)
{
    $_ or next;
    ($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons

    grep {$transcript eq $_} keys %nms or next;
}
my @out=($.,@colsleft,$_,@colsright);
print "\n";
print 'After this evaluation: my @out=($.,@colsleft,$_,@colsright);', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

$out[2]=$gene;
$out[3]=$nms{$transcript};
$out[4]=$transcript;
$out[15]=$coding;
$out[17]=$aa;
$out[45] = "VUS";

print 'After this evaluation:  $out[45] = "VUS";', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

print OF join("\t",@out),"\n";

my @after = split( "\t", join("\t", @out) );
print "\@after = ", scalar @after, ": @after\n";
}

Run as:

Code:

perl debug.pl input.file /dev/null

Output:

Code:

@colsleft = 17: Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null
@colsright = 8: Null Null Null Null Null Null Null Null

After this evaluation: my @out=($.,@colsleft,$_,@colsright);
@out = 27: 2 Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null 4     41748130        41748130        G       C       exonic  PHOX2B      synonymous SNV   PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null

After this evaluation:  $out[45] = "VUS";
@out = 46: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4        41748130        41748130        G       C       exonic       PHOX2B          synonymous SNV  PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null                   VUS

@after = 65: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4 41748130 41748130 G C exonic PHOX2B  synonymous SNV PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G 0.0007 . . . . . 0.0005 0.0002 0.0007 . Null Null Null Null Null Null Null Null                   VUS

Please, scroll all the way to the right to see VUS highlighted.

Again, knowing what the code does... it is not a problem; knowing what you expect is the hard part.
If you were to post an example of how those 2 lines are supposed to look after the process, that might help.

Last edited by Aia; 02-05-2016 at 10:17 PM.. Reason: Run only with the first two lines of input

Aia

View Public Profile for Aia

Find all posts by Aia

02-06-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

desired output: column [45] or classification is VUS

Code:

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs															Null	Null	Null	VUS	Null	Null	Null	Null

So if I am understanding (please correct me if I am wrong)

Code:

print OF join("\t",@out),"\n";

introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-06-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

So if I am understanding (please correct me if I am wrong)

Code:

print OF join("\t",@out),"\n";

introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help Smilie

.

Yes, you have elements of the array that has tabs in itself and when you convert that array to a string it creates more fields if you were to separate again those fields by tab.

Here's the input you posted in #11

Code:

4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

Here's the desired output posted in #13

Code:

1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs

I have to assume you did not correlate them since it is not possible to produce this output from that input, unless other information is missing.
Please, explain where those fields highlighted in red come from, since they are not found anywhere in your posted input or code.
If this were a case of wrong output against input, please, provide a corrected set of input and output, to remove ambiguity.

Also, please, explain the extra tabs in your output file, every ^I identify a tab in the line.

Code:

cat -T desired_output

Code:

1^INull^IPHOX2B^IAD^INM_003924.3^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^I4^I41748071^I41748078^IGGCCCGGG^I-^Iexonic^IPHOX2B^I^Iframeshift deletion^IPHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs^I^I^I^I^I^I^I^I^I^I^I^I^I^I^INull^INull^INull^IVUS^INull^INull^INull^INull

Or if you substitute tab for bar

Code:

perl -pe 's/\t/\|/g' desired_output

Code:

1|Null|PHOX2B|AD|NM_003924.3|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|4|41748071|41748078|GGCCCGGG|-|exonic|PHOX2B||frameshift deletion|PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs|||||||||||||||Null|Null|Null|VUS|Null|Null|Null|Null

As it stands, your example output has 50 fields. Could you, please, confirm you want an output of 50 fields, all the time?
Here's a breakout of it:

Code:

perl -nalF"\t" -e 'for(@F){print $n, ": ", $F[$n++]}' desired_output

Code:

1: 1
2: Null
3: PHOX2B
4: AD
5: NM_003924.3
6: Null
7: Null
8: Null
9: Null
10: Null
11: Null
12: Null
13: Null
14: Null
15: Null
16: Null
17: Null
18: Null
19: 4
20: 41748071
21: 41748078
22: GGCCCGGG
23: -
24: exonic
25: PHOX2B
26:
27: frameshift deletion
28: PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43: Null
44: Null
45: Null
46: VUS
47: Null
48: Null
49: Null
50: Null

Please, explain what would produce fields number 22, 23 and 27 and 26, 29 to 42.
Would you like those tab-empty fields to be Null?

Another question, concerning your code:

Code:

my @mutations=split/,/,$vals[9];

$vals[9] contains PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G according to your input. It can not be split by commas.
Can you explain that? Are there any lines that would have something like:

Code:

PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G

And if so how would you like to handle them?

Thank you.

Last edited by Aia; 02-06-2016 at 07:48 PM.. Reason: Add more questions.

Aia

View Public Profile for Aia

Find all posts by Aia

Shell Programming and Scripting

Add static text in perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to add line breaks to perl command with large text in single quotes?

Discussion started by: kchinnam

2. Shell Programming and Scripting

awk to skip lines find text and add text based on number

Discussion started by: cmccabe

3. Programming

Perl find text and add line

Discussion started by: ab52

4. Programming

Even the Static cURL Library Isn't Static

Discussion started by: BrandonShw

5. UNIX for Advanced & Expert Users

Static code analysis for Perl

Discussion started by: figaro

6. Shell Programming and Scripting

Removing text between two static strings

Discussion started by: cg2

7. IP Networking

I need HELP to Set up Coyote Linux router with 1 static IP & 64 internal static IP

Discussion started by: dlwoaud

8. Shell Programming and Scripting

How to add static lines to short file?

Discussion started by: kevinmccallum

9. Red Hat

permanently add static route

Discussion started by: beaker457

10. Solaris

Add Static Routes to new physical address

Discussion started by: mehrdad68