Add static text in perl


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Add static text in perl
# 8  
Old 02-04-2016
Quote:
Originally Posted by cmccabe
@Aia, please feel free to make any improvements/suggestions to any code posted by me. I am a scientist learning programming and so this is still new to me. I learn from each post and try to improve each time. Thank you very much Smilie. I will try again tomorrow and post back.
Unfortunately, I can tell you what the code is doing, but I can only guess and infer your intentions, and that's the hard part. If you were to post a few representative lines of $input_file and an example of the expected result to be saved in $output_file, I am sure many people would be happy to help. And we, both, might learn something along the way.

---------- Post updated at 09:45 PM ---------- Previous update was at 08:14 PM ----------

This is an example of how you might be able to insert an extra member at the header line and later output with tabs

Code:
#!/usr/bin/env perl
my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

# to be inserted after "Amino Acid Change" header member
my $nineteenth_member = "Heredity";

# insert into @header members
splice @header, 18, 0, $nineteenth_member;

$" = "\t"; # output separator for print when interpolating

# every element of @header will be separated by tab
print "@header\n";


splice works as well for inserting an array in the middle of another array

Code:
#!/usr/bin/env perl

my @numbers = qw( 1 2 3 4 );

my @fractions = qw( 3.1 3.2 3.3 3.4 );

splice @numbers, 3, 0, @fractions;

{
    local $" = ",";
    print "@numbers\n";
}

print "@numbers\n";


Code:
perl example2.pl
1,2,3,3.1,3.2,3.3,3.4,4
1 2 3 3.1 3.2 3.3 3.4 4


Last edited by Aia; 02-04-2016 at 01:32 AM.. Reason: Add example2.pl
# 9  
Old 02-04-2016
The basic idea of my program is that it combines multiple steps/processes into one. A set of data is inputted that is, for lack of a better term, not useful so I use a SOAP API to connect to a python tool that verifies that the input data is found in a database and converts the data into something that is useful. This data which is a set of coordinates15 25653864 25653864 G C is an example. Those cordinates are saved as a text file that is piped into a perl program (not the one posted) to apply meaning to the coordinate. That data is then reformatted using the perl posted and the process is complete.
The science behind this makes alot of sense to me but I am learning more and more about the programming aspect. Science, especially in my field of molecular genetics and genomics is advancing very quickly and using more and more programming. Thank you for all your help Smilie.

edit (perlupdate).

Code:
#!/bin/perl
   use strict;
   my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");  # match NM and inheritence with gene
    
	# Accept the input and output files as parameters
       my $input_file = $ARGV[0];
       my $output_file = $ARGV[1];
     
       # Set the header columns to be added to the left
       # and to the right of the header in the input file
      my @left =  (
                       "Index",
                       "Chromosome Position",
                       "Gene",
                       "Inheritance",
                       "RNA Accession",
                       "Chr",
                       "Coverage",
                       "Score",
                       "A(#F,#R)",
                       "C(#F,#R)",
                       "G(#F,#R)",
                       "T(#F,#R)",
                       "Ins(#F,#R)",
                       "Del(#F,#R)",
                       "SNP db_xref",
                       "Mutation Call",
                       "Mutant Allele Frequency",
                       "Amino Acid Change"
                  );
      my @right = (
                      "HP",
                      "SPLICE",
                      "Pseudogene",
                      "Classification",
                      "HGMD",
                      "Disease",
                      "Sanger",
                      "References"
                  );
    
      # open the input file, read the header line and sandwich it
      # between @left and @right arrays
      my $final_header;
      open (FH, "<", $input_file) or die "Can't open $input_file: $!";
	  chomp(my $hdr=<FH>);

      $final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

      # final header is set, print it to the output file
      open (OF, ">", $output_file) or die "Can't open $output_file: $!";
      print OF "$final_header";
     # close (FH) or die "Can't close $output_file: $!";
	  
	   my @colsleft = map "Null",(1..$#left);	  
	   my @colsright = map "Null",(0..$#right);	 
	  	  
	  while(<FH>)  {  # puts row of input file into $_ 
	  chomp;
	  my @vals = split/\t/; # this splits the line at tabs
	  my @mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
	  my ($gene,$transcript,$exon,$coding,$aa);
	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}
		# warn join ("\t",$gene,$transcript,$exon,$coding,$aa);
		my @out=($.,@colsleft,$_,@colsright);
		$out[2]=$gene;
		$out[3]=$nms{$transcript};
		$out[4]=$transcript;
		$out[15]=$coding;
		$out[17]=$aa;
		$out[45] = "VUS";
                    print "$out[45]\n";
				
		#print OF join("\t",$.,@colsleft,$_,@colsright),"\n";   # row data is set, print it to the output file		
		print OF join("\t",@out),"\n";   # row data is set, print it to the output file			
   	 }

Output
Code:
Classification
Null

all the other $out fields populate from the split. Not sure why the "VUS" isn't populating but that string is hardcoded and not from the split.

Last edited by cmccabe; 02-04-2016 at 11:45 AM.. Reason: added details and perl edit
# 10  
Old 02-05-2016
Hi cmccabe,

I wish you would had posted at least four lines of the input file and how do you expect those four lines to be reformatted as output. Due to the lack of that information the situation has not changed much.

If you are unable to do that, please take a look at these parts:

Code:
my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

What do you think is happening to @colsleft and @colsright at each iteration of the while loop?

Since those are outside the while loop, they never get refreshed for each line of the input line. Therefore, they will contain pieces of data from previous iterations, if they are not rewritten in the loop. I do not know if that's what you want.

Code:
	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}

Code:
$_ or next; # skip if AB empty

$_ will always have something through the loop, if not, the for loop stops. So, I do not know what's your intention even when you commented with skip if AB empty.
Code:
split/\:/;

remove the \, : is not especial in any way there.

Please, explain this part. What do you think that this part is doing for you?
Code:
grep {$transcript eq $_} keys %nms or next;


Last edited by Aia; 02-05-2016 at 01:19 AM..
# 11  
Old 02-05-2016
The input that goes into the perl is:

Code:
Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

Code:
my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

provided the additional columns to the left and right. Basically, the file that is used is only 24 columns (represented by $_) and @colsleft put 18 columns before them with values of Null and @colsright put 8 additional columns after with values of Null. The total is now 50 and that is what is expected. So basically it is my @out=($.,@colsleft,$_,@colsright); unique entry, followed by 18 columns, then 24, then 8.

The split of column 10 [9] provides the arrays that are used in the $out. grep {$transcript eq $_} keys %nms or next; is a special case that matches the Stranscript with the nms in the beginning. This actually all seems to work.

The loop doesn't get refreshed because if there are multiple enteries then they are on separate rows.

I can not seem to figure out why VUS doesn't populate in $out[45] as that is the only part that doesn't seem to work.

I hope this helps a bit and thank you for all your help Smilie.
# 12  
Old 02-05-2016
Quote:
Originally Posted by cmccabe
I can not seem to figure out why VUS doesn't populate in $out[45] as that is the only part that doesn't seem to work.
Earlier, in post #6, I mentioned the following:
Quote:
Originally Posted by Aia
$out[45] = "VUS"; pretty much assigns it and there is not way you have "NULL" after that unless it gets changed or your understanding of what you are looking at is not correct.
I still believe that what you expect to see is misleading you of what it really is.

Code:
$out[45] = "VUS";

At this point, "VUS" is still, the 46th element of array out.

Code:
print OF join("\t",@out),"\n";

At that point, array out has been flatten out as an string separated with tabs.
You have lost your ability to know what element it would be in a string that now contains new tab separated elements.
In fact, since $out[45] is the last part, it would, actually, become $out[64] if you were to split again by tab.
You have introduced new parts that have tabs as well.


You want to test it?
Here's your original code, with some prints to show some details. Run it with just the two lines of input you posted.
Code:
#!/bin/perl

use strict;

my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");
my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

my @left =  (
     "Index",
     "Chromosome Position",
     "Gene",
     "Inheritance",
     "RNA Accession",
     "Chr",
     "Coverage",
     "Score",
     "A(#F,#R)",
     "C(#F,#R)",
     "G(#F,#R)",
     "T(#F,#R)",
     "Ins(#F,#R)",
     "Del(#F,#R)",
     "SNP db_xref",
     "Mutation Call",
     "Mutant Allele Frequency",
     "Amino Acid Change"
 );
my @right = (
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References"
);

my $final_header;
open (FH, "<", $input_file) or die "Can't open $input_file: $!";
chomp(my $hdr=<FH>);
$final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

open (OF, ">", $output_file) or die "Can't open $output_file: $!";
print OF "$final_header";

my @colsleft = map "Null",(1..$#left);
print "\@colsleft = ", scalar @colsleft, ": @colsleft\n";

my @colsright = map "Null",(0..$#right);
print "\@colsright = ", scalar @colsright, ": @colsright\n";

while(<FH>)  {
chomp;
my @vals = split/\t/;
my @mutations=split/,/,$vals[9];
my ($gene,$transcript,$exon,$coding,$aa);
for (@mutations)
{
    $_ or next;
    ($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons

    grep {$transcript eq $_} keys %nms or next;
}
my @out=($.,@colsleft,$_,@colsright);
print "\n";
print 'After this evaluation: my @out=($.,@colsleft,$_,@colsright);', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

$out[2]=$gene;
$out[3]=$nms{$transcript};
$out[4]=$transcript;
$out[15]=$coding;
$out[17]=$aa;
$out[45] = "VUS";

print 'After this evaluation:  $out[45] = "VUS";', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

print OF join("\t",@out),"\n";

my @after = split( "\t", join("\t", @out) );
print "\@after = ", scalar @after, ": @after\n";
}

Run as:
Code:
perl debug.pl input.file /dev/null


Output:

Code:
@colsleft = 17: Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null
@colsright = 8: Null Null Null Null Null Null Null Null

After this evaluation: my @out=($.,@colsleft,$_,@colsright);
@out = 27: 2 Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null 4     41748130        41748130        G       C       exonic  PHOX2B      synonymous SNV   PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null

After this evaluation:  $out[45] = "VUS";
@out = 46: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4        41748130        41748130        G       C       exonic       PHOX2B          synonymous SNV  PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null                   VUS

@after = 65: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4 41748130 41748130 G C exonic PHOX2B  synonymous SNV PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G 0.0007 . . . . . 0.0005 0.0002 0.0007 . Null Null Null Null Null Null Null Null                   VUS

Please, scroll all the way to the right to see VUS highlighted.

Again, knowing what the code does... it is not a problem; knowing what you expect is the hard part.
If you were to post an example of how those 2 lines are supposed to look after the process, that might help.

Last edited by Aia; 02-05-2016 at 10:17 PM.. Reason: Run only with the first two lines of input
# 13  
Old 02-06-2016
desired output: column [45] or classification is VUS
Code:
Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs															Null	Null	Null	VUS	Null	Null	Null	Null

So if I am understanding (please correct me if I am wrong)
Code:
print OF join("\t",@out),"\n";

introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help Smilie.
# 14  
Old 02-06-2016
Quote:
Originally Posted by cmccabe
So if I am understanding (please correct me if I am wrong)
Code:
print OF join("\t",@out),"\n";

introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help Smilie.
Yes, you have elements of the array that has tabs in itself and when you convert that array to a string it creates more fields if you were to separate again those fields by tab.

Here's the input you posted in #11
Code:
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

Here's the desired output posted in #13
Code:
1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs

I have to assume you did not correlate them since it is not possible to produce this output from that input, unless other information is missing.
Please, explain where those fields highlighted in red come from, since they are not found anywhere in your posted input or code.
If this were a case of wrong output against input, please, provide a corrected set of input and output, to remove ambiguity.

Also, please, explain the extra tabs in your output file, every ^I identify a tab in the line.

Code:
cat -T desired_output

Code:
1^INull^IPHOX2B^IAD^INM_003924.3^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^I4^I41748071^I41748078^IGGCCCGGG^I-^Iexonic^IPHOX2B^I^Iframeshift deletion^IPHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs^I^I^I^I^I^I^I^I^I^I^I^I^I^I^INull^INull^INull^IVUS^INull^INull^INull^INull

Or if you substitute tab for bar
Code:
perl -pe 's/\t/\|/g' desired_output

Code:
1|Null|PHOX2B|AD|NM_003924.3|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|4|41748071|41748078|GGCCCGGG|-|exonic|PHOX2B||frameshift deletion|PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs|||||||||||||||Null|Null|Null|VUS|Null|Null|Null|Null

As it stands, your example output has 50 fields. Could you, please, confirm you want an output of 50 fields, all the time?
Here's a breakout of it:
Code:
perl -nalF"\t" -e 'for(@F){print $n, ": ", $F[$n++]}' desired_output

Code:
1: 1
2: Null
3: PHOX2B
4: AD
5: NM_003924.3
6: Null
7: Null
8: Null
9: Null
10: Null
11: Null
12: Null
13: Null
14: Null
15: Null
16: Null
17: Null
18: Null
19: 4
20: 41748071
21: 41748078
22: GGCCCGGG
23: -
24: exonic
25: PHOX2B
26:
27: frameshift deletion
28: PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43: Null
44: Null
45: Null
46: VUS
47: Null
48: Null
49: Null
50: Null

Please, explain what would produce fields number 22, 23 and 27 and 26, 29 to 42.
Would you like those tab-empty fields to be Null?

Another question, concerning your code:
Code:
my @mutations=split/,/,$vals[9];

$vals[9] contains PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G according to your input. It can not be split by commas.
Can you explain that? Are there any lines that would have something like:
Code:
PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G

And if so how would you like to handle them?


Thank you.

Last edited by Aia; 02-06-2016 at 07:48 PM.. Reason: Add more questions.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to add line breaks to perl command with large text in single quotes?

Below code extracts multiple field values from XML into array and prints all in one line. perl -nle '@r=/(?: jndiName| authDataAlias| value| minConnections| maxConnections| connectionTimeout| name)="(+)/g and print join ",",$ENV{tIPnSCOPE},$ENV{pr ovider},$ENV{impClassName},@r' server.xml ... (4 Replies)
Discussion started by: kchinnam
4 Replies

2. Shell Programming and Scripting

awk to skip lines find text and add text based on number

I am trying to use awk skip each line with a ## or # and check each line after for STB= and if that value in greater than or = to 0.8, then at the end of line the text "STRAND BIAS" is written in else "GOOD". So in the file of 4 entries attached. awk tried: awk NR > "##"' "#" -F"STB="... (6 Replies)
Discussion started by: cmccabe
6 Replies

3. Programming

Perl find text and add line

Hi All I need to add a line to a file but after a certain block of text is found The block of text looks like this <RDF:Description RDF:about="urn:mimetype:video/quicktime" NC:value="video/quicktime" and i need to add this in the next line down ( note there is... (4 Replies)
Discussion started by: ab52
4 Replies

4. Programming

Even the Static cURL Library Isn't Static

I'm writing a program which uses curl to be run on Linux PCs which will be used by a number of different users. I cannot make the users all install curl on their individual machines, so I have tried to link curl in statically, rather than using libcurl.so. I downloaded the source and created a... (8 Replies)
Discussion started by: BrandonShw
8 Replies

5. UNIX for Advanced & Expert Users

Static code analysis for Perl

As an addition to our ongoing investigation into static code analysis tools for a Perl programming we are maintaining, can anyone recommend a certain tool that he/she is experienced with? We are already actively using perl::critic (Perl::Critic) and rats... (2 Replies)
Discussion started by: figaro
2 Replies

6. Shell Programming and Scripting

Removing text between two static strings

Hi everyone, I need to replace the text between two strings (html tags) and I'm having trouble figuring out how to do so. I can display the text with sed but I'm not having any luck deleting the text between the two strings. My file looks like this: <oths>test</oths><div class="text">1928... (2 Replies)
Discussion started by: cg2
2 Replies

7. IP Networking

I need HELP to Set up Coyote Linux router with 1 static IP & 64 internal static IP

hello, i need help on setting my coyote linux, i've working on this for last 5 days, can't get it to work. I've been posting this message to coyote forum, and other linux forum, but haven't get any answer yet. Hope someone here can help me...... please see my attached picture first. ... (0 Replies)
Discussion started by: dlwoaud
0 Replies

8. Shell Programming and Scripting

How to add static lines to short file?

I've got a simple log file that looks something like this: And I need to append it to look like this: So I just want to add a timestamp and a static (non-variable) word to each line in the file. Is there an easy scripted way to cat the file and append that data to each line....?? (4 Replies)
Discussion started by: kevinmccallum
4 Replies

9. Red Hat

permanently add static route

I have a machine with an interface that has two different addresses on CentOS 5 eth0: 10.20.21.77 eth0:1 141.218.1.221 If I issue this command I get the result I'm looking for. /sbin/route add -net 141.218.1.0 netmask 255.255.255.0 gw 10.20.21.77 ip route show dev eth0 141.218.1.0/24... (1 Reply)
Discussion started by: beaker457
1 Replies

10. Solaris

Add Static Routes to new physical address

Hi, I need help to add new route: 10.252.0.138, GW 10.252.0.129 to e1000g1 and 10.252.0.10, GW 10.252.0.1 to e1000g2 tnx (4 Replies)
Discussion started by: mehrdad68
4 Replies
Login or Register to Ask a Question