Add static text in perl

02-08-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I apologize, I put the wrong output file for the input previously posted.

input

Code:

Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

output

Code:

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
2	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.					Null	Null	Null	Null	Null	Null	Null	Null

field headers where info comes from

Code:

1: 1                   (Index)
2: Null                (Chromosome)
3: PHOX2B          (Gene)
4: AD                 (Inheritence)
5: NM_003924.3   (RNA Accession)
6: Null                (Chr)
7: Null                (Coverage)
8: Null                (Score)
9: Null                (A(#F,#R)
10: Null              (C(#F,#R)
11: Null              (G(#F,#R)
12: Null              (T(#F,#R)
13: Null              (Ins(#F,#R)
14: Null              (Del(#F,#R)
15: Null              (SNP db_xref)
16: c.C639G        (Mutation Call)
17: Null              (Mutant Allele Frequency)
18: G213G          (Amino Acid Change)
19: 4                 (Chr)
20: 41748130      (Start)
21: 41748130      (Stop)
22: G                 (Ref)
23: C                 (Alt)
24: exonic          (Func.refGene)
25: PHOX2B        (Gene.refGene)
26:                    (GeneDetail.refGene)
27: synonymous   (ExonicFunc.refGene)
28: PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G (AAChange.refGene) - used for the split to get values in 3,4,5,16, and 18) - this  split uses the @nms to only use the record in this field that starts with the same NM_ as the @nms (this field can have multiple records in it, so to ensure I get the correct one I use @nms and only return that matching value)

29:    (PopFreqMax)
30:    (1000G2012APR_ALL)
31:    (1000G2012APR_AFR)
32:    (1000G2012APR_AMR)
33:    (1000G2012APR_ASN)
34:    (1000G2012APR_EUR)
35:    (ESP6500si_ALL)
36:    (ESP6500si_AA)
37:    (ESP6500si_EA)
38:    (CG46)
39:    (common)
40:    (clinvar)
41:    (clinvarsubmit)
42:    (clinvarreference)
43: Null   (HP)
44: Null   (Splice)
45: Null   (Pseudogene)
46: VUS   (Classification) - currently not showing up (Null is)
47: Null   (HGMD)
48: Null   (Disease)
49: Null   (Sanger)
50: Null   (References)

Quote:

Also, please, explain the extra tabs in your output file, every ^I identify a tab in the line.

I did not mean nor do I know why the extra tabs are there.

Quote:

$vals[9] contains PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G according to your input. It can not be split by commas.
Can you explain that? Are there any lines that would have something like:

the perl that is used to populate this column only allows the format with : in it, so commas should not show up.

Thank you for all your help

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-08-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Please, give it a try.
You can modify at your content.

Code:

#!/usr/bin/env perl
# reformat.pl
use strict;
use warnings;

my %nms = (
    "NM_004004.5" => "AR",
    "NM_004992.3" => "XLD",
    "NM_003924.3" => "AD",
);

my $readf = shift || die "Missing input file: $!\n";
my $writef = shift || die "Missing output file: $!\n";

my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

open my $in, '<', $readf or die "Cannot open $readf: $!\n";
open my $out, '>', $writef or die "Cannot create $writef: $!\n";

my $add2header;
chomp( $add2header = <$in> );
splice @header, 18, 0, $add2header;
save(@header);
$.= 0; # reset lines count to remove header
while( <$in> ) {
    chomp;
    my @ruler = (("Null")x17, ("")x25, ("Null")x8);
    my @fields = split "\t";
    my $len = @fields;
    splice @ruler, 17, $len, @fields;
    my ($gene, $transcript, $exon, $coding, $aa) = split ":", $fields[9];
    $ruler[0] = $.;
    $ruler[2] = $gene;
    $ruler[3] = $nms{$transcript};
    $ruler[4] = $transcript;
    $ruler[15] = $coding;
    $ruler[17] = $aa;
    $ruler[45] = "VUS";
    save(@ruler);
}

sub save {
    local $" = "\t";
    print $out "@_\n";
}

close $in;
close $out;

Last edited by Aia; 02-08-2016 at 11:24 PM.. Reason: Add reset lines for index to one.

Aia

View Public Profile for Aia

Find all posts by Aia

02-13-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I apologize for the delay and just got to test the perl using the input from post 15. The results look the same as before with VUS appearing after the "Null" values:
Thank you for all you help

.

Code:

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
2	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	c.C639G	Null	p.G213G	4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.					Null	Null	Null	Null	Null	Null	Null	Null																																						VUS

Last edited by cmccabe; 02-13-2016 at 12:46 PM.. Reason: added input location

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-13-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

This is the result I get when I run the code I posted in #16 against the input you posted in #15.

Code:

Index   Chromosome Position Gene    Inheritance RNA Accession   Chr Coverage    Score   A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)  Del(#F,#R)  SNP db_xref Mutation Call   Mutant Allele Frequency Amino Acid Change   Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.refGene    PopFreqMax  1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL   ESP6500si_AA    ESP6500si_EA    CG46    common  clinvar clinvarsubmit   clinvarreference    HP  SPLICE  Pseudogene  Classification  HGMD    Disease Sanger  References
1   Null    PHOX2B  AD  NM_003924.3 Null    Null    Null    Null    Null    Null    Null    Null    Null    Null    c.C639G Null    p.G213G 41748130    41748130    G   C   exonic  PHOX2B      synonymous SNV  PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G    0.0007  .   .   .   .   .   0.0005  0.0002  0.0007  .                       Null    Null    Null    VUS Null    Null    Null    Null

As you see, VUS is there in the right place. Which leads me to believe that there is a discrepancy between what you posted and what you used for input.
Also, there is a discrepancy between what my code would output for first field of second line: a 1; meaning line 1 and what you are showing: a 2.

If you would like to continue troubleshooting it, all that I can offer you is the result of what your input looks like when reformatted to show tabs.

Code:

perl -pe 's/\t/\[TAB\]/g' new_cmccabe_input

Code:

Chr[TAB]Start[TAB]End[TAB]Ref[TAB]Alt[TAB]Func.refGene[TAB]Gene.refGene[TAB]GeneDetail.refGene[TAB]ExonicFunc.refGene[TAB]AAChange.refGene[TAB]PopFreqMax[TAB]1000G2012APR_ALL[TAB]1000G2012APR_AFR[TAB]1000G2012APR_AMR[TAB]1000G2012APR_ASN[TAB]1000G2012APR_EUR[TAB]ESP6500si_ALL[TAB]ESP6500si_AA[TAB]ESP6500si_EA[TAB]CG46[TAB]common[TAB]clinvar[TAB]clinvarsubmit[TAB]clinvarreference
4[TAB]41748130[TAB]41748130[TAB]G[TAB]C[TAB]exonic[TAB]PHOX2B[TAB][TAB]synonymous SNV[TAB]PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G[TAB]0.0007[TAB].[TAB].[TAB].[TAB].[TAB].[TAB]0.0005[TAB]0.0002[TAB]0.0007[TAB].

Please, run the same two input lines you used and compare against these. There should be the same, since I am using what you posted.

Note:
I am assuming that you have taken care of making sure this input comes from a properly Unix type file and not a MSDOS.

Last edited by Aia; 02-14-2016 at 12:05 AM.. Reason: Add note

Aia

View Public Profile for Aia

Find all posts by Aia

02-15-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The input is the proper unix style but is slightly different then what I posted: it is only 5 fields. I apologize for the oversight

input

Code:

4	41748130	41748130	G	C

Code:

perl -pe 's/\t/\[TAB\]/g' input

Code:

4[TAB]41748130[TAB]41748130[TAB]G[TAB]C

The additional information is populating by those 5 fields most of the time. A small percentage of the time [9] will be Null and need to be skipped, thats what $_ or next; this was supposed to do in the original code. [45] is stil "VUS" however. Thank you

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

02-15-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

The input is the proper unix style but is slightly different then what I posted: it is only 5 fields. I apologize for the oversight

input

Code:

4	41748130	41748130	G	C

Code:

perl -pe 's/\t/\[TAB\]/g' input

Code:

4[TAB]41748130[TAB]41748130[TAB]G[TAB]C

The additional information is populating by those 5 fields most of the time. A small percentage of the time [9] will be Null and need to be skipped, thats what $_ or next; this was supposed to do in the original code. [45] is stil "VUS" however. Thank you Smilie

.

It would have complained about: Use of uninitialized value in split if encounters such a short input.
Here's the previous code with the modification to accommodate that small percentage of times that the input does not have a "PHOX2B:NM_003924.3:exon3:c.C639G

.G213G" string

Code:

#!/usr/bin/env perl
# reformat.pl
use strict;
use warnings;

my %nms = (
    "NM_004004.5" => "AR",
    "NM_004992.3" => "XLD",
    "NM_003924.3" => "AD"
);

my $readf = shift || die "Missing input file: $!\n";
my $writef = shift || die "Missing output file: $!\n";

my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

open my $in, '<', $readf or die "Cannot open $readf: $!\n";
open my $out, '>', $writef or die "Cannot create $writef: $!\n";

my $add2header;
chomp( $add2header = <$in> );
splice @header, 18, 0, $add2header;
save(@header);

$.=0;
while( <$in> ) {
    chomp;
    my @ruler = (("Null")x17, ("")x25, ("Null")x8);
    my @fields = split /\t/;
    if($fields[9]) {
        my $len = @fields;
        splice @ruler, 17, $len, @fields;
        my ($gene, $transcript, $exon, $coding, $aa) = split /:/, $fields[9];
        $ruler[0] = $.;
        $ruler[2] = $gene;
        $ruler[3] = $nms{$transcript};
        $ruler[4] = $transcript;
        $ruler[15] = $coding;
        $ruler[17] = $aa;
        $ruler[45] = "VUS";
        save(@ruler);
    }
}

sub save {
    local $" = "\t";
    print $out "@_\n";
}

close $in;
close $out;

Nevertheless, that would not do anything to solve your input discrepancy.
Did you compare the input that produced the defective reformat output with the one you posted previously?

Last edited by Aia; 02-15-2016 at 08:17 PM..

Aia

View Public Profile for Aia

Find all posts by Aia

02-16-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I got pulled away before I could, but will try it first thing tomorrow. Thank you

.

---------- Post updated 02-16-16 at 09:11 AM ---------- Previous update was 02-15-16 at 06:22 PM ----------

input

Code:

 Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene PopFreqMax 1000G2012APR_ALL 1000G2012APR_AFR 1000G2012APR_AMR 1000G2012APR_ASN 1000G2012APR_EUR ESP6500si_ALL ESP6500si_AA ESP6500si_EA CG46 common clinvar clinvarsubmit clinvarreference
4 41748130 41748130 G C exonic PHOX2B  synonymous SNV PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G 0.0007 . . . . . 0.0005 0.0002 0.0007 .

perl

Code:

perl -pe 's/\t/\[TAB\]/g' input

output

Code:

Chr[TAB]Start[TAB]End[TAB]Ref[TAB]Alt[TAB]Func.refGene[TAB]Gene.refGene[TAB]GeneDetail.refGene[TAB]ExonicFunc.refGene[TAB]AAChange.refGene[TAB]PopFreqMax[TAB]1000G2012APR_ALL[TAB]1000G2012APR_AFR[TAB]1000G2012APR_AMR[TAB]1000G2012APR_ASN[TAB]1000G2012APR_EUR[TAB]ESP6500si_ALL[TAB]ESP6500si_AA[TAB]ESP6500si_EA[TAB]CG46[TAB]common[TAB]clinvar[TAB]clinvarsubmit[TAB]clinvarreference
4[TAB]41748130[TAB]41748130[TAB]G[TAB]C[TAB]exonic[TAB]PHOX2B[TAB][TAB]synonymous SNV[TAB]PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G[TAB]0.0007[TAB].[TAB].[TAB].[TAB].[TAB].[TAB]0.0005[TAB]0.0002[TAB]0.0007[TAB].

Last edited by cmccabe; 02-16-2016 at 11:18 AM.. Reason: updated output

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Add static text in perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to add line breaks to perl command with large text in single quotes?

Discussion started by: kchinnam

2. Shell Programming and Scripting

awk to skip lines find text and add text based on number

Discussion started by: cmccabe

3. Programming

Perl find text and add line

Discussion started by: ab52

4. Programming

Even the Static cURL Library Isn't Static

Discussion started by: BrandonShw

5. UNIX for Advanced & Expert Users

Static code analysis for Perl

Discussion started by: figaro

6. Shell Programming and Scripting

Removing text between two static strings

Discussion started by: cg2

7. IP Networking

I need HELP to Set up Coyote Linux router with 1 static IP & 64 internal static IP

Discussion started by: dlwoaud

8. Shell Programming and Scripting

How to add static lines to short file?

Discussion started by: kevinmccallum

9. Red Hat

permanently add static route

Discussion started by: beaker457

10. Solaris

Add Static Routes to new physical address

Discussion started by: mehrdad68