Changing from FASTA to PHYLIP format | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Changing from FASTA to PHYLIP format

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 02-20-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:
Quote:
>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
And I need to change the format (Phylip) so they can look like this:
Quote:
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGAC GTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGA CGATTT
The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!
Sponsored Links
    #2  
Old 02-20-2011
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 1 August 2014, 7:47 PM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,663
Thanks: 35
Thanked 186 Times in 170 Posts
Hi.

Looks like Sequence Manipulator has a number of format conversion codes, including

Code:
Fasta2Phylip.pl: convert sequence file in fasta format to sequential phylip format

Input: fasta sequence file.

Output: phylip sequence file.

Good luck ... cheers, drl

---------- Post updated at 12:35 ---------- Previous update was at 12:26 ----------

Hi.

Also Yu-Wei's Bioinformatics playground: FASTA to PHYLIP converter

I Googled for:

Code:
convert fasta to phylip format awk OR perl

and these were the first 2 hits of about 1500 ... cheers, drl
Sponsored Links
    #3  
Old 02-20-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
Thanks for the info

Helpful website but I still need and AWK script that I can modify and couple with all my other steps in my bash script.
Any help will be greatly appreciate it!
    #4  
Old 02-20-2011
Scrutinizer's Avatar
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 31 July 2014, 10:48 PM EDT
Location: Amsterdam
Posts: 9,281
Thanks: 264
Thanked 2,303 Times in 2,066 Posts
Try:

Code:
awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file



Code:
$ awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

SeqID1    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence  AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT


Last edited by Scrutinizer; 02-20-2011 at 06:20 PM..
Sponsored Links
    #5  
Old 02-20-2011
bartus11's Avatar
bartus11 bartus11 is offline Forum Staff  
Moderator
 
Join Date: Apr 2009
Last Activity: 1 August 2014, 6:54 AM EDT
Posts: 3,707
Thanks: 7
Thanked 1,141 Times in 1,112 Posts

Code:
awk -vRS=">" -vFS="\n" -vOFS="" '$0!=""{$1=substr($1,1,8);$1=sprintf ("%-10s",$1)}$0!=""' file > file.tmp; awk 'NR==1{"wc -l /tmp/b|cut -d\" \" -f1"|getline a; print a,length($2)}1' file.tmp

Code ugly as hell, but working.
Sponsored Links
    #6  
Old 02-20-2011
rdcwayx rdcwayx is offline Forum Advisor  
Use nawk in Solaris
 
Join Date: Jun 2006
Last Activity: 22 March 2014, 12:27 PM EDT
Posts: 2,759
Thanks: 44
Thanked 418 Times in 406 Posts
For the other request, based on Scrutinizer's code

Code:
$  awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file |awk '{s=length($2)}END{print NR-1, s}'

3 100

Sponsored Links
    #7  
Old 02-23-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
I need to combine both codes

Code:
awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file


Code:
awk '{s=length($2)}END{print NR-1, s}' file

So I can get the desired output
Quote:
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGAC GTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGA CGATTT
I have been trying but I just cannot get the code to do what I want.
Can anyone explain me how can I combine them?
Thanks!

Last edited by Xterra; 02-23-2011 at 05:59 PM..
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Changing from Excel date format to MySQL date format figaro UNIX for Dummies Questions & Answers 2 08-11-2009 03:23 PM
fasta format? lost UNIX for Dummies Questions & Answers 5 01-28-2009 08:32 AM
changing month in Mmm format to mm FORMAT RahulJoshi Shell Programming and Scripting 1 09-04-2008 02:20 AM
Changing date format manneni prakash Shell Programming and Scripting 7 06-27-2008 02:56 AM
changing format shary Shell Programming and Scripting 4 01-31-2008 04:20 AM



All times are GMT -4. The time now is 04:36 AM.