Changing from FASTA to PHYLIP format | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Changing from FASTA to PHYLIP format

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 02-20-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:
Quote:
>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
And I need to change the format (Phylip) so they can look like this:
Quote:
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGAC GTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGA CGATTT
The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!
Sponsored Links
    #2  
Old 02-20-2011
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 20 December 2014, 6:44 PM EST
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,709
Thanks: 45
Thanked 202 Times in 184 Posts
Hi.

Looks like Sequence Manipulator has a number of format conversion codes, including

Code:
Fasta2Phylip.pl: convert sequence file in fasta format to sequential phylip format

Input: fasta sequence file.

Output: phylip sequence file.

Good luck ... cheers, drl

---------- Post updated at 12:35 ---------- Previous update was at 12:26 ----------

Hi.

Also Yu-Wei's Bioinformatics playground: FASTA to PHYLIP converter

I Googled for:

Code:
convert fasta to phylip format awk OR perl

and these were the first 2 hits of about 1500 ... cheers, drl
Sponsored Links
    #3  
Old 02-20-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
Thanks for the info

Helpful website but I still need and AWK script that I can modify and couple with all my other steps in my bash script.
Any help will be greatly appreciate it!
    #4  
Old 02-20-2011
Scrutinizer's Avatar
Scrutinizer Scrutinizer is online now Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 20 December 2014, 11:40 PM EST
Location: Amsterdam
Posts: 9,694
Thanks: 300
Thanked 2,491 Times in 2,230 Posts
Try:

Code:
awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file



Code:
$ awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

SeqID1    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence  AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT


Last edited by Scrutinizer; 02-20-2011 at 07:20 PM..
Sponsored Links
    #5  
Old 02-20-2011
bartus11's Avatar
bartus11 bartus11 is offline Forum Staff  
Moderator
 
Join Date: Apr 2009
Last Activity: 20 December 2014, 10:35 AM EST
Posts: 3,720
Thanks: 7
Thanked 1,147 Times in 1,118 Posts

Code:
awk -vRS=">" -vFS="\n" -vOFS="" '$0!=""{$1=substr($1,1,8);$1=sprintf ("%-10s",$1)}$0!=""' file > file.tmp; awk 'NR==1{"wc -l /tmp/b|cut -d\" \" -f1"|getline a; print a,length($2)}1' file.tmp

Code ugly as hell, but working.
Sponsored Links
    #6  
Old 02-20-2011
rdcwayx rdcwayx is offline Forum Advisor  
Use nawk in Solaris
 
Join Date: Jun 2006
Last Activity: 22 March 2014, 12:27 PM EDT
Posts: 2,759
Thanks: 44
Thanked 418 Times in 406 Posts
For the other request, based on Scrutinizer's code

Code:
$  awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file |awk '{s=length($2)}END{print NR-1, s}'

3 100

Sponsored Links
    #7  
Old 02-23-2011
Xterra Xterra is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 7 June 2014, 9:07 AM EDT
Posts: 214
Thanks: 66
Thanked 0 Times in 0 Posts
I need to combine both codes

Code:
awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file


Code:
awk '{s=length($2)}END{print NR-1, s}' file

So I can get the desired output
Quote:
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGAC GTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGA CGATTT
I have been trying but I just cannot get the code to do what I want.
Can anyone explain me how can I combine them?
Thanks!

Last edited by Xterra; 02-23-2011 at 06:59 PM..
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Changing from Excel date format to MySQL date format figaro UNIX for Dummies Questions & Answers 2 08-11-2009 04:23 PM
fasta format? lost UNIX for Dummies Questions & Answers 5 01-28-2009 09:32 AM
changing month in Mmm format to mm FORMAT RahulJoshi Shell Programming and Scripting 1 09-04-2008 03:20 AM
Changing date format manneni prakash Shell Programming and Scripting 7 06-27-2008 03:56 AM
changing format shary Shell Programming and Scripting 4 01-31-2008 05:20 AM



All times are GMT -4. The time now is 12:43 AM.