Sponsored Content
Top Forums Shell Programming and Scripting Changing from FASTA to PHYLIP format Post 302498206 by Xterra on Sunday 20th of February 2011 12:57:12 PM
Old 02-20-2011
Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:
Code:
>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

And I need to change the format (Phylip) so they can look like this:
Code:
3 100 
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Changing date format

Hi, Is there any way to change one date format to another ?? I mean I have a file having dates in the format (Thu Sep 29 2005) ... and i wud like to change these to YYYYMMDD format .. is there any command which does so ?? Or anything like enum which we have in C ?? Thanks in advance, ... (5 Replies)
Discussion started by: Sabari Nath S
5 Replies

2. UNIX for Dummies Questions & Answers

Changing the format of date

Hi, There are lots of threads about how to manipulate the date using date +%m %....... But how can I change the default format of the commad date? $ date Mon Apr 10 10:57:15 BST 2006 This would be on fedora and SunOs. Cheers, Neil (4 Replies)
Discussion started by: nhatch
4 Replies

3. Shell Programming and Scripting

changing format

Dear Experts, Currently my script is gereating the output like this as mentioned below. 8718,8718,0,8777 7450,7450,0,7483 5063,5063,0,5091 3840,3840,0,3855 3129,3129,0,3142 2400,2400,0,2419 2597,2597,0,2604 3055,3055,0,3078 4249,4249,0,4266 4927,4927,0,4957 8920,8920,0,8978... (4 Replies)
Discussion started by: shary
4 Replies

4. Shell Programming and Scripting

Changing date format

Hi, I have a column in a table of Timestamp datatype. For Example : Var1 is the column 2008-06-26-10.10.30.2006. I have Given query as date(var1) and time (var1) I got the file as in the below format : File1: Col1 Col2 2008-06-02|12.36.06 2008-06-01|23.36.35 But the problem is... (7 Replies)
Discussion started by: manneni prakash
7 Replies

5. Shell Programming and Scripting

changing month in Mmm format to mm FORMAT

i have an variable mydate=2008Nov07 i want o/p as in variable mymonth=11 (i.e nov comes on 11 number month) i want some command to do this for any month without using any loop. plz help me (1 Reply)
Discussion started by: RahulJoshi
1 Replies

6. UNIX for Dummies Questions & Answers

fasta format?

Hi, I'm in need of creating a file in the fasta format: >1A6A.A HVIIQAEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSNYTPITN VPPEVTVLTNSPVELREPNVLICFIDKFTPPVVNVTWLRNGKPVTTGVSETVFLPREDHLFRKFHYLPFLPSTEDVYDCR VEHWGLDEPLLKHWEF >1A6A.B ... (5 Replies)
Discussion started by: lost
5 Replies

7. UNIX for Dummies Questions & Answers

Changing from Excel date format to MySQL date format

I have a list of dates in the following format: mm/dd/yyyy and want to change these to the MySQL standard format: yyyy-mm-dd. The dates in the original file may or may not be zero padded, so April is sometimes "04" and other times simply "4". This is what I use to change the format: sed -i '' -e... (2 Replies)
Discussion started by: figaro
2 Replies

8. Shell Programming and Scripting

Changing the date format

Hi all, I have a file with below data af23b|11-FEB-12|acc7 ad23b|12-JAN-12|acc4 as23b|15-DEC-11|acc5 z123b|18-FEB-12|acc1 I need the output as below:-(date in yyyymmdd format) af23b|20120211|acc7 ad23b|20120112|acc4 as23b|20111215|acc5 z123b|20120218|acc1 Please help me on this.... (7 Replies)
Discussion started by: gani_85
7 Replies

9. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Hi, I am having a file of dna sequences in fasta format which look like this: >admin_1_45 atatagcaga >admin_1_46 atatagcagaatatatat with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to... (5 Replies)
Discussion started by: margarita
5 Replies

10. UNIX for Dummies Questions & Answers

Changing the file name format

Hello all, I am tryign to change the format of files (which are many in numbers). They at present are named like this: SomeProcess_M-130_100_1_3BR.root SomeProcess_M-130_101_2_3BX.root SomeProcess_M-130_103_3_3RY.root SomeProcess_M-130_105_1_3GH.root SomeProcess_M-130_99_1_3LF.root... (7 Replies)
Discussion started by: emily
7 Replies
MH-SEQUENCE(5)                                                       [nmh-1.5]                                                      MH-SEQUENCE(5)

NAME
mh-sequence - sequence specification for nmh message system SYNOPSIS
most nmh commands DESCRIPTION
A sequence (or sequence set) is a symbolic name representing a message or collection of messages. nmh has several internally defined sequences, as well as allowing users to define their own sequences. Message Specification and Pre-Defined Message Sequences Most nmh commands accept a `msg' or `msgs' specification, where `msg' indicates one message and `msgs' indicates one or more messages. To designate a message, you may use either its number (e.g., 1, 10, 234) or one of these "reserved" message names: Name Description first the first message in the folder last the last message in the folder cur the most recently accessed message prev the message numerically preceding "cur" next the message numerically following "cur" In commands that take a `msg' argument, the default is "cur". As a shorthand, "." is equivalent to "cur". For example: In a folder containing five messages numbered 5, 10, 94, 177 and 325, "first" is 5 and "last" is 325. If "cur" is 94, then "prev" is 10 and "next" is 177. The word `msgs' indicates that one or more messages may be specified. Such a specification consists of one message designation or of sev- eral message designations separated by spaces. A message designation consists either of a message name as defined above, or a message range. A message range is specified as "name1-name2" or "name:n", where `name', `name1' and `name2' are message names, and `n' is an integer. The specification "name1-name2" designates all currently existing messages from `name1' to `name2' inclusive. The "reserved" message name "all" is a shorthand for the message range "first-last". The specification "name:n" designates up to `n' messages. These messages start with `name' if `name' is a message number or one of the reserved names "first" "cur", or "next", The messages end with `name' if `name' is "prev" or "last". The interpretation of `n' may be overridden by preceding `n' with a plus or minus sign; `+n' always means up to `n' messages starting with `name', and `-n' always means up to `n' messages ending with `name'. In commands which accept a `msgs' argument, the default is either "cur" or "all", depending on which makes more sense for each command (see the individual man pages for details). Repeated specifications of the same message have the same effect as a single specification of the message. There is also a special "reserved" message name "new" which is used by the mhpath command. User-Defined Message Sequences In addition to the "reserved" (pre-defined) message names given above, nmh supports user-defined sequence names. User-defined sequences allow the nmh user a tremendous amount of power in dealing with groups of messages in the same folder by allowing the user to bind a group of messages to a meaningful symbolic name. The name used to denote a message sequence must consist of an alphabetic character followed by zero or more alphanumeric characters, and can not be one of the "reserved" message names above. After defining a sequence, it can be used wherever an nmh command expects a `msg' or `msgs' argument. Some forms of message ranges are allowed with user-defined sequences. The specification "name:n" may be used, and it designates up to the first `n' messages (or last `n' messages for `-n') which are elements of the user-defined sequence `name'. The specifications "name:next" and "name:prev" may also be used, and they designate the next or previous message (relative to the current message) which is an element of the user-defined sequence `name'. The specifications "name:first" and "name:last" are equivalent to "name:1" and "name:-1", respectively. The specification "name:cur" is not allowed (use just "cur" instead). The syntax of these message range specifications is subject to change in the future. User-defined sequence names are specific to each folder. They are defined using the pick and mark commands. Public and Private User-Defined Sequences There are two varieties of user-defined sequences: public and private. Public sequences of a folder are accessible to any nmh user that can read that folder. They are kept in each folder in the file determined by the "mh-sequences" profile entry (default is .mh_sequences). Private sequences are accessible only to the nmh user that defined those sequences and are kept in the user's nmh context file. In general, the commands that create sequences (such as pick and mark) will create public sequences if the folder for which the sequences are being defined is writable by the nmh user. For most commands, this can be overridden by using the switches -public and -private. But if the folder is read-only, or if the "mh-sequences" profile entry is defined but empty, then private sequences will be created instead. Sequence Negation Nmh provides the ability to select all messages not elements of a user-defined sequence. To do this, the user should define the entry "Sequence-Negation" in the nmh profile file; its value may be any string. This string is then used to preface an existing user-defined sequence name. This specification then refers to those messages not elements of the specified sequence name. For example, if the profile entry is: Sequence-Negation: not then anytime an nmh command is given "notfoo" as a `msg' or `msgs' argument, it would substitute all messages that are not elements of the sequence "foo". Obviously, the user should beware of defining sequences with names that begin with the value of the "Sequence-Negation" profile entry. The Previous Sequence Nmh provides the ability to remember the `msgs' or `msg' argument last given to an nmh command. The entry "Previous-Sequence" should be defined in the nmh profile; its value should be a sequence name or multiple sequence names separated by spaces. If this entry is defined, when when an nmh command finishes, it will define the sequence(s) named in the value of this entry to be those messages that were specified to the command. Hence, a profile entry of Previous-Sequence: pseq directs any nmh command that accepts a `msg' or `msgs' argument to define the sequence "pseq" as those messages when it finishes. Note: there can be a performance penalty in using the "Previous-Sequence" facility. If it is used, all nmh programs have to write the sequence information to the .mh_sequences file for the folder each time they run. If the "Previous-Sequence" profile entry is not included, only pick and mark will write to the .mh_sequences file. The Unseen Sequence Finally, many users like to indicate which messages have not been previously seen by them. The commands inc, rcvstore, show, mhshow, and flist honor the profile entry "Unseen-Sequence" to support this activity. This entry in the .mh_profile should be defined as one or more sequence names separated by spaces. If there is a value for "Unseen-Sequence" in the profile, then whenever new messages are placed in a folder (using inc or rcvstore), the new messages will also be added to all the sequences named in this profile entry. For example, a pro- file entry of Unseen-Sequence: unseen directs inc to add new messages to the sequence "unseen". Unlike the behavior of the "Previous-Sequence" entry in the profile, however, the sequence(s) will not be zeroed by inc. Similarly, whenever show, mhshow, next, or prev displays a message, that message will be removed from any sequences named by the "Unseen-Sequence" entry in the profile. FILES
$HOME/.mh_profile The user profile <mh-dir>/context The user context <folder>/.mh_sequences File for public sequences PROFILE COMPONENTS
mh-sequences: Name of file to store public sequences Sequence-Negation: To designate messages not in a sequence Previous-Sequence: The last message specification given Unseen-Sequence: Those messages not yet seen by the user SEE ALSO
flist(1), mark(1), pick(1), mh-profile(5) DEFAULTS
None MH.6.8 11 June 2012 MH-SEQUENCE(5)
All times are GMT -4. The time now is 12:01 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy