parse fasta file to tabular file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting parse fasta file to tabular file
# 1  
Old 12-23-2011
parse fasta file to tabular file

Hello,
A bioperl problem I thought could be done with awk: convert the fasta format (Note: the length of each row is not the same for each entry as they were combined from different files!) to tabular format.
Code:
input.fasta:

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACC
ACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAAT
ACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGG
TAATTCTAGAGCTAATACATGCTGTTGTGCCCGACTCACGAAGGGCGGTATTTATTAGATATCAGCCAATA
AGCATCTGCTATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence
ATTACCCAATCCTGACTCAGGGAGGTAGTGACAAGAAATAATGGGTCGGGGTTCTGCCCCGGGACTGCA
GGGCACCACCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

I want to convert it to the tabular format as:
Code:
output.tab:

>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACA......CACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAG......CATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence GAAACTGCGAATGGCTCA......ATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence ATTACCCAATCCTGACTC......CCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

i.e. each row has two columns: the first one is the header for the sequence name and description, the second column is the DNA sequence. This is quite common in bioinformatics daily task.
I am aware bioperl is the right tool to do the job, but I am trying to level up my awk when I read the RS variable. Not sure how to handle this situation for the RS and the FS variables.
Thanks a lot!
# 2  
Old 12-23-2011
try this:
Code:
awk 'BEGIN{RS=">"}{gsub("\n","",$0); print ">"$0}' file

# 3  
Old 12-23-2011
Thanks! It worked except the OFS is missing, which is the header and the sequence are not delimited as needed. I added the OFS="\t", but it did not work.
Code:
awk 'BEGIN{RS=">"; OFS="\t"}{gsub("\n","",$0); print ">"$0}' file
-----------------------------
output is:
>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>YAL069W-1.334 Putative promoter sequenceCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACAC

Any clue?
YF
# 4  
Old 12-24-2011
Maybe something like this?
Code:
awk '/^>/ && NR>1{$0=RS $0}{printf $0}END{print ""}' file

# 5  
Old 12-24-2011
You could try using a tab, instead of replacing the new line with nothing:
Code:
awk 'BEGIN{RS=">"}{gsub("\n","\t",$0); print ">"$0}' file

# 6  
Old 12-24-2011
Thanks Kato!
Your second version is much better. Is it possible to remove the tabs within the sequence fields? i.e. merge the sequence to a single field instead of being separated with the tab. gsub the first "\n" with "\t", but gsub the second "\n" and after with nothing. One step from what I want.
Merry Christmas!!!

Last edited by yifangt; 12-24-2011 at 07:39 PM.. Reason: improve the algorithm
# 7  
Old 12-25-2011
Merry Christmas! With a few improvements after @Franklin52:
Code:
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2... (3 Replies)
Discussion started by: Ibk
3 Replies

2. UNIX for Dummies Questions & Answers

Selectively extracting entries from FASTA file

I would like to extract all entries containing the following patterns: ccccta & ccccccccc from the following infile: >P39PT-1224_Freq_900 cccctacgacggcattggtaatggctcccgcaagccatctctcttcagccaagg >P39PT-784_Freq_2 cccctacgacggcattggtaatggcacccgcaagccatctctcttccccccccc >P39PT-678_Freq_5... (4 Replies)
Discussion started by: Xterra
4 Replies

3. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2... (2 Replies)
Discussion started by: Xterra
2 Replies

4. UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

Hi How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this: >H8V34IS02I59VP SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA... (6 Replies)
Discussion started by: Marion MPI
6 Replies

5. Shell Programming and Scripting

Convert text file to HTML tabular format.

Please provide script/commands to convert text file to HTML tabular format. No need of styles and colours, just output and a heading in table is required. Output file will be send via email and will be seen from outlook. (script required without using awk). output file content: (sar... (7 Replies)
Discussion started by: Veera_V
7 Replies

6. UNIX for Dummies Questions & Answers

Append file name to fasta file headers in Linux

How do we append the file name to fasta file headers in multiple fasta-files in Linux? (10 Replies)
Discussion started by: Mauve
10 Replies

7. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

8. Shell Programming and Scripting

Extract sequence from fasta file

Hi, I want to match the sequence id (sub-string of line starting with '>' and extract the information upto next '>' line ). Please help . input > fefrwefrwef X900 AGAGGGAATTGG AGGGGCCTGGAG GGTTCTCTTC > fefrwefrwef X932 AGAGGGAATTGG AGGAGGTGGAG GGTTCTCTTC > fefrwefrwef X937... (2 Replies)
Discussion started by: ritakadm
2 Replies

9. UNIX for Dummies Questions & Answers

Change sequence names in fasta file

I have fasta files with multiple sequences in each. I need to change the sequence name headers from: >accD:_59176-60699 ATGGAAAAGTGGAGGATTTATTCGTTTCAGAAGGAGTTCGAACGCA >atpA_(reverse_strand):_showing_revcomp_of_10525-12048 ATGGTAACCATTCAAGCCGACGAAATTAGTAATCTTATCCGGGAAC... (2 Replies)
Discussion started by: tyrianthinae
2 Replies

10. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT... (2 Replies)
Discussion started by: baika
2 Replies
Login or Register to Ask a Question