parse fasta file to tabular file

12-23-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

parse fasta file to tabular file

Hello,
A bioperl problem I thought could be done with awk: convert the fasta format (Note: the length of each row is not the same for each entry as they were combined from different files!) to tabular format.

Code:

input.fasta:

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACC
ACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAAT
ACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGG
TAATTCTAGAGCTAATACATGCTGTTGTGCCCGACTCACGAAGGGCGGTATTTATTAGATATCAGCCAATA
AGCATCTGCTATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence
ATTACCCAATCCTGACTCAGGGAGGTAGTGACAAGAAATAATGGGTCGGGGTTCTGCCCCGGGACTGCA
GGGCACCACCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

I want to convert it to the tabular format as:

Code:

output.tab:

>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACA......CACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAG......CATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence GAAACTGCGAATGGCTCA......ATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence ATTACCCAATCCTGACTC......CCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

i.e. each row has two columns: the first one is the header for the sequence name and description, the second column is the DNA sequence. This is quite common in bioinformatics daily task.
I am aware bioperl is the right tool to do the job, but I am trying to level up my awk when I read the RS variable. Not sure how to handle this situation for the RS and the FS variables.
Thanks a lot!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

12-23-2011

Registered User

65, 10

Join Date: Mar 2011

Last Activity: 10 January 2012, 2:44 PM EST

Posts: 65

Thanks Given: 2

Thanked 10 Times in 10 Posts

try this:

Code:

awk 'BEGIN{RS=">"}{gsub("\n","",$0); print ">"$0}' file

kato

View Public Profile for kato

Find all posts by kato

12-23-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks! It worked except the OFS is missing, which is the header and the sequence are not delimited as needed. I added the OFS="\t", but it did not work.

Code:

awk 'BEGIN{RS=">"; OFS="\t"}{gsub("\n","",$0); print ">"$0}' file
-----------------------------
output is:
>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>YAL069W-1.334 Putative promoter sequenceCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACAC

Any clue?
YF

yifangt

View Public Profile for yifangt

Find all posts by yifangt

12-24-2011

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

Maybe something like this?

Code:

awk '/^>/ && NR>1{$0=RS $0}{printf $0}END{print ""}' file

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

12-24-2011

Registered User

65, 10

Join Date: Mar 2011

Last Activity: 10 January 2012, 2:44 PM EST

Posts: 65

Thanks Given: 2

Thanked 10 Times in 10 Posts

You could try using a tab, instead of replacing the new line with nothing:

Code:

awk 'BEGIN{RS=">"}{gsub("\n","\t",$0); print ">"$0}' file

kato

View Public Profile for kato

Find all posts by kato

12-24-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks Kato!
Your second version is much better. Is it possible to remove the tabs within the sequence fields? i.e. merge the sequence to a single field instead of being separated with the tab. gsub the first "\n" with "\t", but gsub the second "\n" and after with nothing. One step from what I want.
Merry Christmas!!!

Last edited by yifangt; 12-24-2011 at 07:39 PM.. Reason: improve the algorithm

yifangt

View Public Profile for yifangt

Find all posts by yifangt

12-25-2011

Registered User

65, 10

Join Date: Mar 2011

Last Activity: 10 January 2012, 2:44 PM EST

Posts: 65

Thanks Given: 2

Thanked 10 Times in 10 Posts

Merry Christmas! With a few improvements after @Franklin52:

Code:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

kato

View Public Profile for kato

Find all posts by kato

Shell Programming and Scripting

parse fasta file to tabular file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Discussion started by: Ibk

2. UNIX for Dummies Questions & Answers

Selectively extracting entries from FASTA file

Discussion started by: Xterra

3. UNIX for Dummies Questions & Answers

Round up -FASTA file

Discussion started by: Xterra

4. UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

Discussion started by: Marion MPI

5. Shell Programming and Scripting

Convert text file to HTML tabular format.

Discussion started by: Veera_V

6. UNIX for Dummies Questions & Answers

Append file name to fasta file headers in Linux

Discussion started by: Mauve

7. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

Discussion started by: nelsonfrans

8. Shell Programming and Scripting

Extract sequence from fasta file

Discussion started by: ritakadm

9. UNIX for Dummies Questions & Answers

Change sequence names in fasta file

Discussion started by: tyrianthinae

10. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Discussion started by: baika