Sponsored Content
Top Forums Shell Programming and Scripting parse fasta file to tabular file Post 302584570 by yifangt on Friday 23rd of December 2011 04:55:17 PM
Old 12-23-2011
parse fasta file to tabular file

Hello,
A bioperl problem I thought could be done with awk: convert the fasta format (Note: the length of each row is not the same for each entry as they were combined from different files!) to tabular format.
Code:
input.fasta:

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACC
ACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAAT
ACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGG
TAATTCTAGAGCTAATACATGCTGTTGTGCCCGACTCACGAAGGGCGGTATTTATTAGATATCAGCCAATA
AGCATCTGCTATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence
ATTACCCAATCCTGACTCAGGGAGGTAGTGACAAGAAATAATGGGTCGGGGTTCTGCCCCGGGACTGCA
GGGCACCACCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

I want to convert it to the tabular format as:
Code:
output.tab:

>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACA......CACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAG......CATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence GAAACTGCGAATGGCTCA......ATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence ATTACCCAATCCTGACTC......CCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

i.e. each row has two columns: the first one is the header for the sequence name and description, the second column is the DNA sequence. This is quite common in bioinformatics daily task.
I am aware bioperl is the right tool to do the job, but I am trying to level up my awk when I read the RS variable. Not sure how to handle this situation for the RS and the FS variables.
Thanks a lot!
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT... (2 Replies)
Discussion started by: baika
2 Replies

2. UNIX for Dummies Questions & Answers

Change sequence names in fasta file

I have fasta files with multiple sequences in each. I need to change the sequence name headers from: >accD:_59176-60699 ATGGAAAAGTGGAGGATTTATTCGTTTCAGAAGGAGTTCGAACGCA >atpA_(reverse_strand):_showing_revcomp_of_10525-12048 ATGGTAACCATTCAAGCCGACGAAATTAGTAATCTTATCCGGGAAC... (2 Replies)
Discussion started by: tyrianthinae
2 Replies

3. Shell Programming and Scripting

Extract sequence from fasta file

Hi, I want to match the sequence id (sub-string of line starting with '>' and extract the information upto next '>' line ). Please help . input > fefrwefrwef X900 AGAGGGAATTGG AGGGGCCTGGAG GGTTCTCTTC > fefrwefrwef X932 AGAGGGAATTGG AGGAGGTGGAG GGTTCTCTTC > fefrwefrwef X937... (2 Replies)
Discussion started by: ritakadm
2 Replies

4. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

5. UNIX for Dummies Questions & Answers

Append file name to fasta file headers in Linux

How do we append the file name to fasta file headers in multiple fasta-files in Linux? (10 Replies)
Discussion started by: Mauve
10 Replies

6. Shell Programming and Scripting

Convert text file to HTML tabular format.

Please provide script/commands to convert text file to HTML tabular format. No need of styles and colours, just output and a heading in table is required. Output file will be send via email and will be seen from outlook. (script required without using awk). output file content: (sar... (7 Replies)
Discussion started by: Veera_V
7 Replies

7. UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

Hi How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this: >H8V34IS02I59VP SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA... (6 Replies)
Discussion started by: Marion MPI
6 Replies

8. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2... (2 Replies)
Discussion started by: Xterra
2 Replies

9. UNIX for Dummies Questions & Answers

Selectively extracting entries from FASTA file

I would like to extract all entries containing the following patterns: ccccta & ccccccccc from the following infile: >P39PT-1224_Freq_900 cccctacgacggcattggtaatggctcccgcaagccatctctcttcagccaagg >P39PT-784_Freq_2 cccctacgacggcattggtaatggcacccgcaagccatctctcttccccccccc >P39PT-678_Freq_5... (4 Replies)
Discussion started by: Xterra
4 Replies

10. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2... (3 Replies)
Discussion started by: Ibk
3 Replies
Bio::LiveSeq::Mutator(3pm)				User Contributed Perl Documentation				Bio::LiveSeq::Mutator(3pm)

NAME
Bio::LiveSeq::Mutator - Package mutating LiveSequences SYNOPSIS
# $gene is a Bio::LiveSeq::Gene object my $mutate = Bio::LiveSeq::Mutator->new('-gene' => $gene, '-numbering' => "coding" ); # $mut is a Bio::LiveSeq::Mutation object $mutate->add_Mutation($mut); # $results is a Bio::Variation::SeqDiff object my $results=$mutate->change_gene(); if ($results) { my $out = Bio::Variation::IO->new( '-format' => 'flat'); $out->write($results); } DESCRIPTION
This class mutates Bio::LiveSeq::Gene objects and returns a Bio::Variation::SeqDiff object. Mutations are described as Bio::LiveSeq::Mutation objects. See Bio::LiveSeq::Gene, Bio::Variation::SeqDiff, and Bio::LiveSeq::Mutation for details. FEEDBACK
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing lists Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists Support Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. Reporting Bugs report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ AUTHOR - Heikki Lehvaslaiho &; Joseph A.L. Insana Email: heikki-at-bioperl-dot-org insana@ebi.ac.uk, jinsana@gmx.net APPENDIX
The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ gene Title : gene Usage : $mutobj = $obj->gene; : $mutobj = $obj->gene($objref); Function: Returns or sets the link-reference to a Bio::LiveSeq::Gene object. If no value has ben set, it will return undef Returns : an object reference or undef Args : a Bio::LiveSeq::Gene See Bio::LiveSeq::Gene for more information. numbering Title : numbering Usage : $obj->numbering(); Function: Sets and returns coordinate system used in positioning the mutations. See L<change_gene> for details. Example : Returns : string Args : string (coding [transcript number] | gene | entry) add_Mutation Title : add_Mutation Usage : $self->add_Mutation($ref) Function: adds a Bio::LiveSeq::Mutation object Example : Returns : Args : a Bio::LiveSeq::Mutation See Bio::LiveSeq::Mutation for more information. each_Mutation Title : each_Mutation Usage : foreach $ref ( $a->each_Mutation ) Function: gets an array of Bio::LiveSeq::Mutation objects Example : Returns : array of Mutations Args : See Bio::LiveSeq::Mutation for more information. mutation Title : mutation Usage : $mutobj = $obj->mutation; : $mutobj = $obj->mutation($objref); Function: Returns or sets the link-reference to the current mutation object. If the value is not set, it will return undef. Internal method. Returns : an object reference or undef DNA Title : DNA Usage : $mutobj = $obj->DNA; : $mutobj = $obj->DNA($objref); Function: Returns or sets the reference to the LiveSeq object holding the reference sequence. If there is no link, it will return undef. Internal method. Returns : an object reference or undef RNA Title : RNA Usage : $mutobj = $obj->RNA; : $mutobj = $obj->RNA($objref); Function: Returns or sets the reference to the LiveSeq object holding the reference sequence. If the value is not set, it will return undef. Internal method. Returns : an object reference or undef dnamut Title : dnamut Usage : $mutobj = $obj->dnamut; : $mutobj = $obj->dnamut($objref); Function: Returns or sets the reference to the current DNAMutation object. If the value is not set, it will return undef. Internal method. Returns : a Bio::Variation::DNAMutation object or undef See Bio::Variation::DNAMutation for more information. rnachange Title : rnachange Usage : $mutobj = $obj->rnachange; : $mutobj = $obj->rnachange($objref); Function: Returns or sets the reference to the current RNAChange object. If the value is not set, it will return undef. Internal method. Returns : a Bio::Variation::RNAChange object or undef See Bio::Variation::RNAChange for more information. aachange Title : aachange Usage : $mutobj = $obj->aachange; : $mutobj = $obj->aachange($objref); Function: Returns or sets the reference to the current AAChange object. If the value is not set, it will return undef. Internal method. Returns : a Bio::Variation::AAChange object or undef See Bio::Variation::AAChange for more information. exons Title : exons Usage : $mutobj = $obj->exons; : $mutobj = $obj->exons($objref); Function: Returns or sets the reference to a current array of Exons. If the value is not set, it will return undef. Internal method. Returns : an array of Bio::LiveSeq::Exon objects or undef See Bio::LiveSeq::Exon for more information. change_gene_with_alignment Title : change_gene_with_alignment Usage : $results=$mutate->change_gene_with_alignment($aln); Function: Returns a Bio::Variation::SeqDiff object containing the results of the changes in the alignment. The alignment has to be pairwise and have one sequence named 'QUERY', the other one is assumed to be a part of the sequence from $gene. This method offers a shortcut to change_gene and automates the creation of Bio::LiveSeq::Mutation objects. Use it with almost identical sequnces, e.g. to locate a SNP. Args : Bio::SimpleAlign object representing a short local alignment Returns : Bio::Variation::SeqDiff object or 0 on error See Bio::LiveSeq::Mutation, Bio::SimpleAlign, and Bio::Variation::SeqDiff for more information. create_mutation Title : create_mutation Usage : Function: Formats sequence differences from two sequences into Bio::LiveSeq::Mutation objects which can be applied to a gene. To keep it generic, sequence arguments need not to be Bio::LocatableSeq. Coordinate change to parent sequence numbering needs to be done by the calling code. Called from change_gene_with_alignment Args : Bio::PrimarySeqI inheriting object for the reference sequence Bio::PrimarySeqI inheriting object for the query sequence integer for the start position of the local sequence difference integer for the length of the sequence difference Returns : Bio::LiveSeq::Mutation object change_gene Title : change_gene Usage : my $mutate = Bio::LiveSeq::Mutator->new(-gene => $gene, numbering => "coding" ); # $mut is Bio::LiveSeq::Mutation object $mutate->add_Mutation($mut); my $results=$mutate->change_gene(); Function: Returns a Bio::Variation::SeqDiff object containing the results of the changes performed according to the instructions present in Mutation(s). The -numbering argument decides what molecule is being changed and what numbering scheme being used: -numbering => "entry" determines the DNA level, using the numbering from the beginning of the sequence -numbering => "coding" determines the RNA level, using the numbering from the beginning of the 1st transcript Alternative transcripts can be used by specifying "coding 2" or "coding 3" ... -numbering => "gene" determines the DNA level, using the numbering from the beginning of the 1st transcript and inluding introns. The meaning equals 'coding' if the reference molecule is cDNA. Args : Bio::LiveSeq::Gene object Bio::LiveSeq::Mutation object(s) string specifying a numbering scheme (defaults to 'coding') Returns : Bio::Variation::SeqDiff object or 0 on error _mutationpos2label Title : _mutationpos2label Usage : Function: converts mutation positions into labels Example : Returns : number of valid mutations Args : LiveSeq sequence object _set_DNAMutation Title : _set_DNAMutation Usage : Function: Stores DNA level mutation attributes before mutation into Bio::Variation::DNAMutation object. Links it to SeqDiff object. Example : Returns : Bio::Variation::DNAMutation object Args : Bio::Variation::SeqDiff object See Bio::Variation::DNAMutation and Bio::Variation::SeqDiff. _set_effects Title : _set_effects Usage : Function: Stores RNA and AA level mutation attributes before mutation into Bio::Variation::RNAChange and Bio::Variation::AAChange objects. Links them to SeqDiff object. Example : Returns : Args : Bio::Variation::SeqDiff object Bio::Variation::DNAMutation object See Bio::Variation::RNAChange, Bio::Variation::RNAChange, Bio::Variation::SeqDiff, and Bio::Variation::DNAMutation. _untranslated Title : _untranslated Usage : Function: Stores RNA change attributes before mutation into Bio::Variation::RNAChange object. Links it to SeqDiff object. Example : Returns : Args : Bio::Variation::SeqDiff object Bio::Variation::DNAMutation object See Bio::Variation::RNAChange, Bio::Variation::SeqDiff and Bio::Variation::DNAMutation for details. perl v5.14.2 2012-03-02 Bio::LiveSeq::Mutator(3pm)
All times are GMT -4. The time now is 04:14 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy