05-12-2011
Parse and Join in a text file
I wanted to parse a text file and join in specific format. please suggest me how to get this done..
Quote:
ID US88811111-0005
OO giensis
OS giensis
SN US74811111
PT I-008, testing for the second phase
PA sandiego group, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
//
ID US74811111-0005
OO giensis
OS giensis
SN US74811111
PT I-003, a gene and methods for its use
PA NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//
The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two different sequences.
Quote:
>US88811111-0005 ; I-008, testing for the second phase ; sandiego group, NC
QAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDEL KFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYT
>US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRL EGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLP VGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
Like this i have 50,000 in a single file which should be converted to fasta format
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
Please help.
I have a text file which looks something like this
aaa@abc.com, c:FilePath\Eaaa.txt
bbb@abc.com, c:FilePath\Ebbb.txt
ccc@abc.com, c:FilePath\Eccc.txt
ddd@abc.com, c:FilePath\Eddd.txt...so on
I want to write a shell script which will pick up the first field 'aaa@abc.com' and... (12 Replies)
Discussion started by: Amruta Pitkar
12 Replies
2. Shell Programming and Scripting
i am attempting to parse a simple text file with multiple lines and four fields in each line, formatted as such:
12/10/2006 12:34:06 77 38
this is what i'm having problems with in my bash script:
sed '1,6d' $RAWDATA > $NEWFILE
#removes first 6 lines from file, which are... (3 Replies)
Discussion started by: klick81
3 Replies
3. Shell Programming and Scripting
I have a file that has a header followed by 8 columns of data. I want to toss out the header, and then write the data to another file with a different header and footer. I also need to grab the first values of the first and second column to put in the header.
How do I chop off the header? ... (9 Replies)
Discussion started by: craggm
9 Replies
4. UNIX for Dummies Questions & Answers
Hi, everyone
The input file pattern is like below:
Begin Object1
txt1
end
;
Begin Object2
txt2
end
;
... (14 Replies)
Discussion started by: sophiadun
14 Replies
5. Shell Programming and Scripting
I have a file name version.properties with the following data:
major.version=14
minor.version=234
I'm trying to write a grep expression to only put "14" to stdout. The following is not working.
grep "major.version=(+)" version.properties
What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies
6. Shell Programming and Scripting
Hi,
I lack the utter fundamentals on how to craft an awk script.
I have hundreds of text files that were mangled by .doc format so all the lines are broken up so I need to join all of the lines of text into a single line. Normally I use vim command "ggVGJ" to join all lines but with so many... (3 Replies)
Discussion started by: n00ti
3 Replies
7. Shell Programming and Scripting
Hi guys,
I desperately need some help here...
I need to parse a file similar to this:
I need to read the values for MY_BANNER_SSHD and WARNING_MESSAGE. The value could be empty/single line or multi-line!
# Comments
.
.
.
Some lines
MY_BANNER_SSHD=""... (7 Replies)
Discussion started by: shreeda
7 Replies
8. Shell Programming and Scripting
awk -F "" '/<href=>|<href=>|<top>|<top>/ {print $3, OFS=\t}' source.txt > output.txt
I'm not quite sure how to parse the attached file, but what I am trying to do is in a output file have the link (href=), name (after the <), and count (<top>) in 3 separate columns.
My attempt is the above... (2 Replies)
Discussion started by: cmccabe
2 Replies
9. Shell Programming and Scripting
I have a file of ~500,000 entries in the following:
file.txt
chr1 11868 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type... (17 Replies)
Discussion started by: cmccabe
17 Replies
10. Shell Programming and Scripting
Hi Guys,
Could you please advise how to join multiple details lines into single row, with HEADER 1 as the record separator and comma(,) as the field separator.
Input:
HEADER 1, HEADER 2, HEADER 3,
11,22,33,
COLUMN1,COLUMN2,COLUMN3,
AA1, BB1, CC1,
END: ABC
HEADER 1, HEADER 2,... (3 Replies)
Discussion started by: budz26
3 Replies
LEARN ABOUT DEBIAN
bio::seqio::tab
Bio::SeqIO::tab(3pm) User Contributed Perl Documentation Bio::SeqIO::tab(3pm)
NAME
Bio::SeqIO::tab - nearly raw sequence file input/output stream. Reads/writes id" "sequence"
"
SYNOPSIS
Do not use this module directly. Use it via the Bio::SeqIO class.
DESCRIPTION
This object can transform Bio::Seq objects to and from tabbed flat file databases.
It is very useful when doing large scale stuff using the Unix command line utilities (grep, sort, awk, sed, split, you name it). Imagine
that you have a format converter 'seqconvert' along the following lines:
my $in = Bio::SeqIO->newFh(-fh => *STDIN , '-format' => $from);
my $out = Bio::SeqIO->newFh(-fh=> *STDOUT, '-format' => $to);
print $out $_ while <$in>;
then you can very easily filter sequence files for duplicates as:
$ seqconvert < foo.fa -from fasta -to tab | sort -u |
seqconvert -from tab -to fasta > foo-unique.fa
Or grep [-v] for certain sequences with:
$ seqconvert < foo.fa -from fasta -to tab | grep -v '^S[a-z]*control' |
seqconvert -from tab -to fasta > foo-without-controls.fa
Or chop up a huge file with sequences into smaller chunks with:
$ seqconvert < all.fa -from fasta -to tab | split -l 10 - chunk-
$ for i in chunk-*; do seqconvert -from tab -to fasta < $i > $i.fa; done
# (this creates files chunk-aa.fa, chunk-ab.fa, ..., each containing 10
# sequences)
FEEDBACK
Mailing Lists
User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Support
Please direct usage questions or support issues to the mailing list:
bioperl-l@bioperl.org
rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address
it. Please include a thorough description of the problem with code and data examples if at all possible.
Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the
web:
https://redmine.open-bio.org/projects/bioperl/
AUTHORS
Philip Lijnzaad, p.lijnzaad@med.uu.nl
APPENDIX
The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _
next_seq
Title : next_seq
Usage : $seq = $stream->next_seq()
Function: returns the next sequence in the stream
Returns : Bio::Seq object
Args :
write_seq
Title : write_seq
Usage : $stream->write_seq($seq)
Function: writes the $seq object into the stream
Returns : 1 for success and 0 for error
Args : Bio::Seq object
perl v5.14.2 2012-03-02 Bio::SeqIO::tab(3pm)