Sponsored Content
Top Forums Shell Programming and Scripting Reformat Header of Variable Length Post 302970739 by GDC on Monday 11th of April 2016 12:59:57 PM
Old 04-11-2016
Reformat Header of Variable Length

Dear Forum,

I am struggling with reformatting headers in protein sequence files. For the input file each header (lines starting with @) contains and unique ID followed by barcode (bc) information (a,b,c,d,f,g). The header is of variable length and some barcodes are missing or extra for certain records.

I would like to reformat the barcode by removing fields c1 and d1 if present. I would also like to shorten records with missing barcodes (e.g S006) if consecutive barcodes re missing.

I tried something in awk but it got rather complicated in oder to deal to deal with all possible cases.

Thanks for considering my question!


Code:
awk -F "," '{
   if(NF == 8 && $0 ~ "c1:" && $0 ~ "d1:")
    print $1","$2","$3","$5","$7","$8;
    ...
   else
   print $0;
}
'

Input:
Code:
@S001;bc=a:GGT,b:GGT,c:TTG,c1:TTT,d:ACA,d1:AAA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,c1:TTT,d:ACA,d1:AAA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,d1:AAA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,c1:TTT,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG,c1:TTT,d1:AAA,f:TCC;
AWTVM...
@S006;bc=a:GGT,b:TGT,c1:TTT,d:ACA,d1:AAA,f:TCC;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

Output:
Code:
@S001;bc=a:GGT,b:GGT,c:TTG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG;
AWTVM...
@S006;bc=a:GGT,b:TGT;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

 

10 More Discussions You Might Find Interesting

1. IP Networking

ethernet header length

When i capture a tcp packet (a normal ACK-RST packet), Snort shows me a total packet lenght of 3C(hex) = 60(dez) and an IpLen of 20(dez) and a TcpLen of 20(dez), so the sizeof the Ethernet header should be: TotalPacketLenght-(IpLen+TcpLen), that would be 60-(20+20) = 20, but i thought that the... (4 Replies)
Discussion started by: atmansiddhi
4 Replies

2. Shell Programming and Scripting

creating a fixed length output from a variable length input

Is there a command that sets a variable length? I have a input of a variable length field but my output for that field needs to be set to 32 char. Is there such a command? I am on a sun box running ksh Thanks (2 Replies)
Discussion started by: r1500
2 Replies

3. Shell Programming and Scripting

how to reformat a file to 80 byte rec length?

I have a variable length file that needs to be reformatted to 80 byte reclen before I ftp it to a customer. What is the best way to do this? I tried using dd if=inputfile of=outputfile conv=noblock cbs=80, and it almost gives me what I need. The output file needs to be 80-byte records, and the last... (4 Replies)
Discussion started by: cmgarcia
4 Replies

4. UNIX for Dummies Questions & Answers

Convert a tab delimited/variable length file to fixed length file

Hi, all. I need to convert a file tab delimited/variable length file in AIX to a fixed lenght file delimited by spaces. This is the input file: 10200002<tab>US$ COM<tab>16/12/2008<tab>2,3775<tab>2,3783 19300978<tab>EURO<tab>16/12/2008<tab>3,28523<tab>3,28657 And this is the expected... (2 Replies)
Discussion started by: Everton_Silveir
2 Replies

5. Shell Programming and Scripting

Make variable length record a fixed length

Very, very new to unix scripting and have a unique situation. I have a file of records that contain 3 records types: (H)eader Records (D)etail Records (T)railer Records The Detail records are 82 bytes in length which is perfect. The Header and Trailer records sometimes are 82 bytes in... (3 Replies)
Discussion started by: jclanc8
3 Replies

6. Shell Programming and Scripting

changing a variable length text to a fixed length

Hi, Can anyone help with a effective solution ? I need to change a variable length text field (between 1 - 18 characters) to a fixed length text of 18 characters with the unused portion, at the end, filled with spaces. The text field is actually field 10 of a .csv file however I could cut... (7 Replies)
Discussion started by: dc18
7 Replies

7. UNIX for Dummies Questions & Answers

Delete header row and reformat from tab delimited to fixed width

Hello gurus, I have a file in a tab delimited format and a header row. I need a code to delete the header in the file, and convert the file to a fixed width format, with all the columns aligned. Below is a sample of the file:... (4 Replies)
Discussion started by: chumsky
4 Replies

8. Shell Programming and Scripting

Flat file-make field length equal to header length

Hello Everyone, I am stuck with one issue while working on abstract flat file which i have to use as input and load data to table. Input Data- ------ ------------------------ ---- ----------------- WFI001 Xxxxxx Control Work Item A Number of Records ------ ------------------------... (5 Replies)
Discussion started by: sonali.s.more
5 Replies

9. Shell Programming and Scripting

[Solved] How to increment and add variable length numbers to a variable in a loop?

Hi All, I have a file which has hundred of records with fixed number of fields. In each record there is set of 8 characters which represent the duration of that activity. I want to sum up the duration present in all the records for a report. The problem is the duration changes per record so I... (5 Replies)
Discussion started by: danish0909
5 Replies

10. Shell Programming and Scripting

Convert variable length record to fixed length

Hi Team, I have an issue to split the file which is having special chracter(German Char) using awk command. I have a different length records in a file. I am separating the files based on the length using awk command. The command is working fine if the record is not having any... (7 Replies)
Discussion started by: Anthuvan
7 Replies
Bio::CodonUsage::Table(3pm)				User Contributed Perl Documentation			       Bio::CodonUsage::Table(3pm)

NAME
Bio::CodonUsage::Table - for access to the Codon usage Database at http://www.kazusa.or.jp/codon. SYNOPSIS
use Bio::CodonUsage::Table; use Bio::DB::CUTG; use Bio::CodonUsage::IO; use Bio::Tools::SeqStats; # Get a codon usage table from web database my $cdtable = Bio::DB::CUTG->new(-sp => 'Mus musculus', -gc => 1); # Or from local file my $io = Bio::CodonUsage::IO->new(-file => "file"); my $cdtable = $io->next_data(); # Or create your own from a Bio::PrimarySeq compliant object, # $codonstats is a ref to a hash of codon name /count key-value pairs my $codonstats = Bio::Tools::SeqStats->count_codons($Seq_objct); # '-data' must be specified, '-species' and 'genetic_code' are optional my $CUT = Bio::CodonUsage::Table->new(-data => $codonstats, -species => 'Hsapiens_kinase'); print "leu frequency is ", $cdtable->aa_frequency('LEU'), " "; print "freq of ATG is ", $cdtable->codon_rel_frequency('ttc'), " "; print "abs freq of ATG is ", $cdtable->codon_abs_frequency('ATG'), " "; print "number of ATG codons is ", $cdtable->codon_count('ATG'), " "; print "GC content at position 1 is ", $cdtable->get_coding_gc('1'), " "; print "total CDSs for Mus musculus is ", $cdtable->cds_count(), " "; DESCRIPTION
This class provides methods for accessing codon usage table data. All of the methods at present are simple look-ups of the table or are derived from simple calculations from the table. Future methods could include measuring the codon usage of a sequence , for example, or provide methods for examining codon usage in alignments. SEE ALSO
Bio::Tools::CodonTable, Bio::WebAgent, Bio::CodonUsage::IO, Bio::DB::CUTG FEEDBACK
Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists Support Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ AUTHORS
Richard Adams, Richard.Adams@ed.ac.uk APPENDIX
The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ new Title : new Usage : my $cut = Bio::CodonUsage::Table->new(-data => $cut_hash_ref, -species => 'H.sapiens_kinase' -genetic_code =>1); Returns : a reference to a new Bio::CodonUsage::Table object Args : none or a reference to a hash of codon counts. This constructor is designed to be compatible with the output of Bio::Tools::SeqUtils::count_codons() Species and genetic code parameters can be entered here or via the species() and genetic_code() methods separately. all_aa_frequencies Title : all_aa_frequencies Usage : my $freq = $cdtable->all_aa_frequencies(); Returns : a reference to a hash where each key is an amino acid and each value is its frequency in all proteins in that species. Args : none codon_abs_frequency Title : codon_abs_frequency Usage : my $freq = $cdtable->codon_abs_frequency('CTG'); Purpose : To return the frequency of that codon as a percentage of all codons in the organism. Returns : a percentage frequency Args : a non-ambiguous codon string codon_rel_frequency Title : codon_rel_frequency Usage : my $freq = $cdtable->codon_rel_frequency('CTG'); Purpose : To return the frequency of that codon as a percentage of codons coding for the same amino acid. E.g., ATG and TGG would return 100 as those codons are unique. Returns : a percentage frequency Args : a non-ambiguous codon string probable_codons Title : probable_codons Usage : my $prob_codons = $cd_table->probable_codons(10); Purpose : to obtain a list of codons for the amino acid above a given threshold % relative frequency Returns : A reference to a hash where keys are 1 letter amino acid codes and values are references to arrays of codons whose frequency is above the threshold. Arguments: a minimum threshold frequency most_common_codons Title : most_common_codons Usage : my $common_codons = $cd_table->most_common_codons(); Purpose : To obtain the most common codon for a given amino acid Returns : A reference to a hash where keys are 1 letter amino acid codes and the values are the single most common codons for those amino acids Arguments: None codon_count Title : codon_count Usage : my $count = $cdtable->codon_count('CTG'); Purpose : To obtain the absolute number of the codons in the organism. Returns : an integer Args : a non-ambiguous codon string get_coding_gc Title : get_coding_gc Usage : my $count = $cdtable->get_coding_gc(1); Purpose : To return the percentage GC composition for the organism at codon positions 1,2 or 3, or an average for all coding sequence ('all'). Returns : a number (%-age GC content) or 0 if these fields are undefined Args : 1,2,3 or 'all'. set_coding_gc Title : set_coding_gc Usage : my $count = $cdtable->set_coding_gc(-1=>55.78); Purpose : To set the percentage GC composition for the organism at codon positions 1,2 or 3, or an average for all coding sequence ('all'). Returns : void Args : a hash where the key must be 1,2,3 or 'all' and the value the %age GC at that codon position.. species Title : species Usage : my $sp = $cut->species(); Purpose : Get/setter for species name of codon table Returns : Void or species name string Args : None or species name string genetic_code Title : genetic_code Usage : my $sp = $cut->genetic_code(); Purpose : Get/setter for genetic_code name of codon table Returns : Void or genetic_code id, 1 by default Args : None or genetic_code id, 1 by default if invalid argument. cds_count Title : cds_count Usage : my $count = $cdtable->cds_count(); Purpose : To retrieve the total number of CDSs used to generate the Codon Table for that organism. Returns : an integer Args : none (if retrieving the value) or an integer( if setting ). aa_frequency Title : aa_frequency Usage : my $freq = $cdtable->aa_frequency('Leu'); Purpose : To retrieve the frequency of an amino acid in the organism Returns : a percentage Args : a 1 letter or 3 letter string representing the amino acid common_codon Title : common_codon Usage : my $freq = $cdtable->common_codon('Leu'); Purpose : To retrieve the frequency of the most common codon of that aa Returns : a percentage Args : a 1 letter or 3 letter string representing the amino acid rare_codon Title : rare_codon Usage : my $freq = $cdtable->rare_codon('Leu'); Purpose : To retrieve the frequency of the least common codon of that aa Returns : a percentage Args : a 1 letter or 3 letter string representing the amino acid perl v5.14.2 2012-03-02 Bio::CodonUsage::Table(3pm)
All times are GMT -4. The time now is 07:01 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy