I have files with hundreds of sequences with frequency values reported as "Freq X" and missing characters represented by a dash ("-"), something like this
I need to scan each sequence and if a dash is found, it must be replaced with the character that is present at >80% in that particular position. In the example above, the dash in the second sequence located in the second position should be replaced by a "T", since "T" is found in 700 sequences according to the Freq value. Then, the dash in the last sequence should be replaced by an "A", since the frequency of "A" is 700 representing >80% of the total sequences.
The expected output should be something like this:
The problem is that this code will replace for the majority regardless of the 80% rule
Any help will be greatly appreciated
Moderator's Comments:
Please use CODE tags (not QUOTE, FONT, and COLOR tags), to display sample input, sample output, and code segments.
Last edited by Xterra; 06-20-2015 at 11:35 AM..
Reason: Fix tags.
This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts)... (9 Replies)
I have files with hundreds of sequences with missing characters represented by a dash ("-"), something like this
I need to go sequence by sequence and if a dash is found, it should be replaced with the most common character in that particular position. Thus, in my example the dash in the second... (6 Replies)
I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX.
awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile
But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
I have a list of about 200,000 lines in a text file that look like this:
1 1 120
1 80 200
1 150 270
5 50 170
5 100 220
5 300 420
The first column is an identifier, the next 2 columns are a range (always 120 value range)
I'm trying fill in the values of those ranges, and remove... (4 Replies)
Hi
My file has a series of rows up to 160 characters in length.
There are 7 columns for each row.
In each row, column 1 starts at position 4
column 2 starts at position 12
column 3 starts at position 43
column 4 starts at position 82
column 5 starts at... (7 Replies)
Hello,
For example:
12........6789101112..............20212223242526..................50 ( Positions)
LName FName DOB (Lastname starts from 1 to 6 , FName from 8 to 15 and date of birth from 21 to29)
CURTIS KENNETH ... (5 Replies)
hi.
I have a Fixed Length text file as input where the character positions 4-5(two character positions starting from 4th position) indicates the LOB indicator. The file structure is something like below:
10126Apple DrinkOmaha
10231Milkshake New Jersey
103 Billabong Illinois
... (6 Replies)