07-01-2010
Filling positions based on consensus character
I have files with hundreds of sequences with missing characters represented by a dash ("-"), something like this
Quote:
>GHL8OVD01BNNCA Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 15
T-GATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 50
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>GHL8OVD01CMQVW Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGC-A
I need to go sequence by sequence and if a dash is found, it should be replaced with the most common character in that particular position. Thus, in my example the dash in the second sequence will be replaced by a "T". The dash in the last sequence should be replaced by an "A" -since the 4th sequence (">GHL8OVD01CMQVW") contains an A in that position and its frequency is 700, therefore is the most common character in that position in the alignment. I do not anticipate seeing a 50% of one character, let say "A" and 50% of other character for instance "T" in the same position, but if that was the case either character "A" or "T" will be ok. So, the resulting file from my example should be something like this
Quote:
>GHL8OVD01BNNCA Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 50
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>GHL8OVD01CMQVW Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGCAA
Any help will be greatly appreciated
8 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX.
awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile
But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
Discussion started by: halplessProblem
9 Replies
2. Shell Programming and Scripting
in my file data is like this
1,2,3
3,4,5,6,7,8
10,11,23,24
i want to make as
1,2,3,?,?,?
3,4,5,6,7,8
10,11,23,24,?,?
here max no of words(separated by comma) in a line is 6.so every line contains 6 words.Line which have less than 6 words replaced with '?' as a word
i have... (3 Replies)
Discussion started by: new2ubuntulinux
3 Replies
3. Shell Programming and Scripting
Hi
My file has a series of rows up to 160 characters in length.
There are 7 columns for each row.
In each row, column 1 starts at position 4
column 2 starts at position 12
column 3 starts at position 43
column 4 starts at position 82
column 5 starts at... (7 Replies)
Discussion started by: malts18
7 Replies
4. Shell Programming and Scripting
Greetings.
I need to extract text between two character positions, e.g: all text between character 4921 and 6534.
The text blocks are FASTA-format sequence of whole chromosomes, so basically a million A, T, G, C, combinations. E.g:
>Chr_1
ACCTGTTCAACTCTCAGGACTCTCAGGTCAACTCTCAG... (3 Replies)
Discussion started by: Twinklefingers
3 Replies
5. Shell Programming and Scripting
Hello,
For example:
12........6789101112..............20212223242526..................50 ( Positions)
LName FName DOB (Lastname starts from 1 to 6 , FName from 8 to 15 and date of birth from 21 to29)
CURTIS KENNETH ... (5 Replies)
Discussion started by: duplicate
5 Replies
6. Shell Programming and Scripting
I have two text files as shown below
cat file1.txt
Id leng sal mon
25671 34343 56565 5565
44888 56565 45554 6868
23343 23423 26226 6224
77765 88688 87464 6848
66776 23343 63463 4534
cat file2.txt
Id number
25671 34343
76767 34234
23343 23423
66776 23343 (4 Replies)
Discussion started by: halfafringe
4 Replies
7. UNIX for Dummies Questions & Answers
I have files with hundreds of sequences with frequency values reported as "Freq X" and missing characters represented by a dash ("-"), something like this
>39sample Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>22sample Freq 15
T-GATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA... (12 Replies)
Discussion started by: Xterra
12 Replies
8. Shell Programming and Scripting
hi.
I have a Fixed Length text file as input where the character positions 4-5(two character positions starting from 4th position) indicates the LOB indicator. The file structure is something like below:
10126Apple DrinkOmaha
10231Milkshake New Jersey
103 Billabong Illinois
... (6 Replies)
Discussion started by: kumarjt
6 Replies