Filling positions based on consensus character


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filling positions based on consensus character
# 1  
Old 07-01-2010
Filling positions based on consensus character

I have files with hundreds of sequences with missing characters represented by a dash ("-"), something like this
Quote:
>GHL8OVD01BNNCA Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 15
T-GATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 50
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>GHL8OVD01CMQVW Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGC-A
I need to go sequence by sequence and if a dash is found, it should be replaced with the most common character in that particular position. Thus, in my example the dash in the second sequence will be replaced by a "T". The dash in the last sequence should be replaced by an "A" -since the 4th sequence (">GHL8OVD01CMQVW") contains an A in that position and its frequency is 700, therefore is the most common character in that position in the alignment. I do not anticipate seeing a 50% of one character, let say "A" and 50% of other character for instance "T" in the same position, but if that was the case either character "A" or "T" will be ok. So, the resulting file from my example should be something like this
Quote:
>GHL8OVD01BNNCA Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>GHL8OVD01CMQVT Freq 50
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>GHL8OVD01CMQVW Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGCAA
Any help will be greatly appreciated
# 2  
Old 07-01-2010
For now AWK.. If it happens to be too slow, then I can come up with some Perl, but not until tomorow (note that this is one huge command, you might consider breaking it into awk scripts):
Code:
awk '/>/{fr=$3;getline;n=split ($0,a,""); for (i=1;i<=n;i++) b[i"-"a[i]]+=fr}\
END{for (i in b) {split (i,c,"-"); if (d[c[1]]<=b[i]){e[c[1]]=c[2];d[c[1]]=b[i]}}\
for (i in e) print i" "e[i]}' file | awk 'NR==FNR{a[$1]=$2;next}\
{n=split($0,b,"");for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i]; for (i=1;i<=n;i++) printf b[i];\
printf "\n"}' - file > outfile

# 3  
Old 07-01-2010
Thanks once again for your help. This is what I am getting
Code:
awk: cmd. line:1: {n=split($0,b,"");for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i]; for (i=1;i<=n;i++) printf b[i];\
awk: cmd. line:1: ^ syntax error

As a result, the Outputfile is empty.
I used the code in its entire and by the 2 individual AWKs.
Code:
$ awk '/>/{fr=$3;getline;n=split ($0,a,""); for (i=1;i<=n;i++) b[i"-"a[i]]+=fr}\
END{for (i in b) {split (i,c,"-"); if (d[c[1]]<=b[i]){e[c[1]]=c[2];d[c[1]]=b[i]}}\
for (i in e) print i" "e[i]}' Input.txt | awk 'NR==FNR{a[$1]=$2;next}\
{n=split($0,b,"");for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i]; for (i=1;i<=n;i++) printf b[i];\
printf "\n"}' - Input.txt > Output.txt

Perhaps Perl?
# 4  
Old 07-02-2010
Try AWK again, but this time, 1.awk:
Code:
/>/{fr=$3;
getline;
n=split ($0,a,""); 
for (i=1;i<=n;i++) b[i"-"a[i]]+=fr}
END{for (i in b) {split (i,c,"-"); 
if (d[c[1]]<=b[i]) {e[c[1]]=c[2];
d[c[1]]=b[i]}}
for (i in e) print i" "e[i]}

2.awk:
Code:
NR==FNR{a[$1]=$2;next}
{n=split($0,b,"");
for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i]; 
for (i=1;i<=n;i++) printf b[i];
printf "\n"}

command:
Code:
awk -f 1.awk Input.txt | awk -f 2.awk - Input.txt > Output.txt

# 5  
Old 07-02-2010
I am still getting an error

This is what I get
Code:
$ awk -f 1.awk Input.txt | awk -f 2.awk - Input.txt > Output.txt
awk: 2.awk:3: for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i];
awk: 2.awk:3:                                      ^ syntax error

So I ran it in two steps the first awk
Code:
$ awk -f 1.awk Input.txt > Intermediate.txt

then the second awk
Code:
$ awk -f 2.awk - Intermediate.txt > Output.txt

And I get the same error
Code:
awk: 2.awk:3: for (i=1;i<=n;i++) if (b[i]=="-" b[i]=a[i];
awk: 2.awk:3:                                      ^ syntax error

I am also attaching the input and intermediate files -the out outfile is empty
Thanks!

Last edited by Xterra; 07-02-2010 at 12:46 PM..
# 6  
Old 07-02-2010
Sorry, I missed parenthes in 2.awk. It should be: 2.awk
Code:
NR==FNR{a[$1]=$2;next}
{n=split($0,b,"");
for (i=1;i<=n;i++) if (b[i]=="-") b[i]=a[i]; 
for (i=1;i<=n;i++) printf b[i];
printf "\n"}

This User Gave Thanks to bartus11 For This Post:
# 7  
Old 07-02-2010
Fantastic!

bartus11,

Thanks one more time! It works great!
Cheers,

X
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter lines based on values at specific positions

hi. I have a Fixed Length text file as input where the character positions 4-5(two character positions starting from 4th position) indicates the LOB indicator. The file structure is something like below: 10126Apple DrinkOmaha 10231Milkshake New Jersey 103 Billabong Illinois ... (6 Replies)
Discussion started by: kumarjt
6 Replies

2. UNIX for Dummies Questions & Answers

Filling positions based on frequency

I have files with hundreds of sequences with frequency values reported as "Freq X" and missing characters represented by a dash ("-"), something like this >39sample Freq 4 TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA >22sample Freq 15 T-GATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA... (12 Replies)
Discussion started by: Xterra
12 Replies

3. Shell Programming and Scripting

Join based on positions

I have two text files as shown below cat file1.txt Id leng sal mon 25671 34343 56565 5565 44888 56565 45554 6868 23343 23423 26226 6224 77765 88688 87464 6848 66776 23343 63463 4534 cat file2.txt Id number 25671 34343 76767 34234 23343 23423 66776 23343 (4 Replies)
Discussion started by: halfafringe
4 Replies

4. Shell Programming and Scripting

Sort based on positions in flat file

Hello, For example: 12........6789101112..............20212223242526..................50 ( Positions) LName FName DOB (Lastname starts from 1 to 6 , FName from 8 to 15 and date of birth from 21 to29) CURTIS KENNETH ... (5 Replies)
Discussion started by: duplicate
5 Replies

5. Shell Programming and Scripting

Extract text between two character positions

Greetings. I need to extract text between two character positions, e.g: all text between character 4921 and 6534. The text blocks are FASTA-format sequence of whole chromosomes, so basically a million A, T, G, C, combinations. E.g: >Chr_1 ACCTGTTCAACTCTCAGGACTCTCAGGTCAACTCTCAG... (3 Replies)
Discussion started by: Twinklefingers
3 Replies

6. Shell Programming and Scripting

seds to extract fields based on positions

Hi My file has a series of rows up to 160 characters in length. There are 7 columns for each row. In each row, column 1 starts at position 4 column 2 starts at position 12 column 3 starts at position 43 column 4 starts at position 82 column 5 starts at... (7 Replies)
Discussion started by: malts18
7 Replies

7. Shell Programming and Scripting

filling a character

in my file data is like this 1,2,3 3,4,5,6,7,8 10,11,23,24 i want to make as 1,2,3,?,?,? 3,4,5,6,7,8 10,11,23,24,?,? here max no of words(separated by comma) in a line is 6.so every line contains 6 words.Line which have less than 6 words replaced with '?' as a word i have... (3 Replies)
Discussion started by: new2ubuntulinux
3 Replies

8. Shell Programming and Scripting

awk script replace positions if certain positions equal prescribed value

I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX. awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
Discussion started by: halplessProblem
9 Replies
Login or Register to Ask a Question