Filling positions based on frequency

06-19-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Filling positions based on frequency

I have files with hundreds of sequences with frequency values reported as "Freq X" and missing characters represented by a dash ("-"), something like this

Code:

>39sample Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>22sample Freq 15
T-GATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>sample2 Freq 50
TCGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>sample1 Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>1-2sample Freq 9
TCGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>HL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGC-A

I need to scan each sequence and if a dash is found, it must be replaced with the character that is present at >80% in that particular position. In the example above, the dash in the second sequence located in the second position should be replaced by a "T", since "T" is found in 700 sequences according to the Freq value. Then, the dash in the last sequence should be replaced by an "A", since the frequency of "A" is 700 representing >80% of the total sequences.
The expected output should be something like this:

Code:

>39sample Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>22sample Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>sample2 Freq 50
TCGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>sample1 Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>1-2sample Freq 9
TCGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>HL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGCAA

Code:

$ awk '/>/{fr=$3;getline;n=split ($0,a,""); for (i=1;i<=n;i++) b[i"-"a[i]]+=fr}\
END{for (i in b) {split (i,c,"-"); if (d[c[1]]<=b[i]){e[c[1]]=c[2];d[c[1]]=b[i]}}\
for (i in e) print i" "e[i]}' Input | awk 'NR==FNR{a[$1]=$2;next}\
{n=split($0,b,"");for (i=1;i<=n;i++) if (b[i]=="-") b[i]=a[i]; for (i=1;i<=n;i++) printf b[i];\
printf "\n"}' - Input

The problem is that this code will replace for the majority regardless of the 80% rule
Any help will be greatly appreciated

Moderator's Comments:

Please use CODE tags (not QUOTE, FONT, and COLOR tags), to display sample input, sample output, and code segments.

Last edited by Xterra; 06-20-2015 at 11:35 AM.. Reason: Fix tags.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What is supposed to happen if there is a dash in an input line and no other character in that position occurs more than 90% of the time? For example, if the input file had these two lines:

Code:

>sample1 Freq 700
-TGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC

instead of:

Code:

>sample1 Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC

what should the output file contain for these two input lines?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-20-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Then nothing is done for that particular position. The dash must be retained and the search for the next dash continues.
Thanks!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

And what have you tried?

With well over 200 posts we would hope that you have learned something from all of the sample code we have provided. The UNIX & Linux Forums is here to help you learn how to use the tools supplied on UNIX, Linux, and other similar systems; not to act as your personal, unpaid programming staff.

Moderator's Comments:

And, please, use CODE tags for sample input, sample output, and sample code segments; not QUOTE tags, not FONT tags, not SIZE tags, and only use COLOR tags to highlight changing or unusual data or code.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-20-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Quote:

not to act as your personal, unpaid programming staff.

That's is uncalled for. Please read my posts and then comment. I have not asked/posted anything for >2 years.

Last edited by Xterra; 06-20-2015 at 01:31 AM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Xterra

That's is uncalled for. Please read my posts and then comment. I have not asked/posted anything for >2 years.

I just reread all of the last 25 posts you have submitted (going back to Oct 27th, 2011) and in those 25 posts there is not one line of code that you have shown us that you have written on your own.

It is true that your last post originally contained some code that you had written, but you edited your post and removed that code; so I can't count that. (I admit that I didn't look at the rest of your posts to try to determine if you had posted and removed other code.)

If you are unwilling to show us that you are making any attempt to write code for yourself and just ask us to write code for you; what about my statement:

Quote:

With well over 200 posts we would hope that you have learned something from all of the sample code we have provided. The UNIX & Linux Forums is here to help you learn how to use the tools supplied on UNIX, Linux, and other similar systems; not to act as your personal, unpaid programming staff.

was uncalled for? If you were in the habit of showing us your work and asking us to help fix parts that didn't work in your earlier posts, why did you stop? We are happy to help you learn why your code doesn't work. And, if you do try to write code on your own, you'll learn how to do it yourself MUCH faster.

Now, moving on to the problem at hand. With your sample input, there are 795 samples. So, to get more than 90%, you need 718 samples with a single non-dash value. For character position 47 (where there is a dash in the last input line), the counts are:

Code:

and, therefore, by your rules, the dash in that line should not be changed.

If you agree that the last line should not change, the following code seems to do what you want:

Code:

awk '
FNR == 1 && cnt {
	p90 = tot * .9
#	printf("p90=%f, tot=%d\n", p90, tot)
#	for(i in cc) printf("cc[%s]=%d\n", i, cc[i])
	for(i in cc) {
		if(cc[i] > p90) {
			off = index(i, SUBSEP)
			rep[substr(i, 1, off - 1)] = substr(i, off + 1)
		}
		delete cc[i]
	}
#	for(i in rep) printf("rep[%s]=%s\n", i, rep[i])
}
FNR == NR {
	if($2 == "Freq") {
		cnt = $3
		tot += cnt
	} else	for(i = length($0); i > 0; i--) {
			if((c = substr($0, i, 1)) == "-") continue
			cc[i, c] += cnt
		}
	next
}
NF == 1 {
	for(i = length($0); i > 0; i--)
		if((substr($0, i, 1) == "-") && (i in rep))
			$0 = (i > 1 ? substr($0, 1, i - 1) : "") rep[i] \
				substr($0, i + 1)
				
}
1' Input.txt Input.txt

producing the output:

Code:

>39sample Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>22sample Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>sample2 Freq 50
TCGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>sample1 Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>1-2sample Freq 9
TCGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>HL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGC-A

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

With large input files (especially when there are very few dashes in the input), the last for loop in this script could be made much more efficient. For files with a few hundred input lines with line lengths around 50 characters, it might not be worth the time needed to improve and test it.

If you think it needs to run faster and think fixing this loop would help, feel free to look for dashes instead of evaluating each character position. If you try enhancing this script to do that and run into problems, we'll be happy to help you debug your code.

Last edited by Don Cragun; 06-20-2015 at 05:04 AM.. Reason: Fix bug in code noted in next post.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I apologize. Further testing showed that I have some code backwards. Testing with the sample data didn't catch it. Testing it with more data looking for corner cases made it immediately obvious that the two lines close to the end of the script:

Code:

                        $0 = (i > 1 ? substr($0, i - 1, 1) : "") rep[i] \
                                substr($0, i + 1)

should be changed to:

Code:

                        $0 = (i > 1 ? substr($0, 1, i - 1) : "") rep[i] \
                                substr($0, i + 1)

Note: This problem has now been fixed in the previous post so you if you download the entire script now, you won't have to patch it.

Last edited by Don Cragun; 06-20-2015 at 05:05 AM.. Reason: Add note.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Dummies Questions & Answers

Filling positions based on frequency

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter lines based on values at specific positions

Discussion started by: kumarjt

2. Shell Programming and Scripting

Join based on positions

Discussion started by: halfafringe

3. Shell Programming and Scripting

Sort based on positions in flat file

Discussion started by: duplicate

4. Shell Programming and Scripting

awk regardless positions

Discussion started by: Daniel Gate

5. Shell Programming and Scripting

seds to extract fields based on positions

Discussion started by: malts18

6. UNIX for Dummies Questions & Answers

Need help filling in ranges

Discussion started by: knott76

7. Shell Programming and Scripting

awk script replace positions if certain positions equal prescribed value

Discussion started by: halplessProblem

8. Shell Programming and Scripting

Filling positions based on consensus character

Discussion started by: Xterra

9. Shell Programming and Scripting

Deleting sequences based on character frequency

Discussion started by: Xterra