Sponsored Content
Top Forums UNIX for Dummies Questions & Answers Filling positions based on frequency Post 302947664 by Don Cragun on Saturday 20th of June 2015 02:54:19 AM
Old 06-20-2015
Quote:
Originally Posted by Xterra
That's is uncalled for. Please read my posts and then comment. I have not asked/posted anything for >2 years.
I just reread all of the last 25 posts you have submitted (going back to Oct 27th, 2011) and in those 25 posts there is not one line of code that you have shown us that you have written on your own.

It is true that your last post originally contained some code that you had written, but you edited your post and removed that code; so I can't count that. (I admit that I didn't look at the rest of your posts to try to determine if you had posted and removed other code.)

If you are unwilling to show us that you are making any attempt to write code for yourself and just ask us to write code for you; what about my statement:
Quote:
With well over 200 posts we would hope that you have learned something from all of the sample code we have provided. The UNIX & Linux Forums is here to help you learn how to use the tools supplied on UNIX, Linux, and other similar systems; not to act as your personal, unpaid programming staff.
was uncalled for? If you were in the habit of showing us your work and asking us to help fix parts that didn't work in your earlier posts, why did you stop? We are happy to help you learn why your code doesn't work. And, if you do try to write code on your own, you'll learn how to do it yourself MUCH faster.

Now, moving on to the problem at hand. With your sample input, there are 795 samples. So, to get more than 90%, you need 718 samples with a single non-dash value. For character position 47 (where there is a dash in the last input line), the counts are:
Code:
-  17
A 700
C   0
G   0
T  78

and, therefore, by your rules, the dash in that line should not be changed.

If you agree that the last line should not change, the following code seems to do what you want:
Code:
awk '
FNR == 1 && cnt {
	p90 = tot * .9
#	printf("p90=%f, tot=%d\n", p90, tot)
#	for(i in cc) printf("cc[%s]=%d\n", i, cc[i])
	for(i in cc) {
		if(cc[i] > p90) {
			off = index(i, SUBSEP)
			rep[substr(i, 1, off - 1)] = substr(i, off + 1)
		}
		delete cc[i]
	}
#	for(i in rep) printf("rep[%s]=%s\n", i, rep[i])
}
FNR == NR {
	if($2 == "Freq") {
		cnt = $3
		tot += cnt
	} else	for(i = length($0); i > 0; i--) {
			if((c = substr($0, i, 1)) == "-") continue
			cc[i, c] += cnt
		}
	next
}
NF == 1 {
	for(i = length($0); i > 0; i--)
		if((substr($0, i, 1) == "-") && (i in rep))
			$0 = (i > 1 ? substr($0, 1, i - 1) : "") rep[i] \
				substr($0, i + 1)
				
}
1' Input.txt Input.txt

producing the output:
Code:
>39sample Freq 4
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTAGCAGCACTA
>22sample Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTA
>sample2 Freq 50
TCGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCTA
>sample1 Freq 700
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGAC
>1-2sample Freq 9
TCGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTC
>HL8OVD01AV2U9 Freq 17
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGC-A

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk.

With large input files (especially when there are very few dashes in the input), the last for loop in this script could be made much more efficient. For files with a few hundred input lines with line lengths around 50 characters, it might not be worth the time needed to improve and test it.

If you think it needs to run faster and think fixing this loop would help, feel free to look for dashes instead of evaluating each character position. If you try enhancing this script to do that and run into problems, we'll be happy to help you debug your code.

Last edited by Don Cragun; 06-20-2015 at 05:04 AM.. Reason: Fix bug in code noted in next post.
This User Gave Thanks to Don Cragun For This Post:
 

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting sequences based on character frequency

This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts)... (9 Replies)
Discussion started by: Xterra
9 Replies

2. Shell Programming and Scripting

Filling positions based on consensus character

I have files with hundreds of sequences with missing characters represented by a dash ("-"), something like this I need to go sequence by sequence and if a dash is found, it should be replaced with the most common character in that particular position. Thus, in my example the dash in the second... (6 Replies)
Discussion started by: Xterra
6 Replies

3. Shell Programming and Scripting

awk script replace positions if certain positions equal prescribed value

I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX. awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
Discussion started by: halplessProblem
9 Replies

4. UNIX for Dummies Questions & Answers

Need help filling in ranges

I have a list of about 200,000 lines in a text file that look like this: 1 1 120 1 80 200 1 150 270 5 50 170 5 100 220 5 300 420 The first column is an identifier, the next 2 columns are a range (always 120 value range) I'm trying fill in the values of those ranges, and remove... (4 Replies)
Discussion started by: knott76
4 Replies

5. Shell Programming and Scripting

seds to extract fields based on positions

Hi My file has a series of rows up to 160 characters in length. There are 7 columns for each row. In each row, column 1 starts at position 4 column 2 starts at position 12 column 3 starts at position 43 column 4 starts at position 82 column 5 starts at... (7 Replies)
Discussion started by: malts18
7 Replies

6. Shell Programming and Scripting

awk regardless positions

brw------- 1 oracle dba 49, 21 Apr 05 11:45 dprod_0000018 brw------- 1 oracle dba 49, 26 Apr 05 11:45 dprod_0000019 brw------- 1 oracle dba 43, 93 Feb 02 2011 dprod_000002 brw------- 1 oracle dba 49, 27 Apr 05 11:45 dprod_0000020... (4 Replies)
Discussion started by: Daniel Gate
4 Replies

7. Shell Programming and Scripting

Sort based on positions in flat file

Hello, For example: 12........6789101112..............20212223242526..................50 ( Positions) LName FName DOB (Lastname starts from 1 to 6 , FName from 8 to 15 and date of birth from 21 to29) CURTIS KENNETH ... (5 Replies)
Discussion started by: duplicate
5 Replies

8. Shell Programming and Scripting

Join based on positions

I have two text files as shown below cat file1.txt Id leng sal mon 25671 34343 56565 5565 44888 56565 45554 6868 23343 23423 26226 6224 77765 88688 87464 6848 66776 23343 63463 4534 cat file2.txt Id number 25671 34343 76767 34234 23343 23423 66776 23343 (4 Replies)
Discussion started by: halfafringe
4 Replies

9. Shell Programming and Scripting

Filter lines based on values at specific positions

hi. I have a Fixed Length text file as input where the character positions 4-5(two character positions starting from 4th position) indicates the LOB indicator. The file structure is something like below: 10126Apple DrinkOmaha 10231Milkshake New Jersey 103 Billabong Illinois ... (6 Replies)
Discussion started by: kumarjt
6 Replies
All times are GMT -4. The time now is 12:27 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy