Gaps and frequencies

07-15-2015

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Gaps and frequencies

I have this infile:

Code:

>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGT-GCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGT-GCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGC-TA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCA-TACCAG-AC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTA-AGGACC-TC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTCGCAGCGTTA 
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCTTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCA-TACCAGTAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTA-AGGACCTTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTCGCAGCGTTA

And I need to remove columns were the majority of characters, let say>50%, are dashes.
However, I must take into consideration the frequency of each entry (Freq XX).
In case the dashes are not >50% in a given column, the column should not changed, and the alignment should be scanned till the next column with dashes is found.
Thus, I will end up with this outfile:

Code:

>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTGCAGCATA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTGCAGCATA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTA

I have modified the following code:

Code:

 
perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5) {print $i}}}' infile | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}FNR!=NR' - infile

However, I cannot get it to take into consideration the "Freq value", and instead I get the following output:

Code:

>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTGCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTGCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGC-TA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAG-AC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACC-TC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTTA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGTAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTTA

Any help will be greatly appreciated.

Last edited by Xterra; 07-15-2015 at 09:17 PM.. Reason: Clarifying

Xterra

View Public Profile for Xterra

Find all posts by Xterra

07-16-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I'm better with awk than perl. This seems to do what you want:

Code:

awk '
FNR == NR {
	if($0 ~ /^>/)
		t += f = $NF
	else	for(i = length($0); i > 0; i--)
			if(substr($0, i, 1) == "-")
				cc[i] += f
	next
}
/^>/ {	# Print header lines unchanged.
	print
	next
}
FNR == 2 {
	# Filter out column counts with frequency <= 50%...
	for(i in cc)
		if((cc[i] / t) <= .5)
			delete cc[i]
	# Create arrays for low end and counts for substrings to be printed...
	for(i = 1; i <= length($0); i++) {
		if(low == 0) {
			# Find low end of range to keep.
			if(!(i in cc)) {
				low = i
				count = 1
			}
		} else {# Look for end of range to keep.
			if(!(i in cc)) {
				# Keep this column.
				count++
			} else {# Save range and setup to look for next range.
				sf[++subc] = low
				sl[subc] = count
				low = count = 0
			}
		}
	}
	if(low) {
		# Set up entry to print last substring.
		sf[++subc] = low
		sl[subc] = count
	}
}
{	# Print selected substrings for non-header lines.
	out=""
	for(i = 1; i <= subc; i++)
		out = out substr($0, sf[i], sl[i])
	print out
}' infile infile

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Dummies Questions & Answers

Gaps and frequencies

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Adding gaps to a string in bash

Discussion started by: kristinu

2. Shell Programming and Scripting

Removal of extra spaces in *.log files to allow extraction of frequencies

Discussion started by: wsuchem

3. Shell Programming and Scripting

Sorting and moving file sequence with gaps

Discussion started by: ex_H

4. Shell Programming and Scripting

Merging Frequencies in a File

Discussion started by: gimley

5. Shell Programming and Scripting

Appending lines with word frequencies, ordering and indexing a column

Discussion started by: Ghetz

6. Shell Programming and Scripting

Recalculating frequencies

Discussion started by: Xterra

7. Shell Programming and Scripting

Searching for Gaps in Time

Discussion started by: jclanc8

8. Linux

Searching for gaps in huge (2.2G) log file?

Discussion started by: deckard