Gaps and frequencies


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Gaps and frequencies
# 1  
Old 07-15-2015
Gaps and frequencies

I have this infile:
Code:
>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGT-GCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGT-GCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGC-TA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCA-TACCAG-AC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTA-AGGACC-TC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTCGCAGCGTTA 
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCTTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCA-TACCAGTAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTA-AGGACCTTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTCGCAGCGTTA

And I need to remove columns were the majority of characters, let say>50%, are dashes.
However, I must take into consideration the frequency of each entry (Freq XX).
In case the dashes are not >50% in a given column, the column should not changed, and the alignment should be scanned till the next column with dashes is found.
Thus, I will end up with this outfile:
Code:
>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTGCAGCATA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTGCAGCATA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTA

I have modified the following code:
Code:
 
perl -nla -F"" -e 'if (!/^>/){$n++;for ($i=0;$i<=$#F;$i++){$a{$i}{$F[$i]}++}}END{for ($i=0;$i<=$#F;$i++){if ($a{$i}{"-"}/$n>0.5) {print $i}}}' infile | awk -vFS="" -vOFS="" 'NR==FNR{a[$0+1]++}{for (i=1;i<=NF;i++) if (i in a) $i=""}FNR!=NR' - infile

However, I cannot get it to take into consideration the "Freq value", and instead I get the following output:
Code:
>GHL8OVD01BNNCA Freq 10
TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGTGCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTGCAGCA-TA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGC-TA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAG-AC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACC-TC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTTA
>GHL8OVD01CMQVT Freq 1
TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACTTCGCTTA
>GHL8OVD01CMQVW Freq 1
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCATACCAGTAC
>GHL8OVD01A45V3 Freq 1
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTAAGGACCTTC
>GHL8OVD01AV2U9 Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTGCAGCGTTA

Any help will be greatly appreciated.

Last edited by Xterra; 07-15-2015 at 09:17 PM.. Reason: Clarifying
# 2  
Old 07-16-2015
I'm better with awk than perl. This seems to do what you want:
Code:
awk '
FNR == NR {
	if($0 ~ /^>/)
		t += f = $NF
	else	for(i = length($0); i > 0; i--)
			if(substr($0, i, 1) == "-")
				cc[i] += f
	next
}
/^>/ {	# Print header lines unchanged.
	print
	next
}
FNR == 2 {
	# Filter out column counts with frequency <= 50%...
	for(i in cc)
		if((cc[i] / t) <= .5)
			delete cc[i]
	# Create arrays for low end and counts for substrings to be printed...
	for(i = 1; i <= length($0); i++) {
		if(low == 0) {
			# Find low end of range to keep.
			if(!(i in cc)) {
				low = i
				count = 1
			}
		} else {# Look for end of range to keep.
			if(!(i in cc)) {
				# Keep this column.
				count++
			} else {# Save range and setup to look for next range.
				sf[++subc] = low
				sl[subc] = count
				low = count = 0
			}
		}
	}
	if(low) {
		# Set up entry to print last substring.
		sf[++subc] = low
		sl[subc] = count
	}
}
{	# Print selected substrings for non-header lines.
	out=""
	for(i = 1; i <= subc; i++)
		out = out substr($0, sf[i], sl[i])
	print out
}' infile infile

This User Gave Thanks to Don Cragun For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Adding gaps to a string in bash

I have the following string, and want to introduce additional spaces between the two %s. This will be done by specifying the gap between the %s. Example having gap=8 will put 8 spaces between the two %s. frmt_k1d1_test="%s %s\n" I am doing the script in bash. ---------- Post updated at... (4 Replies)
Discussion started by: kristinu
4 Replies

2. Shell Programming and Scripting

Removal of extra spaces in *.log files to allow extraction of frequencies

Our university has upgraded its version of a computational chemistry program that our group uses quite regularly. In the past we have been able to extract frequency spectra from log files that are generated. Since the upgrade, the viewing program errors out. I've been able to trace down the changes... (16 Replies)
Discussion started by: wsuchem
16 Replies

3. Shell Programming and Scripting

Sorting and moving file sequence with gaps

Hello, I have lots of sequentially numbered files which make up an image sequence. I'm trying to do two things with it: #1: Find gaps in the sequence and move each range of sequencial files into their own subfolder. #2: Designate a starting point (file) and move every 24th file into... (4 Replies)
Discussion started by: ex_H
4 Replies

4. Shell Programming and Scripting

Merging Frequencies in a File

hello, I have a file which has the following structure: word <TAB> frequency The same word can have multiple frequencies: John <TAB> 60 John <TAB> 20 John <TAB> 30 Mary <TAB> 1000 Mary <TAB> 800 Mary <TAB> 20 What I need is a script which could merge all these frequencies into one single... (10 Replies)
Discussion started by: gimley
10 Replies

5. Shell Programming and Scripting

Appending lines with word frequencies, ordering and indexing a column

Dear All, I have the following input data: w1 20 g1 w1 10 g1 w2 12 g1 w2 23 g1 w3 10 g1 w3 17 g1 w3 12.5 g1 w3 21 g1 w4 11 g1 w4 13.2 g1 w4 23 g1 w4 18 g1 First I seek to find the word frequencies in col1 and sort col2 in ascending order for each change in a col1 word. Second,... (5 Replies)
Discussion started by: Ghetz
5 Replies

6. Shell Programming and Scripting

Recalculating frequencies

My file looks like this The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated.... (8 Replies)
Discussion started by: Xterra
8 Replies

7. Shell Programming and Scripting

Searching for Gaps in Time

I am very new to shell scripting. We use C-Shell here and I know the issues that surround it. I hope a solution can be created using awk, sed, etc... instead of having to write a program. I have an input file that is sorted by date and time in ascending order ... (2 Replies)
Discussion started by: jclanc8
2 Replies

8. Linux

Searching for gaps in huge (2.2G) log file?

I've got a 2.2 Gig syslog file from our Cisco firewall appliance. The problem is that we've been seeing gaps in the syslog for anywhere from 10 minutes to 2 hours. Currently I've just been using 'less' and paging through the file to see if I can find any noticeable gaps. Obviously this isn't the... (3 Replies)
Discussion started by: deckard
3 Replies
Login or Register to Ask a Question