Visit Our UNIX and Linux User Community


Recalculating frequencies


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Recalculating frequencies
# 1  
Old 06-29-2010
Recalculating secuence frequency

My file looks like this
Quote:
>GHL8OVD01BNNCF Freq 5
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCL Freq 10
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCCCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01A45V9 Freq 4
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated. Thus, I will end up with the following file
Quote:
>GHL8OVD01BNNCF Freq 15
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCCCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 13
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
Any help will be greatly appreciated.

Last edited by Xterra; 06-29-2010 at 07:18 PM..
# 2  
Old 06-29-2010
Try that:
Code:
awk -vRS=">" 'length($0)>0{a[$4]+=$3;b[$4]=$1}END{for (i in a) printf ">"b[i]" Freq "a[i]"\n"i"\n"}' file

This User Gave Thanks to bartus11 For This Post:
# 3  
Old 06-29-2010
It is partially working

The last two sequences were not 'combine' into one.
This is what I get
Quote:
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCCCAGGACCTTCGCCCA
>GHL8OVD01A45V9 Freq 4
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01BNNCL Freq 15
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
Note that the highlighted sequences are identical (charcater by charecter, not only length) and still were not compressed and consider as 1 entry with s frequency of 13.
# 4  
Old 06-29-2010
That is weird, cause I just tried that on your test data and it did combine those lines. Keep in mind that this command outputs those records in random order. Also double check if you copied the code properly.
# 5  
Old 06-29-2010
It is not working on my end

I tried one more time and it did not combine the last 2. The order is random but I still can see those 2 sequences. Instead of ending up with 5 differen sequences my file contains 6. I have modified the test data and definitively is not working. I entered 1 more sequence (freq 10) identical to the first 2 at the very end of the file and it did not combine it with the other 2.

Last edited by Xterra; 06-29-2010 at 07:34 PM..
# 6  
Old 06-29-2010
Try this to check if 5 or 6 sequences are printed:
Code:
awk '!/^>/{a[$0]++}END{for (i in a) print i}' file

# 7  
Old 06-29-2010
I got 6 sequences

The output file contain 6 sequences (2nd and 3rd are identical).

Previous Thread | Next Thread
Test Your Knowledge in Computers #878
Difficulty: Medium
Memory allocation is less critical in a real-time operating system (RTOS) than in other operating systems.
True or False?

4 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Gaps and frequencies

I have this infile: >GHL8OVD01BNNCA Freq 10 TAGATGTGCCCGTGGGTTTCCCGTCAACACCGGATAGT-GCAGCA-TA >GHL8OVD01CMQVT Freq 1 TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGT-GCAGCA-TA >GHL8OVD01CMQVT Freq 1 TTGATGTGCCAGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGC-TA >GHL8OVD01CMQVW Freq 1... (1 Reply)
Discussion started by: Xterra
1 Replies

2. Shell Programming and Scripting

Removal of extra spaces in *.log files to allow extraction of frequencies

Our university has upgraded its version of a computational chemistry program that our group uses quite regularly. In the past we have been able to extract frequency spectra from log files that are generated. Since the upgrade, the viewing program errors out. I've been able to trace down the changes... (16 Replies)
Discussion started by: wsuchem
16 Replies

3. Shell Programming and Scripting

Merging Frequencies in a File

hello, I have a file which has the following structure: word <TAB> frequency The same word can have multiple frequencies: John <TAB> 60 John <TAB> 20 John <TAB> 30 Mary <TAB> 1000 Mary <TAB> 800 Mary <TAB> 20 What I need is a script which could merge all these frequencies into one single... (10 Replies)
Discussion started by: gimley
10 Replies

4. Shell Programming and Scripting

Appending lines with word frequencies, ordering and indexing a column

Dear All, I have the following input data: w1 20 g1 w1 10 g1 w2 12 g1 w2 23 g1 w3 10 g1 w3 17 g1 w3 12.5 g1 w3 21 g1 w4 11 g1 w4 13.2 g1 w4 23 g1 w4 18 g1 First I seek to find the word frequencies in col1 and sort col2 in ascending order for each change in a col1 word. Second,... (5 Replies)
Discussion started by: Ghetz
5 Replies

Featured Tech Videos