My file looks like this
Quote:
>GHL8OVD01BNNCF Freq 5
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCL Freq 10
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCCCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01A45V9 Freq 4
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated. Thus, I will end up with the following file
Quote:
>GHL8OVD01BNNCF Freq 15
TTGATGTGCCCGTGGGTTTCCCTTCGCCCA
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCCCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 13
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
Any help will be greatly appreciated.