This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts) and a number (prevalence). Something like this:
Quote:
> ID 1 Prev 1
C-TGCTAGCTACGTCGTACGT
> ID 2 Prev 31
A-TGCTAGCTACGTCGTACGT
> ID 3 Prev 30
A-TGCTAGCTACGTCGTTCGT
> ID 4 Prev 30
A-TGCTAGCTACGTCGTACGA
> ID 5 Prev 2
A-TGCTAGCTACGTCG-----
> ID 6 Prev 2
A-TGCTAGCTACNTCGTACGT
> ID 7 Prev 2
A-CGCTAGCTACGTCGTACGT
> ID 8 Prev 2
A-TGCTAGCTA-GTCGTACGT
> ID 9 Prev 1
AGTGCTAGCTACGTCGTACGT
I need to calculate the frequency of A,G,C,T,- and N in each position. Then, I need to evaluate the first position in the first sequence and if the frequency of the character in that position does not reach 5% the entire sequence along with the ID should be removed. If the frequency is higher then I need to determine the frequency of the character in the second position so on and so forth till the entire sequence has been scanned. Then, I should do the same thing for each and every sequence in the file.
Thus, in my example above Seq ID 1 should be removed since "C" accounts for only 1% (Prev=1) in that column (the ID is not considered in the analysis). As result of this process, the output file (File B) should only contained Sequences 2, 3 and 4 since the frequency of each character in every position along the entire sequence in the three entries is higher than 5%. The output file should have the same format (FASTA) as the input file:
Quote:
> ID 2 Prev 31
A-TGCTAGCTACGTCGTACGT
> ID 3 Prev 30
A-TGCTAGCTACGTCGTTCGT
> ID 4 Prev 30
A-TGCTAGCTACGTCGTACGA
Any help will be greatly appreciated.