Removing low frequency sequences


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing low frequency sequences
# 1  
Old 06-24-2010
Removing low frequency sequences

If I have a file with the following information
Quote:
>GHL8OVD01BNNCA Freq 2
TTGATGTGCCCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01BNNCF Freq 2
TTGATGTGCCAGCTGCACTTCCCCCGGTGACGTGGGTTTCCCGTCTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01B9PRR Freq 1
TTGATGTGCCAGCTTTCGCGTCGACACCGGCAAATAGTAGCAGCGCTACCAGGACCTTCGCCCA
>GHL8OVD01BL8BD Freq 4
TTGATGAGTACTTCCCCCGGTGACGTGGGTCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAACTAGCAAGACTGCGCGTGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01BJX6A Freq 3
TTGATGTGCCAGCTGCCGTTGTCCCCCGGTGACGTGGGTCTCCCGTCGAGGACCTTCGCCCA
>GHL8OVD01A9D5T Freq 1
TGATGTGCCAGCCCCGGTGACGTGGGTTTCCGGTCGACATTCGCCCA
And I would like to remove all the sequences with Freq less than 3, so I end up having the following file:
Quote:
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01BL8BD Freq 4
TTGATGAGTACTTCCCCCGGTGACGTGGGTCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAACTAGCAAGACTGCGCGTGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01BJX6A Freq 3
TTGATGTGCCAGCTGCCGTTGTCCCCCGGTGACGTGGGTCTCCCGTCGAGGACCTTCGCCCA
I am currently using awk to accomplish this task but I am not getting the results I actually want.
Any help will be greatly appreciated.

Last edited by Xterra; 06-24-2010 at 04:31 PM..
# 2  
Old 06-24-2010
Code:
awk '/^>/{if ($3<=2){getline;next}}1' file

# 3  
Old 06-24-2010
Hi.

I think you mean less than 3? (based on your output)

Code:
$ awk '$2 == "Freq" { ($3<3)?P=0:P=1}P' file1
>GHL8OVD01CMQVT Freq 15
TTGATGTCGTGGGTTTCCCGTCAACACCGGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01CMQVW Freq 11
TTGATGTGTCCCGTCGACACCGGCAAATAGCAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01A45V3 Freq 9
TTGATTCCCGTCGACACCGGCAAATAGCAGCAGCACTACAGGACCTTCGCCCA
>GHL8OVD01BL8BD Freq 4
TTGATGAGTACTTCCCCCGGTGACGTGGGTCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01AV2U9 Freq 17
TTGATGTGCCAACTAGCAAGACTGCGCGTGCAAATAGTAGCAGCACTACCAGGACCTTCGCCCA
>GHL8OVD01BJX6A Freq 3
TTGATGTGCCAGCTGCCGTTGTCCCCCGGTGACGTGGGTCTCCCGTCGAGGACCTTCGCCCA

This User Gave Thanks to Scott For This Post:
# 4  
Old 06-24-2010
Your example uses less than or equal to two.... which is what this does:
Code:
 awk ' /^>/ && $NF>2 {ok=1}
       /^>/ && $NF<3 {ok=0}
       ok {print $0} ' filename

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicate sequences and modifying a text file

Hi. I've tried several different programs to try and solve this problem, but none of them seem to have done exactly what I want (and I need the file in a very specific format). I have a large file of DNA sequences in a multifasta file like this, with around 15 000 genes: ... (2 Replies)
Discussion started by: 4galaxy7
2 Replies

2. Shell Programming and Scripting

Escape Sequences

Hi Gurus, Escape sequences \n, \t, \b, \t, \033(1m are not working. I just practiced these escape sequences. It worked first. Later its not working. Also the command - echo inside the script editor shows as shaded by a color. Before that echo inside the script editor wont show like this.... (4 Replies)
Discussion started by: GaneshAnanth
4 Replies

3. AIX

High Runqueue (R) LOW CPU LOW I/O Low Network Low memory usage

Hello All I have a system running AIX 61 shared uncapped partition (with 11 physical processors, 24 Virtual 72GB of Memory) . The output from NMON, vmstat show a high run queue (60+) for continous periods of time intervals, but NO paging, relatively low I/o (6000) , CPU % is 40, Low network.... (9 Replies)
Discussion started by: IL-Malti
9 Replies

4. Shell Programming and Scripting

Removing repeates sequences

Hai, How to remove the repeated 'Chr's in different sequences. In the given example, Chr19 is repeated in two samples with the same number i.e. +52245923. How to remove one of the entry in any of the samples and to give the range for each Chr which is -20 for minimum range value and +120 for... (1 Reply)
Discussion started by: hravisankar
1 Replies

5. Shell Programming and Scripting

Deleting sequences based on character frequency

This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts)... (9 Replies)
Discussion started by: Xterra
9 Replies

6. Shell Programming and Scripting

Removing specific sequences from file

My file looks like this But I need to remove the entry with the identifier >Reference1 along with the entire sequence. Thus, I will end up having the following file Thanks in advance! (2 Replies)
Discussion started by: Xterra
2 Replies

7. Shell Programming and Scripting

trimming sequences

My file looks like this: But I would like to 'trim' all sequences to the same lenght 32 characters, keeping intact all the identifier (>GHXCZCC01AJ8CJ) Would it be possible to use awk to perform this task? (2 Replies)
Discussion started by: Xterra
2 Replies

8. Programming

Trigraph sequences

Hi, i have read trigraph sequence in The C99 Draft (N869, 18 January, 1999) printf("Eh???/n"); will produce printf("Eh?\n"); what does that mean? i tried that but i am getting the same output i.e Eh???/n. what actually these tri graph characters are? any idea why ,when and... (1 Reply)
Discussion started by: MrUser
1 Replies

9. UNIX for Advanced & Expert Users

Deal with binary sequences

Hello, I have come across the necessity for me to deal with binary sequences and I had a few questions. 1- Does any UNIX scripting language provide any tool or command for converting text data to binary sequences? Example of binary sequence: "0x97 0x93 0x85 0x40 0xd5 0xd6 0xd7" 2- If I want... (2 Replies)
Discussion started by: Indalecio
2 Replies

10. Solaris

Available escape sequences

:) Hi, Can any one help me to find available escape sequences in UNIX shell programming? ( Like \n, \c etc,. in C or C++) Iam generating one report using one of the script, in that it is very much essential. Regards, LOVE (6 Replies)
Discussion started by: Love
6 Replies
Login or Register to Ask a Question