Selecting sequences based on scores


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Selecting sequences based on scores
# 1  
Old 04-04-2013
Selecting sequences based on scores

I have two files with thousands of sequences of different lengths. infile1 contains the actual sequences and infile2 the scores for each A, T, G and C in infile1. Something like this:
infile1:
Code:
>HZVJKYI01ECH5R
TTGATGTGCCAGCTGCCGTTGGTGTGCCAA
>HZVJKYI01AQWJ8
GGATATGATGATGAACTGGTTTGGCACACC
>HZVJKYI01C8OAV
GGATATGATGATGAACTGGTTTGGCACACC
>HZVJKYI01AXR15
TTGATGTGCCAGCTGCCGTTGGTGT
>HZVJKYI01EDZM4
TGATGTGCCAGCTGCCGTTGGTGTACCAGT

infile2:
Code:
>HZVJKYI01ECH5R
18 20 30 30 38 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 30 30 20 20 
>HZVJKYI01AQWJ8
26 26 26 38 40 40 40 40 40 40 39 35 32 31 31 32 32 17 17 16 16 16 27 27 34 34 40 34 23 26 
>HZVJKYI01C8OAV
29 29 29 34 39 39 40 38 38 40 40 40 36 36 36 36 34 33 33 36 33 33 37 30 32 32 30 30 30 30 
>HZVJKYI01AXR15
21 21 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
>HZVJKYI01EDZM4
26 30 36 40 40 40 40 40 40 40 40 40 40 40 40 30 21 21 21 34 34 40 40 40 40 40 40 40 40 40

I need to output the sequences where the average score is >35, the length of the sequence >28 and a minimal score >20. From this example, the first sequence will not be included in the outfile because the minimal score is 18. The second sequence will be also excluded because the minimal score is 16 and the average is 30.5. The third sequence is eliminated because the average score is <35. The forth sequence will also not be deleted because the length is less than 28. Thus, the outfile will contain only one sequence:

outfile:
Code:
>HZVJKYI01EDZM4
TGATGTGCCAGCTGCCGTTGGTGTACCAGT

I would like to use awk script so I can include it in a bash file. However, I do not know how to match the infiles. Any help will be greatly appreciated.
# 2  
Old 04-04-2013
Build an array with only elements of those records in file 2 that have those criteria. Then use that array to print the corresponding records in file 1,
# 3  
Old 04-04-2013
Scrutinizer

I will be very thankful if you can be more specific. I am really trying to understand the whole think but I just cannot get it to work.
# 4  
Old 04-04-2013
Try this
Code:
awk 'NR==FNR{q=$1;getline;s=d=0;for(i=1;i<=NF;i++){if($i<20&&!d){d=1}s+=$i}a=s/NF;if(NF>=28&&!d&&a>=35){z[q]++}}
z[$1]{print;getline;print}' infile2 infile1

--ahamed
This User Gave Thanks to ahamed101 For This Post:
# 5  
Old 04-04-2013
ahamed

Thanks! That works great!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting sequences based on length with sed

I have this file: >ID1 AA >ID2 TTTTTT >ID-3 AAAAAAAAA >ID4 TTTTTTGGAGATCAGTAGCAGATGACAG-GGGGG-TGCACCCC Add I am trying to use this script to output sequences longer than 15 characters: sed -r '/^>/N;{/^.{,15}$/d}' The desire output would be this: >ID4... (8 Replies)
Discussion started by: Xterra
8 Replies

2. Shell Programming and Scripting

Create combinations based on scores

Hi experts, I have a score matrix like below, where the 3rd column ( 1 max, 0 min) says how close the 2nd column variable is to the 1st column variable a b 0.3 a c 0.87 a d 0.75 b x 0.67 b y 0.98 b z 0.24 c ... (4 Replies)
Discussion started by: jianp83
4 Replies

3. Shell Programming and Scripting

Eliminating sequences based on Distances

I have to remove sequences from a file based on the distance value. I am attaching the file containing the distances (Distance.xls) The second file looks something like this: Sequences.txt >Sample1 Freq 59 ggatatgatgatgaactggt >Sample1 Freq 54 ggatatgatgttgaactggt >Sample1 Freq 44... (2 Replies)
Discussion started by: Xterra
2 Replies

4. Shell Programming and Scripting

Randomly selecting sequences and generating specific output files

I have two files containing hundreds of different sequences with the same Identifiers (ID-001, ID-002, etc.,), something like this: Infile1: ID-001 ATGGGAGCGGGGGCGTCTGCCTTGAGGGGAGAGAAGCTAGATACA ID-002 ATGGGAGCGGGGGCGTCTGTTTTGAGGGGAGAGAAGCTAGATACA ID-003... (18 Replies)
Discussion started by: Xterra
18 Replies

5. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ... (2 Replies)
Discussion started by: Diya123
2 Replies

6. Shell Programming and Scripting

Trimming sequences based on Reference

My file looks something like this Wnat I need is to look for the Reference sequence (">Reference1") and based on the length of that sequence trim all the entries in that file. So, the rersulting file will contain all sequences with the same length, like this Thus, all sequences will keep... (5 Replies)
Discussion started by: Xterra
5 Replies

7. Shell Programming and Scripting

Deleting sequences based on character frequency

This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts)... (9 Replies)
Discussion started by: Xterra
9 Replies

8. Shell Programming and Scripting

Trimming sequences based on specific pattern

My files look like this And I need to cut the sequences at the last "A" found in the following 'pattern' -highlighted for easier identification, the pattern is the actual file is not highlighted. The expected result should look like this Thus, all the sequences would end with AGCCCTA... (2 Replies)
Discussion started by: Xterra
2 Replies

9. UNIX for Dummies Questions & Answers

How to assign scores to rows based on column values

Hi, I'm trying to assign a score to each row which will allow me to identify which rows differ. In the example file below, I've used "," to indicate column separators (my actual file has tab separators). In this example, I'd like to identify that row 1 and row 5 are the same, and row 2 and row... (4 Replies)
Discussion started by: auburn
4 Replies
Login or Register to Ask a Question