Eliminating entries based on relative values


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Eliminating entries based on relative values
# 1  
Old 06-05-2013
Eliminating entries based on relative values

I have posted this before but did not get much feedback. So I will try again. I need to remove sequences from a file based on values listed on a second file.
The sequences file looks like this:
Sequences.txt
Code:
>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 Freq 38
ggatatgatgatgaactggt
>Sample2 run13 reversed 7 3a Freq 32
ggatttgatgatgaactggt
>Sample2 run13 reversed 8 3a 30
ggatttgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct
>Sample2 reversed 10 3a 27
ggatatgatgatgaactggt

and the file containing the distance value looks like this (I am also attaching an excel file for clarity):
Code:
Species 1,Species 2,Dist
Sample1 Freq 39,Sample2 29,3.000
Sample1 Freq 3a 39,Sample2 29,2.000
Sample1 Freq 44,Sample2 29,3.000
Sample1 Freq 54,Sample2 29,3.000
Sample1 Freq 59,Sample2 29,2.000
Sample1 Freq 39,Sample2 Freq 38,1.000
Sample1 Freq 3a 39,Sample2 Freq 38,0.000
Sample1 Freq 44,Sample2 Freq 38,1.000
Sample1 Freq 54,Sample2 Freq 38,1.000
Sample1 Freq 59,Sample2 Freq 38,0.000
Sample1 Freq 39,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 3a 39,Sample2 reversed 10 3a 27,0.000
Sample1 Freq 44,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 54,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 59,Sample2 reversed 10 3a 27,0.000
Sample1 Freq 39,Sample2 run13 reversed 7 3a Freq 32,0.000
Sample1 Freq 3a 39,Sample2 run13 reversed 7 3a Freq 32,1.000
Sample1 Freq 44,Sample2 run13 reversed 7 3a Freq 32,2.000
Sample1 Freq 54,Sample2 run13 reversed 7 3a Freq 32,2.000
Sample1 Freq 59,Sample2 run13 reversed 7 3a Freq 32,1.000
Sample1 Freq 39,Sample2 run13 reversed 8 3a 30,0.000
Sample1 Freq 3a 39,Sample2 run13 reversed 8 3a 30,1.000
Sample1 Freq 44,Sample2 run13 reversed 8 3a 30,2.000
Sample1 Freq 54,Sample2 run13 reversed 8 3a 30,2.000
Sample1 Freq 59,Sample2 run13 reversed 8 3a 30,1.000

Now, if the distance (Dist) is below 1, I have to remove the sequence listed on column 2 (Species 2) from the file Sequences.TXT. Sometimes, the same sequences will be found more than once, in that cases, and considering that the sequence has been already removed in the first instance, the script can "ignore" it. Thus, the output file should look something like this:
Code:
>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct

I would like to use AWK preferebly since I am more familiar with. However, Perl would also work.
Any help will be very much appraciated!
# 2  
Old 06-05-2013
Assuming that the DNA sequences are stored in one line:
Code:
nawk -F, 'NR==FNR&&$3<1{a[">"$2]=1}NR!=FNR&&!($0 in a)&&/^>/{print;getline;print}' dist.txt seq.txt

# 3  
Old 06-05-2013
Hey bartus!

Good to hear from you! Thanks for your prompt reply. I tried your script but I do not get the expected output. This is what I am getting:
Code:
$ awk -F, 'NR==FNR&&$3<1{a[">"$2]=1}NR!=FNR&&!($0 in a)&&/^>/{print;getline;print}' dista.txt seq.txt
>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 Freq 38
ggatatgatgatgaactggt
>Sample2 run13 reversed 7 3a Freq 32
ggatttgatgatgaactggt
>Sample2 run13 reversed 8 3a 30
ggatttgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct
>Sample2 reversed 10 3a 27
ggatatgatgatgaactggt

Am I missing something here?
Once again, thank you very much!
PS. I have uploaded the corresponding files soo you can take a look at them.

Last edited by Don Cragun; 06-20-2015 at 12:36 AM.. Reason: Get rid of extraneous SIZE tags.
# 4  
Old 06-05-2013
That is because there are spaces at the end of each header in your sample seq.txt... Smilie You can remove them using:
Code:
perl -i -pe 's/ *$//' seq.txt

This User Gave Thanks to bartus11 For This Post:
# 5  
Old 06-05-2013
bartus

Once again, thank you so very much!

Last edited by Xterra; 06-05-2013 at 10:56 PM..
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Repositioning based on column values

Dear all ... I have a file which I want to change the structure based on the values in some columns and I would be grateful if you can help... one of my files looks like ... they all have ten rows 1,0,0 10,0,0 2,0,0 3,0,0 4,1,1 4,1,1 4,1,1 5,0,0 6,0,0 7,0,0 8,0.5,2 9,0.33,3 9,0.33,3... (1 Reply)
Discussion started by: A-V
1 Replies

2. UNIX for Dummies Questions & Answers

Sorting and saving values based on unique entries

Hi all, I wanted to save the values of a file that contains unique entries based on a specific column (column 4). my sample file looks like the following: input file: 200006-07file.txt 145 35 10 3 147 35 12 4 146 36 11 3 145 34 12 5 143 31 15 4 146 30 14 5 desired output files:... (5 Replies)
Discussion started by: ida1215
5 Replies

3. Shell Programming and Scripting

Choosing between repeated entries based on the "absolute values" of a column

Hello, I was looking for a way to select between the repeated entries (column1) based on the values of absolute values of column 3 (larger value). For example if the same gene id has FC value -2 and 1, I should get the output as -2. Kindly help. GeneID Description FC ... (2 Replies)
Discussion started by: Sanchari
2 Replies

4. Shell Programming and Scripting

Eliminating sequences based on Distances

I have to remove sequences from a file based on the distance value. I am attaching the file containing the distances (Distance.xls) The second file looks something like this: Sequences.txt >Sample1 Freq 59 ggatatgatgatgaactggt >Sample1 Freq 54 ggatatgatgttgaactggt >Sample1 Freq 44... (2 Replies)
Discussion started by: Xterra
2 Replies

5. Shell Programming and Scripting

Shell : eliminating zero values and printing

I have a log file containing the below data and should have the output file as below. and the output file should not contain any 0 values. Eg. It should not contain 0000000:0000000 in it. input.txt Media200.5.5.1 00010003:065D1202 Media100.5.5.2 7,588,666,067,931,543... (6 Replies)
Discussion started by: scriptscript
6 Replies

6. UNIX for Dummies Questions & Answers

sum values based on ID

Hi, I would like to be able to sum up the counts of a column by the ID of another column. Example (although the actual file I have has thousands of IDs): Input file: A1BG-AS1:001 3 A1BG-AS1:002 0 A1BG-AS1:003 2 A1CF:001 1038 A1CF:002 105 A1CF:003 115 A1CF:004 137 Desired output... (3 Replies)
Discussion started by: fadista
3 Replies

7. Shell Programming and Scripting

Replacing values in a file based on values in another file

Hi I have 2 files:- 1. List of files which consists of names of some output files. 2. A delimited file; delimted by "|" I want to replace the value of the $23 (23rd column) in the delimited file with name in the first file. It is always position to position. Meaning first row of the first... (5 Replies)
Discussion started by: pparthiv
5 Replies

8. Shell Programming and Scripting

How to pick values from column based on key values by usin AWK

Dear Guyz:) I have 2 different input files like this. I would like to pick the values or letters from the inputfile2 based on inputfile1 keys (A,F,N,X,Z). I have done similar task by using awk but in that case the inputfiles are similar like in inputfile2 (all keys in 1st column and values in... (16 Replies)
Discussion started by: repinementer
16 Replies

9. Shell Programming and Scripting

UrgentPlease: compare 1 value with file values eliminating special characters

Hi All, I have file i have values like ---- 112 113 109 112 109 I have another file cat supplierDetails.txt ------------------------- 112|MIMUS|krishnaveni@google.com 113|MIMIRE|krishnaveni@google.com 114|MIMCHN|krishnaveni@google.com 115|CEL|krishnaveni@google.com... (10 Replies)
Discussion started by: kittusri9
10 Replies

10. UNIX for Dummies Questions & Answers

get cygpath to leave relative paths as relative?

If I execute mypath=`cygpath -w ../` echo $mypath I get d:\unix\nextVersion\script OK, d:\unix\nextVersion\script is the correct windows version of the path, but it is in absolute form. I would prefer it if cygpath left it in relative form, i.e. echo $mypath should output ..\ ... (0 Replies)
Discussion started by: fabulous2
0 Replies
Login or Register to Ask a Question