Eliminating entries based on relative values

06-05-2013

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Eliminating entries based on relative values

I have posted this before but did not get much feedback. So I will try again. I need to remove sequences from a file based on values listed on a second file.
The sequences file looks like this:
Sequences.txt

Code:

>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 Freq 38
ggatatgatgatgaactggt
>Sample2 run13 reversed 7 3a Freq 32
ggatttgatgatgaactggt
>Sample2 run13 reversed 8 3a 30
ggatttgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct
>Sample2 reversed 10 3a 27
ggatatgatgatgaactggt

and the file containing the distance value looks like this (I am also attaching an excel file for clarity):

Code:

Species 1,Species 2,Dist
Sample1 Freq 39,Sample2 29,3.000
Sample1 Freq 3a 39,Sample2 29,2.000
Sample1 Freq 44,Sample2 29,3.000
Sample1 Freq 54,Sample2 29,3.000
Sample1 Freq 59,Sample2 29,2.000
Sample1 Freq 39,Sample2 Freq 38,1.000
Sample1 Freq 3a 39,Sample2 Freq 38,0.000
Sample1 Freq 44,Sample2 Freq 38,1.000
Sample1 Freq 54,Sample2 Freq 38,1.000
Sample1 Freq 59,Sample2 Freq 38,0.000
Sample1 Freq 39,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 3a 39,Sample2 reversed 10 3a 27,0.000
Sample1 Freq 44,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 54,Sample2 reversed 10 3a 27,1.000
Sample1 Freq 59,Sample2 reversed 10 3a 27,0.000
Sample1 Freq 39,Sample2 run13 reversed 7 3a Freq 32,0.000
Sample1 Freq 3a 39,Sample2 run13 reversed 7 3a Freq 32,1.000
Sample1 Freq 44,Sample2 run13 reversed 7 3a Freq 32,2.000
Sample1 Freq 54,Sample2 run13 reversed 7 3a Freq 32,2.000
Sample1 Freq 59,Sample2 run13 reversed 7 3a Freq 32,1.000
Sample1 Freq 39,Sample2 run13 reversed 8 3a 30,0.000
Sample1 Freq 3a 39,Sample2 run13 reversed 8 3a 30,1.000
Sample1 Freq 44,Sample2 run13 reversed 8 3a 30,2.000
Sample1 Freq 54,Sample2 run13 reversed 8 3a 30,2.000
Sample1 Freq 59,Sample2 run13 reversed 8 3a 30,1.000

Now, if the distance (Dist) is below 1, I have to remove the sequence listed on column 2 (Species 2) from the file Sequences.TXT. Sometimes, the same sequences will be found more than once, in that cases, and considering that the sequence has been already removed in the first instance, the script can "ignore" it. Thus, the output file should look something like this:

Code:

>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct

I would like to use AWK preferebly since I am more familiar with. However, Perl would also work.
Any help will be very much appraciated!

Dist.xls (25.5 KB)

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-05-2013

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Assuming that the DNA sequences are stored in one line:

Code:

nawk -F, 'NR==FNR&&$3<1{a[">"$2]=1}NR!=FNR&&!($0 in a)&&/^>/{print;getline;print}' dist.txt seq.txt

bartus11

View Public Profile for bartus11

Find all posts by bartus11

06-05-2013

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Hey bartus!

Good to hear from you! Thanks for your prompt reply. I tried your script but I do not get the expected output. This is what I am getting:

Code:

$ awk -F, 'NR==FNR&&$3<1{a[">"$2]=1}NR!=FNR&&!($0 in a)&&/^>/{print;getline;print}' dista.txt seq.txt
>Sample1 Freq 59
ggatatgatgatgaactggt
>Sample1 Freq 54
ggatatgatgttgaactggt
>Sample1 Freq 44
ggatatgttgatgaactggt
>Sample1 Freq 39
ggatttgatgatgaactggt
>Sample1 Freq 3a 39
ggatatgatgatgaactggt
>Sample2 Freq 38
ggatatgatgatgaactggt
>Sample2 run13 reversed 7 3a Freq 32
ggatttgatgatgaactggt
>Sample2 run13 reversed 8 3a 30
ggatttgatgatgaactggt
>Sample2 29
ggatatgatgatgaactcct
>Sample2 reversed 10 3a 27
ggatatgatgatgaactggt

Am I missing something here?
Once again, thank you very much!
PS. I have uploaded the corresponding files soo you can take a look at them.

Dista.txt (1.2 KB)

Seq.txt (464 Bytes)

Last edited by Don Cragun; 06-20-2015 at 12:36 AM.. Reason: Get rid of extraneous SIZE tags.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

06-05-2013

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

That is because there are spaces at the end of each header in your sample seq.txt...

You can remove them using:

Code:

perl -i -pe 's/ *$//' seq.txt

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

06-05-2013

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

bartus

Once again, thank you so very much!

Last edited by Xterra; 06-05-2013 at 10:56 PM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

UNIX for Dummies Questions & Answers

Eliminating entries based on relative values

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Repositioning based on column values

Discussion started by: A-V

2. UNIX for Dummies Questions & Answers

Sorting and saving values based on unique entries

Discussion started by: ida1215

3. Shell Programming and Scripting

Choosing between repeated entries based on the "absolute values" of a column

Discussion started by: Sanchari

4. Shell Programming and Scripting

Eliminating sequences based on Distances

Discussion started by: Xterra

5. Shell Programming and Scripting

Shell : eliminating zero values and printing

Discussion started by: scriptscript

6. UNIX for Dummies Questions & Answers

sum values based on ID

Discussion started by: fadista

7. Shell Programming and Scripting

Replacing values in a file based on values in another file

Discussion started by: pparthiv

8. Shell Programming and Scripting

How to pick values from column based on key values by usin AWK

Discussion started by: repinementer

9. Shell Programming and Scripting

UrgentPlease: compare 1 value with file values eliminating special characters

Discussion started by: kittusri9

10. UNIX for Dummies Questions & Answers

get cygpath to leave relative paths as relative?

Discussion started by: fabulous2