Remove duplicate entries based on the range

01-07-2014

Registered User

28, 0

Join Date: Aug 2013

Last Activity: 30 June 2015, 5:37 AM EDT

Posts: 28

Thanks Given: 9

Thanked 0 Times in 0 Posts

Remove duplicate entries based on the range

I have file like this:

Code:

chr	start	end	
chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

what i want t o do is for the same chromosome (column 1) if start posiiton falls with in 1000bp of the next entries and if the column 4 5 6and 7 remain are same i want to remove those entries and keep only the first entry
for example here

Code:

chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1

the start position second column varies by few bp and the 4, 5, 6 and 7 columns are same so i want t o retain only

Code:

chr15   99874874         99875874       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

raj_k

View Public Profile for raj_k

Find all posts by raj_k

01-07-2014

Moderator

1,837, 668

Join Date: Nov 2012

Last Activity: 30 June 2020, 12:07 PM EDT

Posts: 1,837

Thanks Given: 180

Thanked 668 Times in 590 Posts

Try : [Not Tested]

Code:

$ awk 'p && $2-p<=1000 && !x[$4$5$6$7]++{print last}{p=$2;last=$0}' file

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

01-07-2014

Registered User

28, 0

Join Date: Aug 2013

Last Activity: 30 June 2015, 5:37 AM EDT

Posts: 28

Thanks Given: 9

Thanked 0 Times in 0 Posts

hi
its giving output something like this:

Code:

chr15   99874874         99875874       chr15   99875173        99876173        aa1
chr15   99874962         99875962       chr15   99875173        99876173        aa1

but the desired output that i mentioned is not this

raj_k

View Public Profile for raj_k

Find all posts by raj_k

01-07-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

This one only checks between adjacent lines

Code:

awk '{x=$1 FS $4 FS $5 FS $6 FS $7} (NR>1 && !($2-p2<=1000 && x==px)) {print} {px=x; p2=$2}' file

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-07-2014

Moderator

1,837, 668

Join Date: Nov 2012

Last Activity: 30 June 2020, 12:07 PM EDT

Posts: 1,837

Thanks Given: 180

Thanked 668 Times in 590 Posts

Code:

awk '      NR==1{
                 next
                }
  function out(){
                   if(p && $2-p<=1000 && c==0)
                   print last
                }
                {
                 out()
                }
                {
                 last=$0
                 c=x[$4$5$6$7]++
                 p=$2
                }
             END{
                 out()
                }
    ' file

Code:

chr15   99874874         99875874       chr15   99875173        99876173    aa1        
chr1   10834962    10835962    chr3    5674767    5675545             ahc1

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

01-07-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Akshay, I have understood the requirement was equal columns $1 and $4 $5 $6 $7.?
At least the comparison string should be field-separated x[$4 FS $5 FS $6 FS $7],
so e.g. ab cd ef gh does not match a bc de fg h

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-09-2014

Registered User

28, 0

Join Date: Aug 2013

Last Activity: 30 June 2015, 5:37 AM EDT

Posts: 28

Thanks Given: 9

Thanked 0 Times in 0 Posts

hi akshay
If i use your code on this data set

Code:

chr11   87578121         87579121       chr11   87578115        87579115	ID1        
chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   87578208         87579208       chr11   87578115        87579115	ID1        
chr11   75966214         75967214       chr11   75966112        75967112	ID2        
chr11   75966257         75967257       chr11   75966112        75967112	ID2       
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr7    122067156        122068156      chr7    122067871       122068871	Id3     
chr15   66968646         66969646       chr15   67413704        67414704	ID4        
chr15   66968646         66969646       chr15   67413872        67414872	ID4

the output is as follows:

Code:

chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2       
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

It is supposed to be

Code:

chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2 
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

i dont know why it is skipping those lines which also fit into the condition

---------- Post updated at 11:44 AM ---------- Previous update was at 09:21 AM ----------

@madeingermany
i have modified

Code:

NR>1

Code:

 NR>=1

because every time its producing output it is not considering the first 3 lines in my example.

raj_k

View Public Profile for raj_k

Find all posts by raj_k

Shell Programming and Scripting

Remove duplicate entries based on the range

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Discussion started by: newbie_01

2. Shell Programming and Scripting

Remove sections based on duplicate first line

Discussion started by: ahmedwaseem2000

3. Shell Programming and Scripting

Remove duplicate rows based on one column

Discussion started by: clarissab

4. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Discussion started by: OTNA

5. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Discussion started by: mohan sharma

6. Shell Programming and Scripting

Remove duplicate based on Group

Discussion started by: yale_work

7. Shell Programming and Scripting

Remove duplicate lines based on field and sort

Discussion started by: cokedude

8. UNIX for Dummies Questions & Answers

Remove duplicate rows when >10 based on single column value

Discussion started by: informaticist

9. UNIX for Dummies Questions & Answers

How to get remove duplicate of a file based on many conditions

Discussion started by: reva

10. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Discussion started by: risk_sly