Remove duplicate entries based on the range


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicate entries based on the range
# 1  
Old 01-07-2014
Remove duplicate entries based on the range

I have file like this:
Code:
chr	start	end	
chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

what i want t o do is for the same chromosome (column 1) if start posiiton falls with in 1000bp of the next entries and if the column 4 5 6and 7 remain are same i want to remove those entries and keep only the first entry
for example here
Code:
chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1

the start position second column varies by few bp and the 4, 5, 6 and 7 columns are same so i want t o retain only
Code:
chr15   99874874         99875874       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

# 2  
Old 01-07-2014
Try : [Not Tested]

Code:
$ awk 'p && $2-p<=1000 && !x[$4$5$6$7]++{print last}{p=$2;last=$0}' file

# 3  
Old 01-07-2014
hi
its giving output something like this:

Code:
chr15   99874874         99875874       chr15   99875173        99876173        aa1
chr15   99874962         99875962       chr15   99875173        99876173        aa1

but the desired output that i mentioned is not this
# 4  
Old 01-07-2014
This one only checks between adjacent lines
Code:
awk '{x=$1 FS $4 FS $5 FS $6 FS $7} (NR>1 && !($2-p2<=1000 && x==px)) {print} {px=x; p2=$2}' file

# 5  
Old 01-07-2014
Code:
awk '      NR==1{
                 next
                }
  function out(){
                   if(p && $2-p<=1000 && c==0)
                   print last
                }
                {
                 out()
                }
                {
                 last=$0
                 c=x[$4$5$6$7]++
                 p=$2
                }
             END{
                 out()
                }
    ' file

Code:
chr15   99874874         99875874       chr15   99875173        99876173    aa1        
chr1   10834962    10835962    chr3    5674767    5675545             ahc1

# 6  
Old 01-07-2014
Akshay, I have understood the requirement was equal columns $1 and $4 $5 $6 $7.?
At least the comparison string should be field-separated x[$4 FS $5 FS $6 FS $7],
so e.g. ab cd ef gh does not match a bc de fg h
# 7  
Old 01-09-2014
hi akshay
If i use your code on this data set
Code:
chr11   87578121         87579121       chr11   87578115        87579115	ID1        
chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   87578208         87579208       chr11   87578115        87579115	ID1        
chr11   75966214         75967214       chr11   75966112        75967112	ID2        
chr11   75966257         75967257       chr11   75966112        75967112	ID2       
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr7    122067156        122068156      chr7    122067871       122068871	Id3     
chr15   66968646         66969646       chr15   67413704        67414704	ID4        
chr15   66968646         66969646       chr15   67413872        67414872	ID4

the output is as follows:
Code:
chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2       
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

It is supposed to be
Code:
chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2 
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

i dont know why it is skipping those lines which also fit into the condition

---------- Post updated at 11:44 AM ---------- Previous update was at 09:21 AM ----------

@madeingermany
i have modified
Code:
NR>1

to
Code:
 NR>=1

because every time its producing output it is not considering the first 3 lines in my example.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Hi, In an ideal scenario, I will have a listing of db transaction log that gets copied to a DR site and if I have them all, they will be numbered consecutively like below. 1_79811_01234567.arc 1_79812_01234567.arc 1_79813_01234567.arc 1_79814_01234567.arc 1_79815_01234567.arc... (3 Replies)
Discussion started by: newbie_01
3 Replies

2. Shell Programming and Scripting

Remove sections based on duplicate first line

Hi, I have a file with many sections in it. Each section is separated by a blank line. The first line of each section would determine if the section is duplicate or not. if the section is duplicate then remove the entire section from the file. below is the example of input and output.... (5 Replies)
Discussion started by: ahmedwaseem2000
5 Replies

3. Shell Programming and Scripting

Remove duplicate rows based on one column

Dear members, I need to filter a file based on the 8th column (that is id), and does not mather the other columns, because I want just one id (1 line of each id) and remove the duplicates lines based on this id (8th column), and does not matter wich duplicate will be removed. example of my file... (3 Replies)
Discussion started by: clarissab
3 Replies

4. Shell Programming and Scripting

How To Remove Duplicate Based on the Value?

Hi , Some time i got duplicated value in my files , bundle_identifier= B Sometext=ABC bundle_identifier= A bundle_unit=500 Sometext123=ABCD bundle_unit=400 i need to check if there is a duplicated values or not if yes , i need to check if the value is A or B when Bundle_Identified ,... (2 Replies)
Discussion started by: OTNA
2 Replies

5. Shell Programming and Scripting

Remove duplicate value based on two field $4 and $5

Hi All, i have input file like below... CA009156;20091003;M;AWBKCA72;123;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;321;;CANADIAN WESTERN BANK;EDMONTON;;2300, 10303, JASPER AVENUE;;T5J 3X6;; CA009156;20091003;M;AWBKCA72;231;;CANADIAN... (2 Replies)
Discussion started by: mohan sharma
2 Replies

6. Shell Programming and Scripting

Remove duplicate based on Group

Hi, How can I remove duplicates from a file based on group on other column? for example: Test1|Test2|Test3|Test4|Test5 Test1|Test6|Test7|Test8|Test5 Test1|Test9|Test10|Test11|Test12 Test1|Test13|Test14|Test15|Test16 Test17|Test18|Test19|Test20|Test21 Test17|Test22|Test23|Test24|Test5 ... (2 Replies)
Discussion started by: yale_work
2 Replies

7. Shell Programming and Scripting

Remove duplicate lines based on field and sort

I have a csv file that I would like to remove duplicate lines based on field 1 and sort. I don't care about any of the other fields but I still wanna keep there data intact. I was thinking I could do something like this but I have no idea how to print the full line with this. Please show any method... (8 Replies)
Discussion started by: cokedude
8 Replies

8. UNIX for Dummies Questions & Answers

Remove duplicate rows when >10 based on single column value

Hello, I'm trying to delete duplicates when there are more than 10 duplicates, based on the value of the first column. e.g. a 1 a 2 a 3 b 1 c 1 gives b 1 c 1 but requires 11 duplicates before it deletes. Thanks for the help Video tutorial on how to use code tags in The UNIX... (11 Replies)
Discussion started by: informaticist
11 Replies

9. UNIX for Dummies Questions & Answers

How to get remove duplicate of a file based on many conditions

Hii Friends.. I have a huge set of data stored in a file.Which is as shown below a.dat: RAO 1869 12 19 0 0 0.00 17.9000 82.3000 10.0 0 0.00 0 3.70 0.00 0.00 0 0.00 3.70 4 NULL LEE 1870 4 11 1 0 0.00 30.0000 99.0000 0.0 0 0.00 0 0.00 0.00 0.00 0 ... (3 Replies)
Discussion started by: reva
3 Replies

10. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies
Login or Register to Ask a Question