remove consecutive duplicate rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting remove consecutive duplicate rows
# 1  
Old 06-08-2011
remove consecutive duplicate rows

I have some data that looks like,
Code:
1    3300665.mol   3300665    5177008   102.093
2    3300665.mol   3300665    5177008   102.093
3    3294015.mol   3294015    5131552   102.114
4    3294015.mol   3294015    5131552   102.114
5    3293734.mol   3293734    5129625   104.152
6    3293734.mol   3293734    5129625   104.152
7    3347497.mol   3347497    5510897   109.15
8    3294070.mol   3294070    5132258   113.096
9    3295423.mol   3295423    5141084   114.11
10   3295423.mol   3295423    5141084   114.11
11   3347551.mol   3347551    5511243   114.165
12   3347551.mol   3347551    5511243   114.165
13   3290635.mol   3290635    5108661   116.16
14   3290635.mol   3290635    5108661   116.16
15   3347550.mol   3347550    5511242   117.107
16   3347550.mol   3347550    5511242   117.107
17   3293127.mol   3293127    5119773   118.193
18   3293127.mol   3293127    5119773   118.193
19   3382893.mol   3382893    5728430   119.181
20   3382893.mol   3382893    5728430   119.181

with ~150,000 rows. I need to look for instances where the value in col $1 is the same in consecutive rows. When a duplicate row is found, I need to remove the duplicate row, and write it to a second file. I guess you would just create two new files, one with a single copy of each line (whether or not the row occurs one or more than once in the input file), and a second file that gives the rows that were found to be duplicates.

Minimal set of all rows (dups have been removed):
Code:
1    3300665.mol   3300665    5177008   102.093
3    3294015.mol   3294015    5131552   102.114
5    3293734.mol   3293734    5129625   104.152
7    3347497.mol   3347497    5510897   109.15
8    3294070.mol   3294070    5132258   113.096
9    3295423.mol   3295423    5141084   114.11
11   3347551.mol   3347551    5511243   114.165
13   3290635.mol   3290635    5108661   116.16
15   3347550.mol   3347550    5511242   117.107
17   3293127.mol   3293127    5119773   118.193
19   3382893.mol   3382893    5728430   119.181


Duplicates file:
Code:
 2    3300665.mol   3300665    5177008   102.093
 4    3294015.mol   3294015    5131552   102.114
 6    3293734.mol   3293734    5129625   104.152
 10   3295423.mol   3295423    5141084   114.11
 12   3347551.mol   3347551    5511243   114.165
 14   3290635.mol   3290635    5108661   116.16
 16   3347550.mol   3347550    5511242   117.107
 18   3293127.mol   3293127    5119773   118.193
 20   3382893.mol   3382893    5728430   119.181

I'm not sure how to go about this, and I can't do it in excel, so some assistance would be appreciated. There could be instances of 3 or more in a row, I'm not sure. I guess it makes sense to just keep the first instance of each multiple.

LMHmedchem
# 2  
Old 06-08-2011
Based on your sample file, your key is NOT in the first field, but rather in the SECOND.
This will create 2 files: myInput_dup and myInput_uniq
Code:
nawk '{print $0 >> (FILENAME (($2 in dup)?"_dup":"_uniq"));dup[$2]}' myInput

# 3  
Old 06-08-2011
Make all lines uniq (make duplicate consecutive line appear only once):
Code:
awk '{sub(".*"$2,$2)}1' yourfile | uniq

Log duplicate lines in a file (will be logged only lines that consecutively appear twice or more):
Code:
awk '{sub(".*"$2,$2)}1' yourfile | uniq -d >duplicate.txt

Log lines that appear not more than once consecutively in a file :
Code:
awk '{sub(".*"$2,$2)}1' yourfile | uniq -u >single.txt

Log all lines prefixed by the number of times they appear consecutively:
Code:
awk '{sub(".*"$2,$2)}1' yourfile | uniq -c >count.txt

You can renumber the lines if necessary with ... | cat -b or ... | cat -n if necessary


By the way, do you really care about the first field (line number) or can we get rid of it ?

Last edited by ctsgnb; 06-08-2011 at 07:14 PM..
# 4  
Old 06-08-2011
Quote:
Originally Posted by vgersh99
Based on your sample file, your key is NOT in the first field, but rather in the SECOND.
This will create 2 files: myInput_dup and myInput_uniq
Code:
nawk '{print $0 >> (FILENAME (($2 in dup)?"_dup":"_uniq"));dup[$2]}' myInput

I was numbering the cols with $0 as the first col, is that not right? Now that I think about it, $0 is the whole line, if I remember right.

Will this work with awk, or do I need nawk?

Quote:
Originally Posted by ctsgnb
Make all lines uniq (make duplicate consecutive line appear only once):
Code:
awk '{sub(".*"$2,$2)}1' yourfile | uniq

By the way, do you really care about the first field (line number) or can we get rid of it ?
I probably need in index field, but I probably don't need to preserve the values from the input file. I could just do another line of awk to add a new index.
Code:
awk 'BEGIN{OFS="\t"} {print (NR>1?NR-1:"id"),$0}'

LMHmedchem
# 5  
Old 06-08-2011
The sample you posted is "line numbered" is it the output of an awk command of yours ?
If so, show it to us and provide and example of the very initial input file you have, before any formatting.

---------- Post updated at 12:33 AM ---------- Previous update was at 12:30 AM ----------

When you state :

"I need to remove the duplicate row"

Do you means that 2 same consecutive lines should :
a) appear only once ?
or
b) should not appear at all ?
This User Gave Thanks to ctsgnb For This Post:
# 6  
Old 06-08-2011
Quote:
Originally Posted by LMHmedchem
I was numbering the cols with $0 as the first col, is that not right? Now that I think about it, $0 is the whole line, if I remember right.
awk's field references are one-based.
Quote:
Originally Posted by LMHmedchem
Will this work with awk, or do I need nawk?
If on Solaris use either nawk or /usr/xpg4/bin/awk.
Anywhere else, awk should do (most likely).
This User Gave Thanks to vgersh99 For This Post:
# 7  
Old 06-08-2011
Quote:
Originally Posted by ctsgnb
The sample you posted is "line numbered" is it the output of an awk command of yours ?
If so, show it to us and provide and example of the very initial input file you have, before any formatting.

When you state :

"I need to remove the duplicate row"

Do you means that 2 same consecutive lines should :
a) appear only once ?
or
b) should not appear at all ?
The answer is b, needs to appear only once in the output. The formatting of the input file is pretty far back in the tool chain and I don't see much value in redoing that part. It is just as easy to add a new col. The string in $1 is the index anyway.

LMHmedchem
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh; (6 Replies)
Discussion started by: Mannu2525
6 Replies

2. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Hi, In an ideal scenario, I will have a listing of db transaction log that gets copied to a DR site and if I have them all, they will be numbered consecutively like below. 1_79811_01234567.arc 1_79812_01234567.arc 1_79813_01234567.arc 1_79814_01234567.arc 1_79815_01234567.arc... (3 Replies)
Discussion started by: newbie_01
3 Replies

3. Shell Programming and Scripting

Compute value from more than three consecutive rows

Hello all, I am working on a file like below: site Date time value1 value2 0023 2014-01-01 00:00 32.0 23.7 0023 2014-01-01 01:00 38.0 29.9 0023 2014-01-01 02:00 85.0 26.6 0023 2014-01-01 03:00 34.0 25.3 0023 2014-01-01 04:00 37.0 23.8 0023 2014-01-01 05:00 80.0 20.3 0023 2014-01-01 06:00... (16 Replies)
Discussion started by: kathy wang
16 Replies

4. Shell Programming and Scripting

Remove duplicate rows based on one column

Dear members, I need to filter a file based on the 8th column (that is id), and does not mather the other columns, because I want just one id (1 line of each id) and remove the duplicates lines based on this id (8th column), and does not matter wich duplicate will be removed. example of my file... (3 Replies)
Discussion started by: clarissab
3 Replies

5. UNIX for Dummies Questions & Answers

Remove duplicate rows when >10 based on single column value

Hello, I'm trying to delete duplicates when there are more than 10 duplicates, based on the value of the first column. e.g. a 1 a 2 a 3 b 1 c 1 gives b 1 c 1 but requires 11 duplicates before it deletes. Thanks for the help Video tutorial on how to use code tags in The UNIX... (11 Replies)
Discussion started by: informaticist
11 Replies

6. Shell Programming and Scripting

remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script... (7 Replies)
Discussion started by: clicstic
7 Replies

7. Shell Programming and Scripting

To remove date and duplicate rows from a log file using unix commands

Hi, I have a log file having size of 48mb. For such a large log file. I want to get the message in a particular format which includes only unique error and exception messages. The following things to be done : 1) To remove all the date and time from the log file 2) To remove all the... (1 Reply)
Discussion started by: Pank10
1 Replies

8. Shell Programming and Scripting

awk script to remove duplicate rows in line

i have the long file more than one ns and www and mx in the line like . i need the first ns record and first www and first mx from line . the records are seperated with tthe ; i am try ing in awk scripting not getiing the solution. ... (4 Replies)
Discussion started by: kiranmosarla
4 Replies

9. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies

10. Shell Programming and Scripting

How to capture 2 consecutive rows when a condition is true ?

Hi All, i have an input below. As long as "x= 1" , i would want to capture 2 lines using sed or awk for eg : 0001 x= 1 $---------------------------------..-.--.. 0001 tt= 137 171 423 1682 2826 0 Pls help. Thanks in advance. Note that the number of lines in each block do... (37 Replies)
Discussion started by: Raynon
37 Replies
Login or Register to Ask a Question