Deleting all occurences of a duplicate row


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Deleting all occurences of a duplicate row
# 1  
Old 07-10-2008
Deleting all occurences of a duplicate row

Hi,

I need to delete all occurences of the repeated lines from a file and retain only the lines that is not repeated elsewhere in the file. As seen below the first two lines are same except that for the string "From BaseLine" and "From SMS".I shouldn't consider the string "From SMS" and "From BaseLine" for checking the repeated lines. I want to retain only the third line.

From BaseLine - 0T001 000 999999999 00101 20080411000000T1023.27
From SMS - 0T001 000 999999999 00101 20080411000000T1023.27
From BaseLine - 0T001 000 999999999 00101 20080411000000T109.019

My output should be the third line alone.

These file size would range from 100 MB to 900MB. The performance factor should also be considered. Can you please help me out?

Regards,

Ragav.
# 2  
Old 07-10-2008
Use nawk or /usr/xpg4/bin/awk on Solaris:

Code:
awk -F- 'END {
  for (p in r)
    if (u[p] == 1)
      print r[p]
      }
!u[$2] ++ { 
  r[$2] = $0
  }' input

# 3  
Old 07-10-2008
Computer

Thanks. Can you please explain?

Regards,

Ragav.
# 4  
Old 07-10-2008
Which part of the code is not obvious?
# 5  
Old 07-10-2008
Can you please explain the entire code???

Regards
Ragav
# 6  
Old 07-10-2008
Code:
uniq -u -f 3 file

# 7  
Old 07-10-2008
OK.

Code:
awk -F- ...

Use '-' as a field separator.

The following expression/action pair is execute first:

Code:
!u[$2] ++ { 
  r[$2] = $0
  }

When the string in the second field is seen for the first time the element/value of the associative array u will be 0 (false for AWK), because of the implicit variable initialization. In idiomatic AWK it could be written as:

Code:
!array[key] ++

Which actually means:

Code:
array[key] ++ == 0

So, when NOT array[key]++ returns true (0 -> false, !0 -> true), do the following: build another associative array r (r for record, because it holds the entire record), $2 as key, $0 as element/value. So we store one copy (the first one) of each unique $2 while we're counting the unique values of $2 in the expression part - u[$2] ++.

Code:
END {
  for (p in r)
    if (u[p] == 1)
      print r[p]
      }

After all the input has been read the END block is executed.
For every key (k) in the r array verify: if the element/value in the u array with the same key (k) equals 1 (has only one entry in the entire input), print the corresponding element/value of the r (record) array.

That's all.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete duplicate row based on criteria

Hi, I have an input file as shown below: 20140102;13:30;FR-AUD-LIBOR-1W;2.495 20140103;13:30;FR-AUD-LIBOR-1W;2.475 20140106;13:30;FR-AUD-LIBOR-1W;2.495 20140107;13:30;FR-AUD-LIBOR-1W;2.475 20140108;13:30;FR-AUD-LIBOR-1W;2.475 20140109;13:30;FR-AUD-LIBOR-1W;2.475... (2 Replies)
Discussion started by: shash
2 Replies

2. Shell Programming and Scripting

Delete duplicate row

Hi all, how can delete duplicate files in file form, e.g. $cat file1 aaa 123 234 345 456 bbb 345 345 657 568 ccc 345 768 897 456 aaa 123 234 345 456 ddd 786 784 234 263 ccc 345 768 897 456 aaa 123 234 345 456 ccc 345 768 897 456 then i need ouput file1 some, (4 Replies)
Discussion started by: aav1307
4 Replies

3. Shell Programming and Scripting

In php, Moving a new row to another table and deleting old row

Hi, I already succeed moving a new row to another table if the field from new row doesn't have the first word that I categorized (like: IRC blablabla, PTM blablabla, ADM blablabla, BS blablabla). But it can't delete the old row. Please help me with the script. my php script: INSERT INTO... (2 Replies)
Discussion started by: jazzyzha
2 Replies

4. Shell Programming and Scripting

Moving new row and deleting old row to another table

Hi, I want to move a new row to another table if the field from new row doesn't have the first word that I categorized (like: IRC blablabla, PTM blablabla, ADM blablabla, BS blablabla). I already use this script but doesn't work as I expected. CHECK_KEYWORD="$( mysql -uroot -p123456 smsd -N... (7 Replies)
Discussion started by: jazzyzha
7 Replies

5. Shell Programming and Scripting

deleting dupes in a row

Hello, I have a large database in which name homonyms are arranged in a row. Since the database is large and generated by hand, very often dupes creep in. I want to remove the dupes either using an awk or perl script. An input is given below The expected output is given below: As can be... (2 Replies)
Discussion started by: gimley
2 Replies

6. Shell Programming and Scripting

REMOVE DUPLICATE IN a ROW AFTER CHECKING THE FIRST SIMILAR NAME

Hi all I have a big file like this in rows and columns from 2 column onwards the next column is desciption of previous column means 3rd columns is description of 2 columns and 5 column is description of 4 column. All cloumns are separated by comma ... (1 Reply)
Discussion started by: manigrover
1 Replies

7. Shell Programming and Scripting

Deleting Duplicate Records

Hello, I'm have a file of xy data with over 1000 records. I want to delete both x and y values for any record that has the same x value as any previous record thus removing the duplicates from my file. Can anyone help? Thanks, Dan (3 Replies)
Discussion started by: DFr0st
3 Replies

8. Shell Programming and Scripting

how to identify duplicate columns in a row

Hi, How to identify duplicate columns in a row? Input data: may have 30 columns 9211480750 LK 120070417 920091030 9211480893 AZ 120070607 9205323621 O7 120090914 120090914 1420090914 2020090914 2020090914 9211479568 AZ 120070327 320090730 9211479571 MM 120070326 9211480892 MM 120070324... (3 Replies)
Discussion started by: suresh3566
3 Replies

9. Shell Programming and Scripting

Delete a row that has a duplicate column

I'm trying to remove lines of data that contain duplicate data in a specific column. For example. apple 12345 apple 54321 apple 14234 orange 55656 orange 88989 orange 99898 I only want to see apple 12345 orange 55656 How would i go about doing this? (5 Replies)
Discussion started by: spartan22
5 Replies

10. Shell Programming and Scripting

sort and semi-duplicate row - keep latest only

I have a pipe delimited file. Key is field 2, date is field 5 (as example, my real file is more complicated of course, but the KEY and DATE are accurate) There can be duplicate rows for a key with different dates. I need to keep only rows with latest date in this case. Example data: ... (4 Replies)
Discussion started by: LisaS
4 Replies
Login or Register to Ask a Question