Optimizing query

08-03-2007

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Optimizing query

Hi All,

My first thread to this sub-forum and first thread of this sub-forum

Here it is,

Am trying to delete duplicates from a table retaining just 1 duplicate value out of the duplicate records

for example : from n records of a table out of which x are duplicates, I want to remove x - 1 records retaining 1 record from the x duplicates

This is the query am using,

but I think this could still be optimized for this query really takes time for huge dumps.

DELETE FROM tableA
WHERE rowid not in
(SELECT MIN(rowid) FROM tableA GROUP BY column1)

Oracle 9i

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

08-03-2007

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

rowid is not an indexed column - it is a "pseudocolumn'. the in () subselect will read thru the entire select statement's result set each time. When I get back in a while I'll write something that is faster. You may need to add an index.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

08-03-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Matrix,
Your solution is very simple and easy to understand.

Although it works fine for small to medium tables, or for tables with low access/update,
for very large tables, or tables with heavy access/update, it may:
1) Generate an abnormally long transaction.
2) Fill the logs.
3) Be involved with another process in a deadlock.
4) Run for a very long time.

It is also important to note that for large tables, the internals of your query will be
very ineficient as the system will store a rowid for each unique key and loop thru
each one for every row.

The best and optimized solution would be to write a program to loop thru each row
in the table, begin a transaction and commit every number of deleted rows -- usually
one to five thousand is very quick, safe and easy on the database.

Good luck!

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

08-03-2007

Registered User

149, 1

Join Date: Apr 2007

Last Activity: 12 October 2014, 12:11 PM EDT

Posts: 149

Thanks Given: 0

Thanked 1 Time in 1 Post

I don't know if this is more efficient, but it seems like a positive approach might be better where there are only a few duplicates.

DELETE FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*) > 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1)

kahuna

View Public Profile for kahuna

Find all posts by kahuna

08-03-2007

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Quote:

Originally Posted by jim mcnamara

rowid is not an indexed column - it is a "pseudocolumn'. the in () subselect will read thru the entire select statement's result set each time. When I get back in a while I'll write something that is faster. You may need to add an index.

Does that mean its executing in this fashion.

rowid <1> - evaluate sub query
rowid <2> - evaluate sub query
.
.
.
rowid <n> - evaluate sub query

Last edited by reborg; 08-03-2007 at 03:29 PM.. Reason: touch post to fix quotes

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

08-03-2007

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Shell_Life

out of the 4 potential hazards that you have listed

since the query is executed only on a table with 0.25 million records, I just encounter the 4th hazard which is taking real long time.

When it initially took such a long time, I though I might be receiving ' Long transaction aborted '. But didn't.

Considering the alternative of programmatically deleting is a fine idea without filling the logs.

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

08-03-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Quote:

rowid is not an indexed column - it is a "pseudocolumn'.

By definition, rowids are the physical address of each row, thus it is also an index.

It is also important to note that the database server does not assign rowids to rows
in fragmented tables.

Last edited by reborg; 08-03-2007 at 03:29 PM.. Reason: touch post to fix quotes

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

UNIX and Linux Applications

Optimizing query

10 More Discussions You Might Find Interesting

1. Web Development

Optimizing JS and CSS

Discussion started by: Akshay Hegde

2. Shell Programming and Scripting

Optimizing bash loop

Discussion started by: SkySmart

3. Shell Programming and Scripting

Optimizing find with many replacements

Discussion started by: f77hack

4. Shell Programming and Scripting

Optimizing search using grep

Discussion started by: Junaid Subhani

5. Shell Programming and Scripting

Optimizing awk script

Discussion started by: SkySmart

6. Shell Programming and Scripting

Optimizing the code

Discussion started by: nua7

7. OS X (Apple)

Optimizing OSX

Discussion started by: deiphon

8. Shell Programming and Scripting

Optimizing for a Speed-up

Discussion started by: switch

9. Filesystems, Disks and Memory

optimizing disk performance

Discussion started by: J.P

10. Filesystems, Disks and Memory

Optimizing the system reliability

Discussion started by: Deepa