The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Special Forums > UNIX and Linux Applications
Google UNIX.COM


UNIX and Linux Applications Questions involving software not covered by other forum go here. This includes Databases and Middleware.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Tips for Optimizing Rails on Oracle iBot Oracle Updates (RSS) 0 04-06-2008 02:10 AM
Optimizing for a Speed-up switch Shell Programming and Scripting 3 04-07-2006 05:52 PM
optimizing disk performance J.P Filesystems, Disks and Memory 4 03-01-2005 08:32 AM
Optimizing Apache Server Performance Neo News, Links, Events and Announcements 0 02-14-2005 06:30 PM
Optimizing the system reliability Deepa Filesystems, Disks and Memory 2 12-11-2002 07:09 PM

Reply
 
Submit Tools LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 08-02-2007
Technorati Master
 

Join Date: Mar 2005
Location: Large scale systems...
Posts: 2,411
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Optimizing query

Hi All,

My first thread to this sub-forum and first thread of this sub-forum

Here it is,

Am trying to delete duplicates from a table retaining just 1 duplicate value out of the duplicate records

for example : from n records of a table out of which x are duplicates, I want to remove x - 1 records retaining 1 record from the x duplicates

This is the query am using,

but I think this could still be optimized for this query really takes time for huge dumps.

DELETE FROM tableA
WHERE rowid not in
(SELECT MIN(rowid) FROM tableA GROUP BY column1)

Oracle 9i
Reply With Quote
Forum Sponsor
  #2 (permalink)  
Old 08-03-2007
...@...
 

Join Date: Feb 2004
Location: NM
Posts: 3,289
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
rowid is not an indexed column - it is a "pseudocolumn'. the in () subselect will read thru the entire select statement's result set each time. When I get back in a while I'll write something that is faster. You may need to add an index.
Reply With Quote
  #3 (permalink)  
Old 08-03-2007
Shell_Life's Avatar
Unix/Informix/4GL/SQL
 

Join Date: Mar 2007
Location: Bahia, Brazil
Posts: 694
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Matrix,
Your solution is very simple and easy to understand.

Although it works fine for small to medium tables, or for tables with low access/update,
for very large tables, or tables with heavy access/update, it may:
1) Generate an abnormally long transaction.
2) Fill the logs.
3) Be involved with another process in a deadlock.
4) Run for a very long time.

It is also important to note that for large tables, the internals of your query will be
very ineficient as the system will store a rowid for each unique key and loop thru
each one for every row.

The best and optimized solution would be to write a program to loop thru each row
in the table, begin a transaction and commit every number of deleted rows -- usually
one to five thousand is very quick, safe and easy on the database.

Good luck!
Reply With Quote
  #4 (permalink)  
Old 08-03-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
I don't know if this is more efficient, but it seems like a positive approach might be better where there are only a few duplicates.

DELETE FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*) > 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1)
Reply With Quote
  #5 (permalink)  
Old 08-03-2007
Technorati Master
 

Join Date: Mar 2005
Location: Large scale systems...
Posts: 2,411
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Quote:
Originally Posted by jim mcnamara View Post
rowid is not an indexed column - it is a "pseudocolumn'. the in () subselect will read thru the entire select statement's result set each time. When I get back in a while I'll write something that is faster. You may need to add an index.
Does that mean its executing in this fashion.

rowid <1> - evaluate sub query
rowid <2> - evaluate sub query
.
.
.
rowid <n> - evaluate sub query

Last edited by reborg; 08-03-2007 at 11:29 AM. Reason: touch post to fix quotes
Reply With Quote
  #6 (permalink)  
Old 08-03-2007
Technorati Master
 

Join Date: Mar 2005
Location: Large scale systems...
Posts: 2,411
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Shell_Life

out of the 4 potential hazards that you have listed

since the query is executed only on a table with 0.25 million records, I just encounter the 4th hazard which is taking real long time.

When it initially took such a long time, I though I might be receiving ' Long transaction aborted '. But didn't.

Considering the alternative of programmatically deleting is a fine idea without filling the logs.
Reply With Quote
  #7 (permalink)  
Old 08-03-2007
Shell_Life's Avatar
Unix/Informix/4GL/SQL
 

Join Date: Mar 2007
Location: Bahia, Brazil
Posts: 694
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Quote:
rowid is not an indexed column - it is a "pseudocolumn'.
By definition, rowids are the physical address of each row, thus it is also an index.

It is also important to note that the database server does not assign rowids to rows
in fragmented tables.

Last edited by reborg; 08-03-2007 at 11:29 AM. Reason: touch post to fix quotes
Reply With Quote
  #8 (permalink)  
Old 08-03-2007
Technorati Master
 

Join Date: Mar 2005
Location: Large scale systems...
Posts: 2,411
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Quote:
Originally Posted by kahuna View Post
I don't know if this is more efficient, but it seems like a positive approach might be better where there are only a few duplicates.

DELETE FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*) > 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1)

Correct me if am wrong !

Am not able to understand the optimization that you have made to make the query run faster.

Basically, with combination of using selected columns ( column1 ) and the specific rowids ( rowid ), what is the need to specify an extra condition with a separate subquery to extract column1.

Isnt that redundant ? Or how does it make the query optimized and more efficient.

This is really equivalent to for each record with a rowid ' x ', necessarily two subqueries should be executed for all the records.

Last edited by reborg; 08-03-2007 at 11:30 AM.
Reply With Quote
  #9 (permalink)  
Old 08-03-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Quote:
Originally Posted by matrixmadhan View Post
Basically, with combination of using selected columns ( column1 ) and the specific rowids ( rowid ), what is the need to specify an extra condition with a separate subquery to extract column1.
Yes you are right that it makes 2 sub-selects. If you have many duplicates, then your original post may be better. But suppose you only have a single duplicate. My subquery1 returns only a single value of column1 (the duplicate), where your subquery returns many values (the non duplicates). So now I have a single value to compare against tableA where you have many values. Yes, I still have to make subquery2, but it is against a much smaller set.

Subquery2 makes sure that, for a given value of column1, we don't delete the row with the smallest rowid.

Last edited by kahuna; 08-03-2007 at 12:03 PM. Reason: Clarification
Reply With Quote
  #10 (permalink)  
Old 08-04-2007
...@...
 

Join Date: Feb 2004
Location: NM
Posts: 3,289
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Look - the query has to search the ENTIRE output of the IN (SELECT ..)
for each set of rows it deletes. I know that rowid's are what you get from an index.

Instead of me going on about this - look in Tom Kyte's book 'EXpoert One on One' and look in the analytical function chapter - there is an example of how to optimize a query very like this. Or try the asktom oracle site:
From there:
Code:
from 
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1224636375004

delete from tableA a
     where rowid <> ( select max(rowid)
                        from tableA b 
                       where b.column1 = a.column1)
    /
Reply With Quote
  #11 (permalink)  
Old 08-04-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
I know that it may not be possible, but it would be interesting to see a comparison timing of the different queries.
Reply With Quote
  #12 (permalink)  
Old 08-06-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
I made a test table with a single varchar field and no index. I loaded it with 30,000 records and an additional 30 duplicate records. I ran the following queries.

Code:
select count(*) FROM tableA
WHERE rowid not in
(SELECT MIN(rowid) FROM tableA GROUP BY column1);

1 hour 12 minutes 58 seconds
Code:
select count(*) FROM tableA A1
WHERE column1 in (SELECT column1 FROM tableA GROUP BY column1 having count(*)
> 1)
and rowid != (select min(rowid) from tableA A2 where A1.column1 = A2.column1);

1 second
Code:
select count(*) from tableA a
     where rowid <> ( select max(rowid)
                        from tableA b
                       where b.column1 = a.column1);

6 min 29 sec
Your mileage may vary.

Last edited by kahuna; 08-06-2007 at 07:04 AM.
Reply With Quote
  #13 (permalink)  
Old 08-16-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Matrixmadhan,

Have you decided if anything works better than your original query? If so, can you let us know what worked and some idea of the time difference? Thanks.
Reply With Quote
  #14 (permalink)  
Old 08-16-2007
Technorati Master
 

Join Date: Mar 2005
Location: Large scale systems...
Posts: 2,411
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Thanks for the follow up.

I had one more option of doing that.

Just tried with few records and that seems to be better.
Actually I didnt compare with bulk number of records.

Will do that and post the results positively by end of tomorrow !
Reply With Quote
  #15 (permalink)  
Old 08-23-2007
kahuna's Avatar
Registered User
 

Join Date: Apr 2007
Posts: 147
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Reddit! Stumble this Post!Spurl this Post!
Quote:
Originally Posted by matrixmadhan View Post
Will do that and post the results positively by end of tomorrow !
Perhaps we should use FedEx -"Absolutely Positively Overnight"
Just kidding. Seriously, I am curious to know the results.
Reply With Quote
Google UNIX.COM
Reply

Thread Tools
Display Modes


The 50 most popular UNIX and Linux searches.
Google Search Cloud for The UNIX and Linux Forums
"inappropriate ioctl for device" 421 service not available, remote server has closed connection ^m autosys awk trim bash eval bash exec bash for loop boot: cannot open kernel/sparcv9/unix close_wait command copy/move folder in unix curses.h cut command in unix dead.letter find grep find null character in a unix file grep multiple lines grep or grep recursive grep unique inaddr_any inappropriate ioctl for device logrotate.conf lynx javascript mailx attachment mget mtime ping port remove first character from string in k shell replace space by comma , perl script scp recursive segmentation fault(coredump) sftp batch sftp script