Oracle Performance Tuning is a big subject, as anyone will appreciate who has read any of a number of books that set out to help you to understand what problem you are trying to solve, what factors may affect the performance you are seeing, what strategies are available to the query optimizer, and so on.
While Cary Millsap's Optimizing Oracle Performance focusses on finding, tracing and prioritising specific problems in the face of vague reports that the system seems a bit slow this week, and Jonathan Lewis' Cost-Based Oracle Fundamentals takes us on a tour of the CBO to help answer such questions as Why isn't my EXISTS query using an index? (and why isn't it faster than the IN version?) a poster on OraFAQ has an approach we've not seen before:
The following is a problem I need help with and I am willing to pay for help if necessary. Any info would be greatly appreciated.
Two tables in a database:
Table1 contains a list of phone numbers
Table2 contains a list of phone numbers as well
I would like to create a Table3, in which Table3 contains all numbers from Table1 that is not in Table2. I am looking for the shortest runtime possible, keeping in mind that you can use whatever method(s) you deem necessary.
Table1 contains 30 Million rows,
Table2 contains 2000 rows.
given a regular SQL expression, it will yield Big O(m*n)
Where m = rowcount of Table1
and n = rowcount of Table2
Generate for me, a method in which, runtime will yield Big O (m log2 n).
I don't need code, I want to hear your logic. Table1 is customers, Table2 is a list of prepaid phone numbers. Table3 is list of people to bill.
As usual no database version is given. The first suggestion, as you might expect, is the quite reasonable:
select phone_no from table1 minus select phone_no from table2;
accompanied by a comment that it doesn't seem like a great piece of schema design to have two tables in the first place. However, the reply comes back:
The guy who wrote this problem just told me that this answer is incorrect.
His response to me was:
it's not too simplistic, but it is incorrect. This will still yield a big O(m*n).
Give it one more try, you are thinking too much in terms of DB.
Ask yourself, what are the only structures that would yield BigO(nlog2n)? Answer that, and you will get your answer.
Resident mathematician Ross Leishman tried to explain to me what Big O (m log2n) means, and I can confirm that it is not after all a When Harry Met Sally reference as most of us would probably assume. Apparently the version with an EXISTS subquery was what was wanted, which seemed odd to me on a number of levels, not least that an IN subquery would probably produce the same plan, especially in 10g where the new HASH JOIN RIGHT ANTI allows the database to build its hash table from the 2,000-row table2 rather than the 30 million-row table1. But of course we don't know the database version, do we? Or whether the columns are nullable, unique or indexed, or how values are distributed, or really anything about the actual environment that would help in solving a real-world performance problem. I can see where I'm going wrong though. I am thinking too much in terms of DB.