Wednesday, July 27, 2011

Cost Is Time: Next Generation

It looks like Oracle has introduced with the Oracle 11.2.0.2 patch set a new "cost is time" model for the time estimate of the Cost-Based Optimizer (CBO).

In order to understand the implications let me summarize the evolution of the CBO in terms of cost / time estimate so far:

1. Oracle 7 and 8

The cost estimate generated by the Cost-Based Optimizer (CBO) has always been a time estimate, although expressed in a slightly obscure unit, which is number of single block reads.

The traditional I/O based costing introduced with Oracle 7 in principle counted the number of required single and multi-block reads to arrive at the final cost. A potential drawback of this approach was the missing differentiation of multi- and single-block reads - one multi-block read created the same cost as a one single-block read. Although the model used an "adjusted" multi-block read count to make full table scans more costlier than indicated by larger "db_file_multiblock_read_count" settings (and hence accounted for smaller extents and blocks already cached in the buffer cache making multi-block reads smaller than requested) it still potentially favoured full table scans over index access paths.

From Oracle 8 on one could play with the OPTIMIZER_INDEX_COST_ADJ / OPTIMIZER_INDEX_CACHING parameter to adjust this shortcoming of the costing model in particular for OLTP biased applications.

2. Oracle 9i

Oracle introduced with Oracle 9i the System Statistics along with a more sophisticated cost calculation model.

In short, System Statistics offer the following features:

- Different treatment of single-block and multi-block operations
- Time-based optimization using average timings for single- and multi-block reads
- Cost calculation includes a CPU cost component
- Gather actual hardware capabilities to base the calculations on actual system capabilities and workload pattern

More details can be found for example in my "Understanding System Statistics" blog series.

So with System Statistics the CBO actually calculates an estimated execution time - you can see this in the EXPLAIN PLAN output: With System Statistics enabled it includes a TIME column.

Simplified spoken the time estimate is simply the average time for a single block read (SREADTIM) times the number of single block reads plus the average time for a multi-block read (MREADTIM) times the number of multi-block reads plus the estimated number of cpu operations divided by the cpu operations per second (CPUSPEED / CPUSPEEDNW). So the cost with System Statistics is actually based on a time estimation.

For consistency reasons it has been decided to use the same unit as before, so the estimated time is simply divided by the SREADTIM to arrive at the same cost unit as with traditional I/O based costing which is number of single-block reads (although plans involving full segment scan operations usually arrive at different costs than the traditional costing, so consistency is hardly given anyway).

Right from the beginning in Oracle 9i the System Statistics could be gathered in WORKLOAD mode, which means that Oracle takes two snapshots of certain performance statistics and calculates the System Statistics parameters like SREADTIM, MREADTIM, MBRC etc. from the delta values.

3. Oracle 10g

Starting with Oracle 10g System Statistics were enabled by default with so called default NOWORKLOAD settings. It even allowed to generate an artificial load on the system by gathering the NOWORKLOAD System Statistics simply using a predefined I/O pattern to gather the disk transfer speed (IOTFRSPEED) and disk seek time (IOSEEKTIM) - these values are then used to derive the SREADTIM and MREADTIM values - the two most important ingredients of the enhanced cost/time calculation.

So since Oracle 9i there is a built-in functionality to measure the capabilities of the underlying hardware - from 10g on either based on a particular workload pattern or by submitting an artificial predefined load.

Furthermore Oracle provides a well-defined API as part of the DBMS_STATS package for dealing with System Statistics: They can be gathered, deleted, exported, imported, manually defined and even gathered directly into a separate statistics table to build a history of System Statistics gathered.

4. Oracle 11g

In 11g Oracle introduced the I/O calibration routine as part of the Resource Manager. Note that so far this feature didn't have a direct relationship to the Cost Based Optimizer - it could be used however to control the maximum parallel degree using the PARALLEL_IO_CAP_ENABLED parameter.

The first thing that puzzled me when dealing with that new functionality was the lack of a well-defined API to maintain the gathered information. There is a single call in the resource manager package (DBMS_RESOURCE_MANAGER.CALIBRATE_IO) to run the I/O calibration, but apart from that no additional functionality for maintenance. No way to delete the calibration results, export or import them, or even manually override.

If you want to get an understanding what this means, have a look at the MOS document "Automatic Degree of Parallelism in 11.2.0.2 [ID 1269321.1]" that, besides stating that there can be problems with the actual I/O calibration like gathering unreasonable values or not running to completion, shows you how to manipulate an internal SYS table to override the values gathered which also requires to bounce the instance in order to become effective.

I find it hard to understand why Oracle hasn't address these handling shortcomings in the meantime, particularly given the fact that with Oracle 11.2.0.2 the I/O resource calibration becomes mandatory if you want to make use of the new Auto-DOP feature that has been introduced with 11.2.0.1. Fiddling with a SYS-owned table doesn't sound like a well-designed feature to me, and the calibration functionality is not exactly "brand-new".

5. Oracle 11.2.0.2

So starting with 11.2.0.2 the new "cost is time" calculation comes into the picture. If you have values in the corresponding SYS.RESOURCE_IO_CALIBRATE$ table (that is simply externalized by the DBA_RSRC_IO_CALIBRATE view) then something really odd happens:

The cost that has been calculated according to the System Statistics model - which is already a time estimate based on three different components - time for single-block and multi-block reads as well as the estimated CPU time is now converted into a data volume by simply multiplying the resulting cost with the default block size. Dividing this data volume by the throughput as indicated by the I/O calibration results (it looks like the value MAX_PMBPS is relevant) arrives at a new estimated execution time.

Let's have a look at a working example: With default NOWORKLOAD System Statistics, 8KB default block size and a unset db_file_multiblock_read_count that results in a MultiBlockReadCount (MBRC) of 8 to be used internally for calculation of a full table scan (FTS) the time estimate for a FTS of 10,000 blocks (80MB) will be based on 1,250 multi-block reads, which are estimated to take 26ms each - this gives us a time estimate of 32.5 seconds. The CPU time associated with that full table scan operation will be added on top so that the final result will be something between 32.5 and 33 seconds. Let's stick to the 32.5 seconds - this time estimate corresponds to approx. 2,710 single-block reads by simply dividing the time by 12ms which happens to be the SREADTIM value for default NOWORKLOAD System Statistics with above configuration - this value will be close to the cost shown (minor variations are depending on the CPU speed determined).

Cost / time estimate for a FTS of a 10,000 block segment with 8KB block size, default NOWORKLOAD System Statistics and default MBRC of 8 used for cost calculation (_db_file_optimizer_read_count = 8):

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2716 (1)| 00:00:33 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 10000 | 2716 (1)| 00:00:33 |
-------------------------------------------------------------------


Now if you happen to have a value of MAX_PMBPS of 4MB/sec as I/O Resource Calibration result (I chose this very conservative value deliberately because it happens to be the same transfer rate that the default NOWORKLOAD System Statistics assumes (4096 bytes per millisec), the following new time calculation will happen instead:

2,710 will be multiplied with the 8KB default block size to arrive at a data volume, in this case approx. 21 MB

This approx. 21 MB is now divided by the 4MB/sec, to arrive at a new time estimate of approx. 5.3 seconds, rounded up to 6 seconds. Note that the original time estimate was 32.5 seconds.

Cost / time estimate for a FTS of a 10,000 block segment with 8KB block size, default NOWORKLOAD System Statistics and default MBRC of 8 used for cost calculation (_db_file_optimizer_read_count = 8) but MAX_PMBPS set to 4MB/sec:

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2716 (1)| 00:00:06 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 10000 | 2716 (1)| 00:00:06 |
-------------------------------------------------------------------


You can see this also happening in the 10053 CBO trace file:


kkeCostToTime: using io calibrate stats
maxmbps=0(MB/s) maxpmbps=4(MB/s)
block_size=8192 mb_io_count=1 mb_io_size=8192 (bytes)
tot_io_size=21(MB) time=5304(ms)


Now this approach strikes me as odd for several reasons:

- A sophisticated time estimate (remember that it even includes a CPU time component that has nothing to do with an I/O volume) is turned into data volume to arrive at a new time estimate using a rather simplistic approach

- As you can see from above example the "data volume" calculated does not correspond to the actual I/O volume that we know from the System Statistics cost/time calculation - remember that the actual segment size in this case was 80MB, not 20MB. This is of course caused by the underlying calculation of the original time estimate based on multi-block reads. So why we would turn the cost/time into some data volume that has nothing to do with the actual data volume used for the original cost/time calculation is above me

- There is already an I/O calibration routine available as part of the System Statistics functionality that can be used to arrive at more realistic time estimates based on the gathered System Statistics information - why a second one has been introduced? Furthermore this raises the question: If I'm required to run the I/O calibration to enable Auto-DOP - shouldn't I then also "calibrate" my System Statistics to arrive at a "calibrated" cost estimate? After all the new "Cost Is Time" approach uses the cost estimate for the new time estimate.

- As already outlined there is no officially documented way to properly deal with the I/O calibration results - manually poking into SYS-owned tables doesn't really count

Implications

So you probably think, why bother? The cost estimate is left untouched, only the TIME column is affected. So execution plans shouldn't change since they are still chosen based on the lowest cost estimate - and the lower the cost, the lower the new time estimate.

You'll appreciate however that the previous simple correlation between the cost and the time estimate is no longer true with 11.2.0.2 and resource calibration results available: So far you could simply divide the time estimate by the SREADTIM value to arrive at the cost, or the other way around, you could multiply the cost by the SREADTIM value to arrive at the time estimate - or use both values to arrive at the SREADTIM value - since the time divided by the cost should give you the approximate value of SREADTIM.

The point with 11.2.0.2 and the I/O resource calibration is that the new time estimate is obviously used for the Auto-DOP feature to drive two crucial decisions:

- Is the statement a candidate for parallel execution? This is controlled via the parameter PARALLEL_MIN_TIME_THRESHOLD that defaults to 10 seconds in 11.2.0.2

- If it is a candidate for parallel execution what is the optimal DOP? This is of course depending on a lot of different inputs but also seems to be based on the new time estimate - that, as just explained, arrives at a (wrong) data volume estimate in a questionable way

As a side note, Oracle at present recommends to set the value of MAX_PMBPS to 200 for Exadata environments rather than relying on the results of the actual I/O calibration - another indication that the I/O calibration results as of now are questionable.

Summary

Oracle introduced with the 11.2.0.2 patch set a new model for the estimated "Execution Time" if I/O resource calibration results are available. As outlined above the new approach seems to be questionable (at least), but will be used for crucial decisions regarding the new Auto-DOP feature. It will be interesting to see the further development in this area, whether for example the new time algorithm will be changed in upcoming releases or the influence of the I/O calibration on the CBO calculations will be extended.

If you want to make use of the new Auto-DOP feature in 11.2.0.2 you should be aware of these relationship - the MAX_PMBPS parameter drives the new time estimation and the Auto-DOP calculations.

Monday, July 25, 2011

Logical I/O - Evolution: Part 2 - 9i, 10g Prefetching

In the initial part of this series I've explained some details regarding logical I/O using a Nested Loop Join as example.

To recap I've shown in particular:

- Oracle can re-visit pinned buffers without performing logical I/O

- There are different variants of consistent gets - a "normal" one involving buffer pin/unpin cycles requiring two latch acquisitions and a short-cut variant that visits the buffer while holding the corresponding "cache buffers chains" child latch ("examination") and therefore only requiring a single latch acquisition

- Although two statements use a similar execution plan and produce the same number of logical I/Os one is significantly faster and scales better than the other one

The initial part used the "classic" shape of the Nested Loop Join, but Oracle introduced in recent releases various enhancements in that area - in particular in 9i the "Table Prefetching" and in 11g the Nested Loop Join Batching using "Vector/Batched I/O".

Although these enhancements have been introduced primarily to optimize the physical I/O patterns, they could also have an influence on logical I/O.

The intention of Prefetching and Batching seems to be the same - they both are targeted towards the usually most expensive part of the Nested Loop Join: The random table lookup as part of the inner row source. By trying to "prefetch" or "batch" physical I/O operations caused by this random block access Oracle attempts to minimize the I/O waits.

I might cover the effect on physical I/O of both "Prefetching" and "Batching" in separate posts, here I'll only mention that you might see "db file scattered read" or "db file parallel read" multi-block I/O operations instead of single block "db file sequential read" operations for the random table access with those optimizations (Index prefetching is also possible, by the way). Note also that if you see the Prefetching or Batching plan shape it does not necessarily mean that it is actually going to happen at execution time - Oracle monitors the effectiveness of the Prefetching and can dynamically decide whether it will be used or not.

10.2.0.4 Table Prefetching - Random order

Let's enable table prefetching in 10.2 and re-run the original test case. The first run will use the different order variant of T1 and T2:

Inner row source Unique Index - T1 different order than T2


---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.12 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.90 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.71 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.20 | 200K|
---------------------------------------------------------------------------------------------------------------


Inner row source Non-Unique Index - T1 different order than T2


--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:05.03 | 311K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:04.40 | 311K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:03.02 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.49 | 201K|
--------------------------------------------------------------------------------------------------------------


As you'll see, in 10g even with table prefetching enabled the unique index variant does look the same and performs similar as in the original post.

This changes in 11g by the way, where the unique index variant also supports the table prefetching plan shape.

For the non-unique variant you'll see a different shape of the execution plan where the inner row source random table lookup is actually a parent operation to the Nested Loop Join (and hence will only be started once and consumes the information generated by the child Nested Loop operation).

Note that in case of nested Nested Loop Joins only the inner-most row source will make use of the Table Prefetching shape. The same applies to the 11g Nested Loop Join Batching. If you happen to have several Nested Loops Joins that are not directly nested then each of the inner-most row sources might use the Table Prefetching/Batching shape - which means that it can be used more than once as part of a single execution plan.

If you compare the Runtime profile of the non-unique index variant with the original Runtime profile without Table Prefetching you'll not see any difference in terms of logical I/O, however it becomes obvious that the overall execution is actually slightly faster (more significant with row source sampling overhead enabled). In particular the random table access requires significantly less time than in the original Runtime profile, so it seems to be more efficient, although it is still slower than the unique index variant.

Begin Update

Having focused on the logical I/O I completely forgot to mention the inconsistency in the A-Rows column (thanks to Flado who pointed this out in his comment below), which shows 200K rows for the Nested Loop operation although only 100K rows have been identified in the inner index lookup. I believe this is an inconsistency that also shows up when performing an SQL trace so it seems to be a problem with the row source statistics. In principle with this plan shape the Nested Loop Join operation seems to account for the sum of both the rows identified in the driving row source and the inner index lookup, rather than the expected number of rows identified in the inner index lookup only.

However, as mentioned below in the "Statistics" section there is another anomaly - a consistent get and "buffer is pinned count" for every row looked up in the inner table, so this might not be just coincidence but another indicator that there is really some excess work happening with Table Prefetching.

By the way - both anomalies are still present in 11.1 / 11.2 when using Table Prefetching there.

End Update

Let's have a look at the session statistics.


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..consistent gets 310,012 311,101 1,089
STAT..consistent gets from cache 310,012 311,101 1,089
STAT..session logical reads 310,012 311,101 1,089
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..buffer is not pinned count 200,001 99,997 -100,004
STAT..buffer is pinned count 189,998 290,006 100,008
STAT..consistent gets - examination 300,001 100,007 -199,994
STAT..no work - consistent read gets 10,001 211,084 201,083
LATCH.cache buffers chains 320,031 522,195 202,164


So the only significant difference in this case is the increased "buffer is pinned count" / decreased "buffer is not pinned" count statistics, although the number of logical I/O stays the same. I don't know if this really means excess work with Table Prefetching enabled or whether this is an instrumentation problem. Nevertheless with Table Prefetching enabled in this case you'll end up with both a "buffer is pinned count" and "consistent get" for each row looked up in the inner row source table operation. The number of logical I/O and latch acquisitions stays the same, so it's not obvious from the statistics why this performs better than the non-Table Prefetching case - according to the statistics it even performs more work, but may be the table random access as parent operation to the Nested Loop allows a more efficient processing requiring less CPU cycles.

10.2.0.4 Table Prefetching - Same (Random) order

Let's change the data order and use either the same "Pseudo-Random" order (by uncommenting the second "dbms_random.seed(0)" call) or order by ID - it doesn't matter with Table Prefetching in 10g.

Inner row source Unique Index - T1 and T2 same order


---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.91 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.54 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|
---------------------------------------------------------------------------------------------------------------


Inner row source Non-Unique Index - T1 and T2 same order


--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.54 | 221K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:03.90 | 221K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:02.82 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.40 | 201K|
--------------------------------------------------------------------------------------------------------------


Now we really see a difference: The unique index variant still shows the same results, but the non-unique variant saves logical I/O on the random table access - and is faster than with random order - coming closer to the unique index variant performance.

Whereas the index range scan still requires approx. 200,000 logical I/Os the random table access only requires 10,000 logical I/Os instead of 100,000.

The session statistics:


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
LATCH.cache buffers chains 320,023 342,213 22,190
STAT..consistent gets 310,012 221,110 -88,902
STAT..consistent gets from cache 310,012 221,110 -88,902
STAT..session logical reads 310,012 221,110 -88,902
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..no work - consistent read gets 10,001 121,093 111,092
STAT..buffer is not pinned count 200,001 10,006 -189,995
STAT..buffer is pinned count 189,998 379,997 189,999
STAT..consistent gets - examination 300,001 100,007 -199,994


The session statistics confirm this: The "buffer is pinned count" increases by another 90,000 for the non-unique index variant which corresponds to the 90,000 logical I/Os performed less as part of the random table access operation.

The number of latch acquisitions decreases accordingly so that we end up with a comparable number as with the unique index variant.

Scalability

If you run the non-unique index Table Prefetching variant with the concurrent execution test harness you'll see a corresponding slightly increased scalability although it still scales not as good as the unique index variant.

Summary

Table Prefetching has been introduced in Oracle 9i in order to optimize the random physical access in Nested Loop Joins, however it also seems to have a positive effect on logical I/O. The effectiveness of this optimization depends on the data order - if the data from the driving row source is in the same order as the inner row source table buffers can be kept pinned. Note that the same doesn't apply to the index lookup - even if the data is ordered by ID and consequently the same index branch and leaf blocks will be accessed repeatedly with each iteration, a buffer pinning optimization could not be observed.

In the next part we'll see what happens with this example in Oracle 11g and its new features.

Thursday, July 7, 2011

Logical I/O - Evolution: Part 1 - Baseline

Forward to Part 2

This is the first part in a series of blog posts that shed some light on the enhancements Oracle has introduced with the recent releases regarding the optimizations of logical I/O.http://www.blogger.com/img/blank.gif

Before we can appreciate the enhancements, though, we need to understand the baseline. This is what this blog post is about.

The example used throughout this post is based on a simple Nested Loop Join which is one area where Oracle has introduced significant enhancements.

It started its life as a comparison of using unique vs. non-unique indexes as part of a Nested Loop Join and their influence on performance and scalability.

This comparison on its own is very educating and also allows to demonstrate and explain some of the little details regarding logical I/O.

Here is the basic script that gets used. It creates two tables with a primary defined, one table using a unique index, the other one a non-unique index.

The tables are specifically crafted to have exactly 100,000 rows with 10 rows per block resulting in 10,000 blocks (using the MINIMIZE RECORDS_PER_BLOCK option). These "obvious" numbers hopefully allow for nice pattern recognition in the resulting figures. Using the default 8K block size the resulting indexes will have slightly more than 1,000 blocks.

It will run then a Nested Loop Join from one table to the other and then the other way around along with a snapshot of the session statistics using Adrian Billington's RUNSTATS package which is based on Tom Kyte's well known package of the same name. You can get it from here.

If you run this against 9i to 10.2 you'll need to disable table prefetching to get the results explained here. This can only be done by setting the static parameter "_table_lookup_prefetch_size" equal to 0 which requires to restart the instance.

11g allows to control the behaviour via various hints and parameters, see the script for more details.

In order to be in line with the baseline explanations presented here this should be executed against pre-11g since 11g introduces some significant changes that will be covered in upcoming posts.


--------------------------------------------------------------------------------
--
-- File name: unique_non_unique_index_difference.sql
--
-- Purpose: Compare the efficiency of NESTED LOOP joins via index lookup
-- between unique and non-unique indexes
--
-- Author: Randolf Geist http://oracle-randolf.blogspot.com
--
-- Prereqs: RUNSTATS_PKG by Adrian Billington / Tom Kyte
--
-- Last tested: June 2011
--
-- Versions: 10.2.0.4
-- 10.2.0.5
-- 11.1.0.7
-- 11.2.0.1
-- 11.2.0.2
--------------------------------------------------------------------------------

set echo on timing on linesize 130 pagesize 999 trimspool on tab off serveroutput off doc on

doc
From 9i to 10.2 you need to disable table prefetching
to get the "original" version of NL joins

-- Disable table prefetchting
alter system set "_table_lookup_prefetch_size" = 0 scope = spfile;

-- Back to defaults
alter system reset "_table_lookup_prefetch_size" scope = spfile sid = '*';

From 11g on this can handled via the nlj_prefetch and nlj_batching hints

But they work a bit counterintuitive when combined therefore

opt_param('_nlj_batching_enabled', 0)

is also required to get exactly the NL join optimization requested

Since this is about logical I/O, not physical I/O you need sufficient cache
defined (256M should be fine) otherwise the results will differ
when physical I/O happens
#

spool unique_non_unique_index_difference.log

drop table t1;

purge table t1;

exec dbms_random.seed(0)

-- Random order
-- Create 10 rows in a single block
create table t1
--pctfree 0
as
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 10
order by
-- id
dbms_random.value
;

-- Tell Oracle to store at most 10 rows per block
alter table t1 minimize records_per_block;

truncate table t1;

-- Populate the table, resulting in exactly 10,000 blocks with MSSM
insert /*+ append */ into t1
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 100000
order by
-- id
dbms_random.value
;

-- Force BLEVEL 2 for UNIQUE index (with 8K blocks, root->branch->leaf)
create unique index t1_idx on t1 (id) pctfree 80;

-- Avoid any side effects of dynamic sampling
-- (and perform delayed block cleanout when not using direct-path load)
exec dbms_stats.gather_table_stats(null, 't1', estimate_percent => null)

-- Add PK constraint
alter table t1 add constraint t1_pk primary key (id);

drop table t2;

purge table t2;

-- exec dbms_random.seed(0)

-- Random order (but different from T1 order)
-- Create 10 rows in a single block
create table t2
--pctfree 0
as
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 10
order by
-- id
dbms_random.value
;

-- Tell Oracle to store at most 10 rows per block
alter table t2 minimize records_per_block;

truncate table t2;

-- Populate the table, resulting in exactly 10,000 blocks with MSSM
insert /*+ append */ into t2
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 100000
order by
-- id
dbms_random.value
;

-- Force BLEVEL 2 for NON-UNIQUE index (with 8K blocks, root->branch->leaf)
create index t2_idx on t2 (id) pctfree 80;

-- Avoid any side effects of dynamic sampling
-- (and perform delayed block cleanout when not using direct-path load)
exec dbms_stats.gather_table_stats(null, 't2', estimate_percent => null)

-- Add PK constraint based on non-unique index
alter table t2 add constraint t2_pk primary key (id);

alter session set statistics_level = all;

-- Run the commands once to cache the blocks and get a runtime profile
select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t2 a
, t1 b
where
a.id = b.id
);

select * from table(dbms_xplan.display_cursor(null, null, '+COST +OUTLINE ALLSTATS LAST'));

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t1 a
, t2 b
where
a.id = b.id
);

select * from table(dbms_xplan.display_cursor(null, null, '+COST +OUTLINE ALLSTATS LAST'));

-- Eliminate row source statistics overhead
-- for the "real" test
alter session set statistics_level = typical;

exec runstats_pkg.rs_start

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t2 a
, t1 b
where
a.id = b.id
);

exec runstats_pkg.rs_middle

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t1 a
, t2 b
where
a.id = b.id
);


set serveroutput on

exec runstats_pkg.rs_stop(-1)

spool off


Expectations

What are the expected results - here explained based on 10.2? This is a nested loop join of 100,000 rows to 100,000 rows table allocating each 10,000 blocks. The inner row source will be using the available index and perform a table lookup by ROWID 100,000 times.

The script specifically crafts the indexes to have a height of 3 (BLEVEL = 2) when using a default block size of 8K which means that they have a root block with a number of branch blocks on the second level and finally the leaf blocks on the third level. Note that different block sizes can lead to different index heights and therefore different results.

In terms of "buffer visits" required to complete the statement we could think of the following:

- 100,000 block visits for the outer row source running a simple full table scan. For every iteration of the loop we need to visit the buffer and read the next row that will be used for the lookup into the inner row source

- 300,000 block visits for the inner row source index lookup, since for every index lookup we need to traverse the index from root to branch to leaf

- 100,000 block visits for the inner row source table lookup by ROWID

So according to this model in total we need to "explain" 500,000 block visits for this example.

Let's have a look at the various results from the script.

1. The runtime profile

a) Running the Nested Loop Join using the "Unique Index" inner row source


Plan hash value: 3952364803

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.95 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.59 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|
---------------------------------------------------------------------------------------------------------------


If everything went as expected you'll see here the "classic" shape of a Nested Loop Join using an index lookup for the inner row source. The loop is driven by the full table scan of the T2 table and for every row produced by that row source the inner row source will be examined starting with an index unique scan in this case followed by an table access by ROWID for those rows found in the index.

Comparing the runtime profile to the model described above one significant difference becomes immediately obvious: The profile only shows 310,000 logical I/Os, not 500,000. So either above model is incorrect or Oracle has introduced some "short-cuts" that allow to avoid approx. 190,000 out of 500,000 logical I/Os. The difference of 190,000 seems to come from the index unique scan which only reports 200,000 logical I/Os instead of the expected 300,000 and the full table scan of T2 driving the nested loop. It reports only 10,000 logical I/Os instead of the 100,000 pictured above. More on those differences in a moment.

b) Running the Nested Loop Join using the "Non-Unique Index" inner row source


Plan hash value: 537985513

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:06.31 | 311K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:06.10 | 311K|
| 3 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T2 | 100K| 1 | 2 (0)| 100K|00:00:04.91 | 301K|
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.77 | 201K|
---------------------------------------------------------------------------------------------------------------


This is very interesting: First of all the cost calculation is the same, so in terms of costs estimates of the optimizer there is no difference between the unique and non-unique case.

However the runtime is significantly different: The non-unique variant is consistently slower than the unique variant.

Furthermore, another minor difference is a slightly increased number of logical I/Os that seems to be caused by the INDEX RANGE SCAN operation (201K vs. 200K).

Why this? Although we have defined a non-deferrable primary key constraint that guarantees uniqueness Oracle still searches in case of an index range scan for the next index entry that does not satisfy the access predicate, which means that for every iteration of the loop Oracle looks at the next index entry to check if it still satisfies the predicate or not. This means in case of the last index entry in each leaf block it has to actually check the next leaf block's first entry for this comparison, hence we end up with approx. number of index leaf blocks more logical I/O in this case. It is also the first part of the explanation why Oracle has to perform more work for the non-unique variant. From the runtime profile however we can tell that although we lose time at the index range scan vs. index unique scan operation, we lose even more time at the table access by ROWID operation.

Remember for a better understanding that the A-TIME and Buffer columns are cumulative - every parent operation includes the child operation runtime/logical I/Os, so in order to understand the runtime/logical I/Os of an operation itself you need to subtract the values taken from the direct descendant operation(s).

2. Session Statistics

Let's have a look at the relevant session statistics:


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..buffer is pinned count 189,998 189,998 0
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..buffer is not pinned count 200,001 200,005 4
STAT..consistent gets 310,012 311,110 1,098
STAT..consistent gets from cache 310,012 311,110 1,098
STAT..session logical reads 310,012 311,110 1,098
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..consistent gets - examination 300,001 100,007 -199,994
STAT..no work - consistent read gets 10,001 211,093 201,092
LATCH.cache buffers chains 320,023 522,213 202,190


a) Pinned Buffers

Now isn't that an interesting coincidence? May be not. The "buffer is pinned count" statistics quite nicely matches the missing 190,000 buffer visits. So Oracle managed to keep 190,000 times a buffer pinned instead of re-locating it in the buffer cache by hashing the database block address to find the corresponding hash bucket, grabbing the corresponding "cache buffers chains" child latch and so on.

Which buffers does Oracle keep pinned? Further modifications of the test case and investigating logical I/O details using events 10200/10202 allows to draw the conclusion that Oracle keeps the buffers of the driving table T2 pinned and the root block of the index. Pinning the root block of the index is a good idea in particular since it saves one logical I/O per loop iteration and the index root block is also quite unlikely to change frequently.

Why does Oracle not simply keep all of the buffers pinned rather than going through the hash/latch/pin exercise again and again? Very likely for various scalability/concurrency reasons, for example:

- A pinned buffer can not be removed/replaced even if it was eligible according to the LRU logic, hence it potentially prevents other buffers from being cached

- A pinned buffer can not be accessed by other sessions that want to pin it in incompatible mode (exclusive vs. shared), although multiple sessions can pin it concurrently in compatible mode (shared). Either those sessions have to queue behind (that is what a "buffer busy wait" is about) or they may be able to create a clone copy of the block and continue their work on the clone copy. Although the "clone copy" trick is a nice one, it is undesirable for several reasons:

* The "clone" copies require each a buffer from the cache effectively reducing the number of different blocks that can be held in the buffer cache. They are also the reason why an object might require much more cache than its original segment size in order to stay completely in the cache.

* They increase the "length" of the "cache buffers chains" leading to longer search times for blocks when locating the buffer in the cache and holding the "cache buffers chains" latch while doing so, hence increasing the potential for latch contention

So here is an important point: If you want to understand the work Oracle has performed in terms of buffer visits you need to consider both, the number of logical I/Os as well as the number of buffers visited without involving logical I/O - this is represented by the "buffer is pinned count" statistics.

Quite often this fact is overlooked and people only focus on the logical I/Os - which is not unreasonable - but misses the point about pinned buffers re-visited without doing logical I/O.

Note that buffer pinning is not possible across fetch calls - if the control is returned to the client the buffers will no longer be kept pinned. This is the explanation why a the "fetchsize" or "arraysize" for bulk fetches can influence the number of logical I/Os required to process a result set.

b) "consistent gets - examination"

There is another significant difference between the two runs that explains most of the remaining runtime difference between the unique and non-unique index variant: The unique index variant performs approx. 310,000 logical I/Os quite similar to the non-unique index variant, however it grabs the corresponding "cache buffers chains" child latch only 320,000 times vs. 520,000 times for the non-unique index.

How is that possible? The explanation can be found in the statistics: The Nested Loop Join when dealing with a unique index performs all logical I/Os as part of the inner row source as "short-cut" consistent gets, which are called "consistent gets - examination". Oracle uses this shortcut whenever it knows that the block visit will be of very short duration. Oracle knows that in this particular case because the unique index guarantees that there will be at most one matching row in the index structure as well as when doing the subsequent table row lookup. So there is no need to perform a "range" scan on the index, and it is guaranteed that only one single row per iteration can be returned from the index unique scan for the table lookup by ROWID.

Hence Oracle makes use of this knowledge and works on the buffer contents while holding the latch, this is what the "consistent gets - examination" statistics is about. A "normal" consistent get grabs the latch initially and releases it after having the buffer pinned. It works then on the buffer and afterwards "unpins" the buffer which requires another latch acquisition. Therefore a "non-shortcut" consistent get requires two latch acquisitions per logical I/O. This explains why we have 10,000 non-shortcut consistent gets for the driving full table scan (that are accompanied by 90,000 buffer visits avoiding logical I/O by keeping the buffer pinned) resulting in 20,000 latch acquisitions and 300,000 latch acquisitions for the remaining 300,000 "short-cut" consistent gets which makes in total 320,000 latch acquisitions for the unique index variant.

The non-unique index variant performs 200,000 "non-shortcut" logical I/Os on the inner index and the table lookup, responsible for 400,000 latch acquisitions, another 10,000 for the driving table full table scan (this part is not different from the unique index variant) good for another 20,000 latch acquisitions. But it also performs 100,000 "short-cut" consistent gets, resulting in the remaining 100,000 latch acquisitions. Modifying the test case by creating the index with a height of 2 (BLEVEL = 1) shows that Oracle uses the "short-cut" consistent gets on the branch blocks of the index, so this is another area where Oracle makes use of the "optimized" version of logical I/O even with the non-unique index variant.

Scalability

What do these subtle differences mean in terms of scalability? Well, you can download a bunch of scripts that allow to run the same test case with as many sessions concurrently as desired. It will show that there is a significant difference between the two cases: The unique index variant not only is faster in "single-user" mode but also scales much better than the non-unique index variant when performing the test concurrently (and completely cached viz. purely logical I/O based). The provided test suite could be modified to use a more realistic scenario that runs the statement multiple times in each concurrent session with a random sleep in between, but that is left as an exercise for the interested reader.

Summary

The baseline results show that the Oracle uses many built-in features to optimize logical I/O by either avoiding the logical I/O at all or using "short-cut" versions of logical I/O where applicable.

These optimizations allow the "unique index" variant to perform significantly better than the "non-unique index" variant of this particular test case. Note that this significant difference is only that significant when dealing with the pure logical I/O variant - introducing physical I/O make the difference far less impressive since the majority of the time is then spent on physical I/O, not logical I/O.

In the upcoming parts of this series I'll focus on further enhancements introduced in the recent releases like table prefetching, Nested Loop Join batching aka. as I/O batching and an obviously significantly enhanced buffer pinning mechanism introduced in Oracle 11g.

Footnote

If you study the script carefully, then you'll notice that it allows for different ordering of the data - it could be randomly ordered, randomly ordered but the same (pseudo-random) order for T1 and T2 and it could be ordered by ID.

If you run the test with this different ordering of data you'll notice no difference in the results with 10g (and table prefetching disabled), but it might give a clue where this will be heading for in the upcoming posts.

Forward to Part 2