Wednesday, December 30, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 3: The Optimizer And Distribution Methods

As mentioned in the first and second part of this instalment the different available distribution methods of the new parallel FILTER are selected automatically by the optimizer - in this last post of this series I want to focus on that optimizer behaviour.

It looks like there are two new optimizer related parameters that control the behaviour of the new feature: "_px_filter_parallelized" is the overall switch to enable/disable the new parallel filter capability - and defaults to "true" in 12c, and "_px_filter_skew_handling" influences how the optimizer determines the distribution methods - the parameter naming suggests that it somehow has to do with some kind of "skew" - note that the internal parameter that handles the new automatic join skew handling is called "_px_join_skew_handling" - rather similar in name.

But even after playing around with the feature for quite a while I couldn't come up with a good test case where the optimizer chose a different distribution method based on the typical data distribution skew patterns - so that the expression used for the FILTER lookup had some more popular values than others. So I got in touch with Yasin Baskan - product manager for Parallel Execution at Oracle, asking what kind of skew is meant to see a difference in behaviour.

As it turns out "skew" means something different in this context here. When the mentioned parameter "_px_filter_skew_handling" is set to "true" (default value in 12c) the optimizer will choose a different distribution method based on the size of object driving the filter. According to my tests this effectively means: If the object is such small that only one granule (usually 13 blocks) per PX slave can be assigned the optimizer will use automatically a HASH distribution, otherwise - if the object is larger than this threshold - no re-distribution will be selected. I wasn't able to come up with an example where the optimizer automatically comes up with the other available distribution method, which is RANDOM / ROUND-ROBIN (see previous post). To demonstrate the point, here is a small example:
create table t2 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't2')

create table t3 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Default plan, no redistribution before parallel FILTER
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------

exec dbms_stats.set_table_stats(null, 't3', numblks => 52)

-- Setting stats of T3 to 52 (13 * DOP) blocks or smaller - HASH distribution will be used, 53 blocks or greater => no redistribution
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |     1 |     5 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |     5 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER               |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE          |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH       | :TQ10000 | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         PX BLOCK ITERATOR |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| T3       | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWP |            |
|* 10 |       TABLE ACCESS FULL   | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------
So this example shows that the HASH distribution will be used by the optimizer if the object T3 driving the FILTER operation is 52 blocks or smaller, which corresponds to 13 blocks per PX slave at a degree of 4.

Now I find this behaviour pretty odd to explain - since usually you wouldn't want to use Parallel Execution on such a small object anyway. But things become even worse: Not only to me the "skew" handling based on the object size is questionable, but the behaviour can become a potential threat if the row source driving the FILTER operator no longer is a plain table but the result of a more complex operation, which can be simply a join or non-mergeable view:
-- Resetting stats to true size of table - this would mean no redistribution at a DOP of 4, see above
exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        (select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- But simply using a NO_MERGE hint on the select from the simple T3 row source results in an unnecessary HASH re-distribution
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |     1 |    13 |  9755K  (1)| 00:06:22 |        |      |            |
|   1 |  SORT AGGREGATE            |          |     1 |    13 |            |          |        |      |            |
|   2 |   PX COORDINATOR           |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)     | :TQ10001 |     1 |    13 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE         |          |     1 |    13 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER                |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE           |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH        | :TQ10000 | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         VIEW               |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |          PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|* 11 |       TABLE ACCESS FULL    | T2       |     1 |     5 |   114   (0)| 00:00:01 |        |      |            |
--------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- If we use a simple join as driving row source again a HASH re-distribution before the FILTER gets added
-- As a result the dreaded HASH JOIN BUFFERED will be used instead of the plain HASH JOIN
-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE                 |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)          | :TQ10003 |     1 |    10 |            |          |  Q1,03 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE              |          |     1 |    10 |            |          |  Q1,03 | PCWP |            |
|*  5 |      FILTER                     |          |       |       |            |          |  Q1,03 | PCWP |            |
|   6 |       PX RECEIVE                |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        PX SEND HASH             | :TQ10002 | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  8 |         HASH JOIN BUFFERED      |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |           PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  11 |            STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  12 |             PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  13 |              TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  14 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  15 |           PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  16 |            PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  17 |             TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 18 |       TABLE ACCESS FULL         | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Disabling the FILTER skew handling behaviour means no re-distribution before the FILTER, and hence no HASH JOIN BUFFERED
-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE               |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10002 |     1 |    10 |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |          |     1 |    10 |            |          |  Q1,02 | PCWP |            |
|*  5 |      FILTER                   |          |       |       |            |          |  Q1,02 | PCWC |            |
|*  6 |       HASH JOIN               |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|   9 |          STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  10 |           PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  12 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  13 |         PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  14 |          PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  15 |           TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 16 |       TABLE ACCESS FULL       | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------------
So it looks like if the row source driving the parallel FILTER operator is complex (in this case by complex I mean not a simple table) the optimizer will always add a HASH distribution unconditionally before the FILTER. It it obvious that such a re-distribution adds overhead - it requires resources to perform. What is even worse is that in general the rules is: The more redistributions the more likely the dreaded buffering will be added to the execution plans, as can be seen from the example above, where the HASH JOIN turns into a HASH JOIN BUFFERED due to the HASH distribution by default added by the optimizer after the join and before the FILTER. By disabling the filter "skew" handling this in my opinion unnecessary redistribution doesn't show up and hence the HASH JOIN without buffering can be used in this example.

Summary


The new parallel FILTER operator comes with different distribution methods available to the optimizer. However, at present the way the optimizer determines automatically if and how to re-distribute the data seems to be questionable to me.

The skew handling is based on the size of the driving object - for very small objects a re-distribution gets added before the FILTER. For row sources driving the filter that are no simple tables the skew handling seems to add a re-distribution unconditionally.

For the reasons outlined at present I would recommend considering to disable the filter skew handling by setting the parameter "_px_filter_skew_handling" to "false", of course not without getting the blessing from Oracle Support before doing so - this should allow minimising the number of re-distributions added to an execution plan. Losing the capability of handling the "skew" caused by very small objects in my opinion is negligible in most cases.

Monday, December 28, 2015

New Version Of XPLAN_ASH Utility

A new version 4.22 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version primarily addresses an issue with 12c - if the HIST mode got used to pull ASH information from AWR in 12c it turned out that Oracle forgot to add the new "DELTA_READ_MEM_BYTES" columns to DBA_HIST_ACTIVE_SESS_HISTORY - although it got officially added to V$ACTIVE_SESSION_HISTORY in 12c. So now I had to implement several additional if/then/else constructs to the script to handle this inconsistency. It's the first time that the HIST view doesn't seem to reflect all columns from the V$ view - very likely an oversight rather than by design I assume.

Apart from that the I/O figures (Read Bytes / Write Bytes etc.) in the "Activity Timeline" make more sense for those cases where a process hasn't been sampled for several sample points (see below for more details).

Also in case an execution plan could not be found it is now made more obvious with a corresponding message that you might be able to pull the execution plan from AWR by using different ASH modes (MIXED / HIST).

Here are the notes from the change log:

- Fixed a funny bug that in 12c they have forgotton to add the DELTA_READ_MEM_BYTES to DBA_HIST_ACTIVE_SESS_HISTORY, so in HIST mode with 12c prior XPLAN_ASH versions could error out with invalid column name

- Change the way the I/O figures are treated in the "Activity Timeline based on ASH". Now the I/O per second is spread over the (previous) samples covered by DELTA_TIME. This should give a smoother representation of the I/O performed and much closer to what you see in Real-Time SQL Monitoring reports. The difference to prior versions is only visible in cases where a session wasn't sampled for quite a while and hence has a DELTA_TIME spanning multiple previous sample points. This also means that the I/O related columns in the "Activity Timeline based on ASH" now show only the PER SECOND values, no longer to the totals like prior versions

- Added a SET NULL "" in the configuration and initialization section for SQL*Plus environments that use a non-default SET NULL setting. This screwed up some internal switches so that XPLAN_ASH for example thought it's running in a S-ASH repository

- Added a note to the end of the output if no execution plan could be found and falling back to retrieving plan operation details from ASH. Also added the note to use MIXED or HIST ASH source option if no execution plan could be found in CURR mode, so execution plan has been purged from Shared Pool in the meanwhile

- Cloned the "cleanup" section from the end to the beginning of the script to ensure no current SQL*Plus environment settings influence the script execution. This is particularly relevant if the script execution gets cancelled before the final cleanup section is reached or some other, previous scripts left a mess behind

Sunday, December 20, 2015

IT Tage 2015 - "Analysing and troubleshooting Parallel Execution" presentation material

Thanks to all attending my presentation "Analysing and troubleshooting Parallel Execution" at the IT Tage conference 2015 in Frankfurt, Germany. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.