It’s been a while since I’ve post the first part of this series.
Considerations
I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.
When measuring the session with snap, I only get the statistics of “my” database session, not the statistics of all the Parallel Query slaves which are working for me. This means I get too little details measuring only my own session. The only way I could think of to measure all relevant statistics in an easy way, is to gather the same statistics as my snap tool, but globally. (global means all processes and in all instances of the database) The result is gsnap; global snap. In my case, I am the only user in the database when doing anything, so the global stats are about my session, the parallel query slaves I have invoked and the background processes. If multiple sessions are working in the database, you cannot use gsnap statistics to determine what your session has done, because it’s unknown which session contributed to the statistics.
Another interesting thing came up during the investigation: when looking in the SQL area (v$sqlarea for example), the statistics with the SQL are the statistics of all the session involved (parallel query slaves) accumulated for that instance. While this sounds logical, it means that the elapsed time with the SQL is not the same as the elapsed time the requesting session experienced. To get the average(!) ‘real’ elapsed time, you need to divide the elapsed time by the number of PX slaves used. The number of slaves can be found using the ‘PX_SERVERS_EXECUTIONS’ column. To get the number slaves used per execution (on average, it’s accumulated), divide by ‘EXECUTIONS’. In case of RAC, the number of executions is only available in the instance in which the SQL is executed, the other instances report ‘0’.
In this part I investigate the execution of parallel query on a half rack exadata. This means 4 instances. When measuring the statistics with 4 instances, there is a lot of information. A real lot. For the sake of readability, I do not include all statistics, only the statistics I deem important or interesting.
Parallel query slave usage ‘spills’ to other instances in a RAC database at a certain point. When parallel is set to 2-16 the slaves remain local, when set to 17 all the instances are used. There are no other sessions in any of the instances. The parameter ‘PARALLEL_DEGREE_LIMIT’ is set to CPU. 16 is the number CPU threads on an Exadata V2 database node. (2 sockets, 8 core’s, 16 threads per database node)
In the first part, I have gotten a response time of 37 seconds. In fact, the table has changed a bit so it is different from the first post, but consider it roughly the same. (the same query now takes 33 seconds)
Parallel 16 – local node
As Kevin Closson points out in his comment, the query in the first part was CPU bound, and could be made faster if I could find extra CPU cycles or a faster CPU. Well, I am not going to change the hardware. But wait! I have 16 CPU threads, and the query only uses one. So the obvious next step is to use these threads. To do this, I am going to use Oracle’s Parallel Query functionality. This can be done in several ways, I choose to use the parallel hint in the SQL:
select /*+parallel(16)*/ count(*) from cg_var where sample_id=1;
Wow, the response time now is 13.32 seconds!
What is the result of this parallellisation?
------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 307K(100)| | | | |
| 1 | SORT AGGREGATE | | 1 | 3 | | | | | |
| 2 | PX COORDINATOR | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 3 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 3 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 399M| 1141M| 307K (1)| 01:01:26 | Q1,00 | PCWC | |
|* 6 | TABLE ACCESS STORAGE FULL| CG_VAR | 399M| 1141M| 307K (1)| 01:01:26 | Q1,00 | PCWP | |
------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
6 - storage(:Z>=:Z AND :Z<=:Z)
filter("SAMPLE_ID"=1)
Note
-----
- Degree of Parallelism is 16 because of hint
How do the statistics look like?
i Stat Name Diff Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests 127,557 7,645.497
1,STAT physical read total multi block requests 127,445 7,638.784
1,STAT physical read requests optimized 14 .839
1,STAT physical read total bytes 133,276,368,896 7,988,304,014.076
1,STAT physical write total IO requests 24 1.439
1,STAT physical write total bytes 372,736 22,341.008
1,STAT cell physical IO interconnect bytes 4,678,771,904 280,435,704.328
1,STAT cell physical IO bytes eligible for pred 133,275,516,928 7,988,252,948.914
1,STAT cell physical IO bytes saved by storage 14,680,064 879,892.025
1,STAT cell smart IO session cache lookups 192 11.508
1,STAT cell smart IO session cache hits 192 11.508
1,STAT cell physical IO interconnect bytes retu 4,676,846,784 280,320,316.702
1,STAT cell session smart scan efficiency 447 26.792
1,STAT table scans (short tables) 1 .060
1,STAT table scans (long tables) 208 12.467
1,STAT table scans (rowid ranges) 208 12.467
1,STAT table scans (direct read) 208 12.467
1,STAT table scan rows gotten 391,553,915 23,468,914.533
1,STAT table scan blocks gotten 3,995,742 239,496.335
1,STAT cell scans 208 12.467
1,STAT cell blocks processed by cache layer 16,270,709 975,231.927
1,STAT cell blocks processed by txn layer 16,270,709 975,231.927
1,STAT cell blocks processed by data layer 16,267,256 975,024.961
1,STAT cell blocks helped by minscn optimizatio 16,270,709 975,231.927
1,STAT cell simulated session smart scan effici 133,261,361,152 7,987,404,481.604
1,STAT cell IO uncompressed bytes 133,261,361,152 7,987,404,481.604
1,WAIT cell smart table scan 185.188 11.100
1,WAIT PX Idle Wait 426.043 25.536
1,WAIT PX Deq: Join ACK .007 .000
1,WAIT PX Deq Credit: send blkd .003 .000
1,WAIT PX Deq: Parse Reply .006 .000
1,WAIT PX Deq: Execute Reply 13.104 .785
1,WAIT enq: CF - contention .000 .000
1,WAIT PX Deq: Signal ACK RSG .000 .000
1,WAIT PX Deq: Signal ACK EXT .011 .001
1,WAIT PX Deq: reap credit .003 .000
1,WAIT PX Nsq: PQ load info query .200 .012
1,WAIT PX Deq: Slave Session Stats .053 .003
1,WAIT enq: PS - contention .009 .001
1,WAIT enq: TO - contention .000 .000
1,TIME repeated bind elapsed time .000 .000
1,TIME parse time elapsed .009 .001
1,TIME background cpu time .037 .002
1,TIME PL/SQL execution elapsed time .001 .000
1,TIME DB CPU 22.180 1.329
1,TIME sql execute elapsed time 207.183 12.418
1,TIME DB time 207.355 12.428
1,TIME background elapsed time .305 .018
As I’ve stated above, the query response time was 13.32 seconds. Simplistic profile:
elapsed time : 207 CPU : 22 wait / cell smart table scan : 185
Apparently, I’ve spend 207 seconds of total elapsed time (foreground and PX slaves). 207/16=12.94 a little less than the response time. (13.32)
The profile shows that by enlarging the CPU capacity with PX, the bottleneck shifts, probably towards storage. Are we pushing the storage? Lets see: total amount processed 133,276,368,896, IO eliminated by storage indexes: 14,680,064, that leaves 133,261,688,832.
We got 7 storage servers, so divide by 7: 133,261,688,832/7=19,037,384,119 = average amount of IO done by every cell for this query. If we take 19,037,384,119 and divide that by the total time the query took (13.32); 19,037,384,119/13.32=1,429,233,042. So approximately 1.4GB per second per cell is physically read. I managed to get 1.5GB/s out of a cell during other tests, so we are near the maximum.
This would mean that if I enlarge the parallelism number further, which means multiple instances are used, the response time will not significantly decrease further. Well, let’s see:
Parallel 64 – the entire cluster
select /*+parallel(64)*/ count(*) from cg_var where sample_id=1;
The response time is 12.88 seconds.
------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 76776 (100)| | | | | | 1 | SORT AGGREGATE | | 1 | 3 | | | | | | | 2 | PX COORDINATOR | | | | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 3 | | | Q1,00 | P->S | QC (RAND) | | 4 | SORT AGGREGATE | | 1 | 3 | | | Q1,00 | PCWP | | | 5 | PX BLOCK ITERATOR | | 399M| 1141M| 76776 (1)| 00:15:22 | Q1,00 | PCWC | | |* 6 | TABLE ACCESS STORAGE FULL| CG_VAR | 399M| 1141M| 76776 (1)| 00:15:22 | Q1,00 | PCWP | | ------------------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 6 - storage(:Z>=:Z AND :Z<=:Z) filter("SAMPLE_ID"=1) Note ----- - Degree of Parallelism is 64 because of hint
Statistics:
---------------------------------------------------------------------------------------------------------
i Stat Name Diff Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests 33,306 1,977.687
1,STAT physical read total multi block requests 33,197 1,971.214
1,STAT physical read requests optimized 3 .178
1,STAT physical read total bytes 34,602,745,856 2,054,686,537.670
1,STAT physical write total IO requests 14 .831
1,STAT physical write total bytes 240,640 14,289.033
1,STAT cell physical IO interconnect bytes 1,208,841,744 71,780,166.462
1,STAT cell physical IO bytes eligible for pred 34,602,024,960 2,054,643,731.376
1,STAT cell physical IO bytes saved by storage 3,145,728 186,791.100
1,STAT cell smart IO session cache lookups 200 11.876
1,STAT cell smart IO session cache hits 200 11.876
1,STAT cell physical IO interconnect bytes retu 1,207,442,960 71,697,107.658
1,STAT cell session smart scan efficiency 450 26.721
1,STAT table scans (short tables) 1 .059
1,STAT table scans (long tables) 216 12.826
1,STAT table scans (rowid ranges) 216 12.826
1,STAT table scans (direct read) 216 12.826
1,STAT table scan rows gotten 101,106,534 6,003,634.369
1,STAT table scan blocks gotten 1,031,921 61,274.738
1,STAT cell scans 216 12.826
1,STAT cell blocks processed by cache layer 4,223,803 250,806.430
1,STAT cell blocks processed by txn layer 4,223,803 250,806.430
1,STAT cell blocks processed by data layer 4,223,503 250,788.616
1,STAT cell blocks helped by minscn optimizatio 4,223,803 250,806.430
1,STAT cell simulated session smart scan effici 34,598,936,576 2,054,460,345.321
1,STAT cell IO uncompressed bytes 34,598,936,576 2,054,460,345.321
1,WAIT cell smart table scan 192.931 11.456
1,WAIT library cache pin .006 .000
1,WAIT library cache lock .002 .000
1,WAIT library cache: mutex X .000 .000
1,WAIT PX Idle Wait 448.150 26.611
1,WAIT PX Deq: Join ACK .007 .000
1,WAIT PX Deq Credit: send blkd .003 .000
1,WAIT PX Deq: Parse Reply .006 .000
1,WAIT PX Deq: Execute Reply 12.591 .748
1,WAIT PX Deq: Execution Msg 3.475 .206
1,WAIT PX Deq: Signal ACK RSG .000 .000
1,WAIT PX Deq: Signal ACK EXT .014 .001
1,WAIT PX Deq: reap credit .016 .001
1,WAIT PX Nsq: PQ load info query .200 .012
1,WAIT PX Deq: Slave Session Stats .087 .005
1,WAIT enq: PS - contention .024 .001
1,WAIT enq: TO - contention .000 .000
1,TIME repeated bind elapsed time .000 .000
1,TIME parse time elapsed .013 .001
1,TIME background cpu time .039 .002
1,TIME PL/SQL execution elapsed time .001 .000
1,TIME DB CPU 6.539 .388
1,TIME sql execute elapsed time 199.272 11.833
1,TIME DB time 199.499 11.846
1,TIME background elapsed time .582 .035
---------------------------------------------------------------------------------------------------------
i Stat Name Diff Diff/Sec
---------------------------------------------------------------------------------------------------------
2,STAT physical read total IO requests 31,750 1,885.293
2,STAT physical read total multi block requests 31,666 1,880.305
2,STAT physical read requests optimized 10 .594
2,STAT physical read total bytes 32,994,410,496 1,959,184,723.276
2,STAT physical write total IO requests 12 .713
2,STAT physical write total bytes 196,608 11,674.444
2,STAT cell physical IO interconnect bytes 1,153,927,168 68,519,377.838
2,STAT cell physical IO bytes eligible for pred 32,993,722,368 1,959,143,862.723
2,STAT cell physical IO bytes saved by storage 10,485,760 622,637.001
2,STAT cell smart IO session cache lookups 190 11.282
2,STAT cell smart IO session cache hits 190 11.282
2,STAT cell physical IO interconnect bytes retu 1,152,649,216 68,443,493.954
2,STAT cell session smart scan efficiency 451 26.780
2,STAT cell scans 206 12.232
2,STAT cell blocks processed by cache layer 4,026,750 239,105.563
2,STAT cell blocks processed by txn layer 4,026,750 239,105.563
2,STAT cell blocks processed by data layer 4,026,283 239,077.832
2,STAT cell blocks helped by minscn optimizatio 4,026,750 239,105.563
2,STAT cell simulated session smart scan effici 32,983,310,336 1,958,525,603.638
2,STAT cell IO uncompressed bytes 32,983,310,336 1,958,525,603.638
2,WAIT cell smart table scan 193.646 11.499
2,WAIT library cache pin .006 .000
2,WAIT library cache lock .006 .000
2,WAIT library cache: mutex X .000 .000
2,WAIT PX Idle Wait 460.706 27.356
2,WAIT PX Deq Credit: send blkd .006 .000
2,WAIT PX Deq: Execution Msg 2.922 .173
2,WAIT enq: CO - master slave det .001 .000
2,WAIT PX Deq: reap credit .007 .000
2,WAIT PX Deq: Slave Session Stats .114 .007
2,WAIT enq: PS - contention .012 .001
2,TIME repeated bind elapsed time .000 .000
2,TIME parse time elapsed .020 .001
2,TIME background cpu time .062 .004
2,TIME DB CPU 6.462 .384
2,TIME sql execute elapsed time 199.424 11.842
2,TIME DB time 199.637 11.854
2,TIME background elapsed time .282 .017
---------------------------------------------------------------------------------------------------------
i Stat Name Diff Diff/Sec
---------------------------------------------------------------------------------------------------------
3,STAT physical read total IO requests 33,001 1,959.576
3,STAT physical read total multi block requests 32,902 1,953.697
3,STAT physical read requests optimized 1 .059
3,STAT physical read total bytes 34,282,340,352 2,035,661,085.805
3,STAT physical write total IO requests 12 .713
3,STAT physical write total bytes 196,608 11,674.444
3,STAT cell physical IO interconnect bytes 1,201,407,024 71,338,698.054
3,STAT cell physical IO bytes eligible for pred 34,281,635,840 2,035,619,252.381
3,STAT cell physical IO bytes saved by storage 1,048,576 62,263.700
3,STAT cell smart IO session cache lookups 198 11.757
3,STAT cell smart IO session cache hits 198 11.757
3,STAT cell physical IO interconnect bytes retu 1,200,112,688 71,261,841.300
3,STAT cell session smart scan efficiency 451 26.780
3,STAT cell scans 214 12.707
3,STAT cell blocks processed by cache layer 4,185,106 248,508.630
3,STAT cell blocks processed by txn layer 4,185,106 248,508.630
3,STAT cell blocks processed by data layer 4,184,655 248,481.850
3,STAT cell blocks helped by minscn optimizatio 4,185,106 248,508.630
3,STAT cell simulated session smart scan effici 34,280,693,760 2,035,563,312.338
3,STAT cell IO uncompressed bytes 34,280,693,760 2,035,563,312.338
3,WAIT cell smart table scan 193.330 11.480
3,WAIT library cache pin .002 .000
3,WAIT library cache lock .003 .000
3,WAIT library cache: mutex X .000 .000
3,WAIT PX Idle Wait 456.274 27.093
3,WAIT PX Deq Credit: send blkd .005 .000
3,WAIT PX Deq: Execution Msg 3.006 .179
3,WAIT enq: CO - master slave det .004 .000
3,WAIT PX Deq: reap credit .007 .000
3,WAIT PX Deq: Slave Session Stats .086 .005
3,WAIT enq: PS - contention .009 .001
3,TIME repeated bind elapsed time .000 .000
3,TIME parse time elapsed .011 .001
3,TIME background cpu time .038 .002
3,TIME DB CPU 7.078 .420
3,TIME sql execute elapsed time 199.347 11.837
3,TIME DB time 199.570 11.850
3,TIME background elapsed time .449 .027
---------------------------------------------------------------------------------------------------------
i Stat Name Diff Diff/Sec
---------------------------------------------------------------------------------------------------------
4,STAT physical read total IO requests 30,213 1,794.027
4,STAT physical read total multi block requests 30,129 1,789.039
4,STAT physical read total bytes 31,398,838,272 1,864,440,774.857
4,STAT physical write total IO requests 12 .713
4,STAT physical write total bytes 196,608 11,674.444
4,STAT cell physical IO interconnect bytes 1,118,129,280 66,393,724.606
4,STAT cell physical IO bytes eligible for pred 31,398,133,760 1,864,398,941.433
4,STAT cell smart IO session cache lookups 180 10.688
4,STAT cell smart IO session cache hits 180 10.688
4,STAT cell physical IO interconnect bytes retu 1,116,834,944 66,316,867.852
4,STAT cell session smart scan efficiency 442 26.246
4,STAT cell scans 196 11.638
4,STAT cell blocks processed by cache layer 3,833,202 227,612.819
4,STAT cell blocks processed by txn layer 3,833,202 227,612.819
4,STAT cell blocks processed by data layer 3,832,789 227,588.296
4,STAT cell blocks helped by minscn optimizatio 3,833,202 227,612.819
4,STAT cell simulated session smart scan effici 31,398,207,488 1,864,403,319.350
4,STAT cell IO uncompressed bytes 31,398,207,488 1,864,403,319.350
4,WAIT cell smart table scan 192.808 11.449
4,WAIT library cache pin .007 .000
4,WAIT library cache lock .002 .000
4,WAIT library cache: mutex X .000 .000
4,WAIT PX Idle Wait 444.140 26.373
4,WAIT PX Deq Credit: send blkd .005 .000
4,WAIT PX Deq: Execution Msg 4.306 .256
4,WAIT enq: CO - master slave det .004 .000
4,WAIT PX Deq: reap credit .005 .000
4,WAIT PX Deq: Slave Session Stats .082 .005
4,WAIT enq: PS - contention .008 .000
4,TIME repeated bind elapsed time .000 .000
4,TIME parse time elapsed .019 .001
4,TIME background cpu time .042 .002
4,TIME DB CPU 6.086 .361
4,TIME sql execute elapsed time 198.045 11.760
4,TIME DB time 198.275 11.773
4,TIME background elapsed time .538 .032
Here we see the requests are spread over the available instances. Also, response time is only marginally lower (from 13.32 to 12.88), because the current limitation (physical disk access) hasn’t changed. The average amount of IO becomes:
Total amount of IO: 34,602,745,856+32,994,410,496+34,282,340,352+31,398,838,272=133,278,334,976
Storage index saved: 3,145,728+10,485,760+1,048,576+0=14,680,064
Amount of physical IO per cell: 133,263,654,912 (total amount with storage index IO saved substracted)/7/12.88=1,478,079,579 = 1.4GB/s, which is approximately the same as the previous SQL.
Let’s build the same profile (four times):
1- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193 2- sql execute elapsed: 199, cpu time: 6, cell smart table scan: 194 3- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193 4- sql execute elapsed: 198, cpu time: 6, cell smart table scan: 193
As expected: we are totally IO bound.
Can we push this further? Yes! Stay tuned for part 3!