An investigation into exadata, part 2

It’s been a while since I’ve post the first part of this series.

Considerations

I created a little tool, snap, which is more or less a combination of Tanel Poder’s snapper and a script from Tom Kyte. I used this for the first part to give me all the details of database process I was using. The next step is invoking Oracle’s Parallel Query feature. Parallel Query means multiple processing are working together to execute a query.

When measuring the session with snap, I only get the statistics of “my” database session, not the statistics of all the Parallel Query slaves which are working for me. This means I get too little details measuring only my own session. The only way I could think of to measure all relevant statistics in an easy way, is to gather the same statistics as my snap tool, but globally. (global means all processes and in all instances of the database) The result is gsnap; global snap. In my case, I am the only user in the database when doing anything, so the global stats are about my session, the parallel query slaves I have invoked and the background processes. If multiple sessions are working in the database, you cannot use gsnap statistics to determine what your session has done, because it’s unknown which session contributed to the statistics.

Another interesting thing came up during the investigation: when looking in the SQL area (v$sqlarea for example), the statistics with the SQL are the statistics of all the session involved (parallel query slaves) accumulated for that instance. While this sounds logical, it means that the elapsed time with the SQL is not the same as the elapsed time the requesting session experienced. To get the average(!) ‘real’ elapsed time, you need to divide the elapsed time by the number of PX slaves used. The number of slaves can be found using the ‘PX_SERVERS_EXECUTIONS’ column. To get the number slaves used per execution (on average, it’s accumulated), divide by ‘EXECUTIONS’. In case of RAC, the number of executions is only available in the instance in which the SQL is executed, the other instances report ’0′.

In this part I investigate the execution of parallel query on a half rack exadata. This means 4 instances. When measuring the statistics with 4 instances, there is a lot of information. A real lot. For the sake of readability, I do not include all statistics, only the statistics I deem important or interesting.

Parallel query slave usage ‘spills’ to other instances in a RAC database at a certain point. When parallel is set to 2-16 the slaves remain local, when set to 17 all the instances are used. There are no other sessions in any of the instances. The parameter ‘PARALLEL_DEGREE_LIMIT’ is set to CPU. 16 is the number CPU threads on an Exadata V2 database node. (2 sockets, 8 core’s, 16 threads per database node)

In the first part, I have gotten a response time of 37 seconds. In fact, the table has changed a bit so it is different from the first post, but consider it roughly the same. (the same query now takes 33 seconds)

Parallel 16 – local node

As Kevin Closson points out in his comment, the query in the first part was CPU bound, and could be made faster if I could find extra CPU cycles or a faster CPU. Well, I am not going to change the hardware. But wait! I have 16 CPU threads, and the query only uses one. So the obvious next step is to use these threads. To do this, I am going to use Oracle’s Parallel Query functionality. This can be done in several ways, I choose to use the parallel hint in the SQL:

select /*+parallel(16)*/ count(*) from cg_var where sample_id=1;

Wow, the response time now is 13.32 seconds!

What is the result of this parallellisation?

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |   307K(100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M|   307K  (1)| 01:01:26 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 16 because of hint

How do the statistics look like?


i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			   127,557		    7,645.497
1,STAT physical read total multi block requests 	   127,445		    7,638.784
1,STAT physical read requests optimized 			14			 .839
1,STAT physical read total bytes		   133,276,368,896	    7,988,304,014.076
1,STAT physical write total IO requests 			24			1.439
1,STAT physical write total bytes			   372,736		   22,341.008
1,STAT cell physical IO interconnect bytes	     4,678,771,904	      280,435,704.328
1,STAT cell physical IO bytes eligible for pred    133,275,516,928	    7,988,252,948.914
1,STAT cell physical IO bytes saved by storage		14,680,064		  879,892.025
1,STAT cell smart IO session cache lookups		       192		       11.508
1,STAT cell smart IO session cache hits 		       192		       11.508
1,STAT cell physical IO interconnect bytes retu      4,676,846,784	      280,320,316.702
1,STAT cell session smart scan efficiency		       447		       26.792
1,STAT table scans (short tables)				 1			 .060
1,STAT table scans (long tables)			       208		       12.467
1,STAT table scans (rowid ranges)			       208		       12.467
1,STAT table scans (direct read)			       208		       12.467
1,STAT table scan rows gotten			       391,553,915	       23,468,914.533
1,STAT table scan blocks gotten 			 3,995,742		  239,496.335
1,STAT cell scans					       208		       12.467
1,STAT cell blocks processed by cache layer		16,270,709		  975,231.927
1,STAT cell blocks processed by txn layer		16,270,709		  975,231.927
1,STAT cell blocks processed by data layer		16,267,256		  975,024.961
1,STAT cell blocks helped by minscn optimizatio 	16,270,709		  975,231.927
1,STAT cell simulated session smart scan effici    133,261,361,152	    7,987,404,481.604
1,STAT cell IO uncompressed bytes		   133,261,361,152	    7,987,404,481.604
1,WAIT cell smart table scan				       185.188		       11.100
1,WAIT PX Idle Wait					       426.043		       25.536
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					13.104			 .785
1,WAIT enq: CF - contention					  .000			 .000
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .011			 .001
1,WAIT PX Deq: reap credit					  .003			 .000
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .053			 .003
1,WAIT enq: PS - contention					  .009			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .009			 .001
1,TIME background cpu time					  .037			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							22.180			1.329
1,TIME sql execute elapsed time 			       207.183		       12.418
1,TIME DB time						       207.355		       12.428
1,TIME background elapsed time					  .305			 .018

As I’ve stated above, the query response time was 13.32 seconds. Simplistic profile:

 

elapsed time                        : 207
CPU                                 :  22
wait / cell smart table scan        : 185

 

Apparently, I’ve spend 207 seconds of total elapsed time (foreground and PX slaves). 207/16=12.94 a little less than the response time. (13.32)
The profile shows that by enlarging the CPU capacity with PX, the bottleneck shifts, probably towards storage. Are we pushing the storage? Lets see: total amount processed 133,276,368,896, IO eliminated by storage indexes: 14,680,064, that leaves 133,261,688,832.
We got 7 storage servers, so divide by 7: 133,261,688,832/7=19,037,384,119 = average amount of IO done by every cell for this query. If we take 19,037,384,119 and divide that by the total time the query took (13.32); 19,037,384,119/13.32=1,429,233,042. So approximately 1.4GB per second per cell is physically read. I managed to get 1.5GB/s out of a cell during other tests, so we are near the maximum.

This would mean that if I enlarge the parallelism number further, which means multiple instances are used, the response time will not significantly decrease further. Well, let’s see:

Parallel 64 – the entire cluster

select /*+parallel(64)*/ count(*) from cg_var where sample_id=1;

The response time is 12.88 seconds.

 

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       | 76776 (100)|          |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |     3 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000 |     1 |     3 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |     3 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |          |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| CG_VAR   |   399M|  1141M| 76776   (1)| 00:15:22 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                                                                                                                          

   6 - storage(:Z>=:Z AND :Z<=:Z)
       filter("SAMPLE_ID"=1)                                                                                                                                                                                                                                                                                 

Note
-----
   - Degree of Parallelism is 64 because of hint

 

Statistics:


---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
1,STAT physical read total IO requests			    33,306		    1,977.687
1,STAT physical read total multi block requests 	    33,197		    1,971.214
1,STAT physical read requests optimized 			 3			 .178
1,STAT physical read total bytes		    34,602,745,856	    2,054,686,537.670
1,STAT physical write total IO requests 			14			 .831
1,STAT physical write total bytes			   240,640		   14,289.033
1,STAT cell physical IO interconnect bytes	     1,208,841,744	       71,780,166.462
1,STAT cell physical IO bytes eligible for pred     34,602,024,960	    2,054,643,731.376
1,STAT cell physical IO bytes saved by storage		 3,145,728		  186,791.100
1,STAT cell smart IO session cache lookups		       200		       11.876
1,STAT cell smart IO session cache hits 		       200		       11.876
1,STAT cell physical IO interconnect bytes retu      1,207,442,960	       71,697,107.658
1,STAT cell session smart scan efficiency		       450		       26.721
1,STAT table scans (short tables)				 1			 .059
1,STAT table scans (long tables)			       216		       12.826
1,STAT table scans (rowid ranges)			       216		       12.826
1,STAT table scans (direct read)			       216		       12.826
1,STAT table scan rows gotten			       101,106,534		6,003,634.369
1,STAT table scan blocks gotten 			 1,031,921		   61,274.738
1,STAT cell scans					       216		       12.826
1,STAT cell blocks processed by cache layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by txn layer		 4,223,803		  250,806.430
1,STAT cell blocks processed by data layer		 4,223,503		  250,788.616
1,STAT cell blocks helped by minscn optimizatio 	 4,223,803		  250,806.430
1,STAT cell simulated session smart scan effici     34,598,936,576	    2,054,460,345.321
1,STAT cell IO uncompressed bytes		    34,598,936,576	    2,054,460,345.321
1,WAIT cell smart table scan				       192.931		       11.456
1,WAIT library cache pin					  .006			 .000
1,WAIT library cache lock					  .002			 .000
1,WAIT library cache: mutex X					  .000			 .000
1,WAIT PX Idle Wait					       448.150		       26.611
1,WAIT PX Deq: Join ACK 					  .007			 .000
1,WAIT PX Deq Credit: send blkd 				  .003			 .000
1,WAIT PX Deq: Parse Reply					  .006			 .000
1,WAIT PX Deq: Execute Reply					12.591			 .748
1,WAIT PX Deq: Execution Msg					 3.475			 .206
1,WAIT PX Deq: Signal ACK RSG					  .000			 .000
1,WAIT PX Deq: Signal ACK EXT					  .014			 .001
1,WAIT PX Deq: reap credit					  .016			 .001
1,WAIT PX Nsq: PQ load info query				  .200			 .012
1,WAIT PX Deq: Slave Session Stats				  .087			 .005
1,WAIT enq: PS - contention					  .024			 .001
1,WAIT enq: TO - contention					  .000			 .000
1,TIME repeated bind elapsed time				  .000			 .000
1,TIME parse time elapsed					  .013			 .001
1,TIME background cpu time					  .039			 .002
1,TIME PL/SQL execution elapsed time				  .001			 .000
1,TIME DB CPU							 6.539			 .388
1,TIME sql execute elapsed time 			       199.272		       11.833
1,TIME DB time						       199.499		       11.846
1,TIME background elapsed time					  .582			 .035
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
2,STAT physical read total IO requests			    31,750		    1,885.293
2,STAT physical read total multi block requests 	    31,666		    1,880.305
2,STAT physical read requests optimized 			10			 .594
2,STAT physical read total bytes		    32,994,410,496	    1,959,184,723.276
2,STAT physical write total IO requests 			12			 .713
2,STAT physical write total bytes			   196,608		   11,674.444
2,STAT cell physical IO interconnect bytes	     1,153,927,168	       68,519,377.838
2,STAT cell physical IO bytes eligible for pred     32,993,722,368	    1,959,143,862.723
2,STAT cell physical IO bytes saved by storage		10,485,760		  622,637.001
2,STAT cell smart IO session cache lookups		       190		       11.282
2,STAT cell smart IO session cache hits 		       190		       11.282
2,STAT cell physical IO interconnect bytes retu      1,152,649,216	       68,443,493.954
2,STAT cell session smart scan efficiency		       451		       26.780
2,STAT cell scans					       206		       12.232
2,STAT cell blocks processed by cache layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by txn layer		 4,026,750		  239,105.563
2,STAT cell blocks processed by data layer		 4,026,283		  239,077.832
2,STAT cell blocks helped by minscn optimizatio 	 4,026,750		  239,105.563
2,STAT cell simulated session smart scan effici     32,983,310,336	    1,958,525,603.638
2,STAT cell IO uncompressed bytes		    32,983,310,336	    1,958,525,603.638
2,WAIT cell smart table scan				       193.646		       11.499
2,WAIT library cache pin					  .006			 .000
2,WAIT library cache lock					  .006			 .000
2,WAIT library cache: mutex X					  .000			 .000
2,WAIT PX Idle Wait					       460.706		       27.356
2,WAIT PX Deq Credit: send blkd 				  .006			 .000
2,WAIT PX Deq: Execution Msg					 2.922			 .173
2,WAIT enq: CO - master slave det				  .001			 .000
2,WAIT PX Deq: reap credit					  .007			 .000
2,WAIT PX Deq: Slave Session Stats				  .114			 .007
2,WAIT enq: PS - contention					  .012			 .001
2,TIME repeated bind elapsed time				  .000			 .000
2,TIME parse time elapsed					  .020			 .001
2,TIME background cpu time					  .062			 .004
2,TIME DB CPU							 6.462			 .384
2,TIME sql execute elapsed time 			       199.424		       11.842
2,TIME DB time						       199.637		       11.854
2,TIME background elapsed time					  .282			 .017
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
3,STAT physical read total IO requests			    33,001		    1,959.576
3,STAT physical read total multi block requests 	    32,902		    1,953.697
3,STAT physical read requests optimized 			 1			 .059
3,STAT physical read total bytes		    34,282,340,352	    2,035,661,085.805
3,STAT physical write total IO requests 			12			 .713
3,STAT physical write total bytes			   196,608		   11,674.444
3,STAT cell physical IO interconnect bytes	     1,201,407,024	       71,338,698.054
3,STAT cell physical IO bytes eligible for pred     34,281,635,840	    2,035,619,252.381
3,STAT cell physical IO bytes saved by storage		 1,048,576		   62,263.700
3,STAT cell smart IO session cache lookups		       198		       11.757
3,STAT cell smart IO session cache hits 		       198		       11.757
3,STAT cell physical IO interconnect bytes retu      1,200,112,688	       71,261,841.300
3,STAT cell session smart scan efficiency		       451		       26.780
3,STAT cell scans					       214		       12.707
3,STAT cell blocks processed by cache layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by txn layer		 4,185,106		  248,508.630
3,STAT cell blocks processed by data layer		 4,184,655		  248,481.850
3,STAT cell blocks helped by minscn optimizatio 	 4,185,106		  248,508.630
3,STAT cell simulated session smart scan effici     34,280,693,760	    2,035,563,312.338
3,STAT cell IO uncompressed bytes		    34,280,693,760	    2,035,563,312.338
3,WAIT cell smart table scan				       193.330		       11.480
3,WAIT library cache pin					  .002			 .000
3,WAIT library cache lock					  .003			 .000
3,WAIT library cache: mutex X					  .000			 .000
3,WAIT PX Idle Wait					       456.274		       27.093
3,WAIT PX Deq Credit: send blkd 				  .005			 .000
3,WAIT PX Deq: Execution Msg					 3.006			 .179
3,WAIT enq: CO - master slave det				  .004			 .000
3,WAIT PX Deq: reap credit					  .007			 .000
3,WAIT PX Deq: Slave Session Stats				  .086			 .005
3,WAIT enq: PS - contention					  .009			 .001
3,TIME repeated bind elapsed time				  .000			 .000
3,TIME parse time elapsed					  .011			 .001
3,TIME background cpu time					  .038			 .002
3,TIME DB CPU							 7.078			 .420
3,TIME sql execute elapsed time 			       199.347		       11.837
3,TIME DB time						       199.570		       11.850
3,TIME background elapsed time					  .449			 .027
---------------------------------------------------------------------------------------------------------
i Stat Name				     Diff		     Diff/Sec
---------------------------------------------------------------------------------------------------------
4,STAT physical read total IO requests			    30,213		    1,794.027
4,STAT physical read total multi block requests 	    30,129		    1,789.039
4,STAT physical read total bytes		    31,398,838,272	    1,864,440,774.857
4,STAT physical write total IO requests 			12			 .713
4,STAT physical write total bytes			   196,608		   11,674.444
4,STAT cell physical IO interconnect bytes	     1,118,129,280	       66,393,724.606
4,STAT cell physical IO bytes eligible for pred     31,398,133,760	    1,864,398,941.433
4,STAT cell smart IO session cache lookups		       180		       10.688
4,STAT cell smart IO session cache hits 		       180		       10.688
4,STAT cell physical IO interconnect bytes retu      1,116,834,944	       66,316,867.852
4,STAT cell session smart scan efficiency		       442		       26.246
4,STAT cell scans					       196		       11.638
4,STAT cell blocks processed by cache layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by txn layer		 3,833,202		  227,612.819
4,STAT cell blocks processed by data layer		 3,832,789		  227,588.296
4,STAT cell blocks helped by minscn optimizatio 	 3,833,202		  227,612.819
4,STAT cell simulated session smart scan effici     31,398,207,488	    1,864,403,319.350
4,STAT cell IO uncompressed bytes		    31,398,207,488	    1,864,403,319.350
4,WAIT cell smart table scan				       192.808		       11.449
4,WAIT library cache pin					  .007			 .000
4,WAIT library cache lock					  .002			 .000
4,WAIT library cache: mutex X					  .000			 .000
4,WAIT PX Idle Wait					       444.140		       26.373
4,WAIT PX Deq Credit: send blkd 				  .005			 .000
4,WAIT PX Deq: Execution Msg					 4.306			 .256
4,WAIT enq: CO - master slave det				  .004			 .000
4,WAIT PX Deq: reap credit					  .005			 .000
4,WAIT PX Deq: Slave Session Stats				  .082			 .005
4,WAIT enq: PS - contention					  .008			 .000
4,TIME repeated bind elapsed time				  .000			 .000
4,TIME parse time elapsed					  .019			 .001
4,TIME background cpu time					  .042			 .002
4,TIME DB CPU							 6.086			 .361
4,TIME sql execute elapsed time 			       198.045		       11.760
4,TIME DB time						       198.275		       11.773
4,TIME background elapsed time					  .538			 .032

Here we see the requests are spread over the available instances. Also, response time is only marginally lower (from 13.32 to 12.88), because the current limitation (physical disk access) hasn’t changed. The average amount of IO becomes:
Total amount of IO: 34,602,745,856+32,994,410,496+34,282,340,352+31,398,838,272=133,278,334,976
Storage index saved: 3,145,728+10,485,760+1,048,576+0=14,680,064
Amount of physical IO per cell: 133,263,654,912 (total amount with storage index IO saved substracted)/7/12.88=1,478,079,579 = 1.4GB/s, which is approximately the same as the previous SQL.

Let’s build the same profile (four times):

 

1- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
2- sql execute elapsed: 199, cpu time: 6, cell smart table scan: 194
3- sql execute elapsed: 199, cpu time: 7, cell smart table scan: 193
4- sql execute elapsed: 198, cpu time: 6, cell smart table scan: 193

 

As expected: we are totally IO bound.

Can we push this further? Yes! Stay tuned for part 3!

About these ads
4 comments
  1. >So approximately 1.4GB per second per cell is physically read. I managed to get 1.5GB/s out of a cell, so we are near the maximum.

    …max…without Flash perhaps?

    good post, Frits!

    • Yep, it’s on the HDD’s.

      I see I forgot to mention the diskgroup used is on HDD.

      Flash is the logical next step for the next post.

  2. Great post Frits! If only more folks would realize that “more” does not mean faster once you pass the “sweet spot” for DOP… :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,886 other followers

%d bloggers like this: