This is a small announcement that the slides of all of my four presentations for IOUG Collaborate 2014 are online in the ‘whitepapers and presentations’ section of this blog.
In my blogpost When the oracle wait interface isn’t enough I showed how a simple asynchronous direct path scan of a table was spending more than 99% of it’s time on CPU, and that perf showed me that 68% (of the total elapsed time) was spent on a spinlock unlock in the linux kernel which was called by io_submit().
This led to some very helpful comments from Tanel Poder. This blogpost is a materialisation of his comments, and tests to show the difference.
First take a look at what I gathered from ‘perf’ in the first article:
# Samples: 501 of event 'cpu-clock' # Event count (approx.): 501 # # Overhead Command Shared Object Symbol # ........ ....... .................. ................................... # 52.50% oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore mptspi_qcmd scsi_dispatch_cmd scsi_request_fn __blk_run_queue queue_unplugged blk_flush_plug_list blk_finish_plug | |--99.24%-- do_io_submit | sys_io_submit | system_call_fastpath | io_submit | skgfqio | ksfd_skgfqio | ksfdgo
This shows 52.5% of the time of profiling a “select count(*) from t2″ on the server process was spending it’s time on unlocking a spinlock.
This was in the previous blogpost, and tanel commented the following:
I would be suspicious of any of the “irqrestore” functions shown as the main CPU cycle consumer – as right after enabling interrupts again on a CPU may be just the first chance for the profiler interrupt to kick in and do the RIP and stack backtrace read. This is highly dependent on the hardware (how new CPUs) and OS version + VM version + whether the VM allows the guest OS to use hardware performance counters directly.
Let’s reiterate what I was doing: I was profiling the execution using Linux’ in-kernel perf functionality, but, because of the lack of access of the kernel’s performance registers because I was running on VMWare Fusion (desktop virtualisation), I was using perf in the following way: perf record -e cpu-clock.
These are a partial list of perf’s triggering events:
List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] ref-cycles [Hardware event] cpu-clock [Software event] ...etc...
If no specific event is specified, perf tries to use ‘cpu-cycles’, which has the indication [Hardware event], which means the kernel’s performance registers are used to gather information. If this is not possible (because virtualisation disables access to the performance registers), the software event ‘cpu-clock’ can be used. This is what I used in the previous article.
However, cpu-clock is a software event. And this event (cpu-clock) is depended on the timer interrupt. And the function we see we spent most time on (_raw_spin_unlock_irqrestore) is the re-enabling of IRQ’s for this process when this spinlock is unlocked. So this _could_ mean we did not spend our time on this function, but can not tell, because the timing source was disabled.
However, there was another helpful comment from Tanel:
VMWare Fusion 5.x should already allow some CPU perf counters to be accessed directly in the VM guest. It requires a new enough CPU though (it works in my late 2011 MBP, but not in the 2009 MBP). There’s a setting under “advanced options” under “processors & memory” -> “You can use code profiling applications such as VTune or OProfile to optimize or debug software that runs inside a virtual machine.”
Indeed, there is such a function, and let’s enable it and try again in EXACTLY the same way, but now using the ‘cpu-cycles’ method (which is default).
# Samples: 669 of event 'cycles' # Event count (approx.): 288603593 # # Overhead Command Shared Object Symbol # ........ ....... ................. ....................................... # 11.31% oracle oracle [.] sxorchk | --- sxorchk | |--98.50%-- kcbhxoro | kcbhvbo | kcbzvb | kcbldrget | kcbgtcr | ktrget3 | ktrget2 | kdst_fetch | kdstf00000010000kmP | kdsttgr | qertbFetch | qergsFetch | opifch2 | kpoal8 | opiodr | ttcpip | opitsk | opiino | opiodr | opidrv | sou2o | opimai_real | ssthrdmain | main | __libc_start_main | --1.50%-- kcbhvbo kcbzvb
This is radically different! All of a sudden the top function is not a spinlock in the kernel any more, but an Oracle function!
Let’s look at the top 5 locations where time is spend with exactly the same case, but with -e cycles (the default) and -e cpu-clock (non-default/software timer):
# perf record -g -p 2527 ^C # perf report -n -g none ... # Samples: 580 of event 'cycles' # Event count (approx.): 256237297 # # Overhead Samples Command Shared Object Symbol # ........ ........... ....... .................. ....................................... # 17.47% 100 oracle oracle [.] sxorchk 7.99% 47 oracle oracle [.] kdstf00000010000kmP 6.01% 35 oracle oracle [.] kcbhvbo 3.25% 19 oracle oracle [.] kdst_fetch 3.01% 17 oracle [kernel.kallsyms] [k] __wake_up_bit
And now the same execution, but with the software timer:
# perf record -g -p 2527 -e cpu-clock ^C # perf report -n -g none ... # Samples: 422 of event 'cpu-clock' # Event count (approx.): 422 # # Overhead Samples Command Shared Object Symbol # ........ ........... ....... ................. ................................ # 78.67% 332 oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 4.03% 17 oracle oracle [.] sxorchk 2.13% 9 oracle oracle [.] kcbhvbo 1.90% 8 oracle oracle [.] kdstf00000010000kmP 0.95% 4 oracle oracle [.] qeaeCn1Serial
This reveals some information: it seems that when profiling with the software timer, the “_raw_spin_unlock_irqrestore” function “eats” a lot of samples, which are “stolen” from the functions where they are spent:
sxorchk has 100 samples with the hardware timer, and 17 with the software timer.
kcbhvbo has 35 samples with the hardware timer, and has 9 with the software timer.
kdstf00000010000kmP has 47 samples with the hardware timer, and has 8 with the software timer.
So, general conclusion is that it’s important to understand what you are measuring, and if that method has implication on what you are measuring.
Conclusion specific to perf: do not use cpu-clock if you can use the hardware event.
Oracle has done a great job with the wait interface. It has given us the opportunity to profile the time spend in Oracle processes, by keeping track of CPU time and waits (which is time spend not running on CPU). With every new version Oracle has enhanced the wait interface, by making the waits more detailed. Tuning typically means trying to get rid of waits as much as possible.
But what if your execution is optimised to the point that there are (almost) no waits left? Before you think this is theoretical: this is possible, especially with Oracle adaptive direct path reads (which are non Oracle cached IOs), visible by the wait “direct path read”. Of course I am talking about the omission of waits, which happen with adaptive direct path reads if your system is able to provide the request results fast enough. There isn’t a wait because if the IO request result is returned fast enough, the process doesn’t have to wait. Whilst this sounds very obvious, the “traditional” Oracle IO requests (visible with the waits “db file sequential read” and “db file scattered read”) do always generate a wait, no matter how fast the IO requests where.
Here is a trace excerpt from a fill table scan where the IO was fast enough not to generate only a few waits:
PARSE #140145843472584:c=0,e=28,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947766 EXEC #140145843472584:c=0,e=31,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3321871023,tim=1385308947947823 WAIT #140145843472584: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308947947871 WAIT #140145843472584: nam='asynch descriptor resize' ela= 1 outstanding #aio=0 current aio limit=1562 new aio limit=1592 obj#=75579 tim=1385308947947969 WAIT #140145843472584: nam='direct path read' ela= 428 file number=5 first dba=28418 block cnt=126 obj#=75579 tim=1385308947989097 FETCH #140145843472584:c=161976,e=174323,p=20941,cr=20944,cu=0,mis=0,r=1,dep=0,og=1,plh=3321871023,tim=1385308948122218 WAIT #140145843472584: nam='SQL*Net message from client' ela= 249 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122600 FETCH #140145843472584:c=0,e=2,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=3321871023,tim=1385308948122689 WAIT #140145843472584: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122709 WAIT #140145843472584: nam='SQL*Net message from client' ela= 210 driver id=1413697536 #bytes=1 p3=0 obj#=75579 tim=1385308948122938 CLOSE #140145843472584:c=0,e=15,dep=0,type=3,tim=1385308948555460
The most interesting part of the raw trace file is between the EXEC line and the first FETCH line. There is first a ‘SQL*Net message to client’ wait, then a ‘asynch descriptor resize’ wait, and then a single ‘direct path read’ wait. This is a single wait line for doing IO, while the fetch line shows that 20941 blocks are read by doing physical IO. The fetch line shows that most of the elapsed time (e) is spend on running on cpu (c). This means that details about how those 20941 blocks where read are (except for the single ‘direct path read’ wait) not available.
But what if you want to understand more about what the process is doing here? Except for a few wait lines, all the processing details that waits give are gone. It’s more or less only the PARSE/EXEC/FETCH lines, where the first fetch line contains more than 99% of all the time.
The answer to that on linux is perf. Perf is a profiler that is embedded in the linux kernel (since 2.6.32). I’ve written more about perf, use the search field on this blog find articles on how to setup and use perf. Now let’s see what is happening in this situation: what is Oracle doing to execute the above mentioned SQL (select count(*) from t2)?
I’ve ran perf on the session above with ‘perf record -g -e cpu-clock -p PID’, and the result (with ‘perf report’) is shown below:
67.58% oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore | |--99.19%-- mptspi_qcmd | scsi_dispatch_cmd | scsi_request_fn | __blk_run_queue | queue_unplugged | blk_flush_plug_list | blk_finish_plug | generic_file_read_iter | generic_file_aio_read | aio_rw_vect_retry | aio_run_iocb | io_submit_one | do_io_submit | sys_io_submit | system_call_fastpath | io_submit | --0.81%-- __wake_up fsnotify_add_notify_event inotify_handle_event send_to_group fsnotify vfs_write sys_write system_call_fastpath __write_nocancel 4.40% oracle oracle [.] sxorchk
What is shown here, is that 68% of the time the process ran on CPU, it was spending it’s time in kernel mode ([k]), on a function called _raw_spin_unlock_irqrestore. This function was called in two different ways, but in 99% of the time it came from mptspi_qcmd. This is the device specific kernel driver. What is even more remarkable, is that when we follow the backtrace up (by reading down), that the process was in fact issuing IO’s (the io_submit system call)!
This means that instead of spending time on waiting for IOs to finish, this system is spending time on spinning on a spin lock (alike what is latch in Oracle) for issuing commands to a SCSI device.
The next function in which the Oracle process spend time, is an Oracle function (visible by [.], which means user land function), called sxorchk. This function is a xor check (governed by the db_block_checking parameter).
As a summary: does this means the Oracle wait interface is useless? Of course not. But if the wait interface simply does not provide enough information, like when 99% of the time is only visible as CPU time, you need to step to another layer and investigate there. Perf opens up the CPU time, and is able to tell you how the CPU time is composed.
A substantial part of the people I encounter present using OSX on a Macbook. I am not sure how much of these people use Apple’s Keynote for presenting, but I like Keynote very much for various reasons, like a cleaner interface. This blogpost is about some handy tips and tricks I learned using a few years of presenting around the world. If you don’t use OSX, this blogpost is probably not for you.
1. Setup a private network with your iPhone/clicker
This first step has two important reasons. The first reason is extremely obvious: in order to communicate with your iPhone/clicker, you need a connection. The second reason is a little less obvious: if the conference you are attending as a speaker as wireless access, you probably joined that wireless network. In order to make your computer NOT respond to any kind of signal from the internet (growl, notification center, updates, etc.), you really should disconnect first. When you setup a private network with your iPhone/clicker, you are not connected to the internet anymore. (obviously you need to disconnect any wired internet connections too!)
This is done on the Macbook using the wifi signal strength indicator on the upper right side, create network. Choose 40-bit WEP (this isn’t the safest encryption on the world, but you are going to use this for relatively short time), and choose a 5 character password.
Next go to the Settings on your iPhone, choose ‘Wi-Fi’, and select the network you just setup on your Macbook. The default name of the local network is the name of the computer. If it’s the first time, or you’ve changed the password, enter the 5 character password you choose when setting up the local network.
What is lesser know, is that you now DO NOT HAVE A CONNECTION AT THIS MOMENT. Simple reason is there is no DHCP server which gives both your Macbook and your iPhone an ip address. You need to wait a little while, then both your Macbook and your iPhone will self assign an ip address. On your Macbook, go to System Prefences>Network, and click on “Wi-Fi”. It has an orange colour, not a green colour as you might expect. If you have clicked on “Wi-Fi”, the description will say something like:
Wi-Fi has the self-assigned IP address 169.254.111.111 and will not be able to connect to the Internet.
Your IP address will be different. Now go to your iPhone, and go to Settings>Wi-Fi, and look what network is selected. It should be the network with the name of your Macbook. If your iPhone powersaved, it will probably gone to the wireless of the conference again, more on that later. (Re)select the network with the name of your Macbook, and click on the blue circle with the bigger-than sign in it on the right side if the network. It shows you an IP address and the subnet mask. If you just re-set the Macbook network, you probably must wait a little while before it assigns an IP address to itself.
In order to perform a test if a connection is possible, open a terminal on your Macbook, and ping the (self assigned) IP address of the iPhone. If the network connection can be used, ping will show response from the iPhone.
2. Disable powersave on your Macbook
You do not want your Macbook to go into powersave while your are setting it up, talking to people, when presenting, or when you go out of the presentation to show something and you are discussing that. There is an extremely simple way to do that: caffeine. Search for this little piece of software on the internet, or, even simpler: go to the OSX app store, and search for caffeine. It’s a free application. If you fire it up, it shows an empty cup in the top bar on the right side. When you do not want your computer to go into powersave at any time, click on the cup: it will show a full cup of coffee. That’s simple, right?
3. Disable powersave on your iPhone
Probably you have set your iPhone up to powersave too. This is done in Settings>General>Auto-lock; set it to ‘Never’. As you probably know or learned, once your iPhone goes into powersave, it turns off wireless. So if you enable your iPhone, wireless will turn on again, and just search for any network it can autoconnect to. This is the reason it will connect to the conference wireless again: the local network is not saved by default, but the conference wireless is.
4. Use your iPhone as a clicker
There are two ways that I’ve used to use your iPhone as a clicker, the ‘Remotepad’ app (which needs an application on OSX too called the same, and makes a mouse of your iPhone), or the Keynote Remote app. If you are serious about presenting, and want to use your iPhone as a remote, my opinion is to buy the Keynote Remote app. The strong point is its simplicity: swipe right to left for going forward (‘click’) or swipe left to right to go backward. The other two functions it’s got is go to beginning and go to end. That’s all.
If you didn’t had the Keynote Remote app, and installed it on your iPhone, and you’ve setup the network, there’s one additional thing you should do: link keynote with the app. Startup or select Keynote, Select Keynote>Preferences and go to the ‘Remote’ tab/icon. Now select ‘Enable iPhone and iPod touch remotes’, and link the two together.
Keynote has to be started on your Macbook, and the presentation you want to use needs to be loaded, but does not have to be set in presenting mode already; if you start the Keynote Remote app on your iPhone, it will put Keynote in presentation mode with the current slide.
In my previous post about perf I showed how it can be used to determine the function which is used the most systemwide.
One thing it helped me with, is determining the impact of swapping on a system. The testcase I build for that is kind of a no-brainer (setting up a 3GB SGA with 2GB of physical memory in a test VM), but think about a few things: I regularly encounter systems who have an amount of swap allocated. This can mean memory is simply plain over allocated, but can also be older pages are just swapped out by Linux “just in case”. This probably will happen more with the new oracle recommendation for “swappiness” (the eagerness of Linux to swap) to set it to 100 (used to be 60). The amount of swap transfers can be measured, but the regular tools cannot measure the impact. With perf, this is possible.
I ran another SLOB run (10 readers) against the 3GB SGA/2GB physical memory system, and collected perf data with ‘perf record -e cpu-clock -g -a’. After the run, I printed a summary of the collected data using ‘perf report -g none’:
# Events: 33K cpu-clock # # Overhead Command Shared Object # ........ ............... .................................. # 24.65% swapper [kernel.kallsyms] [k] native_safe_halt 17.99% oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 10.82% oracle oracle [.] kcbgtcr 2.94% oracle oracle [.] qetlbr 2.66% oracle oracle [.] kcbz_fp_buf 2.22% oracle oracle [.] kdsgrp 1.98% oracle oracle [.] kafger
I guess this speaks for itself. The highest number of measurements of a function (24.65%) was a kernel-mode function ([k]) ‘native_safe_halt’, which was issued by ‘swapper’. The next most active function (17.99%) was a kernel-mode function ([k]) ‘_raw_spin_unlock_irqrestore’ which was issued by oracle. This function represent waiting for I/O (read on if you want to be able to determine that). Next functions are user land functions ([.]). These are issued by oracle, with the executable oracle. Because we are using SLOB, which is made to issue 8KB block reads, it makes sense the user land function with the most impact is the ‘kcbgtcr’ function (kernel cache buffer get consistent read).
Above example shows the summary of the perf recording. The “normal” report shows the complete stack from the active function down to main. Here is an example with the data from which the summary above has been made:
# Events: 33K cpu-clock # # Overhead Command Shared Object # ........ ............... .................................. ................................................................................................................... # 24.65% swapper [kernel.kallsyms] [k] native_safe_halt | --- native_safe_halt cpu_idle rest_init start_kernel x86_64_start_reservations x86_64_start_kernel 17.99% oracle [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore | --- _raw_spin_unlock_irqrestore | |--97.51%-- scsi_dispatch_cmd | scsi_request_fn | __blk_run_queue | | | |--99.69%-- queue_unplugged | | blk_flush_plug_list | | | | | |--94.22%-- blk_finish_plug | | | | | | | |--99.85%-- generic_file_read_iter | | | | generic_file_aio_read | | | | aio_rw_vect_retry | | | | aio_run_iocb | | | | io_submit_one | | | | do_io_submit | | | | sys_io_submit | | | | system_call_fastpath | | | | io_submit | | | | | | | | | --100.00%-- skgfqio | | | | ksfd_skgfqio
Here you see the stack of the swapper, and the different stacks, as a tree, which lead to _raw_spin_unlock_irqrestore, with percentages of functions from the differences which lead the same executable/function combination.
Using the percentages, the function calls can be investigated which happen on the system. Above sample shows the last two oracle database functions prior to the io_submit call (io_submit is the function to issue an asynchronous IO call), ksfd_skgfqio and skgfqio. Next we see the io_submit call, and everything which is called by the kernel.
The report contains all the stacks, not a partial one like my example.
I’ve entered two abstracts into the suggest a session database track on oracle mix. If you want to see these sessions at Oracle Openworld this year, please vote on these:
Oracle 11gR2 clusterware in depth
Does the Sun Oracle database machine meet expectations in real life?
These sessions are technical, and try to paint a picture about how it’s working. I am looking forward meeting you in San Francisco!
At the 31st of march 2010 I attended Julian Dyke’s ‘Oracle 11gR2 for RAC users’ seminar.
The seminar was held in Utrecht (a city in the Netherlands) at the La Place meetingcentre in ‘Hoog Catharijne’. The meeting centre has good coffee.
Julian described the differences between currently used RAC versions (mostly 10.2) and the newer 11 versions, together with an impressive amount of experience from a broad range of customers. Also some specific 11.2 database features where touched.
The seminar was enjoyable, especially the experience in specific situations.
Thank you Julian, and Anjo and Annette of http://www.miraclebenelux.nl!
In order to ease the blog administration I decided to move my blog to the wordpress.com site. Traffic from the ‘old’ frits.homelinux.com site is redirected here, I hope the links are changed soon.