Archive

Uncategorized

I am excited to announce that I have accepted a role at Yugabyte as ‘developer advocate’. It’s not easy to leave the talented group of people that is Enkitec, of which many of them I call friends. In fact, the change is much bigger since Yugabyte is a database outside of the Oracle ecosystem, in which ecosystem there also are a lot of people that I have gotten to known and call friends.

So why the change? In essence, my reasons are the same as the ones Bryn Llewellyn mentioned in his blogpost about his move, although I come from a different background being a consultant and not working for Oracle.

Open source

You could argue that having spend investigating the inner working of the Oracle database for 20 or so years means I am discarding all this knowledge I build up. I don’t believe that’s true. Yes, I have build up a lot of knowledge about the Oracle database works, and parts of that knowledge might be known (usable?) to a small group. However, I am looking forward to working with software for which I can actually see how it’s made, and I don’t have to perform extensive research to figure out how it’s “probably” implemented. This is an important reason for me personally. In general I think there are much more reasons for companies for wanting to use open source software instead of proprietary, closed source software.

Distributed architecture & distributed SQL

The area I have spend a significant amount of time in, is assessing and investigating performance of the Oracle database. Oracle, like many traditional monolithic databases, use an architecture where the database itself is actually a single set of files containing the data, which is singular, and are served by processes that perform validation of requests to use that data, and perform performance features like caching so data can be used from memory, which means it doesn’t need to be re-read from the files so it can be served at memory latency. This architecture imposes limits on the scale of use, for which common solutions to overcome these limits with monolithic architectures are:
– Sharing the (still singular) datafiles with multiple machines (RAC).
– Make a copy of the data files, and spool the changes to the copy (DG/OGG/replication).
– (Manually) partition the data over multiple independent databases (sharding).
– Storing data redundantly over the available (still local) disks (RAID).
I am not saying these solutions are wrong. In fact, many of them are very cleverly architected and optimized over the years with the different architectures of databases and work really well.
However, there are a consequences that are inherently imposed by a monolithic architecture, such as scalability, flexibility and failure resistance.

The Yugabyte database provides a cloud ready, cloud vendor agnostic, failure resistant, linear scaling database (sorry for the abundance of hype words). Failure resistance is provided by keeping multiple copies of the data, which is configurable as ‘replication factor’. By keeping copies of the data, failure of a single Yugabyte node (called a ‘tablet server’) can be overcome. By spreading these copies of data over groups of tablet servers in cloud availability zones, the outage of an entire availability zone can be survived. When such a disaster strikes, the cluster will survive and provide normal (read and write functions) as long as 2/3rd of the nodes survive (in case of a replication factor of 3), without the need for any management to accommodate failover, like instantiating a replica/standby.

Each Yugabyte tablet server also serves as an API for data retrieval and manipulation, so with the addition of tablet servers, the processing power increases. A tablet server should use local, non-protected disks to take advantage of low latency, and by adding more tablet servers the work can be spread out further to these, increasing the amount of work that can be done. Aside from increasing processing for API-side processing, Yugabyte provides functionality which pushes down the work to the storage, which Exadata specialists will recognize as ‘smart scans’, alongside things like predicate pushdown and bloom filters.

Yugabyte has not invented yet another SQL and/or no-SQL dialect. Another point that I think is really strong is that Yugabyte reuses Postgresql as the SQL API. The Postgres API is “wire compatible” with Postgres, which means that any product that can talk to a Postgres database can talk to Yugabyte. In fact, instead of writing a Postgres layer, Yugabyte has re-used the Postgres source to use the query layers of postgres, and then provide the storage for postgres using Yugabyte. This also means a lot of the server side functionality of Postgres, such as PL/pgSQL can be used. Needless to say there are limitations inherent to the postgres connection with the Yugabyte storage layer, but most common postgres database functionality is available.

But that is not all! Aside and in parallel to Yugabyte’s Postgres API, it also provides a “NoSQL” APIs, which are compatible with Apache Cassandra and with Redis. Oh, and data storage is done via LSM (log-structured merge-tree), etc.

Conclusion

I hope at this point you can see that Yugabyte provides a fresh, modern approach to a database that provides a lot of advantages over more traditional, monolithic databases, for which I think a lot of cases that I witnessed over the years could significantly benefit from (performance-wise and availability-wise). I am really thrilled to be on the basis of building that further.

Also, a lot of technical reasons that I described are really just a summary, if you have gotten interested or want to learn more, I would urge you (or challenge you :-D) to have a look at https://www.yugabyte.com or read more at https://docs.yugabyte.com. Or if you want to see the code, head over to https://github.com/yugabyte/yugabyte-db!

This post is about a fully documented but easy to miss feature of vagrant, which is the evaluation of the Vagrantfiles when a vagrant ‘box’ is started. As you can see, I wrote ‘Vagrantfiles’, which is plural. I (very naively) thought that the Vagrantfile in the directory where you want to start the ‘box’ determines the specifics of the vagrant ‘box’.

This begins with me trying to create a vagrant virtualbox VM using packer, where I specify a Vagrantfile during the packer build. I normally don’t spend a lot of thought on that, and just put in a very simple Vagrantfile, essentially just defining the box and setting memory and cpu amounts.

In this case, I decided to put in the Vagrantfile that I wanted to use for this special purpose ‘box’ as the Vagrantfile for the packer build, which has some ruby in it which adds a disk:

data1_disk = "data1.vdi"
if !File.exist?(data1_disk)
  vb.customize [ 'createhd', '--filename', data1_disk, '--size', 20480 ]
end
vb.customize [ 'storageattach', :id, '--storagectl', 'SATA Controller', '--port', 2, '--device', 0, '--type', 'hdd', '--medium', data1_disk ]

My (wrong) assumption was that this Vagrantfile was kept as a validation or template.

After the box was created, I tested running it, and therefore I copied the Vagrantfile including the above addition of checking for a file, if it doesn’t exist creating it, and then attaching the disk to the VM. However, this did fail with an error in the provisioning stage that a disk was already added, and therefore could not be added again.

I didn’t understand the error when I first encountered it, so I investigated the issue by removing the box (vagrant destroy), and commenting out the disk addition code (the above 5 lines) from the Vagrantfile, and then run the box addition again (vagrant up). Much to my surprise, it did add the disk for which I explicitly removed the code that provided that functionality.

This severely puzzled me.

After going over the vagrant documentation and searching for the error messages, I found out that the Vagrantfile embedded with the ‘box’ is actually used, and parsed, along with the user specified Vagrantfile, and the contents of both are merged before a vagrant ‘box’ is started. This Vagrantfile can be seen in the directory which holds the zipped boxes after downloading from app.vagrantup.com. On my Mac this is ‘~/.vagrant.d/boxes/{box name}/{version}/{provider}/Vagrantfile’.

This perfectly explained what I witnessed: because the disk creation steps are in the ’embedded’ Vagrantfile, it will simply be executed, even if it’s not in the normal Vagrantfile. And of course it threw an error when the disk creation steps were added to the normal Vagrantfile, because then these steps were executed twice, which would show at the vb.customize ‘storageattach’ execution, because that is not protected by a check, so the second occurrence would be tried and fails.

This was really my ignorance, and maybe a bit the documentation not being overly verbose about the existence of more than one Vagrantfile. It also gives great opportunities: lots of steps or logic that would normally be in the Vagrantfile could be put in the embedded Vagrantfile so that regular Vagrantfile can be kept really simple.

Conclusion is that if you create vagrant boxes yourself, when you want to perform provisioning steps against a vagrant box that simply need to be done, you might as well put them in the embedded Vagrantfile.

Hopefully I got your interest by the weird name of this blogpost. This blogpost is about sensible usage of an Oracle database. Probably, there are a lot of blog posts like this, but I need to get this off my chest.

A quote any starwars fan would recognise is ‘I sense a disturbance in the force’. I do, and I have felt it for a long time. This disturbance is the usage of the number of connections for a database. Because of my profession, this is the oracle database, but this really applies to the server-side of any client/server server processor running on at least (but probably not limited to) intel Xeon processors.

The disturbance is the oversubscription or sometimes even excessive oversubscription of database connections from application servers, or any other means of database processes that acts as clients. This means this does not exclude parallel query processes, in other words: this applies to parallel query processes too.

What is oversubscription of database connections? Any good consultant would be able to tell you this: it depends. In principle, oversubscription means more subscribers than a system can handle. This means ‘oversubscription’ is a multidimensional beast, which can apply to CPU, memory, disk IO, network IO. That makes it hard.

This blogpost is about CPU oversubscription. The way a modern CPU, Intel Xeon in this case, works is not simple, and thus this will and cannot be an exhaustive description. What I want to try is provide a basic, simplistic description to provide an idea and give guidance.

An Intel Xeon CPU provides a number of processing units, called ‘processor’ in /proc/cpuinfo, called ‘cpu’ in the top utility, etc, which are the execution contexts for processes. These execution contexts can be hyperthreads, which for Intel Xeon are two threads per core, or an execution context for a single core. Not all Xeon CPUs provide hyperthreading, and hyperthreading, if available in the CPU, can be disabled in a system’s BIOS. A hyperthread can not, and does not do any processing, that is what a core does.

So why do hyperthreads exist then? I think two prominent reasons are:
1. Not all processes are active all the time. By increasing the number of execution contexts, the switching for processes between execution contexts is reduced, which means a reduction of time spend on context switches.
2. Cores are incredibly powerful, and given that probably not all processes are active all the time, combining two processes on the same core will be “reasonably unnoticeable”.

The next question which then comes to mind is: is hyperthreading a good thing for a database? I cannot give an exhaustive answer for this, partially because there is massive difference between system usage for different types of database usage. I would say that with reasonable usage, hyperthreading in modern Intel Xeon CPUs does provide more benefits, reduced context switching, than it gives downsides, like variances in latency of CPU usage.

This means that for looking at the CPU processing power of a database server the ‘actual’ processing power sits logically between the number of core’s and the number of threads. But wait! Didn’t I just say that threads don’t process, only core’s do? Yes, but let me explain how I look at this: if your processes do processing requiring ACTUAL, ON CPU processing, it means they depend on the core to be able to handle this, versus processes that work by running into waiting really quickly, like doing disk IO, network IO or waiting for users to make them active, which still might appear as running all the time, but in reality are actually processing on the core occasionally.

The first type, doing the actual, on cpu processing should calculate CPU power more towards core count, and the second type, doing lots of stalls, should calculate CPU more towards thread count. Overall, this is quite simply about using a core as efficient as possible.

Now that we gone through CPUs and their cores and threads, and oversubscription in general, the next question is: so how much processes should be allocated on a database server?

The answer is simple: if you want a high performing database for your application servers, the number of processes IN TOTAL should not exceed a number sitting somewhere between CPU core count and CPU thread count.

“Isn’t that incredibly low?”

Yes, for most of the deployments that I see this would be shockingly low. But just because the number of processes is set very high somewhere doesn’t make it right. It just means it’s set that way.

“But why is it set too high everywhere?”

I don’t know. I don’t understand why lots and lots of people do allocate high, up to sometimes ASTRONOMICAL numbers of database processes, and then expect that to be the best tuned way, while there is NO LOGICAL EXPLANATION that I can see for this to make sense. In fact: the explanation why this doesn’t make sense is this blogpost.

To make the comparison with a supermarket and the number of tills: if you go shopping in a supermarket and want to pay and leave as soon as possible, there should be a ready, idle till available, or else you have to wait. For Intel Xeon hyperthreading, you could make the comparison with a till that serves two lanes with persons that want to pay at the same time, because it takes time to put all the items from the shopping basket onto the desk, and the more time that takes, the more efficient a till serving two lanes would be (an Intel Xeon CPU can actually serve two threads at the same time, optimising runtime on the single core).

“Okay, but the majority of the processes is not actually doing anything.”

Well, if the processes are actually not doing anything, why have them in the first place? If that is really true, it doesn’t make sense to have them. And don’t forget: what looks like an idle connection from both an application and a database perspective still is an actual live, running operating system process, and a running database process that has memory allocated, occupies a network socket, has breakable parse locks out, etc, and requires CPU time to maintain the connection.

In fact, by having huge numbers of database connections, you have setup the application to be able to cause “the perfect storm” on the database!

With this, I mean that what I normally see, is that indeed the majority of the database connections are not used. However… if things get worse, and the database gets active and starts lacking CPU processing power, more database connections get active. That is logical, right? The database connections that normally would be active will take longer time because of the increased activity, so with a constant amount of work, new work cannot use an existing connection because that is still active, and thus take another connection that normally would sit idle. However, serving more connections will increase the amount of CPU required even further, which was already lacking, so the waiting time increases further. Now because the waiting time gets higher, more connections are needed, etc.

And then I didn’t talk about dynamically increasing connection pools!

What I mean with that is that I until now talked about STATIC connection pools. Static means the minimal number of connections is the maximal number of connections in the pool. A dynamic connection pool will have a certain amount of connections, and when there is a need for more, which means all the connections are busy, add more connections.

Especially with Oracle, this is really a bad idea. Let me explain. Outside of too much connections in the first place, which is a bad idea already, having an expanding connection pool means not only idle connections are put to work, but instead the database is given EVEN MORE work by initialising new connections. An oracle database connection is not lightweight, it requires initialising memory, which is an expensive operation. And the whole reason the connection is created is because the connection pool established all the connections were busy, which almost certainly is because the database was busy (!!!!).

I hope a lot of people will make it to the end, and then realise that high numbers of connections does not make any sense. If you do have an explanation that makes sense, please comment. Please mind that a tuned setup requires an application server to be reasonably setup too, you cannot have one part setup for ultimate processing power, and another part be just a shipwreck.

This blogpost is about how the oracle database executable created or changed during installation and patching. I take linux for the examples, because that is the version that I am almost uniquely working with. I think the linux operating is where the vast majority of linux installations are installed on, and therefore an explanation with linux is helpful to most of the people.

The first thing to understand is the oracle executable is a dynamically linked executable. This is easy to see when you execute the ‘ldd’ utility against the oracle executable:

$ ldd oracle
	linux-vdso.so.1 (0x00007ffd3f5b0000)
	libodm19.so => /u01/app/oracle/product/19/dbhome_1/lib/libodm19.so (0x00007fa693084000)
	libofs.so => /u01/app/oracle/product/19/dbhome_1/lib/libofs.so (0x00007fa692e82000)
	libcell19.so => /u01/app/oracle/product/19/dbhome_1/lib/libcell19.so (0x00007fa692b69000)
	libskgxp19.so => /u01/app/oracle/product/19/dbhome_1/lib/libskgxp19.so (0x00007fa69284d000)
	libskjcx19.so => /u01/app/oracle/product/19/dbhome_1/lib/libskjcx19.so (0x00007fa692604000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa6923fb000)
	libclsra19.so => /u01/app/oracle/product/19/dbhome_1/lib/libclsra19.so (0x00007fa6921c0000)
	libdbcfg19.so => /u01/app/oracle/product/19/dbhome_1/lib/libdbcfg19.so (0x00007fa691f93000)
	libhasgen19.so => /u01/app/oracle/product/19/dbhome_1/lib/libhasgen19.so (0x00007fa691270000)
	libskgxn2.so => /u01/app/oracle/product/19/dbhome_1/lib/libskgxn2.so (0x00007fa69106d000)
	libocr19.so => /u01/app/oracle/product/19/dbhome_1/lib/libocr19.so (0x00007fa690d49000)
	libocrb19.so => /u01/app/oracle/product/19/dbhome_1/lib/libocrb19.so (0x00007fa690a49000)
	libocrutl19.so => /u01/app/oracle/product/19/dbhome_1/lib/libocrutl19.so (0x00007fa690828000)
	libaio.so.1 => /lib64/libaio.so.1 (0x00007fa690625000)
	libons.so => /u01/app/oracle/product/19/dbhome_1/lib/libons.so (0x00007fa6903d1000)
	libmql1.so => /u01/app/oracle/product/19/dbhome_1/lib/libmql1.so (0x00007fa69016f000)
	libipc1.so => /u01/app/oracle/product/19/dbhome_1/lib/libipc1.so (0x00007fa68fcf5000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fa68faf1000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fa68f76f000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa68f54f000)
	libnsl.so.1 => /lib64/libnsl.so.1 (0x00007fa68f336000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fa68f11f000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa68ed5d000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa693287000)

The way this works, is that a library is defined in the ELF (the executable format of linux executables) header of the oracle executable. This can be seen using the ‘readelf’ utility:

$ readelf -d oracle

Dynamic section at offset 0x16f03640 contains 45 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libodm19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libofs.so]
 0x0000000000000001 (NEEDED)             Shared library: [libcell19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libskgxp19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libskjcx19.so]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libclsra19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libdbcfg19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libhasgen19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libskgxn2.so]
 0x0000000000000001 (NEEDED)             Shared library: [libocr19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libocrb19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libocrutl19.so]
 0x0000000000000001 (NEEDED)             Shared library: [libaio.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libons.so]
 0x0000000000000001 (NEEDED)             Shared library: [libmql1.so]
 0x0000000000000001 (NEEDED)             Shared library: [libipc1.so]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libnsl.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libresolv.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [/u01/app/oracle/product/19/dbhome_1/lib]
 0x000000000000000c (INIT)               0xdb0d80
 0x000000000000000d (FINI)               0x12b54b90
 0x0000000000000019 (INIT_ARRAY)         0x17398c20
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x4002d0
 0x0000000000000005 (STRTAB)             0x9ddb10
 0x0000000000000006 (SYMTAB)             0x528128
 0x000000000000000a (STRSZ)              3567352 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x17504000
 0x0000000000000002 (PLTRELSZ)           26136 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xdaa768
 0x0000000000000007 (RELA)               0xda9328
 0x0000000000000008 (RELASZ)             5184 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0xda9188
 0x000000006fffffff (VERNEEDNUM)         7
 0x000000006ffffff0 (VERSYM)             0xd44a08
 0x0000000000000000 (NULL)               0x0

This shows the names of the needed shared libraries ‘(NEEDED)’. Some of the needed shared libraries are oracle shared libraries, such as libodm19.so, libofs.so, libcell19.so, libskgxp19.so and so on. Other libraries are operating system libraries, such as libc.so.6, libresolv.so.2, libnsl.so.1, libpthread.so.0, libm.so.6 and so on. The oracle libraries are found because an RPATH (runpath) is included in the header, in my case /u01/app/oracle/product/19/dbhome_1/lib. The operating system libraries are not included with the oracle installation, they are dynamically obtained from the operating system, for which the selection lies with the operating system.

So, we got the oracle executable, and we found out it’s a dynamically linked executable, which means that it’s using shared libraries for some of its functionality.

Now let’s take one step further. Whenever the oracle database software is installed or patched, it must be linked in order to build the executable with the current state of the software.
You might wonder the what I mean with the phrase ‘is installed’: you probably don’t execute a relink all. And that is sensible, because the installer does that for you, you can validate it in $ORACLE_HOME/install/make.log.
I’ll get to the manual linking in a bit.

The oracle database executable and compilation

The first thing to discuss now is compilation. Compilation is the process of turning text based code into a compiled form, for which a lot of compilers do not create an executable form, but an intermediary form, which is called an object. Turning an object or objects into an executable form is called linking. Compiling on linux is done using a compiler, and the default C compiler with Oracle and RedHat linux is gcc. Since Oracle 12.2, the compiler is not a requirement for installation anymore. It is documented, but I believe many may have missed this.

But isn’t there the $ORACLE_HOME/rdbms/lib/config.c file, which is still there, and still used, and isn’t there the make target config.o (make -f ins_rdbms.mk config.o)? Yes, both of them are still there. And still gcc is not a requirement anymore. If you have a pressing need for changing the config.c file (which lists the dba,oper,asm,backup,dataguard,keymanagement and RAC group names), you can still change it, and when you remove the config.o file which USED to be generated with gcc, will now be generated by the ‘as’ executable (portable GNU assembler). This is visible in the oracle database executable make target (ioracle):

$ mv config.o config.O
$ make --dry-run -f ins_rdbms.mk ioracle
chmod 755 /u01/app/oracle/product/19/dbhome_1/bin
cd /u01/app/oracle/product/19/dbhome_1/rdbms/lib/; \
/usr/bin/as -o config.o `[ -f config.c ] && echo config.c || echo config.s`; \
/usr/bin/ar r /u01/app/oracle/product/19/dbhome_1/lib/libserver19.a /u01/app/oracle/product/19/dbhome_1/rdbms/lib/config.o
echo
echo " - Linking Oracle "
rm -f /u01/app/oracle/product/19/dbhome_1/rdbms/lib/oracle
/u01/app/oracle/product/19/dbhome_1/bin/orald  -o /u01/app/oracle/product/19/dbhome_1/rdbms/lib/oracle -m64 -z noexecstack -Wl,--disable-new-dtags -L/u01/app/oracle/product/19/dbhome_1/rdbms/lib/ -L/u01/app/oracle/product/19/dbhome_1/lib/ -L/u01/app/oracle/product/19/dbhome_1/lib/stubs/   -Wl,-E /u01/app/oracle/product/19/dbhome_1/rdbms/lib/opimai.o /u01/app/oracle/product/19/dbhome_1/rdbms/lib/ssoraed.o /u01/app/oracle/product/19/dbhome_1/rdbms/lib/ttcsoi.o -Wl,--whole-archive -lperfsrv19 -Wl,--no-whole-archive /u01/app/oracle/product/19/dbhome_1/lib/nautab.o /u01/app/oracle/product/19/dbhome_1/lib/naeet.o /u01/app/oracle/product/19/dbhome_1/lib/naect.o /u01/app/oracle/product/19/dbhome_1/lib/naedhs.o /u01/app/oracle/product/19/dbhome_1/rdbms/lib/config.o  -ldmext -lserver19 -lodm19 -lofs -lcell19 -lnnet19 -lskgxp19 -lsnls19 -lnls19  -lcore19 -lsnls19 -lnls19 -lcore19 -lsnls19 -lnls19 -lxml19 -lcore19 -lunls19 -lsnls19 -lnls19 -lcore19 -lnls19 -lclient19  -lvsnst19 -lcommon19 -lgeneric19 -lknlopt -loraolap19 -lskjcx19 -lslax19 -lpls19  -lrt -lplp19 -ldmext -lserver19 -lclient19  -lvsnst19 -lcommon19 -lgeneric19 `if [ -f /u01/app/oracle/product/19/dbhome_1/lib/libavserver19.a ] ; then echo "-lavserver19" ; else echo "-lavstub19"; fi` `if [ -f /u01/app/oracle/product/19/dbhome_1/lib/libavclient19.a ] ; then echo "-lavclient19" ; fi` -lknlopt -lslax19 -lpls19  -lrt -lplp19 -ljavavm19 -lserver19  -lwwg  `cat /u01/app/oracle/product/19/dbhome_1/lib/ldflags`    -lncrypt19 -lnsgr19 -lnzjs19 -ln19 -lnl19 -lngsmshd19 -lnro19 `cat /u01/app/oracle/product/19/dbhome_1/lib/ldflags`    -lncrypt19 -lnsgr19 -lnzjs19 -ln19 -lnl19 -lngsmshd19 -lnnzst19 -lzt19 -lztkg19 -lmm -lsnls19 -lnls19  -lcore19 -lsnls19 -lnls19 -lcore19 -lsnls19 -lnls19 -lxml19 -lcore19 -lunls19 -lsnls19 -lnls19 -lcore19 -lnls19 -lztkg19 `cat /u01/app/oracle/product/19/dbhome_1/lib/ldflags`    -lncrypt19 -lnsgr19 -lnzjs19 -ln19 -lnl19 -lngsmshd19 -lnro19 `cat /u01/app/oracle/product/19/dbhome_1/lib/ldflags`    -lncrypt19 -lnsgr19 -lnzjs19 -ln19 -lnl19 -lngsmshd19 -lnnzst19 -lzt19 -lztkg19   -lsnls19 -lnls19  -lcore19 -lsnls19 -lnls19 -lcore19 -lsnls19 -lnls19 -lxml19 -lcore19 -lunls19 -lsnls19 -lnls19 -lcore19 -lnls19 `if /usr/bin/ar tv /u01/app/oracle/product/19/dbhome_1/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo19 -lserver19"; fi` -L/u01/app/oracle/product/19/dbhome_1/ctx/lib/ -lctxc19 -lctx19 -lzx19 -lgx19 -lctx19 -lzx19 -lgx19 -lclscest19 -loevm -lclsra19 -ldbcfg19 -lhasgen19 -lskgxn2 -lnnzst19 -lzt19 -lxml19 -lgeneric19 -locr19 -locrb19 -locrutl19 -lhasgen19 -lskgxn2 -lnnzst19 -lzt19 -lxml19 -lgeneric19  -lgeneric19 -lorazip -loraz -llzopro5 -lorabz2 -lorazstd -loralz4 -lipp_z -lipp_bz2 -lippdc -lipps -lippcore  -lippcp -lsnls19 -lnls19  -lcore19 -lsnls19 -lnls19 -lcore19 -lsnls19 -lnls19 -lxml19 -lcore19 -lunls19 -lsnls19 -lnls19 -lcore19 -lnls19 -lsnls19 -lunls19  -lsnls19 -lnls19  -lcore19 -lsnls19 -lnls19 -lcore19 -lsnls19 -lnls19 -lxml19 -lcore19 -lunls19 -lsnls19 -lnls19 -lcore19 -lnls19 -lasmclnt19 -lcommon19 -lcore19  -ledtn19 -laio -lons  -lmql1 -lipc1 -lfthread19    `cat /u01/app/oracle/product/19/dbhome_1/lib/sysliblist` -Wl,-rpath,/u01/app/oracle/product/19/dbhome_1/lib -lm    `cat /u01/app/oracle/product/19/dbhome_1/lib/sysliblist` -ldl -lm   -L/u01/app/oracle/product/19/dbhome_1/lib `test -x /usr/bin/hugeedit -a -r /usr/lib64/libhugetlbfs.so && test -r /u01/app/oracle/product/19/dbhome_1/rdbms/lib/shugetlbfs.o && echo -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 -lhugetlbfs`
rm -f /u01/app/oracle/product/19/dbhome_1/bin/oracle
mv /u01/app/oracle/product/19/dbhome_1/rdbms/lib/oracle /u01/app/oracle/product/19/dbhome_1/bin/oracle
chmod 6751 /u01/app/oracle/product/19/dbhome_1/bin/oracle
(if [ ! -f /u01/app/oracle/product/19/dbhome_1/bin/crsd.bin ]; then \
    getcrshome="/u01/app/oracle/product/19/dbhome_1/srvm/admin/getcrshome" ; \
    if [ -f "$getcrshome" ]; then \
        crshome="`$getcrshome`"; \
        if [ -n "$crshome" ]; then \
            if [ $crshome != /u01/app/oracle/product/19/dbhome_1 ]; then \
                oracle="/u01/app/oracle/product/19/dbhome_1/bin/oracle"; \
                $crshome/bin/setasmgidwrap oracle_binary_path=$oracle; \
            fi \
        fi \
    fi \
fi\
);
$ mv config.O config.o

First of all, I am in the $ORACLE_HOME/rdbms/lib directory already. I moved the config.o file to a different name, config.O (uppercase O). This will trigger the config.o file to be generated during linking via the the makefile, because the make macro for generating the oracle executable checks for the existence of config.o in $ORACLE_HOME/rdbms/lib, and the generation of config.o is triggered by it not existing.
I used make with the ‘–dry-run’ option, which means it will list what it WOULD do, it doesn’t actually do it.
Now that the make macro doesn’t find the $ORACLE_HOME/rdbms/lib/config.o file, it generates it, using ‘as’, the GNU assembler.
After the run, I move the config.O file back to config.o.
Please mind the make target config.o (make -f ins_rdbms.mk config.o) still exists, and this follows the traditional way, so using gcc, to create the object file config.o.

For anything else than config.c, Oracle provides objects (the compiled, intermediary form of C) in object files. This has several advantages. First of all, the server to install oracle on doesn’t require a compiler. That also means that there is no discussion about compiler versions, Oracle knows for a fact which version of a compiler is used. Second, Oracle can use a different compiler than the GNU compiler, as long as it does provide objects in Linux X86_64 (ELF) format. In fact, that is what oracle does: for Oracle 19.9, Oracle used a compiler from intel: Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.2.174 Build 20170213
You can obtain that information from the oracle executable using:

$ readelf -p .comment oracle | egrep 'Intel.*Build\ [0-9]*'

I hope that at this point I made it clear that no compiler is needed anymore for oracle installation and making changes to the oracle installation, like patching.

The oracle database executable and objects and object files

It’s probably a good idea to show how the oracle executable is build. The way this happens is using the make target ‘ioracle’, which is visible above (make -f ins_rdbms.mk ioracle).
The macro calls ‘orald’, which actually is a script in $ORACLE_HOME/bin, which calls the operating-system ‘ld’ executable, which is the GNU linker.
The arguments to ‘orald’ are arguments that mostly are put through to ‘ld’. ‘-o’ is the output flag, and that shows the executable to be build by the linker, and besides options being set, what you mainly see is -L (library path) and -l (library) switches adding libraries (=object files and archive files) to build the oracle executable.
There’s a couple of places that are used to get objects to build the oracle executable:
– $ORACLE_HOME/rdbms/lib — oracle database rdbms specific libraries
– $ORACLE_HOME/lib — oracle database general libraries (objects are used by multiple “products” in the $ORACLE_HOME)
– $ORACLE_HOME/lib/stubs — this is a directory with ‘stub objects’, which are versions of operating system libraries that contain no code, but allow the oracle executable to be build, even if a operating system library is missing (see: https://docs.oracle.com/cd/E23824_01/html/819-0690/chapter2-22.html)
– $ORACLE_HOME/ctx/lib — oracle text (not sure why oracle text requires an explicit lookup to $ORACLE_HOME/ctx/lib, while other options are all in $ORACLE_HOME/lib
– /lib64 — operating system libraries

At this point it’s important to realise that the object files for linking oracle are visible in two forms: as plain object files (.o) and as archive files (.a). Archive files are exactly what the name suggests: these are archives of object files. You can look and manipulate an archive file using the ‘ar’ (archiver utility), for which the working strongly resembles how tar and jar work: t=list, x=extract, c=create.
If you take a look at one of the main archives, libserver19.a, you see that it contains 2852 object files:

$ ar -t $ORACLE_HOME/lib/libserver19.a | wc -l
2852

If you do wonder what’s inside, ‘ar -tv’ would be a good way to have an idea:

$ ar -tv $ORACLE_HOME/lib/libserver19.a
rw-r--r-- 54321/54321  11136 Oct 19 21:12 2020 kdr.o
rw-rw-r-- 94110/42424   8376 Apr 17 04:58 2019 upd.o
rw-rw-r-- 94110/42424  41968 Apr 17 04:58 2019 kd.o
...
rw-r--r-- 54321/54321  13248 Oct 19 21:11 2020 qjsntrans.o
rw-r--r-- 54321/54321  20296 Oct 19 21:11 2020 kubsd.o
rw-r--r-- 54321/54321  16720 Oct 19 21:12 2020 kqro.o

The conclusion here is that archive files are logical and sensible, otherwise the library directories would have been swamped with huge numbers of object files.

When object files are linked to an executable, it requires object files, and these are in ‘.o’ files, or grouped in ‘.a’ files. A third type of file is needed for linking linking an executable that is going to be a dynamically linked executable: the libraries (the ‘.so’ files) the executable is dynamically going to use. The linker will validate the libraries, which means it inspects the libraries to find the symbols that the objects that form the executable is calling. A library (‘.so’ file) is an already compiled form, in facts it’s pretty much similar to an executable, only the way it’s invoked is when it’s called by a dynamically linked executable that uses it, instead of directly.

The object files itself

This above text pretty much describe how executables, libraries, object files and archives sit together, and how the linking creates the oracle executable via the makefile. This description describes how this is configured by oracle for creating the oracle executable. However, this is really flexible, and can be done differently, so this is not how it always is or should be, this is how oracle chosen it to do.

We can look one level deeper into how this works. An object file in fact is already an archive, containing one or more compiled versions of functions:

$ nm -A opimai.o
opimai.o:                 U dbkc_free_bs_context
opimai.o:                 U dbkc_init
opimai.o:                 U dbktFlush
opimai.o:                 U __intel_new_feature_proc_init
opimai.o:                 U kgeasnmierr
opimai.o:                 U kge_pop_guard_fr
opimai.o:                 U kge_push_guard_fr
opimai.o:                 U kge_report_17099
opimai.o:                 U kgeresl
opimai.o:                 U kge_reuse_guard_fr
opimai.o:                 U ksdwrf
opimai.o:                 U kseini
opimai.o:                 U ksmdsg
opimai.o:                 U ksmgpg_
opimai.o:                 U ksmlsge_phaseone
opimai.o:                 U ksmsgaattached_
opimai.o:                 U kso_save_arg
opimai.o:                 U kso_spawn_ts_save
opimai.o:                 U ksosp_parse
opimai.o:                 U ksuginpr
opimai.o:                 U lfvinit
opimai.o:0000000000000010 T main
opimai.o:                 U opiinit
opimai.o:0000000000000300 t opimai_init
opimai.o:0000000000000140 T opimai_real
opimai.o:                 U opiterm
opimai.o:                 U sdterm
opimai.o:                 U _setjmp
opimai.o:                 U skge_sign_fr
opimai.o:                 U skgmstack
opimai.o:                 U skgp_done_args
opimai.o:                 U skgp_retrieve_args
opimai.o:                 U slgtds
opimai.o:                 U slgts
opimai.o:                 U slkbpi
opimai.o:                 U slkfpi
opimai.o:                 U sou2o
opimai.o:                 U spargs
opimai.o:                 U ssthrdmain

This example takes the object file $ORACLE_HOME/rdbms/lib/opimai.o, and this object file contains 3 actual functions (shown by an address and the symbol type ‘T’ or ‘t’), and a whole bunch of functions without an address and symbol type ‘U’. The functions with symbol type ‘U’ are undefined functions, which means that these functions are not in this archive, but defined somewhere else.
The important thing to consider is that a single symbol can contain multiple functions.

I chosen this object file, because this is in fact the actual object file where the main function, the starting function, for the oracle executable is in. If you obtain a stack trace of an oracle database process, the first function (called ‘first frame’) in at least recent linux versions (some other operating systems or versions might show earlier functions) will show main as the first function. This is also what the linker uses to build the executable, it follows the symbol information together with the command line switches to resolve and obtain all the functions via the symbol information. The linker will generate an error and not build the executable if it can’t find or resolve the symbols and get all the information it needs.

Patching

At this point you should have an understanding what a dynamically linked executable, libraries, object files and archives are, and that the oracle executable is build using a makefile using the linker.

It might be handy and interesting to look at patching. This information, the information about archives and objects, should give you more background about the specifics of patching. Oracle patching has many forms, and actually can do and change a lot of things in a lot of ways. It is retraceable what a patch does by looking at the contents of the patch. But that is now what this post is about.

Especially with one-off patches, in the case of a patch to fix or change one or more functions in the oracle executable, what the patch provides is the fixed and thus changed versions of these functions. However, oracle does not provide sourcecode. In general what oracle provides, is the object or objects containing the changed functions. In order to get the changed function or functions into the oracle executable, what generally happens is that the current/old versions of the object file are removed from the archive they are in, and saved in $ORACLE_HOME/.patch_storage, and the patched versions of the object file are inserted into the archive.

But, we saw an object file generally contains more or much more functions. Well, this is why patches can be incompatible with other patches: if multiple patches change different (or the same) functions in the same objects, the patch applied latest will undo the changes of the previous patch(es). This is also why you must request merge patches if patches are incompatible.

Dealing with individual object files, extracting them from an archive and saving them in order to be able to restore it into the archive is tedious. Also, the archive itself doesn’t mind whatever you remove from it or insert to it, even if it will break linking the oracle executable. Therefore, oracle has created opatch to perform a great deal of validations and checks, and take the work of dependency checks from you and fully automate it. In fact, in general, you can take a (one-off) patch and try to apply it, if it does, it will allow oracle to be relinked, if there is a conflicting patch it will tell you. Also, if you want to revert your applied patch, you can simply rollback and get opatch to load the previous version into the archive. This is way better than letting us humans deal with it directly.

After the patching changed the archives to contain the updated versions of the objects which contain updated functions, these must make it into the oracle executable. This must be done by relinking the executable, which will take the objects including the changed objects from all the object files and archives, and create a new executable. The oracle executable is never directly touched on linux with recent versions, to my knowledge.

I hope this explanation made sense and made a lot of these things which we are dealing with as oracle DBAs more understandable. Any comments or updates are welcome!

This post is about how to make your log files being aggregated in a single place and easy searchable via a convenient web interface.

You might think: wait a minute; doesn’t this exist already? Well, yes and no. Let me explain.
a) traditional log management: traditionally, logs are/were searched for certain strings, typically error messages or parts of error messages and a shell script run by cron, and when matched, a system that was setup to be able to email would send an indicator that the aforementioned string was found.
b) log management via a monitoring tool: the next step was the individual script on each server was exchanged for a monitoring tool, which performed the same task as the shell script. In reality quite often a default monitoring set/template was enabled, instead of the specific strings that were searched for with the shell script. Sometimes this was an improvement, sometimes this meant the specific issues (and thus messages) gone invisible. This is still the way the monitoring works in 90% of the cases, including a general completely standard monitoring template. At least in my experience.
c) other log gathering and indexing: there are many more products that perform this function. The first one that comes to my mind is splunk, and all the options for doing this cloud based (many there!), and a lot of tools based on elasticsearch, like the “ELK stack”.

I think it’s clear “a)” and “b)” are not dynamic and in fact very static. My major concern is it doesn’t allow exploratory investigation, it simply is a warning that is raised, any investigation means you have to log on and start browsing the available information locally. Everybody who worked with HP Openview or Oracle Enterprise Manager will recognise this. Yes, it’s probably all possible with these tools, but it never (ever(!)) is implemented.

For the last category, the likes of splunk, the ELK stack and the cloud based log tools: for the first two I feel definitely serves a function, but it’s aimed at aggregating logs of multiple servers, and is simply too much to setup on a server alongside the processes it is meant to monitor. For the cloud based tools: it might be my conservatism, but getting a subscription and loading up logging feels awkward, especially if it’s for my own test purposes.

This is where Loki comes in. Finally, I would say, there is a tool that can function on small scale (!) and perform log aggegration and provide a searchable database without a huge setup. This is ideal for your own test or development box to able to discover what is going on, and have all the log files at your fingertips without endlessly going through the filesystem performing cd, tail, grep, wc, sort, uniq, et cetera. I think lots of people recognise travelling from log file to log file.

Loki gives you a database that orders log entries based on time, and Grafana provides a web based UI to view and query the loki database. This is how it looks like:

(click to enlarge)
This is an example, what this shows is my test machine, where I decided to see when linux started, as well when the Oracle database instance was started.
* The first query uses the label “rdbms_alert” (indicating it came from the text-based oracle alert.log file, this is a label I added to it), and within the log lines with the label job with value “rdbms_alert”, I added a filter for the string “Starting ORACLE”, which indicates an Oracle database instance start.
* The second query uses the label “messages” (indicating it came from the linux /var/log/messages file, this label is mine too), and within the log lines with the label job with the value “messages”, I added a filter for the string “kernel: Command line”, which indicates linux startup. I additionally added a negative filter for “loki”, because loki logs the queries to the messages file, which I don’t want to see.

I hope you can see the power of having all the logs in a single place, and completely searchable.

This is just a start, this is a very simple proof-of-concept setup, for example the date/time in the log lines is not used, the date/time of the log lines is when it was ingested into loki, it is possible to have loki interpret these.

If you are interested, but are uncertain if this is for you, and would like to test this: I got a couple of Ansible scripts that can setup the combination of:
* promtail (the default loki log streaming tool)
* loki (the database)
* Grafana (the web UI)
The scripts are created on Oracle Linux 7.8.

Install git, and clone the https://gitlab.com/FritsHoogland/loki-setup.git repository using a user that has sudo rights:

git clone https://gitlab.com/FritsHoogland/loki-setup.git

Install ansible (you might have to install the EPEL repository).

In the loki-setup repo you find the setup scripts for loki, promtail and grafana.
You can execute the scripts in that order (loki, promtail and Grafana) by executing the setup_loki.yml, setup_promtail,yml and setup_grafana.yml scripts.
IMPORTANT: do proofread the scripts, and validate the variables for your situation. Don’t worry: the scripts are easy to read.

After installing, you can go to http://yourmachine:3000, login with admin and the password you set in the setup_grafana.yml script, and click on the compass (explore) option, and you can perform your own log browsing.

If you decide you want to remove it, I got a remove script for each of the components, which will remove the specific component entirely. Same applies here too: validate the script.

This is a blog not related to Oracle products in any way.

Remote logging.
This post is specific to apple Airport Extreme and Express wifi routers. However, in general: if you have multiple (unix/linux) servers, it makes sense to centralise the (sys)logging of these servers, in order to get a better overview on what is happening on these servers. I would want to go as far as saying that if you don’t you are simply not doing it right.

The central logging can be another syslog deamon receiving the logging, but there are many more products who are able to receive logging, like splunk, graylog, logstash and so on. This blogpost is about my home wifi routers, I use the simple and limited Synology “Log Center” daemon.

What this blog post is about: enabling remote logging on an Apple Airport device.
In ancient versions of apples Airport Utility, you simply could set the logging server. Apparently this version is still around, but it feels like a nuisance to me to install an older version, and there is a chance it does not work with the current version of OSX, and that it breaks something on the Airport side.

However, it’s really simple actually to set the logging server, and even to see if the logging server option is a supported option. In order to do this, go into the Airport utility, click an Airport device and click edit. Now go to ‘File’, and select ‘Export Configuration File…’. Select a name in ‘Save As:’, and save it.

This saves the configuration of the Airport device in XML format in a file that ends with “.baseconfig”. To understand what elements in the XML file mean, you can look at this link.

In order to set the logserver, open the “.baseconfig” file you just created, and search for “slCl”. The row you will find is:

<key>slCl</key>

Actually when there is no syslog server configured already, it will look like this:

<key>slCl</key>
<string>0.0.0.0</string>

Setting the syslog server is as simple as setting the ip address of your log server at the place of 0.0.0.0! Once this configuration is made active (that is the final step, which we will do next) the Airport device will send BSD like syslog information to port 514/UDP of the ip address just set. This is the default setting for a receiving syslog server. Make sure to save the file when you changed the 0.0.0.0 ip address with the ip address of your log server.

The final step is to go into the Airport Utility again, click the Airport device, click edit, “File”, and choose “Import Configuration File…”. Select the “.baseconfig” file you just edited, and open. This will get you back in the edit/configuration dialogue. To effectuate the setting, click “Update”. This will reboot your Airport device, and have remote sys logging enabled.

Extra setting: below “slCl”, you will find “slvl”, this is the log level threshold for sending. By default, it’s 5 (notice), if you want lesser information, set it to 4 (warning), or 3 (error), etc.

(warning: this is a rather detailed technical post on the internal working of the Oracle database’s commit interactions between the committing foreground processes and the log writer)

After the Trivadis Performance days I was chatting to Jonathan Lewis. I presented my Profiling the log writer and database writer presentation, in which I state the foreground (user/server) process looks at the commit SCN in order to determine if its logbuffer contents are written to disk by the logwriter(s). Jonathan suggested looking deeper into this matter, because looking at the commit SCN might not the way it truly works.

The reasoning is the foreground process flushes its log information from its private strand into the public logbuffer, and as such only needs to keep track of the log blocks it allocated in the public logbuffer to see if these are written. Different processes can allocate different blocks in the public log buffer, which potentially do not have to be written all in SCN order. In other words: it could be hard to figure out on write time what the exact on-disk SCN of the redo is.

In order to verify this, let’s first recap what I tell about how a foreground process behaves on commit:

This is a schematic drawing of a foreground process committing.
There are two timelines, one of the foreground process, and one of the logwriter.
The foreground process commits. The commit copies the changes (redo vectors) from the process’ private strand into the public logbuffer (not shown, all indicated with “commit;”), after which it signals (“posts”) the logwriter via the semctl() system call.

When idle, the logwriter sleeps in a systemcall called semtimedop() for 3 seconds after which it performs some “household” tasks (measure time and resource usage for example) then calls semtimedop() again. When signalled, or when it finds contents in the public log buffer, the log writer writes the public log buffer contents to the online redo log file(s) via AIO if available. This is indicated by the io_submit() and io_getevents() calls.

In the meantime, after executing semctl(), the foreground process enters the function kcrf_commit_force(), or the kcrf_commit_force_int() function with Oracle 12. Inside this function, it calls the function kcscur3() three times. If the log writer was able to write the public log buffer contents before kcscur3() has been executed 3 times, the foreground will not enter the wait event ‘log file sync’, and not sleep on semtimedop() in order to wait for being signalled by the log writer, which is why I put it in a gray bar (indicating “optional”). Also, in that case, because the foreground process is not awaiting to be signalled by the logwriter, the log writer does not execute semctl(). The log writer knows what process(es) it needs to signal via a list that administers posters and waiters, which is called the “post-wait queue”, which is not externalised. If the log writer did not write the public log buffer contents fast enough, the foreground process registers it started waiting (in the aforementioned post-wait queue), and starts sleeping in semtimedop() for 100ms, after which it executes kcscur3() two times, do some household actions like measuring time and resource usage, and then calls semtimedop() again. This repeats until the process is receiving a message from the log writer via semctl().

Starting from Oracle version 11.2.0.3, the underscore parameter “_use_adaptive_log_file_sync” is set to TRUE, which enables a feature called “adaptive log file sync”. This settings means the database can and will (!) switch between post-wait mode, which has just been described, and a mode called ‘polling’. Polling means the processes that requested the log writer to write will not wait until they are posted by the log writer, but look at the log writer write progress, and continue after the log writer has written their specific log blocks from the public log buffer.

This is how that schematically looks like:

As you can see, it is quite alike the post wait mode, only the sleeping is done using nanosleep() instead of semtimedop() and there is no semctl() on the log writer timeline. The foreground process issues the same kcscur3() functions, but because these measure SCN progress, the foreground can determine if its public log buffer contents have been written to disk, and stop waiting and continue processing.

I made the conclusion that the kcscur3() function is used to determine the commit SCN based on profiling the function call sequence and logic reasoning. Encouraged by Jonathan’s reasoning, let’s try to see if we can dig deeper.

The first thing we need to do, is see if we can obtain more information on the kcscur3() function. One way of doing that, is investigating the function’s arguments. This can be done without source code or debug symbols, because the arguments of a function are passed via CPU registers on X86_64 Linux: first to fourth arguments are in registers RDI, RSI, RDX, RCX.

A way to do this, is insert data into a table, then attach to this process using gdb, and execute the following gdb macro:

break kcscur3
commands
silent
printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
continue
end

After continuing the foreground via gdb, gdb shows kcscur3 is executed 6 times:

kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Interesting! But what are these numbers? Values? Addresses?
In order to understand if these arguments mean anything, let’s first get the addresses for the general shared memory area’s. This can be done using ipc command with oradebug:

SYS@v12102 AS SYSDBA> oradebug setmypid
Statement processed.
SYS@v12102 AS SYSDBA> oradebug ipc
IPC information written to the trace file
SYS@v12102 AS SYSDBA> @lt

@lt is a script to look at the current active trace file. The relevant contents of this trace file are the shared memory area’s:

Handle:            0x113c2940 `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102'
 Dump of unix-generic realm handle `/u01/app/oracle/product/12.1.0.2/dbhome_1v12102', flags = 00000000
  key 604726764 actual_key 604726764 num_areas 4 num_subareas 4
  primary shmid: 112590854 primary sanum 3 version 3
  deferred alloc: FALSE (0) def_post_create: FALSE (0) exp_memlock: 2050M
 Area #0 `Fixed Size' containing Subareas 2-2
  Total size 00000000002ca788 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      0        2 112492547 0x00000060000000 0x00000060000000 0x00000060000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          00000000002cb000 0000000000400000 default       readwrite
 Area #1 `Variable Size' containing Subareas 0-0
  Total size 0000000077000000 Minimum Subarea size 01000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      1        0 112525316 0x00000061000000 0x00000061000000 0x00000061000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000077000000 0000000077000000 default       readwrite
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000008d35000 Minimum Subarea size 00001000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      2        1 112558085 0x000000d8000000 0x000000d8000000 0x000000d8000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000008d35000 0000000008e00000 default       readwrite
 Area #3 `skgm overhead' containing Subareas 3-3
  Total size 0000000000003000 Minimum Subarea size 00000000
   Area  Subarea    Shmid    Segment Addr    Stable Addr    Actual Addr
      3        3 112590854 0x000000e1000000 0x000000e1000000 0x000000e1000000
               Subarea size     Segment size   Req_Protect  Cur_protect
                          0000000000003000 0000000000003000 default       readwrite

We see the 4 shared area’s and their memory address:
– Fixed size, start address 0x60000000, size 0x2cb000
– Variable size, start address 0x61000000, size 0x77000000
– Redo buffers, start address 0xd8000000, size 0x8d35000
– Skgm overhead, start 0x1e000000, size 0x3000

If we combine this information with the kcscur3() arguments, we see that the first argument points to the ‘fixed size’ area, in other words: the fixed SGA. The fixed SGA variables metadata is listed in x$ksmfsv. Please mind this view does not list all the contents of the fixed SGA, for example the latches are in the fixed SGA too.

We got 3 addresses which are inside the fixed SGA on calling commit: 0x6001fbb0, 0x60027c68 and 0x60027c98. Let’s see if we can find them in x$ksmfsv:

SYS@v12102 AS SYSDBA> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
2  where to_number('6001fbb0','XXXXXXXX')
3  between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX') and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcsgscn_                                                         000000006001FBB0         48

SYS@v12102 AS SYSDBA> c/6001fbb0/60027c68/
  2* where to_number('60027c68','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

SYS@v12102 AS SYSDBA> c/60027c68/60027c98/
  2* where to_number('60027c98','XXXXXXXX')
SYS@v12102 AS SYSDBA> /

KSMFSNAM                                                         KSMFSADR           KSMFSSIZ
---------------------------------------------------------------- ---------------- ----------
kcrfsg_                                                          0000000060027C30       1608

The first fixed SGA address, 0x6001fbb0, points to a variable called kcsgscn_. Some sources on the internet report this is the current SCN. (Kernel Cache System Global SCN?)
Let’s test this! First I set a watchpoint on 0x6001fbb0 in gdb, then query v$database.current_scn:

(gdb) awatch *0x6001fbb0
SYS@v12102 AS SYSDBA> select current_scn from v$database;

This triggers the watchpoint!

Hardware access (read/write) watchpoint 1: *0x6001fbb0

Old value = 11278094
New value = 11278163
0x000000000cc77983 in kcscur3 ()

Ah! So when querying the current_scn, it uses the kcscur3() function too, and apparently, kcscur3() can change a value too (we see an old value, and the new value). I press ‘c’ and enter to let the debugger continue the program it debugs. In fact, I have to do this multiple times in the functions kcscur3(), kcsgssn(), kcsgcsn() and kcsgbsn. The watchpoint shows memory address 0x6001fbb0 is left with a number exactly one higher than is shown in the query ‘select current_scn from v$database’.

The second and third fixed SGA addresses, 0x60027c68 and 0x60027c98, both point to a variable called kcrfsg_. Actually, they do not point to the starting address of the variable, but rather to “somewhere” in this variable. First let’s see what value is stored at these addresses in the variable kcrfsg_ using gdb:

(gdb) x/dw 0x60027c68
0x60027c68:	11278915
(gdb) x/dw 0x60027c98
0x60027c98:	11278917

These are large numbers, which are highly likely to be SCNs of some sort.

The variable kcrfsg_, which is quite probably a c “struct” (a variable composed of multiple variables, alike records in a table) is linked with the x$ view x$kcrfws:

SYS@v12102 AS SYSDBA> select addr from x$kcrfws;

ADDR
----------------
0000000060027C38

The address reported is 8 bits into kcrfsg_. The x$kcrfws view is only used by the v$ view v$xstream_capture, and because of that it is reported to have something to do with replication. That is incorrect. My current assumption is x$kcrfws means Kernel Cache Redo Write Status.

In order to figure out which field in x$kcrfws is linked to which memory address (0x60027c68 and 0x60027c98), I use gdb once again, and use a watchpoint on a memory address. Oradebug also provides this functionality, but it doesn’t work in my setup (mprotect error 22). I attach gdb to a SYSDBA session, and execute:

(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 1: *0x60027c68

Then query the fields in x$kcrfws one by one. It turns out, memory address 0x60027c68 is accessed for the fields lwn_scn_bas and lwn_scn_wrp, and memory address 0x60027c98 is accessed for the fields on_disk_scn_bas and on_disk_scn_wrp.

So, what do we know now? It becomes apparent Oracle uses the kcscur3() function for reading SCN values. The function seems to be dynamic and can be used for multiple locations holding different types of SCNs. We witnessed reading the instance current SCN, the on disk SCN and the LWN SCN.

The on disk SCN (x$kcrfws.on_disk_scn_(bas|wrp)) seems to be a registration of up to which SCN is written by the logwriter, and the LWN SCN (x$kcrfws.lwn_scn_(bas|wrp)) is a registration of up to which SCN is in the current log write number (LWN). The log write number (LWN) seems to be a number appointed to groups of redo blocks for writing them in batch.

This information is needed to make more sense of how my foreground works. In order to make the tracing of the foreground more meaningful, we need to add a break on semctl() to understand when all redo vectors are copied into the public log buffer, and the foreground actually starts waiting on the log writer, and peeks at its progress. It is also handy to add breaks to semtimedop() and nanosleep(), so we know what logwrite mode is in use:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000cc77970 <kcscur3>
	breakpoint already hit 6 times
        silent
        printf "kcscur3, %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
        continue
(gdb) break semctl
Breakpoint 2 at 0x3bfdaeb030
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>silent
>printf "semctl\n"
>continue
>end
(gdb) break semtimedop
Breakpoint 3 at 0x3bfdaeb060
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>printf "semtimedop\n"
>continue
>end
(gdb) break nanosleep
Breakpoint 4 at 0x3bfde0ef90
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>silent
>printf "nanosleep\n"
>continue
>end
(gdb) disable
(gdb)

I disabled all breakpoints (disable command), and do the insert. After the insert, the enable command enables all breakpoints, and watch what the function call sequence:

(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

So, after a foreground process has written its change vectors into the public log buffer and requests the log writer to write using semctl(), it doesn’t poll a single location for a SCN, but requests the on disk SCN and the current SCN! It also seemed my log writer was speedy enough to write, because there is no nanosleep() nor semtimedop() call.

It seems we need to manipulate the logwriter too, in order to get the call sequences I showed in my slides. In order to do that, I open another terminal session, and attached a debugger session to the lgwr too. At this point I should point out I made sure my Oracle (12.1.0.2) database was using the single lgwr process and not the log writer workers by setting the “_use_single_log_writer” undocumented parameter to true. (do NOT do this on a production database; both setting an undocumented parameter without the blessing of oracle support, nor attach gdb to a log writer process!!)

I added a break to the io_getevents_0_4 function, which is the function for reaping asynchronous submitted IO, and then disable the breakpoint:

break io_getevents_0_4
disable
continue

I also disabled the breakpoints in the other debugger session attached to the foreground session, and insert another record. After the insertion, enable the breakpoints in both gdb sessions, and enter commit. Now the log writer debugger session will break on io_getevents_0_4, and the foreground debugger session should show a repeating pattern:

(gdb) enable
(gdb) c
Continuing.
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070
kcscur3, 60027c68, f6aac6c0, 1, 5
kcscur3, 60027c98, f6aac2f0, 1, 634
kcscur3, 60027c68, f6aac258, 0, 0
semctl
kcscur3, 60027c98, f6ab01b0, 1, 634
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
nanosleep
kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 60027c68, f6ab0118, 0, 0
...

The nanosleep and kcscur3 calls for address 60027c98 and 60027c68 continue to be executed. After continuing the log writer debugger session, the foreground debugger session shows:

kcscur3, 60027c98, f6ab01b0, 1, d629cbe8
kcscur3, 6001fbb0, f6ab0708, 1, fe92f070

Bingo! We see the nanosleep() call, so we are in polling mode, not in post-wait mode. And because we artificially made the log writer stop from progressing , the foreground process is reading the on-disk SCN and LWN SCN, then calls nano sleep, then scans the on-disk and LWN SCNs again, etcetera, until the on-disk SCN gets higher than the foreground process commit SCN. Interestingly, it seems that once kcscur3() on address 0x60027c98 was able to identify the log writer progressed beyond it’s commit SCN, the next kcscur3() call does not read address 0x60027c68, but instead reads address 0x6001fbb0, alias the current SCN.

Okay, so this points to the on-disk SCN being responsible for commit progress. If this is actually true, we should now take a look at the log writer and see if the log writer indeed changes the on-disk SCN at memory location 0x60027c98 after the write. For this I can use the memory watch facility of the debugger again:

(gdb) awatch *0x60027c98
Hardware access (read/write) watchpoint 2: *0x60027c98
(gdb) awatch *0x60027c68
Hardware access (read/write) watchpoint 3: *0x60027c68
(gdb) c
Continuing.

This will quite probably immediately break on access of these memory areas. The log writer is looking at these memory area’s a lot. Here’s a snippet of the logwriter approximately during the commit of the foreground:

(gdb) c
Continuing.
Hardware access (read/write) watchpoint 3: *0x60027c68

Old value = 11174890
New value = 11174892
0x000000000cc77ac4 in kcsnew3 ()
(gdb) c
Continuing.

Breakpoint 1, io_getevents_0_4 (ctx=0x7f165de70000, min_nr=2, nr=128, events=0x7ffc696611b8, timeout=0x7ffc696621b8)
    at io_getevents.c:46
46		if (ring==NULL || ring->magic != AIO_RING_MAGIC)
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x000000000cc77983 in kcscur3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Value = 11174890
0x0000000002d70bb9 in kcsadj3 ()
(gdb) c
Continuing.
Hardware access (read/write) watchpoint 2: *0x60027c98

Old value = 11174890
New value = 11174892
0x0000000002d70c13 in kcsadj3 ()

First we see a function called kcsnew3() change the value at 0x60027c68. Then we see the breakpoint at io_getevents() indicating it looks for submitted IOs to return, in other words: the log writer was asked to write something or found something to write. After the write we see the log writer reading the memory location 0x60027c98 using the kcscur3() function, just as our foreground process does. Then we find a function called kcsadj3() which first reads the on-disk SCN location 0x60027c98, and then changing it. This indicates the log writer is increasing the value of the on-disk SCN in 0x60027c98 after it has written using the function kcsadj3(), a safe bet is Kernel Cache Service Adjust, which the foreground in polling mode is reading to understand if the log writer has written the contents the process put in the public log buffer.

Of course this still isn’t the ultimate proof, I break at a few, arbitrary functions, there can be all kinds of other things going on. This means that still the foreground process can just keep the list of log buffer blocks and really use that, and only use the SCN values for verification, because I am looking at only a small subset of the functions it is executing.

However, there is a way to test this!
=> Another word of caution, the next manipulation can render your database into a smoking pile of bits and bytes! <=
The log writer mode needs to be polling. Also, for simplicity, in Oracle 12 “_use_single_log_writer” must be set to true, to be sure the log writer is performing all the tasks, and not log writer slaves.

In order to proof that a foreground session is truly looking at 0x60027c98 for the on-disk SCN to understand if the log writer has written its log buffer contents, we let the log writer stop before it has adjusted the on-disk SCN. That can be accomplished by attaching to the log writer with gdb, and break on io_getevents_0_4:

break io_getevents_0_4
disable
continue

Now the breakpoint is set, disabled, and execution is resumed.

The next thing to do, is go to the sqlplus foreground session and insert a row of data. Once the insert is done, go to the debugger session, stop execution, enter “enable”, and continue. The debugger resumed execution again, but now will break execution once it encounter io_getevents. Now go back to the foreground sqlplus session and enter commit.

This will trigger the log writer to write, and break execution because of the breakpoint. Because the log writer is stopped the foreground session will appear to be “hanging”. In fact what the foreground session is doing is sleeping in nanosleep(), then executing kcscur3() to look at the on-disk and LWN SCNs, then sleeping in nanosleep() again, etc.

This next part is experimental!
Because the log writer is stopped at the io_getevents() call, it didn’t execute kcsadj3() yet to update the on-disk SCN in 0x60027c98. What we will do, is read the current value in 0x60027c98, and increase the value using gdb. If the foreground is truly only waiting for the value at 0x60027c98, increasing the value at this memory address should give back the prompt of the sqlplus session, because it thinks the commit fully succeeded because the log writer performed all its functions.
First query the on-disk SCN:

(gdb) x/dw 0x60027c98
0x60027c98:	11175042

So, the on-disk SCN value is 11175042. Mind you the sqlplus session appears hanging, but really is “polling” the on-disk SCN, and the log writer has stopped executing, because of the breakpoint.

Let’s see if increasing the on-disk SCN makes the sqlplus session think the commit succeeded. Please mind I randomly increased the value from above by changing the 7 to 8 on the 5th position.

(gdb) set {int}0x60027c98 = 11185042

Voila! the foreground process continues and returns the prompt, indicating it is thinking the log writer has finished!

This means that Oracle uses the log writer on-disk SCN for processes to determine if their log buffer contents have been written to disk.

When the log writer is in post-wait mode, this mechanism is in place too, but the foreground process, once waiting, needs to wait for the log writer to post it via semctl(). There is one exception to this, which is true for both the polling and the post-wait mechanisms: right after a foreground process signalled the log writer to write, it will call kcscur3() looking for the on-disk SCN and if it finds the on-disk SCN beyond its commit SCN, there will not be a wait triggered. In all other cases the process will register a wait “log file sync”.

Conclusion
The kcscur3() function is a function to read and update SCNs. There are multiple SCNs, like current SCN, on-disk SCN and LWN SCN, which all can be read with kcscur3(). Also kcscur3() is not exclusive to commit, when querying the current_scn field in v$database, this function is used too.

When a process commits, right after it signalled the log writer, it checks for the log writer progress via the on-disk SCN. If the log writer was fast enough to have the contents flushed to disk and the on-disk SCN updated, the process will not wait for ‘log file sync’ at all.

Both log writer modi, post-wait and polling, look for the log writer write progress via the on-disk SCN. Both modi will show the wait ‘log file sync’ if the above “fast log writer exception” didn’t happen.

When the log writer is in post-wait mode, a process without the “fast log writer exception” will register itself in the post-wait queue and sleep on a semaphore waiting to be posted by the log writer, and register it is waiting for event ‘log file sync’. Still, it does keep track the on-disk SCN (twice) with a check of the LWN SCN in between, via calls to kcscur3(). After being signalled by the log writer, it again verifies the on-disk SCN and looks for the current SCN, after which it continues.

When the log writer is in polling mode, a process without the “fast log writer exception” will register a waiting for event ‘log file sync’. Now it sleeps in nanosleep() and only executes two checks: the on-disk SCN and the LWN SCN, via calls to kcscur3(). Once the log writer progressed the on-disk SCN beyond the process’ commit SCN, it will continue.

In my previous post, I looked at non shared latches and how the latching is done by Oracle. This post is a description on how the latching works for shared latches.

The information is quite internal, if you landed on this page it might be a good idea to start with my first post on this topic: first post.

A famous example for shared latches is the ‘cache buffers chains’ latch.

For the sake of the test I quite randomly scanned a test table, and had a little gdb script to look at the function call ksl_get_shared_latch:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end

After getting the results, I looked up the latch addresses (laddr) in V$LATCH.ADDR, and searched for a cache buffers chains latch.

Next, I did setup the same construction as the previous investigation into non shared latches, I started two sqlplus / as sysdba sessions. In the first session, I simply took a shared latch in exclusive mode (16 as the fourth argument).

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

This takes latch 0x94af8768 in mode 16, exclusive mode.

I essentially did the same as with the previous investigation, I ran perf record, on the session trying to get the latch, and looked at the functions which were used during the spinning for the shared latch with perf report. If the spinning is not clear enough visible, set “_spin_count” to an higher value.

After the investigation of perf report, I worked out the relevant functions. Here is the gdb script with the relevant functions that I discovered:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end
break kslgess
  commands
    silent
    printf "kslgess %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslskgs
  commands
    silent
    printf "kslskgs %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break *0xc291a9
  commands
    silent
    printf " kslskgs loop: %d\n", $r15d
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

If you attach to the second process and source the breakpoint script, this is what you will see:

ksl_get_shared_latch laddr:94af8768, willing:1, where:0, why:2303, mode:16
kslgess 94af8768, 16, 1, 0
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 2000
 kslskgs loop: 1999
...
 kslskgs loop: 2
 kslskgs loop: 1
kslwlmod 464092480, -1780205088, -1800435864, 1
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 1
skgpwwait 464091920, 1628424256, -1780204168, 0
sskgpwwait 464091920, 1628424256, -1780204168, 0
semop 1409027, 464091720, 1, -1

1-A shared latch is gotten using the distinct ksl_get_shared_latch function call. In the ksl_get_shared_latch function an attempt to get the latch is done, and returns if succeeded.
2-Next kslgess (kernel service latch get spin shared (latch)). It does not appear a latch get attempt is done in this function.
3-kernel service latch spin kernel get shared (this is a guess). Here the latch is tried up to _spin_count-1 times.
4-8 this is the kslskgs function spinning over a latch get.
9-kernel service latch wait list modification. Here the process is (almost?) done spinning, and registers itself in the post wait interface for getting post when the latch is freed.
10/11-here a last time the latch is tried.
12-system kernel generic post wait wait: the function to setup the waiting for getting posted.
13-system system kernel generic post wait wait: probably the port specific code for the post wait setup.
14-semop: semaphore operation: the session starts sleeping on a semaphore waiting to get posted. As has been indicated previously, this makes the process stay clear of running unless it is posted.

This also means that instead of the (very) CPU heaving spinning on the latches, which was done in the old days, this way is very light on the CPU, and seems to be very efficient. Or more efficient than the old days.

A word of caution: it seems the shared latch get in exclusive mode can not be freed; the session that has gotten the latch will hang when calling kslfre, and the session that was waiting for the latch did get:

ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

This seems to be expected behaviour, as indicated by Stefan Koehler and Andrey Nikolaev, which is not fixed in Oracle version 12.1.0.2.

Last week I’ve gotten a question on how storage indexes (SI) behave when the table for which the SI is holding data is changed. Based on logical reasoning, it can be two things: the SI is invalidated because the data it’s holding is changed, or the SI is updated to reflect the change. Think about this for yourself, and pick a choice. I would love to hear if you did choose the correct one.

First let’s do a step back and lay some groundwork first. The tests done in this blogpost are done on an actual Exadata (V2 hardware), with Oracle version 11.2.0.4.6 (meaning bundle patch 6). The Exadata “cellos” (Cell O/S) version is 11.2.3.3.1.140529.1 on both the compute nodes and the storage nodes.

A storage index is a memory structure used by the cell daemon, which is the storage server process on the storage layer of Exadata. By default (although I’ve never seen it different yet) a SI contains minimum and maximum values for up to eight columns of the table it is describing. The memory structure is transient. It describes a region of one megabyte of a table. A storage index is populated during the fetching of data for a smart scan, based on the filter predicates of the query causing the smart scan. In essence the whole SI management is done automatically by the cell daemon (“essence” means you can play around with some undocumented settings on the database level, and there are some undocumented settings and events you can set on the cell server level).

Okay, back to the the original question. Let’s test this. First we need a table large enough to be smartscanned. I’ve got a table called ‘BIGTAB_NOHCC’ (as you probably guessed, there is a table ‘BIGTAB’ in the schema I am using, which is hybrid columnar compressed). This table consists of 2671736 blocks of 8KB, which means it got a total size of 21G. This is big enough for my database instance (big enough is relative to the buffer cache size) to get a smart scan.

For this test, I use the session statistics. In v$sesstat there is a statistic called ‘cell physical IO bytes saved by storage index’, which tells us how many bytes we saved from being scanned because of the use of storage indexes. I also show some output of a (single) storage server, although a normal Exadata rack typically will have 3, 7 or 14 storage servers in use.

First of all, in order to get an idea and be pretty sure my query will populate storage indexes, I stop and start the cell daemon on the storage servers. I use ‘service celld restart’ for that.

Next, in order to get storage index information for the table I use, I need some metadata. The metadata I need are:
a) data object id

select data_object_id from dba_objects where owner = 'MARTIN' and object_name = 'BIGTAB_NOHCC';

b) tablespace number

select ts# from v$tablespace t, dba_segments s, dba_objects o where t.name=s.tablespace_name and s.segment_name=o.object_name and o.data_object_id = 18716;

c) database unique id from the x$ksqdn

select ksqdngunid from x$ksqdn;

(thanks to the progeeking website for this information)

Next, I run the query on my BIGTAB_NOHCC table. Please mind it’s fundamental to have a filter predicate, and that the execution plan shows we are doing a full table scan. Only with a filter predicate the storage server has a reason to build a storage index:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

Now let’s see if the storage layer built a storage index:

CellCLI> alter cell events = "immediate cellsrv.cellsrv_storidx('dumpridx','all',18716,5,1126366144);
Dump sequence #1 has been written to /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Cell enkcel01 successfully altered

This is why it’s handy to have the aforementioned data on the table: you can dump the specific storage indexes for a single table. The three numbers after ‘all’ are data_object_id, tablespace number and the ksqdngunid from the x$ksqdn table.

Let’s look in the file which the cell event produced:

# less /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
Trace file /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_58.trc
ORACLE_HOME = /opt/oracle/cell
System name:    Linux
Node name:      enkcel01.enkitec.com
Release:        2.6.39-400.128.17.el5uek
Version:        #1 SMP Tue May 27 13:20:24 PDT 2014
Machine:        x86_64
CELL SW Version:        OSS_11.2.3.3.1_LINUX.X64_140529.1

*** 2014-07-12 13:16:17.030

*** 2014-07-12 13:16:17.030
UserThread: LWPID: 27455 userId: 58 kernelId: 58 pthreadID: 0x7f2863fe7940
2014-07-12 13:21:36.499123 :0005C9DE: $$$ Dumping storage idx summary for griddisk DATA_CD_08_enkcel01:
2014-07-12 13:21:36.499217 :0005C9E0: Dump sequence #1:

*** 2014-07-12 13:21:36.499
2014-07-12 13:21:36.499212 :0005C9DF: 
***************************
2014-07-12 13:21:36.499249 :0005C9E1: Dumping RIDX summary for objd 18716, tsn 5, dbid 1126366144

2014-07-12 13:21:36.499249*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-12 13:21:36.499249*: RIDX: strt 0 end 2048 offset 6215958528 size 1048576 rgnIdx 5928 RgnOffset 0 scn: 0x099a.5cc294b8 hist: 2
2014-07-12 13:21:36.499249*: RIDX validation history: 
2014-07-12 13:21:36.499249*: 0:FullRead 1:Undef 2:Undef 3:Undef 4:Undef 5:Undef 6:Undef 7:Undef 8:Undef 9:Undef
2014-07-12 13:21:36.499249*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-12 13:21:36.499249*: lo: c3 c 34 11 0 0 0 0
2014-07-12 13:21:36.499249*: hi: c3 5a 5 18 0 0 0 0

Here you see the typical heading of an Oracle trace file, next the announcement of the dump (“Dumping storage idx summary”,”Dumping RIDX summary”). The real storage index information starts with “RIDX(0x7f27b4ddab64)”. Starting with that and until the “lo” and “hi” values, you are looking at an actual storage index which holds data for a single column. You can see which column by looking at the “Col id” in the square brackets: 1. It’s interesting to note that there is a scn (system change number) included. The storage index shows if there are NULL values in the column (in this case it says HASNONNULLVALUES, so we don’t have any NULL values in the 1MB chunk in the column this storage index describes), and, of course, the low and high values in the Oracle internal data format.

So, despite any indication on the database layer, the query built storage indexes! That should mean that executing the same query again will result in actually using the storage indexes which were just build:

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906259;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

Yes! This using the storage index we caused the be built up previously!

With this being established, we can try to get an answer to our question: how does the storage index react to DML on the table it is describing?

We build the storage index, and used it. Now let’s update the ‘id’ field for which the storage index was build, and redo our query test:

MARTIN:dbm011> update bigtab_nohcc set id = id + 1;

16000000 rows updated.

MARTIN:dbm011> commit;

Commit complete.

Okay, now let’s redo the select again, and take stats before and after on storage index usage!

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

What we see here with the storage index session statistic, is the update statement didn’t use the storage index (which is obvious, we updated the data, which happened on the database layer, so we didn’t use a smartscan) because this number is the same as the last time we looked at it before the update statement.
When we executed the select query on the table with a filter on the id column again, there is no storage index usage, because the storage index session statistic didn’t increase.

Actually, this query built new storage indexes. When this query is executed again, we can use these:

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    5746368512

MARTIN:dbm011> select count(*) from bigtab_nohcc where id=906260;

   COUNT(*)
-----------
	 16

MARTIN:dbm011> @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				   14534426624

(actually, it sometimes can take more than one execution for the storage indexes to be created again, my guess would be some heuristics are used to try to come up with the best candidates for storage indexes)

Let’s try another case: update a column in the table for which storage indexes are created on another column.
First make sure storage indexes are build:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index					     0

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

For completeness sake, I verified the storage indexes for this table by using the dump shown previously. A simple grep on the dump shows this query created storage indexes for only column 1, the id column:

grep -e RIDX_VALID -e 'Col id' /opt/oracle/cell/log/diag/asm/cell/enkcel01/trace/svtrc_26565_4.trc
014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddab64) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddac40) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 4 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814235*: RIDX(0x7f27b4ddad1c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814235*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddadf8) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddaed4) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
2014-07-13 03:34:46.814587*: RIDX(0x7f27b4ddb08c) : st 2(RIDX_VALID) validBitMap 0 tabn 0 id {18716 5 1126366144}
2014-07-13 03:34:46.814587*: Col id [1] numFilt 5 flg 2 (HASNONNULLVALUES): 
...etc.

Now let’s update another field:

MARTIN@dbm011 > update bigtab_nohcc set spcol = spcol + 1;

16000000 rows updated.

MARTIN@dbm011 > commit;

Commit complete.

And query the storage index use, and do our query with filter predicate again:

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    8790155264

So…this strongly suggests the update invalidated our storage index, even when the column was not in the storage index.
Just for completeness sake, let’s issue the select statement again to see if the storage index was build up again:

MARTIN@dbm011 > select count(*) from bigtab_nohcc where id=906260;

  COUNT(*)
----------
	16

MARTIN@dbm011 > @mystats
Enter value for name: %storage%
old   4: and name like nvl('%&name%',name)
new   4: and name like nvl('%%storage%%',name)

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO bytes saved by storage index				    9808936960

Yes, there it is again, the previous statement didn’t benefit from the storage indexes, but build them, and now we take advantage of it again.

Conclusion.
I ran these tests on a quarter rack without any other usage. The method I used for measuring how storage indexes behave was to execute on the database layer, and see the result on the storage layer and in database layer session statistics. It’s important to realise that despite this being quite strong evidence, there isn’t definite proof on my conclusion. Also, the behaviour described can change in future storage server versions.

My conclusion is that the storage server uses a system change number to validate if the storage indexes are valid. If not, storage indexes are not used. If blocks contained in a storage index progresses it’s system change number (in other words: are updated), the storage index is invalidated right away (during my tests, when the update starts, no or few storage indexes remain).

So, the ones that picked storage indexes being invalidated were right. As far as I could detect, there is no updating of storage indexes.

%d bloggers like this: