Oracle database processes and waiting long in the supermarket
Hopefully I got your interest by the weird name of this blogpost. This blogpost is about sensible usage of an Oracle database. Probably, there are a lot of blog posts like this, but I need to get this off my chest.
A quote any starwars fan would recognise is ‘I sense a disturbance in the force’. I do, and I have felt it for a long time. This disturbance is the usage of the number of connections for a database. Because of my profession, this is the oracle database, but this really applies to the server-side of any client/server server processor running on at least (but probably not limited to) intel Xeon processors.
The disturbance is the oversubscription or sometimes even excessive oversubscription of database connections from application servers, or any other means of database processes that acts as clients. This means this does not exclude parallel query processes, in other words: this applies to parallel query processes too.
What is oversubscription of database connections? Any good consultant would be able to tell you this: it depends. In principle, oversubscription means more subscribers than a system can handle. This means ‘oversubscription’ is a multidimensional beast, which can apply to CPU, memory, disk IO, network IO. That makes it hard.
This blogpost is about CPU oversubscription. The way a modern CPU, Intel Xeon in this case, works is not simple, and thus this will and cannot be an exhaustive description. What I want to try is provide a basic, simplistic description to provide an idea and give guidance.
An Intel Xeon CPU provides a number of processing units, called ‘processor’ in /proc/cpuinfo, called ‘cpu’ in the top utility, etc, which are the execution contexts for processes. These execution contexts can be hyperthreads, which for Intel Xeon are two threads per core, or an execution context for a single core. Not all Xeon CPUs provide hyperthreading, and hyperthreading, if available in the CPU, can be disabled in a system’s BIOS. A hyperthread can not, and does not do any processing, that is what a core does.
So why do hyperthreads exist then? I think two prominent reasons are:
1. Not all processes are active all the time. By increasing the number of execution contexts, the switching for processes between execution contexts is reduced, which means a reduction of time spend on context switches.
2. Cores are incredibly powerful, and given that probably not all processes are active all the time, combining two processes on the same core will be “reasonably unnoticeable”.
The next question which then comes to mind is: is hyperthreading a good thing for a database? I cannot give an exhaustive answer for this, partially because there is massive difference between system usage for different types of database usage. I would say that with reasonable usage, hyperthreading in modern Intel Xeon CPUs does provide more benefits, reduced context switching, than it gives downsides, like variances in latency of CPU usage.
This means that for looking at the CPU processing power of a database server the ‘actual’ processing power sits logically between the number of core’s and the number of threads. But wait! Didn’t I just say that threads don’t process, only core’s do? Yes, but let me explain how I look at this: if your processes do processing requiring ACTUAL, ON CPU processing, it means they depend on the core to be able to handle this, versus processes that work by running into waiting really quickly, like doing disk IO, network IO or waiting for users to make them active, which still might appear as running all the time, but in reality are actually processing on the core occasionally.
The first type, doing the actual, on cpu processing should calculate CPU power more towards core count, and the second type, doing lots of stalls, should calculate CPU more towards thread count. Overall, this is quite simply about using a core as efficient as possible.
Now that we gone through CPUs and their cores and threads, and oversubscription in general, the next question is: so how much processes should be allocated on a database server?
The answer is simple: if you want a high performing database for your application servers, the number of processes IN TOTAL should not exceed a number sitting somewhere between CPU core count and CPU thread count.
“Isn’t that incredibly low?”
Yes, for most of the deployments that I see this would be shockingly low. But just because the number of processes is set very high somewhere doesn’t make it right. It just means it’s set that way.
“But why is it set too high everywhere?”
I don’t know. I don’t understand why lots and lots of people do allocate high, up to sometimes ASTRONOMICAL numbers of database processes, and then expect that to be the best tuned way, while there is NO LOGICAL EXPLANATION that I can see for this to make sense. In fact: the explanation why this doesn’t make sense is this blogpost.
To make the comparison with a supermarket and the number of tills: if you go shopping in a supermarket and want to pay and leave as soon as possible, there should be a ready, idle till available, or else you have to wait. For Intel Xeon hyperthreading, you could make the comparison with a till that serves two lanes with persons that want to pay at the same time, because it takes time to put all the items from the shopping basket onto the desk, and the more time that takes, the more efficient a till serving two lanes would be (an Intel Xeon CPU can actually serve two threads at the same time, optimising runtime on the single core).
“Okay, but the majority of the processes is not actually doing anything.”
Well, if the processes are actually not doing anything, why have them in the first place? If that is really true, it doesn’t make sense to have them. And don’t forget: what looks like an idle connection from both an application and a database perspective still is an actual live, running operating system process, and a running database process that has memory allocated, occupies a network socket, has breakable parse locks out, etc, and requires CPU time to maintain the connection.
In fact, by having huge numbers of database connections, you have setup the application to be able to cause “the perfect storm” on the database!
With this, I mean that what I normally see, is that indeed the majority of the database connections are not used. However… if things get worse, and the database gets active and starts lacking CPU processing power, more database connections get active. That is logical, right? The database connections that normally would be active will take longer time because of the increased activity, so with a constant amount of work, new work cannot use an existing connection because that is still active, and thus take another connection that normally would sit idle. However, serving more connections will increase the amount of CPU required even further, which was already lacking, so the waiting time increases further. Now because the waiting time gets higher, more connections are needed, etc.
And then I didn’t talk about dynamically increasing connection pools!
What I mean with that is that I until now talked about STATIC connection pools. Static means the minimal number of connections is the maximal number of connections in the pool. A dynamic connection pool will have a certain amount of connections, and when there is a need for more, which means all the connections are busy, add more connections.
Especially with Oracle, this is really a bad idea. Let me explain. Outside of too much connections in the first place, which is a bad idea already, having an expanding connection pool means not only idle connections are put to work, but instead the database is given EVEN MORE work by initialising new connections. An oracle database connection is not lightweight, it requires initialising memory, which is an expensive operation. And the whole reason the connection is created is because the connection pool established all the connections were busy, which almost certainly is because the database was busy (!!!!).
I hope a lot of people will make it to the end, and then realise that high numbers of connections does not make any sense. If you do have an explanation that makes sense, please comment. Please mind that a tuned setup requires an application server to be reasonably setup too, you cannot have one part setup for ultimate processing power, and another part be just a shipwreck.
There are some messages about performance (including yours here) that need to be restated forever.
Tom Kyte used to say that for many OLTP systems, the quickest way to improve performance was to cut the size of the connection pools in half. The problem he mentioned was contention: too many processes at once needing access to the same resource. Imagine in the supermarket two cashiers fighting over one till.
The scope of this subject is wider than the database. Many applications have big connection pools because they use each connection inefficiently. I always strongly recommend testing applications with a pool of ONE connection, to make sure it is being well used, and above all released quickly to the pool for reuse. Then add connections one by one and see whether use of the database increases linearly.
We have to remember that application developers write code to “open” and “close” connections, but in reality that code is intercepted by the connection pool manager. Some connection pool managers can be configured *not to release the connection back to the pool* upon “close”, just in case the application might need it again.
Bottom line, performance tuning around connections requires appropriate configuration of the database, the application and whatever software the application uses to access the database.
Best regards, Stew
“An oracle database connection is not lightweight, it requires initializing memory, which is an expensive operation”
This is absolutely correct since I have experienced a real-life case where a switch from WebLogic to JBoss server has increased the number of connected sessions (and processes) which has ended up crashing the database. The application has a big SGA and was not using huge pages. The incoming supplementary connections forced the Linux kernel to manage a very big number of 4K memory page tables, started swapping, burning a lot of CPU, and freezing. We had to switch to huge pages of 2MB in order to reduce the number of mapped page tables between virtual and physical memory when there are many connections to handle.
Best regards
Mohamed Houri
Frits, it makes total sense! I used to work for a fintech here in Brazil they had hundreds of sessions per application server while we were running the database on 2 nodes of a full rack Exadata X5. Also services configured to run the OLTP application on one node only meaning we had a maximum processing capacity of 36 cores (72 threads), so my question to the “architects” was always what you explained and I showed them, proved them it was not efficient and it didn’t make any sense. Every time we had some higher than normal concurrency in the database that app behavior always made things worse. But in the end I was the wrong one and the problem was the database not being able to handle the workload. Solution: migrate to X7. A few years later now they are on X8M with the same architecture. Thank you for sharing the thoughts
It’s all about understanding, which is the whole intention of this blogpost. You cannot solve an issue if you don’t understand it. I am afraid lots of architects don’t understand what they are doing, and thus make choices based on ? ; probably “somebody told me”, “I read this somewhere”, “we’ve always done it this way”, or any other hollow bogus reason.