Recently I was asked to go to a client who had gone to production with an Oracle 11 database.
The first few days the application worked but was very slow. But that time was the the first few days in this year, which means there where very few customers of this client who used it. Later, all customers of this client/the normal amount of customers tried to use it, and the application broke down.
The application was written using the .NET framework, used a frontend for visualisation, an application backend which served the frontend through web services, and a database for the data. The application layer servers had a low load (CPU busy around 30-40%), the database server was quite busy (CPU busy 70-100%, load ~ 10). The .NET programmers/lead programmer was the driving ‘tuning’ force.
Much of the performance investigations where done by the .NET programming team throughout the development process. The .NET suite seems to have some way of doing that fully automated and integrated within the .NET development suite (as I’ve been told by the .NET lead developer, I don’t know). In fact, it’s all so nicely integrated, Oracle does not fit in nice, because it’s not object oriented (?) and integrated in the microsoft tools. It’s better to use the microsoft database. I listened and learned….
But, nonetheless, it’s all been tested and tuned. And always the database was the slowest part.
And, to be absolutely sure, an external party was hired to do load testing prior to going production with the all new shiny .NET application. During the load testing it was the database which turned out to be the limiting factor. And, of course it had been tested with an higher number than the number of users who would use the application.
So they gone production with full confidence. And still it broke down.
That is where I came in. My task was to tune that $@#^$@# database so it worked well with the application, so it would bring back the application to life and fulfill the client’s core business.
My first question was if somebody could explain and draw the application’s architecture. Nobody could and pointed to the lead developer. The lead developer talked a lot, but did not provide any kind of architecture explanation nor drawing. The layers that the database administrator’s explained to me (application frontend, application backend and database) was as far as it gone.
And, because the database is always slow, it was of nobody’s interest to explain anything. It was very clear the database was causing all the problems so I should not ask and just tune. That was what I was hired for.
I reluctantly asked if there was any kind of instrumentation in the application. To be absolutely sure it was the database which was causing all the grief. The answer was what I expected with my gained knowledge of this environment: NO. And, it wasn’t necessary; it all been tested and tuned. Could I please tune that database, please?
Luckily, the project leader saw and understood I was trying to break up the application stack into logical components and tried to do some kind of measuring to be able to narrow the scope of the performance tuning down. The lead developer was doing the opposite, and increasingly included components in his investigation, because anything he had done so far did not lead to any improvement.
Through the project lead I was able to get the entire application stack in a private environment, without the database, which this private stack shared with production. Using this setup, I was able to get an application stack with the connection pool set to 1, so I was able to measure which time was spend in the application, and which time was spend in the database.
This lead to an interesting situation, even without looking at measured results: the private application stack performed reasonably, whilst the production stack still was unworkable.
To most trained and untrained performance tuners, this undeniably proves at least some of the performance problems lie outside the database, which means in the beloved .NET application… Of course there are always individuals who interprent thing differently. I guess you know who choose to ignore this; the load of 10 on the database is undeniably too high, right? (which is a myth too; load is a figure which is slightly different implemented on different operating systems and is essentially the number of systemwide processes which are running or waiting on a resource (yes, this is a very rough description). It can be used to get an idea of busyness of a server, but should be divided by the number of processors (this too is very rough))
We later got the application back to life using several modifications to the environment, of which I still don’t know which contributed the biggest part to the performance, because of the lack of instrumentation. My guess (sorry alex!) would be the increase of the upper range of ephemeral ports and the reduction of the client TCP/IP socket connection timeout (http://msdn.microsoft.com/en-us/library/aa560610.aspx) because of the usage of webservices between the application frontend and backend.