Recently I was looking in the trace directory in the diag dest of my (12.2.0.1) instance, and found some trace files which were different from the others:

$ ls | grep dia0
test_dia0_9711_base_1.trc
test_dia0_9711_base_1.trm
test_dia0_9711_lws_1.trc
test_dia0_9711_lws_1.trm
test_dia0_9711.trc
test_dia0_9711.trm

The dia0 ‘test_dia0_9711.trc’ file is fully expected. But what are these ‘..lws_1.trc’ and ‘..base_1.trc’ files? And ‘base’ is something that I understand, but what would ‘lws’ mean? Lunatics Without Skateboards?

First, let’s look into the normally named trace file of the dia0 process:

$ cat test_dia0_9711.trc
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:25.849759+00:00
HM: Moving the default trace file to: test_dia0_9711_base_1.trc


*** 2017-11-26T16:45:33.193729+00:00
HM: Tracing moved from base trace file: test_dia0_9711_base_1.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

HM: Tracing is now being performed to this file.
2017-11-26 16:45:33.193 :kjzdshn(): Shutting down DIAG

Ah! So LWS means ‘Long Waiting Sessions’. This seems to be functionality that produces trace files for execution which do not perform optimally.

Let’s look into the ‘base’ trace file:

*** 2017-11-26T12:18:25.849773+00:00
HM: Tracing moved from default trace file: test_dia0_9711.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

*** 2017-11-26T12:18:25.849903+00:00
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:30.870494+00:00
HM: Detected incompatible setting 'INSTANCE' for Hang Resolution Scope
    - disabling hang resolution.

I get the feeling this is originally made for RAC, but now also working for single instance. Some RAC related features seem to be turned off. It seems that the actual session information is stored in the ‘lws’ trace files.

This is a sample of what is in a lws trace file:

*** 2017-11-26T13:14:25.053171+00:00
HM: Early Warning - Session ID 58 serial# 53774 OS PID 30760 (FG) is
     'not in a wait' for 33 seconds
     last waited on 'resmgr:cpu quantum'
     p1: 'location'=0x3, p2: 'consumer group id'=0x4a2f, p3: ' '=0x0

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     1      0      0      0      0      0      0      0 not in wait


HM: Dumping Short Stack of pid[71.30760] (sid:58, ser#:53774)
Short stack dump:
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-sslnxadd()+38<-pfrinstr_ADDN()+103<-pfrrun_no_tool()+60<-pfrrun()+919<-plsql_run()+756<-peicnt()+288<-kkxexe()+717<-opiexe()+22531<-kpoal8()+2679<-opiodr()+1229<-ttcpip()+1257<-opitsk()+1940<-opiino()+941<-opiodr()+1229<-opidrv()+1021<-sou2o()+145<-opimai_real()+455<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

HM: Current SQL: declare
        x number := 0;
begin
        for c in 1..1000000000 loop
                x := x + c;
        end loop;
end;

I got a sample PLSQL loop running for testing load (which is really silly), and this running for 33 seconds apparently generates an ‘Early Warning’. It shows what it was doing from a wait based analysis perspective (‘not in a wait’, ‘resmgr: cpu quantum’), it shows some statistics for the hang analyser, it dumps a short stack, which is fully expected; kdsests() to __sighandler() are the routines for generating the stack dump, the actual function that was executing was sslnxadd(), which is a system specific linux add function, which is called by pfrinstr_ADDN(), which is the PLSQL add function, which is called by pfrrun_no_tool(), which is the PLSQL fast interpretation loop function, and so on. The last thing it dumps is the current SQL, which is the silly PLSQL code.

Is there anything inside the Oracle database that shows anything related to this? Yes, it seems there are:

V$HANG_INFO
V$HANG_SESSION_INFO
V$HANG_STATISTICS
DBA_HANG_MANAGER_PARAMETERS

Doing some simple tests, I could not get V$HANG_INFO and V$HANG_SESSION_INFO populated. V$HANG_STATISTICS contains 87 statistics, and displays current data, of which most in my single instance test database are 0.

One thing that confused me at first was the abbreviation ‘HM’ for hang monitor, because the very same abbreviation is used in the diagnostic repository for ‘health monitor’.

It seems the hang manager is a quite extensive framework for detecting hangs and hang related issues; I peeked at the documented and undocumented parameters containing “hang” in them, of which there are quite an impressive amount quite likely belonging to the hang manager framework:

SQL> @parms
Enter value for parameter: hang
Enter value for isset:
Enter value for show_hidden: Y
_global_hang_analysis_interval_secs                10                                                                     TRUE     FALSE      FALSE
_hang_allow_resolution_on_single_nonrac            FALSE                                                                  TRUE     FALSE      FALSE
_hang_analysis_num_call_stacks                     3                                                                      TRUE     FALSE      FALSE
_hang_base_file_count                              5                                                                      TRUE     FALSE      FALSE
_hang_base_file_space_limit                        100000000                                                              TRUE     FALSE      FALSE
_hang_blocked_session_percent_threshold            10                                                                     TRUE     FALSE      FALSE
_hang_bool_spare1                                  TRUE                                                                   TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_bool_spare2                                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_boundary_hang_detection_enabled        TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_cluster_hang_detection_enabled         TRUE                                                                   TRUE     FALSE      FALSE
_hang_delay_resolution_for_libcache                TRUE                                                                   TRUE     FALSE      FALSE
_hang_detection_enabled                            TRUE                                                                   TRUE     FALSE      FALSE
_hang_enable_processstate                          TRUE                                                                   TRUE     FALSE      FALSE
_hang_hang_analyze_output_hang_chains              TRUE                                                                   TRUE     FALSE      FALSE
_hang_hiload_promoted_ignored_hang_count           2                                                                      TRUE     FALSE      FALSE
_hang_hiprior_session_attribute_list                                                                                      TRUE     FALSE      FALSE
_hang_hung_session_ewarn_percent                   34                                                                     TRUE     FALSE      FALSE
_hang_ignored_hang_count                           1                                                                      TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_ignored_hangs_interval                       300                                                                    TRUE     FALSE      FALSE
_hang_ignore_hngmtrc_count                         1                                                                      TRUE     FALSE      FALSE
_hang_int_spare1                                   0                                                                      TRUE     FALSE      FALSE
_hang_int_spare2                                   0                                                                      TRUE     FALSE      FALSE
_hang_log_io_hung_sessions_to_alert                FALSE                                                                  TRUE     FALSE      FALSE
_hang_log_verified_hangs_to_alert                  FALSE                                                                  TRUE     FALSE      FALSE
_hang_long_wait_time_threshold                     0                                                                      TRUE     FALSE      FALSE
_hang_lws_file_count                               5                                                                      TRUE     FALSE      FALSE
_hang_lws_file_space_limit                         100000000                                                              TRUE     FALSE      FALSE
_hang_max_session_hang_time                        96                                                                     TRUE     FALSE      FALSE
_hang_monitor_archiving_related_hang_interval      300                                                                    TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_msg_checksum_enabled                         TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_allow_archiving_issue_termination TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_confidence_promotion              FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_global_hang_confidence_promotion  FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_percent_hung_sessions_threshold   300                                                                    TRUE     FALSE      FALSE
_hang_resolution_policy                            HIGH                                                                   TRUE     FALSE      FALSE
_hang_resolution_promote_process_termination       FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_scope                             OFF                                                                    TRUE     FALSE      FALSE
_hang_running_in_lrg                               FALSE                                                                  TRUE     FALSE      FALSE
_hang_short_stacks_output_enabled                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_signature_list_match_output_frequency        10                                                                     TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_statistics_collection_interval               15                                                                     TRUE     FALSE      FALSE
_hang_statistics_collection_ma_alpha               30                                                                     TRUE     FALSE      FALSE
_hang_statistics_high_io_percentage_threshold      25                                                                     TRUE     FALSE      FALSE
_hang_terminate_session_replay_enabled             TRUE                                                                   TRUE     FALSE      FALSE
_hang_trace_interval                               32                                                                     TRUE     FALSE      FALSE
_hang_verification_interval                        46                                                                     TRUE     FALSE      FALSE

For the sake of experimenting with the hang manager I locked a table using ‘lock table t2 in exclusive mode’ and another session doing an insert. The ‘base’ trace file showed the following information:

*** 2017-12-11T06:25:10.364037+00:00
HM: Hung Sessions (local detect) - output local chains
===============================================================================
Non-intersecting chains:

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7526
              process id: 38, oracle@memory-presentation.local (TNS V1-V3)
              session id: 27
        session serial #: 32599
    }
    is waiting for 'enq: TM - contention' with wait info:
    {
                      p1: 'name|mode'=0x544d0003
                      p2: 'object #'=0x57e6
                      p3: 'table/partition'=0x0
            time in wait: 1 min 35 sec
           timeout after: never
                 wait id: 74
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000123 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 26.493425 sec
                     wait id: 73               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000025 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000002 sec
                     wait id: 72               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000005 sec
              3.       event: 'SQL*Net break/reset to client'
                 time waited: 0.000055 sec
                     wait id: 71               p1: 'driver id'=0x62657100
                                               p2: 'break?'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7481
              process id: 24, oracle@memory-presentation.local (TNS V1-V3)
              session id: 55
        session serial #: 9654
    }
    which is waiting for 'SQL*Net message from client' with wait info:
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 1 min 39 sec
           timeout after: never
                 wait id: 6714
                blocking: 1 session
            wait history:
              * time between current wait and wait #1: 0.000006 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6713             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000203 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 17.115734 sec
                     wait id: 6712             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000007 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6711             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
    }

Chain 1 Signature: 'SQL*Net message from client'<='enq: TM - contention'
Chain 1 Signature Hash: 0x163c4cba
-------------------------------------------------------------------------------

===============================================================================

If you wait a little longer, there is another type of dia0 trace file that appears: ‘..vfy_1.trc’, which is a file containing ‘verified hangs’. They appear one time for a given hang situation, and looks like this:

One or more possible hangs have been detected on your system.
These could be genuine hangs in which no further progress will
be made without intervention, or it may be very slow progress
in the system due to high load.

Previously detected and output hangs are not displayed again.
Instead, the 'Victim Information' section will indicate that
the victim is from an 'Existing Hang' under the 'Existing or'
column.

'Verified Hangs' below indicate one or more hangs that were found
and identify the final blocking session and instance on which
they occurred. Since the current hang resolution state is 'OFF',
there will be no attempt to resolve any of these hangs unless
hang resolution is enabled.  Terminating this session may well
free up the hang. Be aware that if the column named 'Fatal BG'
is marked with 'Y', terminating the process will cause the
instance to crash.

Any hangs with a 'Hang Resolution Action' of 'Unresolvable'
will be ignored. These types of hangs will either be resolved
by another layer in the RDBMS or cannot be resolved as they may
require external intervention.

Deadlocks (also named cycles) are currently NOT resolved even if
hang resolution is enabled.  The 'Hang Type' of DLCK in the
'Verified Hangs' output identifies these hangs.

Below that are the complete hang chains from the time the hang
was detected.

The following information will assist Oracle Support Services
in further analysis of the root cause of the hang.


*** 2017-12-11T07:05:07.989085+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

                     number of hangs detected 1
                        number of local hangs 1
  hangs ignored due to application contention 1
      hangs:total number of impacted sessions 2
    hangs:minimum number of impacted sessions 2
    hangs:maximum number of impacted sessions 2
                      current number of hangs 1
            current number of active sessions 1
              current number of hung sessions 2

*** 2017-12-11T07:05:07.989122+00:00
Verified Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
     1 HANG VICSELTD    1    58     2     2    LOW  LOCAL IGNRD:HngRslnScpOFF
 Hang Ignored Reason: Hang resolution is disabled.

  Inst  Sess   Ser             Proc  Wait    Wait
   Num    ID    Num      OSPID  Name Time(s) Event
  ----- ------ ----- --------- ----- ------- -----
      1     45  5781     27709    FG     137 enq: TM - contention
      1     58 15676     27700    FG     146 SQL*Net message from client

Wait event statistics for event 'SQL*Net message from client'

Self-Resolved Hang Count                        : 0
Self-Resolved Hang Total Hang Time in seconds   : 0
Self-Resolved Hang Minimum Hang Time in seconds : 0
Self-Resolved Hang Maximum Hang Time in seconds : 0
Resolved Hang Count                             : 0
Total Hung Session Count                        : 1
Total Wait Time in micro-seconds                : 59799468
Total Wait Count                                : 67
Moving Average of Wait Time in micro-seconds    : 5220981

Victim Information
                                                         Existing or  Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG       New Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       1      1      58    15676      27700     N           New Hang        1

It appears Oracle adds more and more useful information in trace files, which only can be fetched from the trace files. Yet it doesn’t seem that any monitoring is actually picking this up, which, to my knowledge includes Oracle’s own Enterprise Manager.

The hang manager tracing talks about outliers, upon grepping through the dia0 trace files I found an explanation what dia0 is considering an outlier:

*** 2017-11-26T16:45:33.193479+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

            current number of active sessions 1
  instance health (in terms of hung sessions) 100.00%

Dumping Wait Events with non-zero hung session counts
                       or non-zero outliers wait counts
(NOTE: IO outliers have wait times greater than 32 ms.
       non-IO outliers have wait times greater than 64 s.
       IDLE wait events are not output.)

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     0      0      0   1366      2      0      0      1 log file parallel write
     1      0      0   2206 4294966594      1 4294966272      0 resmgr:cpu quantum
     0      0      0      1      1      0      0      1 external table read
     1      0      0      0      0      0      0      0 not in wait


*** 2017-11-26T16:45:33.193686+00:00
TotalCPUTm:4230 ms TotalRunTm:16027340 ms %CPU: 0.03%
Advertisements

Recently I was triggered about the ‘automatic big table caching’ feature introduced in Oracle version 12.1.0.2 with Roger Macnicol’s blogpost about Oracle database IO and caching or not caching (https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-1 https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-2). If you want to read something about the feature in general, search for the feature name, you’ll find several blogposts about it.

If you are not familiar with automatic big table caching, it’s a feature which allows you to dedicate a percentage of the buffer cache for scanning big tables. Big tables are tables that are not small :-), which technically means tables bigger (a higher number of blocks) than the “_small_table_threshold” parameter, which defaults to 2% of the buffer cache or 20 blocks, whichever is bigger. The choice to store blocks in the big table cache area is dependent on the segment “temperature”, which essentially is a kind of touch count. There is one other caveat of using this feature, which I currently find weird: once a database is operating in clustered mode (RAC), serial scans can not use the automatic big table caching anymore, the use of it then exclusively is for caching parallel query workers blocks scanned.

The purpose of this blogpost is to investigate the IO implementation of using the big table caching feature. The available general read code paths that I am aware of in Oracle are:
– Single block reads, visible via the wait event ‘db file sequential read’, which always go to the buffer cache.
– Multiple single block reads, visible via the wait ‘db file parallel read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘db file scattered read’, which always go to the buffer cache.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the process’ PGA.
– Multi block reads, visible via the wait event ‘direct path read’, which go to the in-memory area.

For full scans of a segment, essentially the choices are to go buffered, or go direct. The question that I asked myself is: what code path is used when the automatic big table caching feature is used?

The reason for asking myself this is because the difference between a cached multi block read and a direct path read is significant. The significance of this decision is whether multiple IO requests can be submitted for multiple extents leveraging asynchronous IO, which is possible in the direct path read code path, or only the IO requests can be submitted for a single extent, which is what is true for the db file scattered read code path. To summarise the essence: can I only execute a single IO request (scattered read) or multiple IO requests (direct path read).

The answer is very simple to find if you enable the big table caching feature, and then read a table that is considered big with an empty big table cache. A table that is read into the big table area, shows the ‘db file scattered read’ wait event. What is interesting when a table is scanned that does not fit in the big table area: when a big table is not read for reading into the big table area, the wait event switches to ‘direct path read’. Or actually I should say: the general rules are applied for choosing between buffered or direct path reads, potentially being a direct path read.

I want to emphasise again on the importance of the IO code path decision. If you have an Oracle database server that uses modern high performance hardware for its disk subsystem, not choosing the correct multi block read codepath can severely influence throughput.

For example, if your IO subsystem has an average IO time of 1ms, the maximal theoretical throughput for buffered IO is 1/0.001*1MB=1GBPS (this will be way lower in reality). Direct path reads start off with 2 IO requests, which obviously means that that’s double the amount of buffered IO, but this can, depending on an Oracle internal resource calculation, grow up to 32 IO requests at the same time.

Warning! This is a post about Oracle database internals for internals lovers and researchers. For normal, functional administration, this post serves no function. The post shows a little tool I created which consists of a small database I compiled with Oracle database C function names and a script to query it. The reason that keeping such a database makes sense in the first place, is because the Oracle C functions for the Oracle database are setup in an hierarchy based on the function name. This means you can deduct what part of the execution you are in by looking at the function name; for example ‘kslgetl’ means kernel service lock layer, get latch.

To use this, clone git repository at https://gitlab.com/FritsHoogland/ora_functions.git

Use the ‘sf.sh’ (search function) script to query the functions. The script uses sqlite3, use the sqlite rpm package to add this (on linux), or use the packaging method of your platform.

This is how to use sf.sh :

$ ./sf.sh kglGetMutex
kglGetMutex -- kernel generic lock management
---

This shows the sf.sh script could find the first 3 letters (underlined), which probably mean kernel generic lock management. It couldn’t find ‘GetMutex’ however that is so self-explanatory that it doesn’t need annotation.

This is how a full function annotation looks like:

$ ./sf.sh qercoFetch
qercoFetch -- query execute rowsource count fetch
----------

Here the full function is underlined, which means the entire function is found.

If you found explanations for Oracle database code locations, or found an error in the annotation (I made assumptions here and there), please send them to frits.hoogland@gmail.com, or react to this post, so I can add or update it.

There are many situations where you want to use a very specific configuration of the Oracle database, for example when a client has an issue and is still on EL5, or gets disk errors on a filesystem that is ext3, or is using ASM and gets weird IO patterns. Other examples are: you want to test the newest PSU to see if responds differently to an issue you are working on, or you want to test a combination of the Oracle database version 11.2.0.3 and grid infrastructure 12.1.0.2.

Of course you can just go and install a virtual machine, install all the different bits and pieces. Doing so manually kills vast amounts of time. By doing that, you will end up with a lot of virtual machines, for which at a certain point in time you have to make a decision to remove some of these.

Also a lot of people use a (virtual) machine with a couple of database versions installed, and test on these. In that case you sometimes have to ignore details like filesystemASM, or specific PSU level, it’s hard to keep that updated, but when a client case is in a lower version, in general you don’t go back in PSU level (although not impossible). One thing I ran into frequently is that it’s easy to get caught in side effects because of changes and settings made for earlier test cases (often underscore parameters).

This blogpost introduces my project ‘vagrant-builder’ which allows you to build a virtual machine with Oracle and optionally clusterware installed in any version you specify. The provisioning will download all software and patches (except for the 12.2.0.1 media, which needs to be provided in the ‘files’ directory) fully automatic for you. These are the options:

Linux version:
Oracle linux version 5, 6 or 7 (limited by boxes build by the box-cutter project).
The Actual versions currently existing are ol5.11, ol6.6/7/8, ol7.0/1/2/3. I am awaiting the boxcutter project to produce ol6.9 and ol7.4.

Filesystems:
Filesystem types for u01 and for oradata (when no ASM is used): xfs, ext4, ext3.

Kernel:
Oracle linux 5: latest redhat kernel, latest UEK2 kernel.
Oracle linux 6: any exadata kernel version (if made available on public-yum), latest redhat/UEK2/UEK3/UEK4 kernel.
Oracle linux 7: latest redhat kernel, latest UEK3 or UEK 4 kernel.

ASM:
No ASM install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814

Database:
No database install.
12.2.0.1 no patches, PSU: 170620, 170718, 170814
12.1.0.2 no patches, PSU: 1, 2, 3, 4, 5, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.4 no patches, PSU: 1, 2, 3, 4, 5, 6, 7, 8, 160119, 160419, 160719, 161019, 170117, 170418, 170718, 170814
11.2.0.3 PSU 15 only.
11.2.0.2 PSU 12 only.

Database:
By specifying a database name, a database will be created with that name. Of course the dictionary part of the patching will be applied to the database!

How does this work? This works using the combination of the following pieces of software:
– Virtualbox
– Vagrant
– Ansible
Plus the vagrant-builder repository: https://gitlab.com/FritsHoogland/vagrant-builder

If you don’t have Virtualbox, Vagrant or Ansible installed, follow the installation procedure in this blog article; it’s a bit older, so versions of the software components will be higher, you should simply install the latest versions. There is quite an important caveat (sadly): Ansible in principle does not run on windows. You can made it working on windows by using Cygwin, but officially it doesn’t support windows. If you can get the provisioning using Ansible to fully work on windows please share how you did that.

Once you got all the software components installed, another thing you might want to do first is to move your default virtual box directory to a place where you got enough space to hold virtual machines.

Then, clone the vagrant-builder repository into a directory (git clone https://gitlab.com/FritsHoogland/vagrant-builder.git myvm, for example), go into that directory and edit the Vagrantfile to set:
– hostonly_network_ip_address
– mos username & password
– database_name (if you want a database)
– linux (choose one by removing the hash sign in front of it)
– kernel
– asm_version (set a version if you want clusterware “siha” and ASM, if a database_version is set and asm_version is empty, you get a filesystem based database)
– database_version (set a version if you want the database software to be installed)
– vm_cpus (number of CPUs visible/made available to the VM)
– vm_memory (amount of memory made available ot the VM)
– vm_hostname (if you want multiple VMs, you need multiple vm_hostnames set!)
– perl_l4cache_workaround (if you got a newer CPU with a level 4 cache, set this to Y (yes), otherwise set this to N (no))

Save the changes, and startup the virtual machine: ‘vagrant up’. This will pull the operating system image, add a disk for the database, startup linux, setup and configure linux, download the database and grid software version (except for version 12.2.0.1, for which the installation media needs to be staged in the files dictory), install it, download the patches, install these and create a database, without manual intervention.

This post is about the decision the Oracle database engine makes when it is using a full segment scan approach. The choices the engine has is to store the blocks that are physically read in the buffercache, or read the blocks into the process’ PGA. The first choice is what I refer to as a ‘buffered read’, which places the block in the database buffercache so the process itself and other processes can bypass the physical read and use the block from the cache, until the block is evicted from the cache. The second choice is what is commonly referred to as ‘direct path read’, which places the blocks physically read into the process’ PGA, which means the read blocks are stored for only a short duration and is not shared with other processes.

There are some inherent performance aspects different between a buffered and a direct path read. A buffered read can only execute a single physical read request for a single range of blocks, wait for that request to finish, fetch and process the result of the physical read request after which it can execute the next physical read request. So there is maximum of one outstanding IO for multiple (adjacent) Oracle blocks. A direct path read works differently, it submits two physical IO requests, each for a distinct range of Oracle blocks asynchronously, after which it waits one or more IOs to finish. If an IO is returned, it is processed, and an IO for another range of Oracle blocks is submitted to restore the number of IOs in flight to two. If the database engine determines (based upon a non-disclosed mechanism) that enough resources are available it can increase the amount of IO physical IO requests in flight up to 32. Other differences include a maximum for the total size of the IO request, which is 1MB for buffered requests, and 32MB for direct path requests (which is achieved by setting db_file_multiblock_read_count to 4096).

At this point should be clear that there are differences between buffered and direct path reads, and when full segment scans switch from direct path reads to buffered reads it could mean a significant performance difference. On top of this, if your database is using Exadata storage, this decision between buffered reads and direct path reads is even more important. Only once the decision for direct path reads has been made, an Exadata smartscan can be executed. I have actually witnessed cases where a mix of partitioning and HCC lead to the situation that the partitions were so small that a direct path read was not chosen, which meant a smartscan was not considered anymore, meaning that instead of the cells decompressing the compressed blocks all in parallel, the process now had to fetch them and do the decompression on the database layer.

There have been some posts on the circumstances of the decision. However, I have seen none that summarise the differences for the different versions. In order to investigate the differences between the different Oracle versions, I created a git repository at gitlab: https://gitlab.com/FritsHoogland/table_scan_decision. You can easily use the repository by cloning it: git clone https://gitlab.com/FritsHoogland/table_scan_decision.git, which will create a table_scan_decision directory in the current working directory.

Oracle version 11.2.0.2.12
Please mind this version is very old, and SHOULD NOT BE USED ANYMORE because it’s not an actively supported version. However, I do use this version, because this version has different behaviour than the versions that follow.

First determine the small table threshold of the database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

Let’s create tables just below and just over 1531 blocks/small table threshold:

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1531 1531
...
    BLOCKS
----------
      1664

So the small table threshold is 1531, this means that an internal statistic that is used for determining using the direct path mechanism, medium table threshold will be approximately 1531*5=7655. Let’s create tables just below and just over that number of blocks:

TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

For the other versions, trace event ‘nsmtio’ can be used to learn how the decision is made. However, this trace event does not exist in Oracle version 11.2.0.2. The workaround is to just execute a SQL trace and interpret the wait events. For a full table scan, the wait events ‘db file scattered read’ means a buffered read is done, and wait events ‘direct path read’ means a direct path read was done (obviously).

TS@test > alter session set events 'sql_trace level 8';
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1531';
TS@test > select count(*) from table_1531;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that in my case, with Oracle version 11.2.0.2, the switching point is at 5 times _small_table_threshold.

Oracle 11.2.0.3.15
This version too should NOT BE USED ANYMORE because it is not in active support. This too is for reference.
Small table threshold for this database:

SYS@test AS SYSDBA> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1531

With the small table threshold being 1531, the medium table threshold should be approximately 1531*5=7655.

TS@test > @create_table table_1350 1350
...
    BLOCKS
----------
      1408
TS@test > @create_table table_1440 1440
...
    BLOCKS
----------
      1536
TS@test > @create_table table_7000 7000
...
    BLOCKS
----------
      7168
TS@test > @create_table table_7655 7655
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans. By doing that I ran into something peculiar with the ‘nsmtio’ event in this version (11.2.0.3 with the latest PSU). This event does exist for this version (which you can validate by running ‘oradebug doc component’), however, it does not yield any output. This means I have to revert to the previous method of running sql_trace at level 8 and interpret the wait events.

TS@test > alter session set events 'trace[nsmtio]:sql_trace level 8'; -- no NSMTIO lines, only sql_trace!
TS@test > select count(*) from table_1350;
-- main event: db file scattered read
TS@test > alter session set tracefile_identifier = 'table_1440';
TS@test > select count(*) from table_1440;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7000';
TS@test > select count(*) from table_7000;
-- main event: direct path read
TS@test > alter session set tracefile_identifier = 'table_7655';
TS@test > select count(*) from table_7655;
-- main event: direct path read

This shows that with Oracle version 11.2.0.3, the direct path read switching point seems to have moved from 5 times small table threshold to small table threshold itself.

Oracle 11.2.0.4.170718
This version is in active support!
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1538

With the small table threshold being 1538, the medium table threshold should be approximately 1538*5=7690.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1538 1538
...
    BLOCKS
----------
      1664
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7690 7690
...
    BLOCKS
----------
      7808

Flush buffer cache and set trace events, and test the scans.

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7693 blocks),
-- main event: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1538';
SQL> select count(*) from table_1538;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14422, objn: 14422
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14423, objn: 14423
-- main event: direct path read
SQL> alter session set tracefile_identifier = 'table_7690';
SQL> select count(*) from table_7690;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 14424, objn: 14424
-- main event: direct path read

This shows that with Oracle version 11.2.0.4, the direct path read switching is at small table threshold, which was changed starting from 11.2.0.3.

Oracle version 12.1.0.2.170718
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1440

SQL>

With small table threshold being 1440, the medium table threshold is approximately 1440*5=7200.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7203 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20489, objn: 20489
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20490, objn: 20490
-- main events: direct path read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 20491, objn: 20491
-- main events: direct path read

This is in line with the switch in version 11.2.0.3 to small table threshold as the switching point between buffered reads and direct path reads.

Oracle 12.2.0.1.170814
Small table threshold for this database:

SQL> @small_table_threshold

KSPPINM 		       KSPPSTVL
------------------------------ ------------------------------
_small_table_threshold	       1444

SQL>

With small table threshold being 1444, the medium table threshold is approximately 1444*5=7220.

SQL> @create_table table_1350 1350
...
    BLOCKS
----------
      1408
SQL> @create_table table_1440 1440
...
    BLOCKS
----------
      1536
SQL> @create_table table_7000 7000
...
    BLOCKS
----------
      7168
SQL> @create_table table_7200 7200
...
    BLOCKS
----------
      7424

Now flush the buffer cache, and use the ‘nsmtio’ trace event together with ‘sql_trace’ to validate the read method used:

SQL> alter session set events 'trace[nsmtio]:sql_trace level 8';
SQL> select count(*) from table_1350;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1378 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_1440';
SQL> select count(*) from table_1440;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 1504 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7000';
SQL> select count(*) from table_7000;
-- nsmtio lines:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 7048 (blocks), Threshold: MTT(7222 blocks),
-- main events: db file scattered read
SQL> alter session set tracefile_identifier = 'table_7200';
SQL> select count(*) from table_7200;
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 22502, objn: 22502
-- main events: direct path read

Hey! With 12.2.0.1 the direct path read switching point reverted back to pre-11.2.0.3 behaviour of switching on 5 times small table threshold instead of small table threshold itself.

Update!
Re-running my tests shows differences in the outcome between buffered and direct path reads. My current diagnosis is that the scan type determination uses a step based approach:

– The first determination of size is done with ‘NSMTIO: kcbism’ (kcb is medium). If islarge is set to 1, it means the segment is bigger than STT. If islarge is set to 0 it means the segment is smaller than STT, and the segment will be read buffered, and the line ‘qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown in the NSMTIO output.

– The next line is 'NSMTIO: kcbimd' (kcb is medium determination?) It shows the size of the segment (nblks), STT (kcbstt), MTT (kcbpnb) and is_large, which in my tests always is set to 0. Here, there are 4 options that I could find:

1) Segment size between STT and MTT and a buffered read is executed.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, which probably is externalised in the line 'NSMTIO: kcbcmt1'. The outcome can be a buffered read, for which the line 'qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]' is shown.

2) Segment size between STT and MTT and the direct path code path is chosen.
If the segment is between STT and MTT, the Oracle engine uses a non-disclosed costing mechanism, probably externalised in the line 'NSMTIO: kcbcmt1'. If the costing determines it would be beneficial to use a direct path mechanism, it seems it switches to the direct path with cache determination code, which is also used for any table scan that is smaller than VLOT. Because of switching to that code, it will determine if the segment is bigger than VLOT: 'NSMTIO: kcbivlo', which of course in this case isn't true. Then, it will show the line 'NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]'

3) Segment size bigger than MTT but smaller than VLOT.
If the segment is between MTT and VLOT, the Oracle engine does not apply the costing mechanism (which is means the kcbcmt1 line is not shown). It will determine if the segment is bigger than VLOT ('NSMTIO: kcbivlo'), and then show 'NSMTIO: qertbFetch:[MTT VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

4) Segment size bigger than VLOT.
If the segment is bigger than VLOT, the Oracle engine execute the functions kcbimd and kcbivlo, the NSMTIO line for kcbivlo will show is_large 1 to indicate it’s a very large object (VLOT by default is ‘500’, which is 5 times the total number of buffers in the buffer cache. The qertbFetch line will say ‘NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]’, and there is no kcbdpc to analyse choosing doing a buffered or direct path read.

In the cases where ‘NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]' is shown, which is either a segment between STT and MTT which switched to this code path, or between MTT and VLOT, the code will apply a second determination and potential switching point from buffered to direct path or vice versa, which is shown with the line 'kcbdpc' (kcb direct path check). The outcome can be:

– NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ] to indicate it will use a buffered read.
– NSMTIO: kcbdpc:DirectRead to indicate it will use a direct path read.

I have verified the above 'decision tree' in 11.2.0.2, 11.2.0.3, 11.2.0.4, 12.1.0.2 and 12.2.0.1. It all seems to work this way consistently. I derived this working by looking at the NSMTIO tracing of 12.2, and then gone back in version. You will see that going lower in versions, there is lesser (nsmtio) tracing output; 11.2.0.4 does show way lesser information, for example, it does not show the kcbcmt1 line, and of course 11.2.0.3 and 11.2.0.2 do not show NSMTIO lines altogether. In order to verify the working, I used gdb and quite simply breaked on the kcbism, kcbimd, kcbcmt1, kcbivlo and kcbdpc functions in the versions where this information was missing in the trace.

Still, at the kcbcmt1 point:
– 11.2.0.2 seems to quite consistently take MTT as the direct path switching point.
– 11.2.0.3-12.1.0.2 seem to quite consistently take STT as the direct path switching point.
– 12.2.0.1 varies.

Conclusion.
This article first explained the differences between buffered and direct path reads, and why this is important, and that it is even more important with Exadata for smartscans.

The next part shows how to measure the switching point. The most important message from this blog article is that starting from 11.2.0.3 up to 12.1.0.2 the direct path read switching point is small table threshold, and with Oracle database version 12.2.0.1, the direct path switching point is changed back to pre-11.2.0.3 behaviour which means 5 times the small table threshold of the instance.
The next part shows measurements of the switching point. The addition shows that between STT and MTT there is a cost based decision to go direct path or buffered path. Once the direct path is chosen, it still can go buffered if the majority of the blocks are in the cache.

If you look closely at the output of the nsmtio lines for version 11.2.0.3-12.1.0.1 for tables that had a size between small table threshold and medium table threshold, it seemed a bit weird, because the nsmtio trace said ‘[MTT < OBJECT_SIZE < VLOT]', which to me means that Oracle detected the object size to be between medium table threshold and very large object threshold, which was not true. I can't tell, but it might be a bug that is solved for measuring the wrong size.
The text description in the NSMTIO qertbFetch line is bogus, it simply is a code path; ‘[- STT < OBJECT_SIZE < MTT]' means it's a buffered read, and could be chosen when < STT or in between STT and MTT, '[MTT < OBJECT_SIZE < VLOT]' means it's a direct path read, and could be chosen when in between STT and MTT or MTT and VLOT.

I added the scripts and examples of the tracing events so you can measure this yourself in your environment.

This post shows you how to install Rundeck with the Ansible plugin on Centos 7. The installation is done with nginx as the web server and using SSL with a self signed certificate. Please read the Ansible installation script, and modify anything that should be different for your situation. You will be amazed how well readable the installation script is!

Rundeck is a web based user interface that allows you to run commands against a group of hosts. Rundeck has an ansible plugin. Using that plugin, it could perform the similar functionality as Ansible Tower (commercial product) or Semaphore (open source).

After a fresh installation of Centos 7, do the following as root:

yum -y localinstall https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install ansible git
git clone https://gitlab.com/FritsHoogland/install_rundeck.git

This adds the EPEL (extra packages for Enterprise Linux) repository to Centos, which contains ansible and git. The next yum command installs ansible (the orchestration software) and git (distributed version control). The git clone command pulls the ansible orchestration script from my install_rundeck git repository at gitlab.

It is important to open the install_rundeck/install_rundeck.yml script and modify anything that is different in your situation. The public hostname will most likely be different than the 192.168.66.16 in the script. You might want to change the locality information with the certificate generation (unless you live in Lutjebroek like me :-). If you do a corporate installation, you might want to skip this step altogether and get a certificate pair signed by your company’s CA.

Please mind I ran into issues with ansible inventories:
– The hosts in the inventory need to have ansible run against them to pick up their properties and become visible in rundeck in the nodes tab. For being able to have ansible run against the hosts in the inventory, the host need to exist in the ssh known hosts file of the rundeck user, otherwise ansible can’t run and the host or hosts will not be visible in rundeck. The solution is to log in as the rundeck user and logon to the hosts in the inventory once manually and accept the host or hosts. From a security perspective it’s understandable that you careful need to assess the hosts to trust, but from an automation standpoint this is annoying. Outside of essentially filling out the known hosts file as I described, there are several other workarounds.
– I created an ansible inventory file in the rundeck project directory and entered the hosts in it. Rundeck picked up the hosts (after solving the above point they became visible in the nodes tab), however when executing something using ansible via rundeck it would say ‘[WARNING]: provided hosts list is empty, only localhost is available’. This means ansible was not pointed explicitly to an inventory, so it used the default one. In order to solve this, I symlinked my (rundeck) project inventory to the /etc/ansible/hosts to make it centrally available. Apparently, using a central inventory for ansible using the plugin is by design. I would rather have rundeck generate an inventory per execution, and pointing to it when the plugin executes ansible.

Now install rundeck:

ansible-playbook install_rundeck/install_rundeck.yml

Done!

Prometheus is an open source systems monitoring and alerting toolkit originally build at Soundcloud. This blogpost shows how to install the needed components to do visualisation of linux system statistics via Grafana.

The setup consists of 3 components:
node_exporter, an exporter of system and hardware metrics.
prometheus, a metric collection and persistence layer.
grafana, the visualisation layer.

1. Preparation
The needed components are installed in the home directory of the user ‘prometheus’. In order for that user exist, it must obviously first be created:

# useradd prometheus
# su - prometheus
$

This installation guide uses Oracle Linux 7.3, but should work for RHEL or Centos too.

2. Node exporter
The next thing to do is install the node exporter. Please mind new version do come out, so you might want to verify the latest release on

$ curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz"
$ mkdir -p Prometheus/node_exporter
$ cd $_
$ tar xzf ../../node_exporter-0.14.0.linux-amd64.tar.gz

Now become root and create a unit file to automatically startup the node exporter using systemd:

# echo "[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

[Install]
WantedBy=default.target" > /etc/systemd/system/node_exporter.service

And make systemd start the node exporter:

# systemctl daemon-reload
# systemctl enable node_exporter.service
# systemctl start node_exporter.service

Next you can verify if the node exporter is running by using ‘systemctl status node_exporter.service:

# systemctl status node_exporter.service
● node_exporter.service - Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:20:54 UTC; 7s ago
 Main PID: 3017 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─3017 /home/prometheus/Prometheus/node_exporter/node_exporter-0.14.0.linux-amd64/node_exporter

Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - hwmon" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - infiniband" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - textfile" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - conntrack" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - diskstats" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - entropy" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - loadavg" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - sockstat" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg=" - wifi" source="node_exporter.go:162"
Jul 31 15:20:54 test.local node_exporter[3017]: time="2017-07-31T15:20:54Z" level=info msg="Listening on :9100" source="node_exporter.go:186"

Additionally, you can go to hostname:9100, and look if that page says ‘node exporter’, and has a link called ‘metric’, which has all the metrics.

3. Prometheus
After we installed node_exporter to provide measurements, we must install the software that can fetch that information and store it. That is what prometheus does. First, become the prometheus user again, and install prometheus. Here too is important to realise that newer versions will come out after this article has been written:

# su - prometheus
$ curl -LO "https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz"
$ cd Prometheus
$ tar xzf ../prometheus-1.7.1.linux-amd64.tar.gz
$ cd prometheus-1.7.1.linux-amd64
$ echo "scrape_configs:

  - job_name: 'prometheus'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    scrape_interval: 1s
    static_configs:
      - targets: ['localhost:9100']"> prometheus.yml

This downloaded and unzipped prometheus, and created prometheus scrape config to fetch data from prometheus itself and the node exporter. Now become root, and install the systemd unit file for prometheus:

# echo "[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Restart=on-failure
ExecStart=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus.yml -storage.local.path=/home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/data

[Install]
WantedBy=multi-user.target" > /etc/systemd/system/prometheus.service

And make systemd start prometheus:

# systemctl daemon-reload
# systemctl enable prometheus.service
# systemctl start prometheus.service

And verify prometheus is running:

# systemctl status prometheus.service
● prometheus.service - Prometheus Server
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:36:55 UTC; 9s ago
     Docs: https://prometheus.io/docs/introduction/overview/
 Main PID: 22656 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─22656 /home/prometheus/Prometheus/prometheus-1.7.1.linux-amd64/prometheus -config.file=/home/prometheus/Prometheus/prometheus-1.7.1....

Jul 31 15:36:55 test.local systemd[1]: Started Prometheus Server.
Jul 31 15:36:55 test.local systemd[1]: Starting Prometheus Server...
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting prometheus (version=1.7.1, branch=mast...n.go:88"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Build context (go=go1.8.3, user=root@0aa1b7fc43...n.go:89"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Host details (Linux 3.10.0-514.26.2.el7.x86_64 ...n.go:90"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading configuration file /home/prometheus/Pro....go:252"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Loading series map and head chunks..." source="....go:428"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="0 series loaded." source="storage.go:439"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Starting target manager..." source="targetmanager.go:63"
Jul 31 15:36:55 test.local prometheus[22656]: time="2017-07-31T15:36:55Z" level=info msg="Listening on :9090" source="web.go:259"
Hint: Some lines were ellipsized, use -l to show in full.

Additionally you can go to hostname:9090/targets and verify both node_exporter and prometheus report state=UP.

At this point, system metrics are fetched and stored. All we need to do, is visualise it. An excellent tool for doing so is grafana. This is how grafana is installed:

4. Grafana
This webpage shows installation instructions and a link to the latest version. During the time of writing of this blogpost, the latest version was 4.1.1. This is how grafana is installed: (please mind installation and systemd require root privileges)

# yum install https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.4.1-1.x86_64.rpm

Next up make systemd handle grafana and start it:

# systemctl daemon-reload
# systemctl enable grafana-server.service
# systemctl start grafana-server.service

And check if grafana is running:

# systemctl status grafana-server.service
● grafana-server.service - Grafana instance
   Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 15:43:11 UTC; 1min 58s ago
     Docs: http://docs.grafana.org
 Main PID: 22788 (grafana-server)
   CGroup: /system.slice/grafana-server.service
           └─22788 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile= cfg:default.paths.logs=/var/log/grafana cfg:default.path...

Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Starting plugin search" logger=plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/var/lib...plugins
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing Stream Manager"
Jul 31 15:43:12 test.local grafana-server[22788]: t=2017-07-31T15:43:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server ad...socket=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Jul 31 15:44:34 test.local grafana-server[22788]: t=2017-07-31T15:44:34+0000 lvl=info msg="Request Completed" logger=context userId=0 orgI...eferer=
Hint: Some lines were ellipsized, use -l to show in full.

5. Grafana configuration
Next, we need to hook up grafana with prometheus. First, go to hostname:3000.
– Login with admin/admin
– Click ‘add datasource’
– Name: prometheus, Type: Prometheus
– Http settings: http://localhost:9090, select Access: ‘proxy’.
– Click ‘save and test’. This should result in ‘success’ and ‘datasource updated.’

Now click on the grafana symbol in the left upper corner, dashboards, import. Enter ‘2747’ at ‘grafana.com dashboard’. This will say ‘Linux memory’, select the prometheus datasource which you just defined, and click import.

This should result in a dashboard the shows you the linux memory area’s (click on the picture to get a better view!):

%d bloggers like this: