Recently I was applying the data dictionary part from an (exadata bundle) patch and ran into the following errors:
ORA-24324: service handle not initialized ORA-24323: value not allowed ORA-27140: attach to post/wait facility failed ORA-27300: OS system dependent operation:invalid_egid failed with status: 1 ORA-27301: OS failure message: Operation not permitted ORA-27302: failure occurred at: skgpwinit6 ORA-27303: additional information: startup egid = 1001 (oinstall), current egid = 1002 (dba)
This was very weird, I had just started the instance using sqlplus. This is a database that is normally started by Oracle clusterware, but for the sake of quickly patching the database, I started it manually. Another reason I had done it that way is the (infamous) OJVM patch was part of the patching, which for the version I applied needs the database in upgrade mode, which means I had to turn the cluster_database parameter to false temporarily.
Back to the story: I ran datapatch on my manually (sqlplus) started instance, and gotten the above mentioned error. The quick solution was to stop the instance (killing the pmon process will do that), and start the instance again. A thorough look through the alert.log file and trace files generated during startup did not show any error anymore. This issue showed up after successful startup previously, so I kept a close eye on the instance for some time, but it didn’t appeared again. Problem solved, but what did happen?
When looking at the error messages, there are two lines mentioning ‘egid’ which means effective group id. Also, two values for groups are mentioned: 1001, which is the oinstall group, and 1002, which is the dba group. The last line is even more clear actually, it says the startup group id is oinstall, but the current group id dba.
The line ‘ORA-27140: attach to post/wait facility failed’ is actually the root cause. The line ‘ORA-27302: failure occurred at: skgpwinit6’ tells the exact function, and skgpwinit6 probably can be deciphered as ‘System Kernel Generic PostWait INITialisation’. But what does ‘post/wait’ mean? My Oracle Support has a nice description in note ‘TECH: Unix Semaphores and Shared Memory Explained (Doc ID 15566.1)’. Essentially, post/wait is the Oracle side of using the operating system system V semaphore facility. The issue here is the group id set for the semaphores does not align with the group id for this oracle database server process, and is rejected by the operating system (line ‘ORA-27300: OS system dependent operation:invalid_egid failed with status: 1’); linux error 1 is (errno.h) EPERM 1 /* Operation not permitted */, so declined because of permissions.
The next question obviously is: how did this happen? The instance was started by me, in the same linux session, and then running something else that connects to the instance (datapatch) suddenly errors out.
This has to do with Oracle clusterware, ASM disk devices and (potential) role separation. In order for the Oracle database server processes to be able to access and use the local ASM disk devices, it must have the group id set of the ASM disk devices. When role separation is setup, which means the clusterware uses a different user id than the database software, the disk devices have ownership to the clusterware user id, and the group id will be the group set as OSASM during installation, which tends to be set as ‘asmadmin’ in such cases.
Because the disk devices need to have permissions set to 660 (read and write for both the owner and the group), the group set with the devices can be used to use the devices. With role separation, the user id of the database processes is different from the clusterware processes, so for the database processes to be able to use the disk devices, it needs to have membership of the set OSASM group.
Now, the actual root case of this issue is, for the clusterware to make sure the databases can startup using the disk devices, it will set the OSASM group as the group id for the database oracle executable (in the database home!) whenever clusterware is invoked to startup an instance. Because the oracle executable has SUID and SGID bits set (rwsr-s–x), this could mean the group of newly created processes suddenly changes from the previous group id to the OSASM group if clusterware is used to startup an instance from the same home a database instance has been started up earlier without clusterware.
How is the group set on the oracle executable of the database home?
Some simple testing shows the current primary group of the user performing linking of the oracle executable is used as the group of the oracle executable (# means executed as root, $ means executed as oracle):
# groupadd test_group # id oracle uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba) # user mod -a -G test_group # id oracle uid=54321(oracle) gid=54321(oinstall) groups=54321(oinstall),54322(dba),54323(test_group) # su - oracle $ . oraenv ORACLE_SID = [oracle] ? o12102 The Oracle base has been set to /u01/app/oracle $ cd $ORACLE_HOME/rdbms/lib $ make -f ins_rdbms.mk ioracle ... $ ls -ls /u01/app/oracle/product/184.108.40.206/dbhome_1/bin/oracle 316476 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 11:27 /u01/app/oracle/product/220.127.116.11/dbhome_1/bin/oracle === $ exit # usermod -g test_group -G dba,oinstall oracle # su - oracle $ id uid=54321(oracle) gid=54323(test_group) groups=54323(test_group),54321(oinstall),54322(dba) $ . oraenv ORACLE_SID = [oracle] ? o12102 The Oracle base has been set to /u01/app/oracle $ cd $ORACLE_HOME/rdbms/lib $ make -f ins_rdbms.mk ioracle ... $ ls -ls /u01/app/oracle/product/18.104.22.168/dbhome_1/bin/oracle 316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 /u01/app/oracle/product/22.214.171.124/dbhome_1/bin/oracle
I first used my current settings, which shows oinstall as primary group, and when I relink the oracle executable the group set with the oracle executable is oinstall. This is shown in lines 5-15. Next, I switch the primary group of the oracle user to test_group and execute linking again. Now the group of the oracle executable is test_group.
If I startup a database with the current settings, it will create semaphores with test_group as the current/effective group:
$ sqlplus / as sysdba SQL> startup ... Database opened. SQL> exit $ sysresv -l o12102 IPC Resources for ORACLE_SID "o12102" : ... Semaphores: ID KEY 557059 0x017b8a6c Oracle Instance alive for sid "o12102" $ ipcs -si 557059 Semaphore Array semid=557059 uid=54321 gid=54321 cuid=54321 cgid=54323 mode=0640, access_perms=0640 nsems = 104 otime = Wed Mar 15 12:50:51 2017 ctime = Wed Mar 15 12:50:51 2017 semnum value ncount zcount pid 0 0 0 0 17097 1 4869 0 0 17097 2 10236 0 0 17097 3 32760 0 0 17097 ...
Lines 1-4: startup the instance o12102 using sqlplus.
Lines 5-13: use the sysresv utility to find the semaphore array that the instance o12102 is using. The semaphore array id is 557059.
Lines 14-26: the current group id of the semaphore array of the instance o12102 is 54323 (cgid).
Before we go on and involve clusterware, let’s replay the scenario (changing the group of the oracle executable) manually outside of clusterware to see if we can get the same behaviour:
$ ls -ls oracle 316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle $ chgrp oinstall oracle $ chmod 6751 oracle ls -ls oracle 316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle
Please mind the database currently is running (otherwise there would be no semaphore array above).
In order to change the group correctly, not only the group needs to be reset (line 3), but also the SUID and SGID bits must be set again, these are lost when the group is changed. Setting the SUID and SGID bits is done in line 4.
Now try to logon again as sysdba:
$ sqlplus / as sysdba SQL*Plus: Release 126.96.36.199.0 Production on Wed Mar 15 13:02:25 2017 Copyright (c) 1982, 2014, Oracle. All rights reserved. Connected.
That actually succeeds! This is because principal access is arranged by membership of the OSDBA group (which I have set to dba, of which the oracle user is a member). However, if I try to shutdown the instance, I get the messages regarding post/wait:
SQL> shutdown immediate ERROR: ORA-27140: attach to post/wait facility failed ORA-27300: OS system dependent operation:invalid_egid failed with status: 1 ORA-27301: OS failure message: Operation not permitted ORA-27302: failure occurred at: skgpwinit6 ORA-27303: additional information: startup egid = 54323 (test_group), current egid = 54321 (oinstall)
Can I solve this issue? Yes, quite simply by changing the group of the oracle executable back to the group the database was startup (plus the SUID and SGID bits, obviously):
$ chgrp test_group oracle $ chmod 6751 oracle $ sqlplus / as sysdba ... SQL> select * from dual; D - X
Voila! It’s corrected again.
Now let’s keep our database group setting in mind (test_group), and involve clusterware. Because I got the o12102 instance already started, let’s see when the group of the database oracle executable changes, because the OSASM group of clusterware is set to oinstall:
$ srvctl status database -d o12102 Database is running. $ ls -ls oracle 316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle $ srvctl stop database -d o12102 $ ls -ls oracle 316472 -rwsr-s--x 1 oracle test_group 324067184 Mar 15 12:06 oracle $ srvctl start database -d o12102 $ ls -ls oracle 316472 -rwsr-s--x 1 oracle oinstall 324067184 Mar 15 12:06 oracle
So, it is really only when an instance is started using clusterware which changes the group to the OSASM group of the oracle executable in the database home.
The group as set by clusterware can be changed by:
– unlocking the clusterware home ($ORACLE_HOME/crs/install/rootcrs.sh -unlock (roothas.sh for SIHA) as root.
– changing the $ORACLE_HOME/rdbms/lib/config.c entries for .Lasm_string: .string “GROUP HERE” and #define SS_ASM_GRP “GROUP HERE” in the clusterware home, and then relink (make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/config.o; relink all) as the owner of clusterware.
– lock the clusterware home again ($ORACLE_HOME/crs/install/rootcrs.sh -patch (roots.sh for SIHA) as root.
==> Please mind that group for the ASM devices needs to be changed accordingly if you change the ASM group.