Saturday, November 28, 2009

Example how to save deadlock exceptions into table (beside alert log)!

Deadlock is a normal situation in a multi user environments that causes two or more sessions to wait indefinitely for a locked resource. First session needs a resource locked by the second one and the second session needs a resource locked by the first session. You may accept deadlock as a easiest solution from Oracle side to ensure normal functioning in unsolvable situations.



After deadlock occur, beside exception (ORA-00060), alert log entry is entered with message, one trace file in background_dump_dest directory is generated. In that trace file, between all others information about deadlock, it is very clearly written:
...
[Transaction Deadlock]
The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL. The following information may aid in determining the deadlock
...

Because alert log and trace files are on the server side, where most of developers cannot reach, this is why all such problems fall in DBA hands, regardless most of us do not do design (!?) or program the database which we administer.

9i deadlock monitoring solution

While I was working in 2005 as DBA in one company with Oracle 9i, where deadlock was a serious and constant problem, I was tired from transferring all the time files to developers. This is why I created a standalone solution which capture deadlock and save all important data to two native Oracle tables. This 9i solution is available to download .

Solution is implemented in separate schema (tools) which should have enough privileges for that. So this is one additional step is needed to be done for such a scenario. Because non privileged users can query "%$" views, I have to create replica with similar name and grant select to tools user:
create or replace SYS.VT$OBJ as select * from SYS.OBJ$;
GRANT SELECT ON SYS.VT$OBJ TO TOOLS;
Of course, other way was to implement the whole logic in SYS schema, but then in some future exports...this could be lost (exp doesn't export sys schema).

Deadlock information are recorded in mentioned two tables (DEADLOCK_LOG_LOCKERS, DEADLOCK_LOG_WAITERS). The whole logic is automated through database trigger, which is fired when ORA-00060 occur.
CREATE OR REPLACE TRIGGER db_aserr_tgb
AFTER SERVERERROR
ON DATABASE
BEGIN
  IF (IS_SERVERERROR (60)) THEN
    deadlock_log_prc;
  END IF;
EXCEPTION
  WHEN OTHERS THEN
    null;
END db_aserr_tgb;
/

Finally, referenced deadlock_log_prc procedure in trigger, do the whole job-fill tables with information about deadlocks. If you perform code from picture at the beginning (attached source for this exampel is in RAR files as well) then on 9i database you get something like:

Notice that beside other information from session that caused deadlock, rightmost column contains SQL from session that survived.


Second table describes more with global data what was in the moment of deadlock.

Using ID and DEADLOG_DATE fields you can join tables and analyze the deadlock situation. Because I granted public select on both of tables developers were able to help them self with no intervention from DBA, what makes my work easier.

As you can see those two tables has almost all important data which you can find in trace file.

Non automatic usage

Because deadlock is in core classical exception, following, usual "developers like" code, will prevent deadlock logging:
BEGIN
  UPDATE....
EXCEPTION
  WHEN OTHERS THEN
    null;
END;
/
because exception is silently "crushed" in general exception handler without proper processing interesting part of it.

So it is much, much better next code which will fix previous problem with missed logging, leaving the whole program functionality as it was:
DECLARE
  my_deadlock_exc EXCEPTION;
  PRAGMA EXCEPTION_INIT (my_deadloc_exc, -60);
BEGIN
  UPDATE....
EXCEPTION
  WHEN my_deadlock_exc THEN
    deadlock_log_prc;
  WHEN OTHERS THEN
    null;
END;
As you can see, because trigger cannot be called implicitly, we are calling explicitly deadlock_log_prc which is in the case exception handler for deadlock exception.

Whenever I had a chance, I always tried to learn developers to make some exception handlers at least for some harmless events (resource bussy, record not found etc)... to prevent hiding real problems in WHEN OTHERS part.

11g deadlock changes!

After that I was working as APPS 11i DBA, locking story passed away from my mind.  The reason is very easy: in EBS there is no ref constraints-so deadlocks are very, very rare...frankly if they happened it is usually from the customization part.

Recently I jump into one interesting post 11g is more deadlock sensitive than 10g?, which mentioned something about expanded logic in deadlock. As I have at home my 11gR1, I decided to test that immediately.

Very soon I realize that 11g-9i difference in this part are really huge. As you can see, my solution, which was perfectly working on 9i, on 11g, beside important data which must be captured, was capturing many others unnecessary data, which were not produced in 9i.

11g deadlock monitoring workaround

After some small investigation, the quickest solution can be done by adding additional filters before posting captured sqls. For that i have re changed a little bit logic in deadlock_log_prc procedure. Here is how it now it looks (change code is especially colored):
CREATE OR REPLACE PROCEDURE deadlock_log_prc IS
/*------------------------------------------------------------------------------
 Filename: 3_deadlock_log_prc.sql
 Purpose : Procedure which fill deadlock monitoring tables with data
           when deadlock occur

 Date    : 18.10.2005
 Author  : Damir Vadas
 CR      : 1500

 Remarks : run as priviledged user
Should work on 9i-11g Oracle versions

 Changes (DD.MM.YYYY):
          27.11.2009. Damir Vadas
                      Rechanged logic for 11g databases (not too much tested-looks good)
 ----------------------------------------------------------------------------*/
PRAGMA AUTONOMOUS_TRANSACTION;
  v_id INTEGER;
  v_date DATE;
BEGIN
  SELECT tools_seq.nextVal, sysdate
    INTO v_id, v_date
    FROM dual
  ;

  INSERT INTO tools.deadlock_log_lockers (
      id
    , deadlog_date
    , os_locker
    , locker_schema
    , locker_pid
    , os_waiter
    , waiter_schema
    , waiter_pid
    , sql_text_waiter
  )
    /* TX row locks  */ 
    WITH Q AS (
      SELECT /*+ rule */
          S_LOCKER.OSUSER OS_LOCKER
        , S_LOCKER.USERNAME LOCKER_SCHEMA
        , S_LOCKER.PROCESS LOCK_PID
        , S_WAITER.OSUSER OS_WAITER
        , S_WAITER.USERNAME WAITER_SCHEMA
        , S_WAITER.PROCESS WAITER_PID
        , (SELECT sql_fulltext from v$sql where sql_id = O.SQL_ID AND ROWNUM<=1)
        SQL_TEXT
      FROM
          V$LOCK L_WAITER
        , V$LOCK L_LOCKER
        , V$SESSION S_WAITER
        , V$SESSION S_LOCKER
        , V$_LOCK L1_WAITER
        , V$OPEN_CURSOR O
      WHERE S_WAITER.SID = L_WAITER.SID
         AND L_WAITER.TYPE IN ('TX')
         AND S_LOCKER.sid = L_LOCKER.sid
         AND L_LOCKER.ID1 = L_WAITER.ID1
         AND L_WAITER.REQUEST > 0
         AND L_LOCKER.LMODE > 0
         AND L_WAITER.ADDR != L_LOCKER.ADDR
         AND L1_WAITER.LADDR = L_WAITER.ADDR
         AND L1_WAITER.KADDR = L_WAITER.KADDR
         AND L1_WAITER.SADDR = O.SADDR
         AND O.USER_NAME != 'SYS'
    )
    SELECT
            v_id
          , v_date
          , OS_LOCKER
          , LOCKER_SCHEMA
          , LOCK_PID
          , OS_WAITER
          , WAITER_SCHEMA
          , WAITER_PID
          , CAST (DBMS_LOB.SUBSTR (SQL_TEXT, 4000) AS VARCHAR (4000)) 
      FROM Q
     WHERE 
            SUBSTR(UPPER(
                           CAST (DBMS_LOB.SUBSTR (SQL_TEXT, 4000) AS VARCHAR (4000))
                          )
                          ,1,7)!='SELECT '                   -- 11g
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'INSERT INTO SYS.AUD$')<=0  -- auditing
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'OBJ#')<=0                  -- 11g
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'AUDIT$')<=0                -- 11
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'TABLE_1_FF')<=0            -- 11
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'DECLARE  TYPE ATTRS_CUR IS REF CURSOR;  M_CUR')<=0            -- 11
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'SDO_GEOR_DDL__TABLE$$')<=0 -- 11
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;')<=0 -- 11
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'DEADLOCK_LOG_LOCKERS')<=0  -- log tables
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'DEADLOCK_LOG_WAITERS')<=0  -- log tables
       AND INSTR(UPPER(RTRIM(Q.SQL_TEXT)), 'IS_SERVERERROR')<=0        -- log trigger which monitor deadlocks
   UNION
    /* table-level locks (TM) and mixed TM/TX TX/TM */
    SELECT  /*+ rule */
        v_id
      , v_date
      , S_LOCKER.OSUSER OS_LOCKER
      , S_LOCKER.USERNAME LOCKER_SCHEMA
      , S_LOCKER.PROCESS LOCKER_PID
      , S_WAITER.OSUSER OS_WAITER
      , S_WAITER.USERNAME WAITER_SCHEMA
      , S_WAITER.PROCESS WAITER_PID
      , 'Table lock (TM): ' || U.NAME || '.' || O.NAME || ' - Mode held: '||
          DECODE(L_LOCKER.LMODE,
            0, 'None', /* same as Monitor */
            1, 'Null', /* N */
            2, 'Row-S (SS)', /* L */
            3, 'Row-X (SX)', /* R */
            4, 'Share', /* S */
            5, 'S/Row-X (SSX)', /* C */
            6, 'Exclusive', /* X */
            '???: ' || to_char(L_LOCKER.LMODE)
          ) || ' / Mode requested: '||
          DECODE(L_WAITER.REQUEST,
            0, 'None', /* same as Monitor */
            1, 'Null', /* N */
            2, 'Row-S (SS)', /* L */
            3, 'Row-X (SX)', /* R */
            4, 'Share', /* S */
            5, 'S/Row-X (SSX)', /* C */
            6, 'Exclusive', /* X */
            '???: '||to_char(RTRIM(L_WAITER.REQUEST))
          )
        SQL_TEXT_WAITER
      FROM
          SYS.VT$OBJ O
        , SYS.USER$ U
        , V$LOCK L_WAITER
        , V$LOCK L_LOCKER
        , V$SESSION S_WAITER
        , V$SESSION S_LOCKER
     WHERE S_WAITER.SID = L_WAITER.SID
       AND L_WAITER.TYPE IN ('TM')
       AND S_LOCKER.sid = L_LOCKER.sid
       AND L_LOCKER.ID1 = L_WAITER.ID1
       AND L_WAITER.REQUEST > 0
       AND L_LOCKER.LMODE > 0
       AND L_WAITER.ADDR != L_LOCKER.ADDR
       AND L_WAITER.ID1 = O.OBJ#
       AND U.USER# = O.OWNER#
   ;

  INSERT INTO tools.deadlock_log_waiters (
      id
    , deadlog_date
    , wait
    , os_user
    , process
    , locker
    , object_owner
    , object_name
    , program)
    SELECT /*+ ordered */
        v_id
      , v_date
      , DECODE(L.REQUEST,0,'NO','YES') WAIT
      , S.OSUSER
      , S.PROCESS
      , S.USERNAME LOCKER
      , U.NAME T_OWNER
      , O.NAME OBJECT_NAME
      , ' '||S.PROGRAM PROGRAM
    FROM
        V$LOCK L
      , SYS.USER$ U
      , SYS.VT$OBJ O
      , V$SESSION S
    WHERE U.USER# = O.OWNER#
       AND S.SID = L.SID
       AND L.ID1 = O.OBJ#
       AND L.TYPE = 'TM'
       AND U.NAME != 'SYS'
       AND O.NAME != 'DEADLOCK_LOG_LOCKERS'
       AND O.NAME != 'DEADLOCK_LOG_WAITERS'
   UNION
    SELECT /*+ ordered */
       v_id
     , v_date
     , DECODE(L.REQUEST,0,'NO','YES') WAIT
     , S.OSUSER
     , S.PROCESS
     , S.USERNAME LOCKER
     , '-'
     , 'Record(s)'
     , ' '||S.PROGRAM PROGRAM
    FROM
        V$LOCK L
      , V$SESSION S
    WHERE S.SID = L.SID
       AND L.TYPE = 'TX'
  ;
  COMMIT;
EXCEPTION
  WHEN OTHERS THEN
    dbms_output.put_line(SQLERRM);
END;
/

The whole 11g changed solution is packed in additional RAR file

What next

I'm perfectly aware that this approach is far from perfect, but for me and a frend of mine, who needed this code ASAP, this is acceptable working solution for a while. When I find some additional not necessary rows are shown in table, I'll just add new filter and it will be quickly fixed.

When (and if!) I get time to re change that code I'll post it. In the meantime I'd like to get some enhancements from someone of you...if you have experience the same problems.
On 12th December I have added a small amount of code according vt$obj which was not clearly described. Thank you for that!

Cheers!

Monday, November 23, 2009

Treat lists as table

For many times I have a situation to use a variable list of varchar2 values as parameter in some procedure or multiple action. It was OK when action was able to be pulled in one script and execute at once, but many times this was not a case, so I have to copy/paste these entries to several places by hand with no or little chance to automate that. And even more, chance to make a mistake was much bigger!
Many of you probably know that Apex has brought PL/SQL solution, which introduced similar functionality through string_to_table function.

Here is how it looks like in Oracle’s case:
declare
  list_arr2 htmldb_application_global.vc_arr2;
begin
  list_arr2 := htmldb_util.string_to_table('a:2:III');
  for i in 1..list_arr2.count loop
    dbms_output.put_line(list_arr2(i));
  end loop;
end;
/

And the result is:
SQL> /
a
2
III

PL/SQL procedure successfully completed.

SQL>

You see that a variable items list is transformed in some kind of table like structure, what satisfied needs for serving in Apex application itself. However, main problem in this case is that you deal with arrays and in a case you want to use them in SQL you must transform them somehow to table.

From this point of view, now I realize that mine solution, which exists several years before Apex has published it’s method, did resolve mine needs completely and may resolve some of yours...in a very similar way like Apex did. I'm not saying that Apex code cannot be fixed to behave like mine, but for that you need source which is not available.

Solution is compound from one type and one small function, which I called list2table:
create or replace type g_tbl_varchar2 is table of varchar2(1000);
/

CREATE OR REPLACE FUNCTION list2table(
                                      p_list IN VARCHAR2,
                                      p_delimiter IN VARCHAR2 default ','
                                       ) RETURN g_tbl_varchar2

/*--------------------------------------------------------------------------------------
NAME    : list2table
PURPOSE : Function returns one column table records from variable items list (like classic Oracle table).

Date    : 05.03.2005.
Author  : Damir Vadas

Remarks : p_delimiter could be any CHAR with length 1

Changes (DD.MM.YYYY, Name, CR/TR#):
------------------------------------------------------------------------------------ */
IS
  v_str VARCHAR2(4000) := p_list || p_delimiter;
  v_cnt NUMBER := length(v_str) - length(REPLACE(v_str, p_delimiter));
  v_chartab g_tbl_varchar2 := g_tbl_varchar2();
BEGIN
  FOR i IN 1 .. v_cnt LOOP
    v_chartab.EXTEND;
    v_chartab(i) := substr(v_str, 1, instr(v_str, p_delimiter) - 1);
    v_str := substr(v_str, instr(v_str, p_delimiter) + 1);
  END LOOP;
  RETURN v_chartab;
EXCEPTION
  WHEN OTHERS THEN
    dbms_output.put_line(SQLERRM);
    raise;
END;
/

Be aware that this code has a parameter p_delimiter but in my case only 1 char length is allowed. Usage is very simple:
SELECT * FROM TABLE(list2table('a,s,d,12,3,4,5'));

Here is a little more complex example which shows how to include where and column name in such a list result. This is condittio sinne qua non for full manipulation as a table:
with q as (
   SELECT my_name FROM TABLE(list2table('a,s,d,12,3,4,5,122,145'))
)
select q.my_name
from q
where q.my_name='d';

MY_NAME
-------
d

With such a functionality I was able to sore values from variable list to one column table and then use them as normal data parameter all around.

Practical usage of this will be shown soon in topic “Disable APPS users dynamically” when I get time to write it.

Monday, November 16, 2009

Some thoughts on Quest’s Toad for Oracle

Last weeks I have been discussed with several guys from Quest, on subject "Toad for Oracle". Initially I have started to chat with Steve Hilker, product management director for Oracle solutions at Quest Software, about some new enhancements in Quest's Toad for Oracle.

First I have suggested Steve to modify way how user can add "new tab" in SQL Editor. In current version only "right click" option from menu is available so I have suggested a little bit more easy way-double click on toolbar. Beside that, if user want to close tab, again action is started with right click and popup menu... So my suggestion was to implement "close button" feature (small "x" on each tab) to close tab with left click on it. I find those two enhancements very easy in implementation and very useful for end users in the same time.

Very soon, I got forwarded answer from  John Pocknell, product manager at Quest Software based out of the European HQ in Maidenhead UK, who has on my surprise, promised to include both features in next Toad release (first one that will come after initial 10.0.0.41)!
Mine next idea was to give better support on one big Toad’s problem–handling Oracle connections. Every Toad user has faced in past problem with “dead connection” … i.e. when some connection was not used for a longer time and then user tried to open some query/table/view ... application in most of the cases crashes with known TOAD error dialog:

All previous versions until recent 10, were really suffering big problems on this part and it was not rare that one broken connection would pull all others opened and the result will be complete crash of Toad. In 10g, stability is, generally speaking, greatly enhanced (I haven’t notice this serial crashes), but with still visible problems in connection part.
For this problem I have one hand made solution. Before starting any action on such a sessions, I would always choose menu option Session|Test Connections (Reconnect). In this way I would always force to use disco/conn method implicitly. Because I do not have Toad’s source code I’m just guessing that this option works in described way…without any official confirmation from Quest. Another restriction on this solution is that it would work only if When closing connections option is set to Rollback, what cause implicit rollback without asking about previous session when reestablishing a new one, what is not so unacceptable to some users. This solution was far from from perfect but has satisfied my and mine needs because I never leave unfinished transaction without immediately proper ending (and long actions I always run through sqlplus not any third party tools).

Mine idea for improvements was to create two kinds of connections "week" which should never be tried to reestablished (i.e. some on wireless or known "time limited connections") and "strong", which should always tried to reestablish current connection (like it is now). In this way user can help Toad to behave more easy in situation when communication is broken what will directly lead to more stable application.
Unfortunately , Steve Hilker, informed me, according information from his development team, that core problem is in Oracle Data Access Components (ODAC) component, Delphi component which Toad use for accessing Oracle, so there is not too much to be done in implementation of mine idea. Steve informed that this problem is analyzed for a longer time with little success and progress. However I’m pretty sure that future Delphi to Microsoft .NET transition (which is already in progress from version 9.x) would improve this part in incoming releases of Toad for Oracle.

HTML Schema Doc Generator

Whenever user need to create documentation based on Oracle database objects, classic problem arise-how to do that efficiently in few steps with acceptable result? Solution comes as out of the box module, named HTML Schema Doc Generator, nice and very useful feature inside Toad for Oracle. Module which is available through Database|Report|HTML Schema Doc Generator Toad’s menu option. It is not so known and famous to many users even experienced one. In core it generates HTML documentation with inner reference and links all around chosen objects area (mostly schema but can be highly customizable) in just a few clicks. This module should use not use only DBAs but any kind of system or any other IT specialist all around the world that deal with Oracle.
From a time when I started to be DBA until today and especially time when I was APPS DBA, I have used that feature quite a lot in next situations:
  • as reference HTML documentation … like some kind of dirty manual (distributing to other parties on project)
  • as reference code snapshot … to have backup for quick reference "what was changed in between"
  • when I want to understand more deeply some source code (this was a case in APPS schema where exists a huge and interesting repository of Oracle most sophisticated programming code to learn).
One standard advice is to use HTML Schema Doc Generator with "one file per object" option in any situation you will use:


Problem

Regardless previously defined option (which really reduce overhead and create proper sized files), when generating huge number of objects (like in APPS schema where you are dealing with cca. 21,000 packages!), because Toad create inner reference of all objects, it use huge amount of memory for that because of quicker reference and editing new coming links. Toad starts to work and when cca 1.9 GB of memory is filled it crashed (out of memory problem) with never finishing any part of the report. However, in even recent 10 versions this operation will never finished because Toad will crash with next error:

This is happening regardless there is still left 1GB on Win32 (like in my case) or even more memory on Win64 desktops/servers with /3GB and /PAE switch in boot.ini in all kind of combination. The problem is not solved even in a case of creation of huge Windows swap file (cca. 4 GB on 4GB real RAM server)! Obviously problem was not a memory but the way this memory is used ... what could be fixed somehow.

This is really a bug!?

When I mentioned this bug to Steve Hilker, he has addressed me to John Dorlon, developer who wrote the Toad module-HTML Schema Doc Generator.
Very soon he has faced the same because bug is reproducible (ha has tested on one of available APPS environment) with system modifications which clearly pointed out that "out of memory" is really not a problem! The problem seems a little bit more complex when John realize that even exception log file is not generated ... what would lead to some quick fixes.

At the end of mail conversation, John said that this fix would rely on recoding part of generation of HTML files (there must be a leak in logic) and will be included as soon he finishes other more important obligations that he has in this moment. Honest and fair enough! This is what I appreciates most-be honest, nobody is perfect!

Workaround

For those who has faced with previously described error, I have workaround in a way to run report against huge schema in several steps. In my case I have decided to split whole process in three runs:
  1. all "non package" objects in APPS schema
  2. all packages A-M
  3. all packages N-Z
After that I merge generated HTML pages by hand in one scope file.

Toad on 64 bit Windows

If you have to use Toad on 64 bit Windows (desktop/server) then the easiest solution is to install in separate $ORACLE_HOME, 32 bit Oracle client (any version). In this way Toad will functioning with no problems.

However installing any other 32 bit software (for any other purpose then mentioned) is highly not recommended because Oracle strongly advice to use 64 bit Oracle software on 64 bit OS (Metalink note: 334528.1-Running 32-bit Oracle Software on 64-bit Windows OS).

For the end

For the end let me express very positive thoughts about all guys from Quest software with whom I was exchanging posts. Regardless their position, they are all open minded and not afraid of discussion about bugs and enhancements on their products in all means. Pretty cool property from my point of view indeed. This assure me to believe in even more bright future of this application.

Cheers!

Friday, November 13, 2009

Blocking session ... detect, unblock, notify


Blocking sessions are very unpleasant event on any database. With bigger number of users (sessions) possibility that one session blocks other is significantly higher. Every DBA is facing problems which could be briefly express in next three points:
  1. How to detect that blocking has come?
  2. How to unblock situation?
  3. How to act proactively (be notified)?
If blocking session occure, DBA has two choice: to ask that user to cancel his work or to manually killed blocking session. In a cases when first rule cannot be applied (long running job or not reachable user) kill session will occurs. Generally speaking killing session can be perform with ALTER SYSTEM DISCONNECT  SESSION …  IMMEDIATE (preferable method) or ALTER SYSTEM KILL SESSION… IMMEDIATE statement. There are some situations when first statement would not be successful so second one can be used. DISCONNECT SESSION is more gentle and cleaner method because it removes killed session entries all around Oracle. KILL SESSION may leave some entries in Oracle instance so you may notice that some killed sessions still have MARKED FOR KILL label. Regardless this session is really gone mentioned label will remain until next restart of instance.

Detecting blocking sessions

One and most common way in recent time, is use Oracle Enterprise Manager or other visual tool (i.e. Quest’s Toad), where all information is served to you without knowing what is behind the scene or where these data are placed. Better (but harder) approach for any DBA is to use sql script that can retrieve data from Oracle database in a classic way and where you have to understand more deeply the problem and Oracle internal functionality. There are many postings which shows variety way of detecting blocking locks with sql code. Me as young DBA has one script which was functioning very well until 3 years ago where I face with Oracle RAC and problems of multi instance world. Then I have to rearrange the logic to support RAC as well non RAC database. This was the moment where following script has been reborn.

/* ---------------------------------------------------------------------------
Filename: sb.sql
CR/TR#  :
Purpose : Find blocking sessions and create release statements for RAC and non RAC environment

Date    : 08.07.2005.
Author  : Damir Vadas

Remarks : Tested on 10g only (should work on 11g as well) 
          run as privileged  user

Changes (DD.MM.YYYY, Name, CR/TR#):
          08.08.2006, Damir Vadas
                      Change the whole logic to accept RAC databases
          11.07.2009, Damir Vadas 
                      tuned a litle bit some queries
--------------------------------------------------------------------------- */

set serveroutput on
set linesize 266
set pagesize 600
set trimout on

declare
  cursor c_blockers IS
    SELECT DISTINCT
       username blocker_user,
       gvb.sid,
       gvs.serial# serial,
       gvb.inst_id,
       gvb.ctime ,
       module,
       decode(gvb.type, 'MR', 'Media_recovery',
                        'RT', 'Redo_thread',
                        'UN', 'User_name',
                        'TX', 'Transaction',
                        'TM', 'Dml',
                        'UL', 'PLSQL User_lock',
                        'DX', 'Distrted_Transaxion',
                        'CF', 'Control_file',
                        'IS', 'Instance_state',
                        'FS', 'File_set',
                        'IR', 'Instance_recovery',
                        'ST', 'Diskspace Transaction',
                        'IV', 'Libcache_invalidation',
                        'LS', 'LogStaartORswitch',
                        'RW', 'Row_wait',
                        'SQ', 'Sequence_no',
                        'TE', 'Extend_table',
                        'TT', 'Temp_table',
                              'Nothing-'
       ) lock_type,
       action
     FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvb.sid=gvs.sid
      AND gvb.inst_id=gvs.inst_id
  ORDER BY CTIME desc
  ;
 
  CURSOR c_w IS
    SELECT  gvw.sid waiter_sid,
            gvw.inst_id waiter_inst_id
    FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvb.sid=gvs.sid
      AND gvb.inst_id=gvs.inst_id
  ORDER BY gvb.CTIME desc
  ;
 
  CURSOR c_waiters (p_blocker_waiter_sid gv$lock.sid%TYPE, p_blockers_waiter_inst_id gv$lock.inst_id%TYPE) IS
    SELECT /* ordered */
      username waiter_user,
      gvw.sid,
      gvs.serial# serial,
      gvw.inst_id,
      gvw.ctime ,
      module,
      decode(gvw.request, 0, 'None',
                         1, 'NoLock',
                         2, 'Row-Share',
                         3, 'Row-Exclusive',
                         4, 'Share-Table',
                         5, 'Share-Row-Exclusive',
                         6, 'Exclusive',
                            'Nothing-'
      ) lock_req,
      action
    FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvw.sid=gvs.sid
      AND gvw.inst_id=gvs.inst_id
      AND gvw.sid=p_blocker_waiter_sid
      AND gvw.inst_id=p_blockers_waiter_inst_id
    ORDER BY CTIME desc
  ;

  i NUMBER := 0;
  bHasAny  boolean;
  bOnTheSameNode boolean;
  l_inst_id BINARY_INTEGER := 0;
  v_sid number := 0;
  v_serial NUMBER := 0;
  v_instance BINARY_INTEGER := 0;
  db_ver                 VARCHAR2(128);
  db_ver2                VARCHAR2(128); 
  s_job_command VARCHAR2 (256);
  s VARCHAR2(1024);
 
  PROCEDURE kill_by_hand is
  BEGIN
    DBMS_OUTPUT.PUT_LINE (chr(9)||'ALTER SYSTEM DISCONNECT  SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');
    DBMS_OUTPUT.PUT_LINE (chr(9)||'ALTER SYSTEM KILL        SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');   
  END; 
 
begin
  DBMS_OUTPUT.ENABLE(200000);
  dbms_utility.db_version(db_ver,db_ver2);
  DBMS_OUTPUT.PUT_LINE('Oracle version: '||db_ver|| ' ('||db_ver2||')');
  DBMS_OUTPUT.PUT_LINE (chr(9));
  bHasAny := FALSE;
  bOnTheSameNode := FALSE;
  FOR v_blockers in c_blockers loop
    bHasAny := TRUE;    
    IF (i=0) THEN                 
      select instance_number into l_inst_id from v$instance;   
      bOnTheSameNode := (v_blockers.inst_id=l_inst_id);
      DBMS_OUTPUT.PUT_LINE (rpad('Blocker',12,' ')||'  '||'  Inst  '||rpad('SID',6,' ')||'  '||rpad('Serial',6,' ')||'  '||'  '||rpad('[sec]',6,' ')||' '||
                            lpad('Lock Type',10,' ')||' '|| lpad('Module',13,' ')
                           );
      dbms_output.put_line(rpad('-',120,'-'));
    END IF;
    v_sid := v_blockers.sid;
    v_serial := v_blockers.serial;
    v_instance := v_blockers.inst_id ;
    DBMS_OUTPUT.PUT_LINE(LPAD(to_char(i+1),3,' ')||'. '||rpad(v_blockers.blocker_user,12,' ')||'  '||rpad(v_blockers.inst_id,4,' ')||'  '||rpad(v_blockers.SID,8,' ' )||'  '||lpad(v_blockers.serial,5,' ')||'  '||lpad(v_blockers.CTIME,7,' ')||' '||
               lpad(v_blockers.lock_type,13,' ')||lpad(v_blockers.module,15,' ')
               );
    FOR v_w in c_w LOOP
      FOR v_waiters IN c_waiters (v_w.waiter_sid,v_w.waiter_inst_id) LOOP
        DBMS_OUTPUT.PUT_LINE (lpad(chr(9),1)||rpad(v_waiters.waiter_user,10,' ')||'  '||rpad(to_char(v_waiters.inst_id),4,' ')||rpad(to_char(v_waiters.sid),4,' ')|| ' '||lpad(to_char(v_waiters.serial),12,' ')||' ' ||' '||
                    ' ' ||lpad(to_char(v_waiters.ctime),6,' ')||' '|| lpad(to_char(v_waiters.lock_req),13,' ') || ' '|| lpad(to_char(v_waiters.module),15,' ')
                    );     
      END LOOP;
    END LOOP;
    IF bHasAny THEN
      i:=i+1;
    END IF;
  END LOOP;
  IF bHasAny THEN
    DBMS_OUTPUT.PUT_LINE (chr(9));
    DBMS_OUTPUT.PUT_LINE ('To kill first from the list, perform: ');
    DBMS_OUTPUT.PUT_LINE (chr(9));
    IF bOnTheSameNode THEN
      DBMS_OUTPUT.PUT_LINE (chr(9)||'exec dba_tools.safe_kill('||v_sid||','||v_serial||');');
      DBMS_OUTPUT.PUT_LINE (chr(9));
      kill_by_hand;
    ELSE
      IF db_ver like '10%' THEN
        s := '''begin execute immediate ''''ALTER SYSTEM KILL SESSION '''''''''||v_sid||','||v_serial||''''''''' IMMEDIATE''''; end; ''';
        dbms_output.put_line(s);
        DBMS_OUTPUT.PUT_LINE (chr(9));
        s_job_command := 'declare'|| chr(10)||
                         'v_job binary_integer;'|| chr(10)||
                         'begin'|| chr(10)||
                         'DBMS_JOB.submit (job=>v_job'||chr(10)||
                         ',what=>'||s||chr(10)||
                         ',instance=>'||v_instance||chr(10)||
                         '); '||chr(10)||
                         'DBMS_OUTPUT.PUT_LINE(to_char(v_job));'||chr(10)||
                         'commit;'||chr(10)||
                         'end;'|| chr(10) ||
                         '/'
        ;
        dbms_output.put_line(s_job_command);
      ELSIF db_ver like '11%' THEN
        DBMS_OUTPUT.PUT_LINE (chr(9)||'exec dba_tools.safe_kill('||v_sid||','||v_serial||','||v_instance||');');
        DBMS_OUTPUT.PUT_LINE (chr(9));
        DBMS_OUTPUT.PUT_LINE (chr(9)||'ALTER SYSTEM DISCONNECT  SESSION '''||v_sid||','||v_serial||',@'||v_instance||''' IMMEDIATE;');
        DBMS_OUTPUT.PUT_LINE (chr(9)||'ALTER SYSTEM KILL        SESSION '''||v_sid||','||v_serial||',@'||v_instance||''' IMMEDIATE;');   
      ELSE
        dbms_output.put_line('Try to logon as priveledged user on node ('||v_instance||') and perform:');
        kill_by_hand;
      END IF;
    END IF;
  ELSE
    dbms_output.put_line('No blocking users or process!');
  END IF;
end;
/

Remove blocking session

Output of this script not only detects blocking sessions but also shows how to remove them on RAC and non RAC environment. The other benefit of my approach is sb.sql script detects Oracle versions and for 10g on RAC create script which kill blocking session independently of the node where DBA connects. Here is an example of on RAC database (worst case scenario):
15:54:05 DAMIRV@test3> @sb
Oracle version: 10.1.0.4.0 (10.1.0.4.0)
      
Blocker         Inst  SID     Serial    [sec]   Lock Type        Module
-----------------------------------------------------------------------------------
1. SCOTT        1     153        7087       93   Transaction       SQL*Plus
        JOHN        3   133          2781        5     Exclusive        SQL*Plus 
        MARRY       2   154         58928       75     Exclusive        SQL*Plus
      
To kill first from the list, perform:
      
ALTER SYSTEM DISCONNECT  SESSION '153,7087' IMMEDIATE;
ALTER SYSTEM KILL        SESSION '153,7087' IMMEDIATE;
      
declare
v_job binary_integer;
begin
DBMS_JOB.submit (job=>v_job
,what=>'begin execute immediate ''ALTER SYSTEM KILL SESSION ''''153,7087'''' IMMEDIATE''; end; '
,instance=>1
);
DBMS_OUTPUT.PUT_LINE(to_char(v_job));
commit;
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.64
15:54:15 DAMIRV@test3>

As you can see user SCOTT (instance number 1)  is blocking two sessions, JOHN (connected on instance 3) and MARRY (on instance 2) for 5 and 75 seconds. DBA has connected with sqlplus tool on instance 3 (look at the sql prompt-DAMIRV@test3).
To use ordinary DISCONNECT SESSION command DBA must be on the same node as blocker, what sometimes is not easy to produce if RAC is driven with some load balancing system which connect to node with most free resources. For such a cases, beside classic command, script generate additional output where DBA kill session with job. To perform that solution copy/paste code that starts from “declare” . 
For the same situation Oracle 11g script will produce output like:
ALTER SYSTEM KILL        SESSION '153,7087,@1' IMMEDIATE;
what is enhancement in 11g because session can be killed without submitting a job but only declaring instance number, regardless on what node you are logged on.

Act proactively (be notified)!

Proactively here is not meant to avoid blocking sessions (this is subject which couldn’t be covered in several posts) but to find out when this occur before users start to complaint. To be fully informed I have used Oracle mail capability. In Oracle 10g UTL_MAIL is not installed by default. So from $ORACLE_HOME/rdbms/admin ($ORACLE_HOME\RDBMS\admin on Windows) directory, install as sysdba @utlmail.sql and @prvtmail.plb packages.
Examples here require a working SMTP mail server. Customize scripts to point the IP Address of your SMTP server. The SMTP port is usually 25. Configuring an SMTP server is not an Oracle function and thus out of scope for this site. In our case this is mail.iskon.hr which I use in Croatia. Then create check_bloking_session procedure, which is based on previous sb.sql code with adition and modification to comply mail feature.
CREATE OR REPLACE procedure TOOLS.check_blocking_session (
/* ---------------------------------------------------------------------------
 Filename: check_bloking_session.prc
 CR/TR#  :
 Purpose : Find blocking sessions and create release statements for RAC and non RAC environement 

 Date    : 18.07.2009.
 Author  : Damir Vadas

 Remarks : Tested on 10g only (should work on 11g as well and lower versions)
           To be able to send mail functionality "utl_mail" must be installed and granted to user
           Put in Oracle job with window of 5 minutes (default)

 Changes (DD.MM.YYYY, Name, CR/TR#):
--------------------------------------------------------------------------- */

                                        p_sender          VARCHAR2 DEFAULT 'dba@noreply.com',
                                        p_recipients      VARCHAR2 DEFAULT 'damir.vadas@gmail.com',
                                        p_subject         VARCHAR2 DEFAULT 'Blocking session occured!',
                                        p_smtp_out_server VARCHAR2 DEFAULT 'mail.iskon.hr')

AS
  cursor c_blockers IS
    SELECT DISTINCT
       username blocker_user,
       gvb.sid,
       gvs.serial# serial,
       gvb.inst_id,
       gvb.ctime ,
       module,
       decode(gvb.type, 'MR', 'Media_recovery',
                        'RT', 'Redo_thread',
                        'UN', 'User_name',
                        'TX', 'Transaction',
                        'TM', 'Dml',
                        'UL', 'PLSQL User_lock',
                        'DX', 'Distrted_Transaxion',
                        'CF', 'Control_file',
                        'IS', 'Instance_state',
                        'FS', 'File_set',
                        'IR', 'Instance_recovery',
                        'ST', 'Diskspace Transaction',
                        'IV', 'Libcache_invalidation',
                        'LS', 'LogStaartORswitch',
                        'RW', 'Row_wait',
                        'SQ', 'Sequence_no',
                        'TE', 'Extend_table',
                        'TT', 'Temp_table',
                              'Nothing-'
       ) lock_type,
       action
     FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvb.sid=gvs.sid
      AND gvb.inst_id=gvs.inst_id
  ORDER BY CTIME desc
  ;
 
  CURSOR c_w IS
    SELECT  gvw.sid waiter_sid,
            gvw.inst_id waiter_inst_id
    FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvb.sid=gvs.sid
      AND gvb.inst_id=gvs.inst_id
  ORDER BY gvb.CTIME desc
  ;
 
  CURSOR c_waiters (p_blocker_waiter_sid gv$lock.sid%TYPE, p_blockers_waiter_inst_id gv$lock.inst_id%TYPE) IS
    SELECT /* ordered */
      username waiter_user,
      gvw.sid,
      gvs.serial# serial,
      gvw.inst_id,
      gvw.ctime ,
      module,
      decode(gvw.request, 0, 'None',
                         1, 'NoLock',
                         2, 'Row-Share',
                         3, 'Row-Exclusive',
                         4, 'Share-Table',
                         5, 'Share-Row-Exclusive',
                         6, 'Exclusive',
                            'Nothing-'
      ) lock_req,
      action
    FROM gv$lock gvb, gv$lock gvw, gv$session gvs
    WHERE (gvb.id1, gvb.id2) in (
                                 SELECT id1, id2 FROM gv$lock WHERE request=0
                                 INTERSECT
                                 SELECT id1, id2 FROM gv$lock WHERE lmode=0
                                )
      AND gvb.id1=gvw.id1
      AND gvb.id2=gvw.id2
      AND gvb.request=0
      AND gvw.lmode=0
      AND gvw.sid=gvs.sid
      AND gvw.inst_id=gvs.inst_id
      AND gvw.sid=p_blocker_waiter_sid
      AND gvw.inst_id=p_blockers_waiter_inst_id
    ORDER BY CTIME desc
  ;

  bHasAny  boolean;
  i        NUMBER := 0;
  v_sid    NUMBER := 0;
  v_serial NUMBER := 0;
  v_instance BINARY_INTEGER := 0;
  db_ver            VARCHAR2(128);
  db_ver2           VARCHAR2(128); 
  s_job_command     VARCHAR2 (4000);
  s_mail_text       VARCHAR2 (4000);
  s_subject         VARCHAR2 (256);
  s_smtp_out_server VARCHAR2 (256);
  s_host_name       v$instance.host_name%TYPE;
  s_db_name         v$database.name%TYPE;
  s_instance_name   v$instance.instance_name%TYPE;
  s                 VARCHAR2(1024);
 
  PROCEDURE add_message_line (p_text VARCHAR2) is
  BEGIN
    s_mail_text := s_mail_text || p_text || chr(10);    
  END;
begin
  dbms_utility.db_version(db_ver,db_ver2);
  add_message_line ('Oracle version: '||db_ver|| ' ('||db_ver2||')');
  bHasAny := FALSE; 
  FOR v_blockers in c_blockers loop
    bHasAny := TRUE;    
    IF (i=0) THEN                 
      add_message_line (rpad('Blocker',12,' ')||'  '||'  Inst  '||rpad('SID',6,' ')||'  '||rpad('Serial',6,' ')||'  '||'  '||rpad('[sec]',6,' ')||' '||
                        lpad('Lock Type',10,' ')||' '|| lpad('Module',13,' ')
                        );
      add_message_line (rpad('-',120,'-'));
    END IF;
    v_sid := v_blockers.sid;
    v_serial := v_blockers.serial;
    v_instance := v_blockers.inst_id ;
    add_message_line (LPAD(to_char(i+1),3,' ')||'. '||rpad(v_blockers.blocker_user,12,' ')||'  '||rpad(v_blockers.inst_id,4,' ')||'  '||rpad(v_blockers.SID,8,' ' )||'  '||lpad(v_blockers.serial,5,' ')||'  '||lpad(v_blockers.CTIME,7,' ')||' '||
                      lpad(v_blockers.lock_type,13,' ')||lpad(v_blockers.module,15,' ')
                     );
    FOR v_w in c_w LOOP
      FOR v_waiters IN c_waiters (v_w.waiter_sid,v_w.waiter_inst_id) LOOP
        add_message_line (lpad(' ',5)||rpad(v_waiters.waiter_user,10,' ')||'  '||rpad(to_char(v_waiters.inst_id),4,' ')||rpad(to_char(v_waiters.sid),4,' ')|| ' '||lpad(to_char(v_waiters.serial),12,' ')||' ' ||' '||
                    ' ' ||lpad(to_char(v_waiters.ctime),6,' ')||' '|| lpad(to_char(v_waiters.lock_req),13,' ') || ' '|| lpad(to_char(v_waiters.module),15,' ')
                    );     
      END LOOP;
    END LOOP;
    IF bHasAny THEN
      i:=i+1;
    END IF;
  END LOOP;
  IF bHasAny THEN
    SELECT instance_name, host_name INTO s_instance_name, s_host_name FROM v$instance;
    SELECT name INTO s_db_name FROM v$database;
    s_subject := p_subject || '  '|| s_db_name || '('|| s_instance_name || ') on ' || s_host_name ||'';
    add_message_line ('');
    add_message_line ('To kill first from the list, perform: ');
    add_message_line ('');
    IF db_ver like '10%' THEN
      add_message_line ('NON RAC (or RAC logged on that node):');
      add_message_line ('---------------------------------');
      add_message_line ('ALTER SYSTEM DISCONNECT  SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');
      add_message_line ('ALTER SYSTEM KILL        SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');
      add_message_line ('');
      add_message_line ('');
      s := '''begin execute immediate ''''ALTER SYSTEM KILL SESSION '''''''''||v_sid||','||v_serial||''''''''' IMMEDIATE''''; end; ''';    
      add_message_line ('RAC (logged on any node) :');
      add_message_line ('--------------------------');
      s_job_command := 'declare'|| chr(10)||
                       '  v_job binary_integer;'|| chr(10)||
                       'begin'|| chr(10)||
                       '  DBMS_JOB.submit (job     =>v_job'||chr(10)||
                       '                  ,what    =>'||s||chr(10)||
                       '                  ,instance=>'||v_instance||chr(10)||
                       '  ); '||chr(10)||
                       '  commit;'||chr(10)||
                       'end;'|| chr(10) ||
                       '/'
      ;
      add_message_line (s_job_command);
    ELSIF db_ver like '11%' THEN
      add_message_line ('');
      add_message_line ('ALTER SYSTEM DISCONNECT  SESSION '''||v_sid||','||v_serial||',@'||v_instance||''' IMMEDIATE;');
      add_message_line ('ALTER SYSTEM KILL        SESSION '''||v_sid||','||v_serial||',@'||v_instance||''' IMMEDIATE;');   
    ELSE
      add_message_line ('');
      add_message_line ('Untested version... Try to logon as priveledged user on node ('||s_host_name||') and perform:');
      add_message_line ('ALTER SYSTEM DISCONNECT  SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');
      add_message_line ('ALTER SYSTEM KILL        SESSION '''||v_sid||','||v_serial||''' IMMEDIATE;');   
    END IF;
    IF p_smtp_out_server != 'mail.iskon.hr' THEN
      EXECUTE IMMEDIATE 'ALTER SESSION SET smtp_out_server = '''||p_smtp_out_server||'''';
    END IF;
    add_message_line ('');
    add_message_line ('This is automated message so, do not reply to this mail.');
    add_message_line ('');
    select to_char(sysdate, 'dd.mm.yyyy hh24:mi:ss') into s from dual;
    UTL_MAIL.send(sender => p_sender,
              recipients => p_recipients,
                 subject => s_subject,
                 message => s_mail_text,
               mime_type => 'text/plain; charset=us-ascii');    
  END IF;
end check_blocking_session;
/

Then you have to create Oracle job, which will repeat checking periodically (in our case 5 minutes):
DECLARE
  X NUMBER;
BEGIN
  SYS.DBMS_JOB.SUBMIT (
    job => X
   ,what => 'check_blocking_session;'
   ,next_date => to_date('13/11/2009 13:09:02','dd/mm/yyyy hh24:mi:ss')
   ,interval => 'SYSDATE+5/1440'
   ,no_parse => FALSE
  );
  SYS.DBMS_OUTPUT.PUT_LINE('Job Number is: ' || to_char(x));
   COMMIT;
END;
/

Customization

With procedure parameters you might use this functionality in different scenarios. In our case, mail will be sent whenever single blocking occurs. Here are some covered situations:
  • You may have some actions with partially acceptable blocking sessions (for some time or by time when they occur, for example especially at night) so you do not want to be informed about that.
  • You may check how long blocking is happening-is it longer then some period you define as threshold.
  • Is blocking caused by some higher priority user, what is perfectly acceptable to your situation.
For all these situations you may reconsider to change when mail should be sent. However, please understand that this is not my advice. I'll rather recheck code that is causing blocks or reschedule problematic events.

Monday, November 9, 2009

Check database instance/listener is running (old fashion way)

There are many implementation of checking status of database instance and listeners. However there are more or less successful works, I had make mine script which has suite mine needs 100% on many database projects. Mine solution has following characteristics:
  • RAC and non RAC environment are covered (ASM instance are covered as well as "ordinary" Oracle database instances)
  • Support for unlimited number of instances on the same server.
  • In /home/oracle directory, $1.env file must exist. This is pure environment initialization file so that Oracle can works with wondered instance. In my case I have several env files - for each instance separate one. Example:
    /home/oracle/ASM.env
    /home/oracle/PROD.env
    /home/oracle/TEST.env
    This initialization is on the start of the script so change it to apply your need.
  • Script is run with only one parameter database name (i.e. ASM, PROD, DEV)
  • Script is looping in endless loop so checking is permanent regardless instance/listener is up or down
  • CHECK_PERIOD is variable where you define number of seconds before next loop of checking is started. Default is 200 seconds (adjust it as you need).
  • For the first time when start mail is sent to user defined with MAIL_TO and MAIL_CC variable with current status. This is very interesting to me to see when someone has started the server. To use mail feature mail command must exists on Linux (what is enabled by default installation).
  • When state of instance/listener is change to "DOWN", mail is sent immediately. Here is an example of mail for database:


    and for listener:

    where real names are deleted because of security reasons.
  • After that, when SEND_AGAIN_PERIOD is reached, (default 5*CHECK_PERIOD seconds) mail will be resend again as another warning. It will stop sending mail only when state change to UP! With easy calculation we come to at least 2 mail per instance/listener per hour in worst case scenario. With ability to place indefinite number of people in warning mail (MAIL_CC), this is fair enough to make a notice official and seen by everyone! Adjust timing and frequency to your need.
  • When listener/instance change status from DOWN to UP only one mail is sent, saying that instance/listener is started again. This mail will happened only once:



    After that, next mail will be sent only when instance/listener state change to DOWN. This make checking proactive to send mail only when things went bad!
  • With LAST_DB_ERROR_IN_SECONDS and LAST_LSNR_ERROR_IN_SECONDS variables script control differently listener and instance status.
  • This intelligent mail sending (according UP/DOWN state and vice verse) is a feature that I haven't seen in other script.
  • Script can be, of course, automatically started if you put in /etc/rc.d/rc.local file. Here is an example:
    "su - oracle -c "/oracle/export/rman_backup/scripts/check_instance ASM &"
    "su - oracle -c "/oracle/export/rman_backup/scripts/check_instance TEST &"
    "su - oracle -c "/oracle/export/rman_backup/scripts/check_instance PROD1 &"
    "su - oracle -c "/oracle/export/rman_backup/scripts/check_instance DEV &"
  •  You can start manually checking scripts as well, using screen Linux capability:
    screen
    check_instance ASM &
    check_instance PROD1 &
    check_instance TEST &
    and then Ctrl+A+D (inside screen) to exit screen and leave scripts running.
  • Each script can be manually terminated and leave other instance/listener checking active with no influence.
  • This is Linux bash script and it will not run on Windows
Here is the code of script:
############################################################################
# NAME    : check_instance ver 1.1
#
# PURPOSE : check_instance and listener status on RAC and non RAC configuration
#
# Date    : 08.09.2008
# Author  : Damir Vadas
#
# Remarks : Should be run on each node in RAC
#           mail support shouold be avilable on OS for oracle user
#           example of ussage:
#           screen
#                /u01/rman_backup/scripts/check_instance ASM &
#                /u01/rman_backup/scripts/check_instance DB1 &
#                /u01/rman_backup/scripts/check_instance DB2 &
#                /u01/rman_backup/scripts/check_instance DB3 &
#
#           and then Ctrl+A+D (inside screen) to exit screen
#
#           or (permanent solution-boot activation)
#           in /etc/rc.d/rc.local
#                su - oracle -c "/oracle/export/rman_backup/scripts/check_instance ASM &"
#                ...
#                su - oracle -c "/oracle/export/rman_backup/scripts/check_instance DB3 &"
#        
# Changes (DD.MM.YYYY, Name, CR/TR#):
#          09.09.2008, Damir Vadas
#                      Changed wrong message in mail
#          10.09.2008, Damir Vadas
#                      Changed logic to send positive mail when instance was previously down
#                      All instances are checked by opened status!
#                      Send_mail is now in function
#          09.11.2009, Damir Vadas
#                      Added listener check and second script parameter as listener port
#############################################################################

# mail definition
MAIL_TO="damir.vadas@gmail.hr"
MAIL_CC="-c user1_cc@gmail.com user2_cc@gmail.com"

function bad_usage ()
{
  echo "Usage:"
  echo "$0 '$1' , where parameter is name of database on this node (i.e. DB1, ASM, DB2 etc.)"
  echo "Error raised (retvalue=$retvalue) for passed parameter '$1' !" | mail -s "`date`  check_instance start script error!"  ${MAIL_TO} ${MAIL_CC}
  exit 1
}

if [ "$1" = "" ]; then
  bad_usage
else
  DB_NAME=$1
fi

. /home/oracle/$1.env
retvalue=$?
if [ $retvalue -ne 0 ]; then
  bad_usage
fi

# initial variable settings for listener as well for database
LAST_DB_ERROR_IN_SECONDS=0
LAST_LSNR_ERROR_IN_SECONDS=0

# Default time interval of checking (seconds)
CHECK_PERIOD=200

# Every 1000 seconds check again when last instance/listener state has changed
let SEND_AGAIN_PERIOD=5*${CHECK_PERIOD}

function f_send_mail ()
{ if [ "$2" = "DB" ]; then
    if [ "$1" = "DOWN" ]; then
      if [ ${LAST_DB_ERROR_IN_SECONDS} -eq 0 ]; then
        # this is firt time for warning mail...
        echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} is DOWN!" ${MAIL_TO} ${MAIL_CC}
      else
        # n-th warning mail is going after multiple SEND_AGAIN_PERIOD seconds
        let SEND_AGAIN=${LAST_DB_ERROR_IN_SECONDS}%${SEND_AGAIN_PERIOD}
        if [ ${SEND_AGAIN} -eq 0 ]; then
          echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} is DOWN at least ${LAST_DB_ERROR_IN_SECONDS} seconds!"  ${MAIL_TO}  ${MAIL_CC}
        fi
      fi
      let LAST_DB_ERROR_IN_SECONDS=${LAST_DB_ERROR_IN_SECONDS}+${CHECK_PERIOD}
    else
      if [ "$1" = "UP" ]; then 
        echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} is UP!" ${MAIL_TO}  ${MAIL_CC}
        LAST_DB_ERROR_IN_SECONDS=0
      else
        echo "Bad first parameter passed to 'f_send_mail' function ($1 - DB). Exiting..."
        exit 1
      fi
    fi
  else
    if [ "$2" = "LSNR" ]; then 
      if [ "$1" = "DOWN" ]; then
        if [ ${LAST_LSNR_ERROR_IN_SECONDS} -eq 0 ]; then
          # this is firt time for warning mail...
          echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} LISTENER is DOWN!" ${MAIL_TO} ${MAIL_CC}
        else
          # n-th warning mail is going after multiple SEND_AGAIN_PERIOD seconds
          let SEND_AGAIN=${LAST_LSNR_ERROR_IN_SECONDS}%${SEND_AGAIN_PERIOD}
          if [ ${SEND_AGAIN} -eq 0 ]; then
            echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} LISTENER is DOWN at least ${LAST_LSNR_ERROR_IN_SECONDS} seconds!"  ${MAIL_TO}  ${MAIL_CC}
          fi
        fi
        let LAST_LSNR_ERROR_IN_SECONDS=${LAST_LSNR_ERROR_IN_SECONDS}+${CHECK_PERIOD}
      else
        if [ "$1" = "UP" ]; then 
          echo "This is automated message so, please do not reply!" | mail -s "`date`  ${ORACLE_SID} LISTENER is UP!" ${MAIL_TO}  ${MAIL_CC}
          LAST_LSNR_ERROR_IN_SECONDS=0
        else
          echo "Bad first parameter passed to 'f_send_mail' function ($1 - LSNR). Exiting..."
          exit 1
        fi
      fi
    else
      echo "Bad second parameter passed to 'f_send_mail' function ("$1" , "$2"). Exiting..."
      exit 1     
    fi
  fi
}

while true; do
  pslist="`ps -ef | grep pmon`"
  echo  ${pslist} | grep  "pmon_${ORACLE_SID}"  >> /dev/null 2>&1
  retvalue=$?
  # echo "retval=$retvalue"
  if [ $retvalue -eq 0  ] ; then
    INSTANCE_STATUS=`echo "select 'xX '||status from v\\$instance;" | sqlplus -s "/ as sysdba" | grep "xX" | awk '{print $2}'`
    # echo $INSTANCE_STATUS
    if [ "$DB_NAME" = "ASM" ]; then
      if [ "$INSTANCE_STATUS" = "STARTED" ]; then
        if [ ${LAST_DB_ERROR_IN_SECONDS} -gt 0 ]; then
          # previous LAST_DB_ERROR_IN_SECONDS > 0 => instance was down
          # let us send mail about good news ...
          f_send_mail "UP" "DB"
        fi
      else
        f_send_mail "DOWN" "DB"
      fi
    else 
      # non ASM instances-real Oracle instances!
      if [ "$INSTANCE_STATUS" = "OPEN" ]; then
        if [ ${LAST_DB_ERROR_IN_SECONDS} -gt 0 ]; then
          # previous LAST_DB_ERROR_IN_SECONDS > 0 => instance was down
          # let us send mail about good news ...
          f_send_mail "UP" "DB"
        fi
      else
        f_send_mail "DOWN" "DB"
      fi
    fi
  else
    f_send_mail "DOWN" "DB"
  fi
  # check listener part
  ps -ef | grep LISTENER | grep -v grep |grep _${DB_NAME}_
  retvalue=$?
  if [ $retvalue -eq 0  ] ; then
    if [ ${LAST_LSNR_ERROR_IN_SECONDS} -gt 0 ]; then
      f_send_mail "UP" "LSNR"
    fi
  else
    # no listener
    f_send_mail "DOWN" "LSNR"
  fi
  sleep ${CHECK_PERIOD}
done

In mine case listener has always ${DB_NAME} in it's name and has capital letters. Here are some examples of listener on my environment:
ora.node2.LISTENER_PROD_NODE2.lsnr
ora.node3.LISTENER_PROD_NODE3.lsnr
ora.node1.LISTENER_ASM_NODE1.lsnr
ora.node4.LISTENER_DEV_NODE4.lsnr

If you have different situation with listener names, apply it to your need and change line.
ps -ef | grep LISTENER | grep -v grep |grep _${DB_NAME}_

Listener checking may be done in a different way. With dummy user which connect to real ${ORACLE_SID}. Implementation should be like following:
create user dummy identified by dummy;
grant create session to dummy;
It is important to grant only create session privilege and nothing else because of security reasons. The rest is to check status of sqlplus command with such an user:
sqlplus dummy/dummy@${ORACLE_SID}

But this approach need aditional Oracle user (sometimes it is not possible to do that), additional logic in script (ASM instance has no users and placing sys pasword in script is not acceptable)...so previously describe solution looks to me more easy and enough correct in the same time.

Hope this will help someone to build even more sophisticated chack_instance script base on my "old fashioned way"-manual codding!

Here I call someone to show windows version of such an script...

Welcome!