Monday, February 28, 2011

How to ... redo logs generation?

All around Internet there are many articles which says how to narrow redo log generation in Oracle. However I still found out my self in position to ask what is the real cause of redo log generation. In this post I'll try to answer all aspects of the cause redo log generation. This subject is more interesting for those who are running stand by databases which redo log transport on remote location may caused many problems.

Who is generating redo logs now?

One of the first question, which cover firefighter is: "Who is generating redo logs in this moment". Idea is to determine action on ad hoc basis. For that I use following script, which I call top_redo.sql:
/* -----------------------------------------------------------------------------
 Filename: top_redo.sql
 
 CR/TR#  : 

 Purpose : Shows current redo logs generation info (RAC-non RAC environment)
           
 Date    : 12.08.2008.
 Author  : Damir Vadas, damir.vadas@gmail.com
 
 Remarks : run as privileged user

 Changes (DD.MM.YYYY, Name, CR/TR#):
--------------------------------------------------------------------------- */
col machine for a15
col username for a10
col redo_MB for 999G990 heading "Redo |Size MB"
column sid_serial for a13;

select b.inst_id, 
       lpad((b.SID || ',' || lpad(b.serial#,5)),11) sid_serial, 
       b.username, 
       machine, 
       b.osuser, 
       b.status, 
       a.redo_mb  
from (select n.inst_id, sid, 
             round(value/1024/1024) redo_mb
        from gv$statname n, gv$sesstat s
        where n.inst_id=s.inst_id
              and n.name = 'redo size'
              and s.statistic# = n.statistic#
        order by value desc
     ) a,
     gv$session b
where b.inst_id=a.inst_id
  and a.sid = b.sid
and   rownum <= 30
;

PROMPT Top 30 from gv$sesstat view according generated redo logs
Result is something like:
SQL> @top_redo
                                                                                               Redo
   INST_ID SID_SERIAL    USERNAME   MACHINE         OSUSER                         STATUS    Size MB
---------- ------------- ---------- --------------- ------------------------------ -------- --------
         1   788,    1              iis1            oracle                         ACTIVE      2,073
         4   788,    1              iis4            oracle                         ACTIVE      1,928
         1   792,    1              iis1            oracle                         ACTIVE      1,168
         1   791,    1              iis1            oracle                         ACTIVE      1,149
         3   788,    1              iis3            oracle                         ACTIVE      1,111
         4   792,    1              iis4            oracle                         ACTIVE      1,092
         1   785,    1              iis1            oracle                         ACTIVE      1,064
         4   791,    1              iis4            oracle                         ACTIVE      1,064
         3   792,    1              iis3            oracle                         ACTIVE        757
         3   791,    1              iis3            oracle                         ACTIVE        738
         3   785,    1              iis3            oracle                         ACTIVE        436
         4   785,    1              iis4            oracle                         ACTIVE        411
         1   764,    4   SYS        iis1            oracle                         ACTIVE        340
         1   737,61477   DBSNMP     iis1            oracle                         ACTIVE        117
         3   703,33361   DBSNMP     iis3            oracle                         ACTIVE        113
         4   677,30159   DBSNMP     iis4            oracle                         ACTIVE         86
         4   795,    1              iis4            oracle                         ACTIVE         81
         1   795,    1              iis1            oracle                         ACTIVE         77
         4   794,    1              iis4            oracle                         ACTIVE         76
         3   795,    1              iis3            oracle                         ACTIVE         75
         1   794,    1              iis1            oracle                         ACTIVE         74
         3   794,    1              iis3            oracle                         ACTIVE         70
         1   645, 5393   ANPI       USR\APINTARIC   apintaric                      INACTIVE       50
         3   758,    5              iis3            oracle                         ACTIVE         39
         3   755,    2              iis3            oracle                         ACTIVE         24
         3   754,    2              iis3            oracle                         ACTIVE         22
         3   756,    2              iis3            oracle                         ACTIVE         21
         3   757,    3              iis3            oracle                         ACTIVE         21
         4   774, 5835   JAGO       CLT\JGOLUZA     jgoluza                        INACTIVE       10
         1   619,61303   LIMI       NIO\LMIHALIC    lmihalic                       INACTIVE        9

30 rows selected.

Top 30 from gv$sesstat view according generated redo logs
SQL> 
If you want to concentrate on real oracle users (avoid core Oracle processes in result) place next condition in outer where clause:
and b.username is not null

When and how many redo logs generation occurred?

Beside current analyze in many times wider analyze/compare is even more interesting. So questions like:
  • When do we have most of redo log generation?
  • Where was the peak of log generation?
  • Did we have any "strange" redo log generation?
need a little different approach-query v$log_history view. It holds historic data which retention period is initially controlled with MAXLOGHISTORY, defined while creating database (fixed not changeable without recreation of control file) and CONTROL_FILE_RECORD_KEEP_TIME which is changeable. In mine case it was set to 31 days (exact number of days for longest month):
SQL> show parameter CONTROL_FILE_RECORD_KEEP_TIME

NAME_COL_PLUS_SHOW_PARAM       TYPE        VALUE_COL_PLUS_SHOW_PARAM
------------------------------ ----------- ---------------------------------------------
control_file_record_keep_time  integer     31
Script to gather data through mentioned period looks like.
/* -----------------------------------------------------------------------------
 Filename: rl.sql
 
 CR/TR#  : 

 Purpose : redo logs distribution per hours on each day ...

 Date    : 07.08.2007.
 Author  : Damir Vadas, damir.vadas@hypo-alpe-adria.com
 
 Remarks : run as privileged user

 Changes (DD.MM.YYYY, Name, CR/TR#):
--------------------------------------------------------------------------- */
set pagesize 120;
set linesize 200;
col day for a8;
spool rl.txt
PROMPT Archive log distribution per hours on each day ...

select 
  to_char(first_time,'YY-MM-DD') day,
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'00',1,0)),'999') "00",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'01',1,0)),'999') "01",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'02',1,0)),'999') "02",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'03',1,0)),'999') "03",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'04',1,0)),'999') "04",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'05',1,0)),'999') "05",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'06',1,0)),'999') "06",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'07',1,0)),'999') "07",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'08',1,0)),'999') "08",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'09',1,0)),'999') "09",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'10',1,0)),'999') "10",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'11',1,0)),'999') "11",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'12',1,0)),'999') "12",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'13',1,0)),'999') "13",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'14',1,0)),'999') "14",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'15',1,0)),'999') "15",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'16',1,0)),'999') "16",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'17',1,0)),'999') "17",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'18',1,0)),'999') "18",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'19',1,0)),'999') "19",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'20',1,0)),'999') "20",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'21',1,0)),'999') "21",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'22',1,0)),'999') "22",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'23',1,0)),'999') "23",
  COUNT(*) TOT
from v$log_history
group by to_char(first_time,'YY-MM-DD')
order by day
;
Result looks like:
SQL>@rl
Archive log distribution per hours on each day ...

DAY      00   01   02   03   04   05   06   07   08   09   10   11   12   13   14   15   16   17   18   19   20   21   22   23     TOT
-------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- -----
11-01-14    0    0   23   16   17   16   16   16   22   39   23   18   22   18   18   18   22   18   19   16   19   16   16   17   425
11-01-15   24   23   17   18   21   16   16   16   22   18   20   19   19   18   18   21   20   18   20   16   18   16   16   17   447
11-01-16   40   39   43   24   17   16   16   16   22   18   18   21   21   18   19   19   22   18   19   17   18   16   16   16   509
11-01-17   25   23   18   16   17   16   16   16   22   18   21   18   20   18   18   18   24   18   18   17   20   16   16   17   446
11-01-18   25   23   21   16   18   17   16   17   21   43   18   19   26   18   20   40   21   18   20   16   18   16   16   17   500
11-01-19   24   24   22   16   17   16   16   16   23   18   19   18   23   19   18   19   19   18   22   16   21   16   16   17   453
11-01-20   24   24   19   16   17   16   16   16   22   18   19   18   24   18   18   19   70   18   19   18   17   16   16   17   495
11-01-21   24   23   20   16   17   16   16   16   22   18   18   20   48   25   46   57   22   18   18   17   20   16   16   17   546
11-01-22   24   27   22   16   17   16   16   16   22   18   18   19   21   18   18   21   20   18   18   17   20   16   16   17   451
11-01-23   24   19   19   16   17   16   16   16   22   18   18   18   21   18   20   18   21   18   20   17   18   16   16   17   439
11-01-24   24   21   19   16   17   16   16   16   22   18   20   18   21   18   21   18   21   18   19   17   17   16   16   17   442
11-01-25   24   23   18   16   17   16   16   16   43   18   18   18   25   18  115   41   23   18   19   17   20   16   16   18   589
11-01-26   23   22   19   16   17   16   16   16   22   18   21   18   30   18   43   18   20   18   22   16   21   16   16   16   478
11-01-27   21   23   23   16   17   16   16   16   22   18   18   20   22   20   40   97  155  145  155   93  109  116  164  214  1556
11-01-28   93   24   23   16   17   16   16   16   29   18   19   19   30  101   19   18   75   30   19   16   20   16   22   16   688
11-01-29   21   16   16   16   17   16   27   20   22   18   18   18   31   18   18   19   23   18   19   18   18   16   16   17   456
11-01-30   24   22   17   16   17   16   16   16   22   18   18   19   23   18   19   18   23   18   18   17   20   16   16   17   444
11-01-31   24   19   21   16   17   16   16   16   23   18   19   18   22   19   18   20   20   18   18   16   21   16   16   17   444
11-02-01   24   25   22   16   17   16   16   16   23   18   18   20   24   18   18   19   21   18   18   17   19   16   16   17   452
11-02-02   24   22   20   16   17   16   16   16   23   18   18   40   23   18   39   18   21   18   22   16   21   16   16   17   491
11-02-03   24   23   22   16   18   16   16   16   22   18   19   18   24   18   20   18   20   18   19   18   19   16   16   17   451
11-02-04   24   23   23   16   17   16   16   16   22   18   19   19   48   18   20   20   23   18   18   17   19   16   16   17   479
11-02-05   24   23   17   17   17   16   16   16   22   18   20   19   23   18   19   20   22   18   18   17   20   16   16   17   449
11-02-06   24   23   17   16   18   16   16   16   22   18   19   19   21   18   19   18   24   18   20   16   21   16   16   17   448
11-02-07   24   22   18   16   18   16   16   16   22   18   52   18   44   25   29   24   21   18   18   16   19   16   16   17   519
11-02-08   24   23   19   16   17   16   16   16   22   18   20   19   28   18   19   18   22   18   19   17   34   32   32   31   514
11-02-09   36   39   35   31   37   31   32   32   32   34   34   34   39   34   34   34   36   34   38   32   37   31   32   33   821
11-02-10   37   36   38   31   33   32   31   32   37   34   34   34   41   34   35   35   37   34   34   31   34   32   31   33   820
11-02-11   39   35   38   31   33   32   31   32   37   34   34   34   38   34   34   34   35   34   34   32   34   32   31   32   814
11-02-12   40   34   34   33   35   32   31   32   37   34   34   34   36   34   35   34   36   34   34   31   36   32   31   32   815
11-02-13   40   34   35   31   34   32   31   32   37   34   34   34   38   34   34   34   37   34   34   32   36   32   31   32   816
11-02-14   40   34   33   31   35   32   32   32   37   66   34   35   38   34   34   34   37   34   34   31   36   32   31   33   849
11-02-15   48   50   50   44   48   32   32   32   35   34   34   34   38   34   37   34   39   34   34   32   36   31   32   33   887
11-02-16   39   34   33   32   33   31   32   32   37   53   37   34   38   38   34    2    0    0    0    0    0    0    0    0   539

34 rows selected.

SQL>
Redo logs generation is grouped by hours where last column (TOT) is sum of all redo logs in one day. According this it is more then obvious where redo log generation was highest, so our interest may be focused on presented point in time.

How much is that in Mb?

Total redo logs size (and according that, archived log size) cannot be computed from previous query because not all redo log switches occur when redo log was full. For that you might want to use this very easy query:
SQL> select sum(value)/1048576 redo_MB from sys.gv_$sysstat where name = 'redo size';

   REDO_MB
----------
1074623.75

SQL>
If you want to calculate on instance grouping, then use this:
SQL> select inst_id, sum(value)/1048576 redo_MB from sys.gv_$sysstat where name = 'redo size'
  2  group by inst_id;

   INST_ID    REDO_MB
---------- ----------
         1 370325.298
         2   4712.567
         4 405129.283
         3 294457.100

SQL>
Both queries works on single instances as well.

Which segments are generating redo logs?

After we found out our point of interest, in mine case where were most of the redo logs generation, it is very useful to find out which segments (not tables only) are causing redo log generation. For that we need to use "dba_hist" based tables, part of "Oracle AWR (Automated Workload Repository)", which usage I have described in topic Automated AWR reports in Oracle 10g/11g. For this example I'll focus on data based on time period: 11-01-28 13:00-11-01-28 14:00. Query for such a task should be:
SELECT to_char(begin_interval_time,'YY-MM-DD HH24') snap_time,
        dhso.object_name,
        sum(db_block_changes_delta) BLOCK_CHANGED
  FROM dba_hist_seg_stat dhss,
       dba_hist_seg_stat_obj dhso,
       dba_hist_snapshot dhs
  WHERE dhs.snap_id = dhss.snap_id
    AND dhs.instance_number = dhss.instance_number
    AND dhss.obj# = dhso.obj#
    AND dhss.dataobj# = dhso.dataobj#
    AND begin_interval_time BETWEEN to_date('11-01-28 13:00','YY-MM-DD HH24:MI') 
                                AND to_date('11-01-28 14:00','YY-MM-DD HH24:MI')
  GROUP BY to_char(begin_interval_time,'YY-MM-DD HH24'),
           dhso.object_name
  HAVING sum(db_block_changes_delta) > 0
ORDER BY sum(db_block_changes_delta) desc ;
Reduced result from previously shown query would be:
SNAP_TIME   OBJECT_NAME                    BLOCK_CHANGED
----------- ------------------------------ -------------
11-01-28 13 USR_RACUNI_MV                        1410112
11-01-28 13 TROK_TAB_RESEAU_I                     734592
11-01-28 13 TROK_VOIE_I                           638496
11-01-28 13 TROK_DATUM_ULAZA_I                    434688
11-01-28 13 TROK_PAIEMENT_I                       428544
11-01-28 13 D_DPX_VP_RAD                          351760
11-01-28 13 TROK_SVE_OK_I                         161472
11-01-28 13 I_DATPBZ_S002                         135296
11-01-28 13 IDS2_DATUM_I                          129904
11-01-28 13 IDS2_PZNBR                            129632
11-01-28 13 IDS2_IDS1_FK_I                        128848
11-01-28 13 IDS2_DATTRAN_I                        127440
11-01-28 13 IDS2_DATSOC_I                         127152
11-01-28 13 IDS2_VRSTA_PROD_I                     122816
...
Let us focus on first segment "USR_RACUNI_MV", segment with highest number of changed blocks (what mean directly highest redo log generation). Just for information, this is MATERIALIZED VIEW.

What SQL was causing redo log generation

Now when we know when, how much and what, time is to find out how redo logs are generated. In next query "USR_RACUNI_MV" and mentioned period are hard codded, because we are focused on them. Just to point that SQL that start with "SELECT" are not point of our interest because they do not make any changes.
SELECT to_char(begin_interval_time,'YYYY_MM_DD HH24') WHEN,
       dbms_lob.substr(sql_text,4000,1) SQL,
       dhss.instance_number INST_ID,
       dhss.sql_id,
       executions_delta exec_delta,
       rows_processed_delta rows_proc_delta
  FROM dba_hist_sqlstat dhss,
       dba_hist_snapshot dhs,
       dba_hist_sqltext dhst
  WHERE upper(dhst.sql_text) LIKE '%USR_RACUNI_MV%'
    AND ltrim(upper(dhst.sql_text)) NOT LIKE 'SELECT%'
    AND dhss.snap_id=dhs.snap_id
    AND dhss.instance_number=dhs.instance_number
    AND dhss.sql_id=dhst.sql_id 
    AND begin_interval_time BETWEEN to_date('11-01-28 13:00','YY-MM-DD HH24:MI') 
                                AND to_date('11-01-28 14:00','YY-MM-DD HH24:MI')
;
Result is like:
WHEN          SQL                                               inst_id       sql_id  exec_delta rows_proc_delta
------------- ------------------------------------------------- ------- ------------- ---------- ---------------
2011_01_28 13 DECLARE                                                 1 duwxbg5d1dw0q          0                0
                job BINARY_INTEGER := :job; 
                next_date DATE := :mydate;  
                broken BOOLEAN := FALSE; 
              BEGIN 
                dbms_refresh.refresh('"TAB"."USR_RACUNI_MV"'); 
                :mydate := next_date; 
                IF broken THEN :b := 1; 
                ELSE :b := 0; 
                END IF; 
              END;                                                    
2011_01_28 13 delete from "TAB"."USR_RACUNI_MV"                       1 5n375fxu0uv89          0                0
For both of examples it was impossible to find out number of rows changed according operation that was performed. Let us see output of another example (NC_TRANSACTION_OK_T table) where we can meet with DDL that generate redo logs!
WHEN          SQL                                               inst_id       sql_id  exec_delta rows_proc_delta
------------- ------------------------------------------------- ------- ------------- ---------- ---------------
2011_01_28 13 alter table TAB.NC_TRANSACTION_OK_T                     4 g5gvacc8ngnb8          0               0 
              shrink space cascade                                
If you are focused on pure number of changes, then you might to perform query where inst_id and sql_id are irrelevant (excluded from query). Here is a little modified previous example, for "Z_PLACENO" segment (pure oracle table):
SELECT when, sql, SUM(sx) executions, sum (sd) rows_processed
FROM (
      SELECT to_char(begin_interval_time,'YYYY_MM_DD HH24') when,
             dbms_lob.substr(sql_text,4000,1) sql,
             dhss.instance_number inst_id,
             dhss.sql_id,
             sum(executions_delta) exec_delta,
             sum(rows_processed_delta) rows_proc_delta
        FROM dba_hist_sqlstat dhss,
             dba_hist_snapshot dhs,
             dba_hist_sqltext dhst
        WHERE upper(dhst.sql_text) LIKE '%Z_PLACENO%' 
          AND ltrim(upper(dhst.sql_text)) NOT LIKE 'SELECT%'
          AND dhss.snap_id=dhs.snap_id
          AND dhss.instance_Number=dhs.instance_number
          AND dhss.sql_id = dhst.sql_id 
          AND begin_interval_time BETWEEN to_date('11-01-25 14:00','YY-MM-DD HH24:MI') 
                                      AND to_date('11-01-25 15:00','YY-MM-DD HH24:MI')
        GROUP BY to_char(begin_interval_time,'YYYY_MM_DD HH24'),
             dbms_lob.substr(sql_text,4000,1),
             dhss.instance_number,
             dhss.sql_id
)
group by when, sql;
Result is like:
WHEN          SQL                                                                    exec_delta rows_proc_delta
------------- ---------------------------------------------------------------------- ---------- ---------------
2011_01_25 14 DELETE FROM Z_PLACENO                                                           4         7250031
2011_01_25 14 INSERT INTO Z_PLACENO(OBP_ID,MT_SIFRA,A_TOT)                                    4         7250830
              SELECT P.OBP_ID,P.MT_SIFRA,SUM(P.OSNOVICA)   
                FROM (SELECT OPI.OBP_ID,
                              OPO.MT_SIFRA,
                              SUM(OPO.IZNKN) OSNOVICA
                        WHERE OPI.OBP_ID = OPO.OPI_OBP_ID  
                          AND OPI.RBR = OPO.OPI_RBR  
                          AND NVL(OPI.S_PRETPOREZA,'O') IN ( 'O','N','A','Z','S') 
                        GROUP BY OPI.OBP_ID,OPO.MT_SIFRA 
                      ) 
Here you can see directly number executions and number of involved rows.

Query based on segment directly

Sometimes you do not want to focus on period, so your investigation may start with segment as starting point. For such a tasks I use next query. This is small variation of previous example where "USR_RACUNI_MV" segment is hard codded.
SELECT to_char(begin_interval_time,'YY-MM-DD HH24') snap_time,
       sum(db_block_changes_delta)
  FROM dba_hist_seg_stat dhss,
       dba_hist_seg_stat_obj dhso,
       dba_hist_snapshot dhs
  WHERE dhs.snap_id = dhss.snap_id
    AND dhs.instance_number = dhss.instance_number
    AND dhss.obj# = dhso.obj#
    AND dhss.dataobj# = dhso.dataobj#
    AND dhso.object_name = 'USR_RACUNI_MV'
  GROUP BY to_char(begin_interval_time,'YY-MM-DD HH24')
  ORDER BY to_char(begin_interval_time,'YY-MM-DD HH24');
Reduced result is:
   SNAP_TIME   SUM(DB_BLOCK_CHANGES_DELTA)
   ----------- ---------------------------
   ...
   11-01-28 11                     1224240
   11-01-28 12                      702880
 11-01-28 13                     1410112
   11-01-28 14                      806416
   11-01-28 15                     2008912
   11-01-28 16                     1103648
   ...
As you can see in accented row, the numbers are the same as at the begging of topic.

The End

Regardless which approach you use, redo logs generation has important effects in proper management of Oracle RMAN backups as well as stand by databases.

Cheers!

Monday, February 21, 2011

Trust ... but verify!

Today topic comes from real life story ... a story that took 3 days of work/investigation and involved even Oracle SR with result that was far from expected. Why ? Because I left mine main saying:
Trust but verify!
So let me start .... from the beginning as it should ...

Allocate one RAC node to other rack

A customer, where 4 node Oracle 10gR2 RAC on RHEL 5 Linux x64 was perfectly working for a year, come with more then welcome idea to reallocate two nodes (blade servers) from one rack to another rack, to reduce risk from power failure. Idea was immediately accepted and meeting with other involved parties was pretty short.

One outsource partner will reconfigure HW rack (console, network etc), second, which represent OS admins, will reconfigure NIC interfaces in new rack, third, Oracle DBA should be here only as checking instance that all is working as it should.

On mine suggestion, we agreed to reallocate just one node in the first place and when all went OK, we'll repeat the same procedure with second node. On mine question: "Would there be any HW/OS changes?", I was clearly answered: "No, there will not be any changes that might be notices. We will only change MAC address on all NICs, all other will remain the same!". This sound to me like situation when all node NICs are dead so their replacement is necessary. Sounds easy!

According this, let me show the picture how the hardware was configured before and how it should be configure after reallocation, according OS and hardware partners work. Here is short picture of RAC and switch where VIP connections are placed:

The problem

Reallocation went pretty quick in planned time (cca. 22:30 in the evening). When all was ready, node was started in "normal" (non admin mode). But Oracle RAC was not able to start on that node! And here the problem arises.
oifcfg, Oracle network for RAC tool, mostly hanged showing no result in several minutes, what I found very bad sign and all problems related to network.

After half an hour ad hoc analyze of the RAC logs, and two unsuccessful node reboots (when OS admins has some ideas), I suggested to leave everything as is until morning, when we might be more wise. Frankly, whenever I find myself in situation that in problem situation people have "ideas" I suggest to leave the problem immediately (if possible), knowing that things come from bad to worse very soon! Suggestion was accepted and we went home leaving problematic node up and running.

Next morning, while I was driving to work, the same client called me with panic in his voice that no one cannot connect to any of the nodes-RAC was unusable. After one minute I have suggested to shutdown problematic node to see if anything will be better. Shutting down node was performed with halt command because shutdown was not enough. After shut down of node, other three nodes in RAC started immediately to work normally.

This information was really scary to me because Linux doesn't behave in that way really and this may be the sign of some bigger problems that might arise in the rest of the day. Fortunately that was all and nothing other went wrong.

The analyze part

Initially, when I come to work, I check once again if Oracle documentation or Google mention any problem with "MAC address change" in RAC environment, but except one case on specific platform, there was nothing! So I left the case to one who initially made the problem-HW and OS admins.

After a week, customer come to us saying that we should meet with OS admins to analyze the problem and find the solution to repair problematic node. Customer reaction was normal-he wanted to get all nodes up and running as it was before, but I was a little surprised why DBA has to help when there is nothing changed in Oracle configuration and we were not involved in this failure in any mean! I felt they do not trust us so raising any other frictions were not acceptable nor professional. Even thought, I have opened SR on Metalink, just to prove myself that Oracle is not the guilty part.

Because server was down, I asked to start node in "no network mode" to collect all logs that might help us in finding the cause. While I was getting log files on USB stick, I asked OS admin several questions:
  • Have you checked SSH equivalence on all nodes? Answer was Yes-all OK.
  • Have you check Jumbo frame setting (it was implemented as MTU=9000). Answer was Yes-all OK.
  • On mine question if all the cables were the same as before, answer was nothing but strange look from his eyes. This was a point when I should react and insist on question ... but I was not 100% sure what is the real problem and pass over that.
  • You haven't changed anything in Linux start up files? Answer was No-it is untouched.
Shortly, on all mine questions answer was pretty the same: nothing was touched or changed and everything is the same as before. I do not know if anyone have ever been in such a situation, but I was so confused so much that some of mine last questions were pretty stupid, leaving impression that I do not understand anything. So I shut up and left customer to see logs in peace at mine desk. Parallel, I sent logs to Oracle support who react in very quick manner, proving that every SR depend on person who took it.

/var/tmp/.oracle solution?

Initially I find some crsd logs that took me to solution which says to clear /var/tmp/.oracle directory content, where Linux create socket's files. In the same time, from Metalink comes recommendation like:
I think the issue is with ocfs it can not communicate with other nodes. Is this used for ocr or voting files ?

If this used please execute RDA on working node and this failing node according to instructions from this article: Remote Diagnostic Agent (RDA) 4 - RAC Cluster Guide (Doc ID 359395.1)

If the ocfs is not used for ocr or voting files please then execute following steps
1. enable network public and interconnect if they are disabled
2. crsctl stop crs
3. cleanup old socket's "/var/tmp/.oracle"
4. crsctl start crs
Our RAC configation has OCFS, but it has nothing with ocr or voting disks (RAC was based on raw services and ASM) and this OCFS was used only for some exp and file based actions ... shortly nothing with any necessary RAC operation. Another thing was a small problem to start node because of possibility that it will be hang RAC operation on all other nodes (as first time). So second SR advice seems to be logical and promising to us. This was a plan to do.
  • Start node in admin mode
  • Delete all content of /var/tmp/.oracle directory
  • Start up node normally
All in short I was really confused knowing that if this plan fail we will be in real trouble because, we have no idea what was wrong and Metalink suggested to run RDA, what is also very clear sign that they also do not know really what is the problem.

Solution

On this second intervention on customer site, I was absent, so mine colleague took this ungrateful role to present DBA side. After performing previously explained plan, RAC was still unable to start CRS core services. They were stuck as we were first night.
But there was one difference, mine colleague didn't believe anybody and he wanted to get prove about every question/idea he asked. When they come to question about configuration and how it looks now, he got a picture:
After he realized that 2 new switches has been added and connected totally different then it was initially set up (compare this pic with one on the beginning is this the same configuration to you!!??), he was now focused on errors that might arise in bad network connection. Soon he found errors like:
2011-02-19 17:06:37.378: [  OCRMSG][233026112]prom_rpc: CLSC send failure..ret code 6
2011-02-19 17:06:37.378: [  OCRMSG][233026112]prom_rpc: possible OCR retry scenario
2011-02-19 17:06:37.378: [  OCRSRV][233026112]proas_forward_request: PROM_TIME_OUT or Master Fail
2011-02-19 17:06:37.429: [ COMMCRS][522433088]clscsendx: (0x600000000065bcd0) Connection not active
Google gave him just one result where was a problem with jumbo frames. When he wanted prove that all node are pingable with correct MTU , all packets to reallocated node was rejected, because MTU was not set on middle switch!! After setting correct MTU values on that switch, RAC started to work like a charm! Customer was happy and this said story finish happy.

But lessons should be learned!

The End

What I wanted to point is that DBA (or any admin staff) should never trust anything that cannot be proved or tested. Mine error was just in that - I trusted but haven't verified and this cost me a lot.

So regardless how big "faces" are working with you, if you are asked to help them, you are equal amount them in any mean! You might ask something stupid but if HW and OS people (in this case) were not afraid to claim that configuration has not been changed, in a situation when 2 new switches has been added, and network was totally different, then you are entitled to get answer on your every questions in no excuse. Without any inconvenience!

That was mine small and cincere story that I wanted to share with you.

Cheers!

Thursday, February 17, 2011

master.sh script for Oracle RAC

For a long time, in RAC environment, I use famous Bernhard de Cock Buning's script which shows current master node in Oracle RAC. Script is originally downloaded from RACHelp site where you can find a lot of nice RAC scripts.

But execution of script had a little bug, when one of the RAC nodes is down-script hang! This is why I have added ssh ConnectTimeout parameter which solve this problem fully.

Here is the whole source of changed script:
#!/bin/sh
#
#
# Requirements:
# - Make sure the ORA_CRS_HOME / ORACLE_HOME is set to the CRS environment.
# - Make sure all the nodes in the cluster can be accessed trough ssh.
# Version:
# - version 1.0
# - Bernhard de Cock Buning
# -> master.sh
# -> 1.0 Initial creation of masternode information script.
# - www.rachelp.nl / GRID-IT
#
# -> 1.1 added "-o ConnectTimeout=5" when node is down -script hang otherwise!
# - Damir Vadas
# - http://damir-vadas.blogspot.com/

#set -x -v
#exec >> /tmp/master_node.log 2>&1

ORA_CRS_HOME=$ORACLE_HOME
SSH=/usr/bin/ssh
NODES=`$ORA_CRS_HOME/bin/olsnodes`
HIGHEST_HOSTNAME=""
HIGHEST_VALUE=0

for I in $NODES
do
CURRENT_HOSTNAME=$I
CURRENT_VALUE=`$SSH -o ConnectTimeout=5 $I grep clssgmEstablishMasterNode $ORA_CRS_HOME/log/$I/cssd/ocssd.l* | tail -1 | tr -d CSSD] | awk '{print $2 $3}' | tr -d ':.-'`
if [[ $(( $(echo $CURRENT_VALUE ) - $(echo $HIGHEST_VALUE ) )) > 0 ]] ;
then
HIGHEST_VALUE=$CURRENT_VALUE
HIGHEST_HOSTNAME=$CURRENT_HOSTNAME
fi
done
MASTERNODE=`$SSH $HIGHEST_HOSTNAME grep clssgmEstablishMasterNode $ORA_CRS_HOME/log/$HIGHEST_HOSTNAME/cssd/ocssd.l* | tail -1 | awk '{print $11}' | tr -d 'node()'`
echo "Masternode is node with number: $MASTERNODE"
$ORA_CRS_HOME/bin/olsnodes -n 

Modified script output in a case when iis2 node is down looks like:
[oracle CRS@iis4 ~]$ master
ssh: connect to host iis2 port 22: No route to host
Masternode is node with number: 1
iis1      1
iis2      2
iis3      3
iis4      4
[oracle CRS@iis4 ~]$

Hope this helps someone. Cheers!

Tuesday, February 1, 2011

The macro cannot be found or has been disabled (BI Publisher and Office 2010)

Recently I have upgraded MS Office from 2007 to 2010. As I work on some BI projects, part of mine set of oracle programs is oracle BI Publisher Desktop, which use Word as report template designer. When I run BI Publisher macros in Word 2010 I got following error:
When I google term it seems that all advices were not useful. Here are two most common errors and how should Word be setup correctly:
Templates were correctly enabled:
But in Windows event viewer, curious error was more then obvious:

The solution

The solution shown here really works (tested on several computers), so I'm writing this down to be more like documentation for future reference to anyone stack on this case. Idea is to remove old *.EXD files from any location in %USERPROFILE% folders that might exists.
Because mine praxis is to have separate %USERPROFILE% folder location then where are OS files (different then partition then C:), exd file was found as shown. In mine case it was only in one folder:
D:\Users\DamirV\Application Data\Microsoft\Forms>dir
 Volume in drive D is Data
 Volume Serial Number is 505E-794F

 Directory of D:\Users\DamirV\Application Data\Microsoft\Forms

01.02.2011  11:45    <DIR>          .
01.02.2011  11:45    <DIR>          ..
12.03.2009  15:17           165.808 MSComctlLib.exd
31.08.2010  10:34            11.776 WINPROJ.box
01.02.2011  11:19             7.168 WINWORD.box
               3 File(s)        184.752 bytes
               2 Dir(s)  51.699.859.456 bytes free

D:\Users\DamirV\Application Data\Microsoft\Forms>
So all you have to do is:
  1. Rename existing "MSComctlLib.exd" file to "MSComctlLib.exd_bak"
  2. Open Word 2010
  3. Open menu Add-Ins|Oracle BI Publisher|Log On
Last action will create new MSComctlLib.exd file in background:
D:\Users\DamirV\Application Data\Microsoft\Forms>dir
 Volume in drive D is Data
 Volume Serial Number is 505E-794F

 Directory of D:\Users\DamirV\Application Data\Microsoft\Forms

01.02.2011  11:45    <DIR>          .
01.02.2011  11:45    <DIR>          ..
01.02.2011  13:51           167.456 MSComctlLib.exd
12.03.2009  15:17           165.808 MSComctlLib.exd_old
31.08.2010  10:34            11.776 WINPROJ.box
01.02.2011  11:19             7.168 WINWORD.box
               3 File(s)        184.752 bytes
               2 Dir(s)  51.699.859.456 bytes free

D:\Users\DamirV\Application Data\Microsoft\Forms>
Notice different file size and file date of "MSComctlLib.exd". This is now Office 2010 version of EXD file which was not able to be created because of previous version. And that is all.

The End

If you have upgraded several Office versions (i.e. 2000->2003->2007->2010), you might find ".EXD" files as well in following folders:
%USERPROFILE%\Application Data\Microsoft\Forms
%USERPROFILE%\Local Settings\Temp\Excel8.0
%USERPROFILE%\Local Settings\Temp\VBE
%USERPROFILE%\Local Settings\Temp\Word8.0
Like in previous example rename them and later (when you are sure that all works) you may delete.

Hope this helps!

Cheers!

Post festum part-IMPORTANT!!!

I have received many asks that people complaint about this issue was resolved and later after some Windows updates, all is back to same problem.
For them I have wrote this part what should help them.
  1. Again first step is to rename all .exd files.
  2. For 64-bit operating systems, type the following:
    Regsvr32 "C:\Windows\SysWOW64\MSCOMCTL.OCX"
    
    For 32-bit operating systems, type the following:
    Regsvr32 "C:\Windows\System32\MSCOMCTL.OCX"
    
  3. Restart your machine.