Tuesday, October 20, 2009

Enterpise backup strategy leaks


Every company has some kind of backup strategy regardless underlying technology. In Oracle world most of them are based on some enterprise archiving solution (Legato, Veritas etc.) which in core represent a powerful and robust backup solution. But no one of them will ensure and create backup as you need.

Oracle is really unbeatable platform for backup data in all means. It's features are well famous but here I'd like to accent some important but commonly made mistakes.

Any backup strategy without DRS (Disaster Recovery Site), which is very often in many small and mid range companies) has one crucial question: How much of data can company afford to loose? Lost of data are always present in such a situations and this answer determine many next steps and procedures which are crucial in defining backup strategy of the company at the very start.

At the begging let us clear up term "data loss problem". In it's core it means how often data has to be backup ed to ensure that unexpected lost of ANY production data will not be over wondered range (measured in time).What ever software vendor promise (including Oracle) that data are safe is 99.99% true. Most of experienced users can very easy demonstrate some kind of data loss, regardless underlying technology! Other thing is: How much is that possible and how to quantify that risk.

In praxis in Oracle, preventing "loss of data" means how often archive logs have to be backup ed on some other location so in a case of some huge disaster on production data, DBA can perform recover which could never be above mentioned data loss policy. The most common solution is implemented in a way that archived logs are sent to some tape/disk storage on a time policy. In most cases 2 hours are very reasonable timings for many companies around the world. In our case, where example script will be based on Legato backup, here is how to make that concept:
connect target backup_admin/backup_admin_pass@PROD_DB
connect catalog rman/rman_pass@RMAN_DB
run {
allocate channel t1 type 'SBT_TAPE';
allocate channel t2 type 'SBT_TAPE';
send 'NSR_ENV=(NSR_SERVER=backup_server.com,NSR_CLIENT=production_server.com)';
# 0.0840278 = 121 minutes!
backup archivelog FROM TIME = 'sysdate-0.0840278' filesperset 50 format '/arch1_%d_%u/';
release channel t1;
release channel t2;
}
If we put that script in Legato schedule (special kind of crontab for Legato solution) with restart policy on 120 minutes, 24x7, DBA ensure that all archived logs will be backup ed every 2 hours.

Even experienced user will not immediately noticed two problematic points in this looks like bullet proof concept:
  1. Would that schedule fire every two hours?
  2. What do we backup every two hours?

1) Would that schedule fire every two hours?
Immediate answer is: probably NOT!
In a case when tape/disk library backup is shared among other backups, this approach has a, I'll call it this way, "busy backup resource problem". This term manifests in a way that sometimes Oracle backup schedule can be postponed because in that time some other backup has already started so we face lack of free backup resources. If you keep in mind that ouor solution is time sensitive (every two hours has to be run!) it is easy to see that we have a serious problem.Solving this problem lead us to two kinds of solution:
  • Easiest implementation (and most frequent in praxis) is to create another schedule on daily basis which will backup all not backup ed archived logs. More frequent backup than once a day may lead to overlapping archived logs (backup storage space problem) or in a case of selective backup time consuming operation (RMAN catalog database may not be so fast) what lead us to beginning of the problem-busy backup resource problem.
    However, problem of this implementation is that we still have a risk of not backup ed archive logs, but afaik, this is acceptable risk according time that last.
    Here is a code for that:
connect target backup_admin/backup_admin_pass@PROD_DB
connect catalog rman/rman_pass@RMAN_DB
run {
  allocate channel t1 type 'SBT_TAPE';
  send 'NSR_ENV=(NSR_SERVER=backup_server.com,NSR_CLIENT=production_server.com)';
  # 0.0840278 = 121 minutes!
  backup NOT BACKED UP 1 times archivelog all filesperset 50 format '/arch1_%d_%u/';
  release channel t1;
}
  • Enterprise solution would be to buy a dedicated backup solution only for this database or more "head device" (some kind of parallelism). With this no backup busy resource should happened any time!
    But implementing this solution bring you next problem: continuity. If just one scheduled job is skipped (some problems in any IT resource during the day) you will be without some archived logs backup. So this bring you implement previous method because you have to be 100% sure to backup all archived log files and this cannot be done without sequential checking RMAN catalog database separately on some time basis whenever.
2) What do we backup every two hours?
This is interesting question so let me show an example what I meant by this. What will happened when (usually at night or during weekend days) there is no activities on database, so archive log switch doesn't occurs at all!
And this is main danger which could happened in your strategy because no archived log files are produced, so our retention policy is dangered regardless dedicated storage solution or not-RMAN has no redo log files to backup!
To solve that Oracle has a solution out of the box in way of one init.ora parameter ARCHIVE_LAG_TARGET which can fire archive log switch on time basis.From my past experience, setting this value to 899 seconds (slightly less then 15 minutes) will be satisfying for many companies. With this implementation you ensure that you risk of data loss is now smaller and more acceptable in all means of risk. For the end here is a script archived redo logs generation:
/* -----------------------------------------------------------------------------
 Filename: rl.sql
 CR/TR#  :
 Purpose : Archive log distribution per hours on each day ...
 Date    : 07.08.2007.
 Author  : Damir Vadas, funky_disco_freak@yahoo.com
 Remarks : run as privileged user
 Changes (DD.MM.YYYY, Name, CR/TR#):
--------------------------------------------------------------------------- */

set pagesize 120;
set linesize 200;
col day for a8;

spool rl.txt

PROMPT Archive log distribution per hours on each day ...

select
  to_char(first_time,'YY-MM-DD') day,
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'00',1,0)),'999') "00",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'01',1,0)),'999') "01",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'02',1,0)),'999') "02",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'03',1,0)),'999') "03",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'04',1,0)),'999') "04",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'05',1,0)),'999') "05",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'06',1,0)),'999') "06",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'07',1,0)),'999') "07",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'08',1,0)),'999') "08",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'09',1,0)),'999') "09",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'10',1,0)),'999') "10",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'11',1,0)),'999') "11",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'12',1,0)),'999') "12",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'13',1,0)),'999') "13",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'14',1,0)),'999') "14",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'15',1,0)),'999') "15",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'16',1,0)),'999') "16",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'17',1,0)),'999') "17",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'18',1,0)),'999') "18",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'19',1,0)),'999') "19",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'20',1,0)),'999') "20",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'21',1,0)),'999') "21",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'22',1,0)),'999') "22",
  to_char(sum(decode(substr(to_char(first_time,'HH24'),1,2),'23',1,0)),'999') "23",
  COUNT(*) TOT
from v$log_history
group by to_char(first_time,'YY-MM-DD')
order by day
;
spool off;
This script reproduce rl.txt file in directory where sqlplus was started and here is an  example of it with notice that this company was having ARCHIVE_LAG_TARGET=1799 (30 minutes). Result is intentionally reduced to be more readable (some hours are removed from original spool file):
DAY      00   01   02   03   04   05   06   07   20   21   22   23          TOT
-------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----------
09-09-27    5    2   64    2    2    2    2    2    2    2    2    2        113
09-09-28    4    2   63    2    2    2   16   29    2    2    3    2        294
09-09-29    5    2   65    2    2    2   23   44   43   33   34   37        895
09-09-30   41   27    3    2    2    2  142   44    2    2    2    2        373
09-10-01    4    3    3    2    2    2    2    2    2    2    2    2         57
09-10-02    4    3    3    2    2    2    2    2    2    2    2    2         52
09-10-03    4    2    3    2    2    2    2    2    2    2    2    2         51
09-10-04    4    2    2    2    2    2    2    2    2    2    2    2         50
09-10-05    4    2    2    2    2    2    2   15    2    2    2    2        546
09-10-06    4    2   83   12   33   37    2    2    2    2    2    2        207
09-10-07    4    2   93    3    2    2    2    2    2    2    2    2        142
09-10-08    4    2    3    2    2    2    2    2    2    2    2    2         51
09-10-09    4    2    3    2    2   54   41    2    2    2    2    2        142
09-10-10    4    2   80   15    2    2    2    2    2    2    2    2        141
09-10-11    5    2   94    2    2    2    2    2    2    2    2    2        143
09-10-12    4    2   93    2    2    2    2    2    2    2    2    2        142
09-10-13    4    2   82   13    2    2    2    7   14    3    2    2        257
09-10-14    4    2   95    3    2    2    2    2    2    2    2    2        144
09-10-15    4    2   94    4    2    2    2    2    2    2    2    2        144
09-10-16    5    2   95    3    2    2    2    2    2    2    3    2        146
09-10-17    5    3   96    2    2    2    2    2    2    2    2    2        146
09-10-18    4    2   95    2    2    2    2    2    2    2    2    2        143
09-10-19    4    2   96    2    2    2    2    2    2    2    4    2        148
As you can see at least two archived log are created each hour so we are sure that we can backup something!

Applying mentioned methods 1) and 2) in company backup strategy really reduce risk to it's minimuum.

For the end I'd like to express that there is no backup policy without testing it-so test it from time to time!!

No comments:

Post a Comment