Database horror: reanimating a fool backup

Sometimes life isn’t pretty. Last week I’ve been asked to reanimate/save a database of which only a backup exists which was done without the database being in backup mode. Yes, the infamous ‘fool backup’ (the opposite of the ‘full backup’, an online backup done with the database in backup mode). For the record: Windows 2003, Database oracle EE

Can it be done? Yes.
Do you want to do it? In almost all cases: NO.
Do you loose data? Yes, depending on activity at that time.

For the record & other interested readers: oracle offers an undocumented parameter which especially is designed to bring up databases of which not all datafiles are consistent, or of which the oracle engine knows are ‘fuzzy’ (the fuzzy bit is a number in the fileheader of a datafile which oracle uses to put the state of a datafile in. Because the fuzzy bit gets set in ‘online backup mode’, oracle knows it is put in backup mode, and can be recovered). This parameter is “_allow_resetlog_corruption”. Oracle’s (and my) advice is: only use this parameter under direct guidance of oracle support.

Originally, this was about a few datafiles which went missing. This led to the first action: only restoring the (wrongly backuped) datafiles, and try to force open the database. (the database was three days older than the restored datafiles)
This action led to a ORA-1555 on opening of the database (which also blew up my instance). Investigation turned out this is due to the locally managed datafiles. (the tracefile showed a query of the database itself on uet$ (used extents data dictionary view), which originally was a (clustered) table in the SYSTEM tablespace, but with locally managed datafiles this data is actually stored inside the datafile in the bitmap area). So, an unfindable block of a space bitmap image of the datafile’s space managed broke the instance. Bummer.

Next action: restore the total fool backup.
This action led to a running instance! Interesting!
Following up (of course), was the action to export the entire database. (to save the data. this database is -absolutely- inconsistent and NOT USABLE!)
Also we are aware we can have inconsistencies and corrupt blocks. For this, we also set an undocumented event at startup, so oracle doesn’t break, but skips corrupt blocks (of tables, an export doesn’t read indexes, but gathers the information to create the an index out of the data dictionary). This event is 10231, and can be set in the init.ora:

event = '10231 trace name context forever, level 10'

We succeeded to export most of the tables, only 4 gave an ORA-1555.
The ORA-1555 means the before image of a transaction in a block is not present at the designated UBA (undo block address).
Every index and data block has a variable list of transactions in the variable transaction header of each block. This list is called ‘ITL’, which means Interested Transaction List. Once a transaction starts, this list arranges consistent reads for other transactions. If a read actions needs to obey the ITL entry (most likely because a transaction is pending on a row), it searches at the UBA for the data consistent with its read. If that data cannot be found at the UBA, the ORA-1555 is triggered.

So, we got ourselves a nice ‘logical corruption’. Oracle perfectly understands the data in the block (so it’s not physically corrupt), but can also not fulfill my wish because it cannot get the old data needed to give me my data consistently.

We choose to ‘get what we could’, by marking the block corrupt (so the 10231 event skips the block). Rough? Yes, but it allowed us to get an export of (most of) the data (and an export of the table). We did this by using the BBED.

The BBED is the ‘block browser and editor’. It is a commandline tool, which is present (as executable) in windows, and can be linked to an executable on linux/unixes. It is password protected.
The BBED can be ‘set’ at a block, and by using the command ‘corrupt’ it can corrupt a block (meaning it is marked as ‘media corrupt’, which means the 10231 event will skip that block).

We additionally set the following event to get the DBA (data block address):

event='1555 trace name errorstack level 10'

so we know which block should be marked.

Well, this led to a situation where we were able to get most of the data out. Additionaly I’ve did some research on the rows inside the blocks (you’ll need the format of the table which is stored in the block), but the BBED is made for browsing and editting a block, not for unloading data. (tools like oracle’s DUL, ORA600’s DUDE/jDUL are made for this purpose).

I know this could be done several other ways. But only in this situation we have both as much data as possible, and as little risk as possible (with a potentially corrupted database:-) for inconsistent rows (because those blocks we skipped).

  1. Pierre M said:

    This reminds me another one: In 2003, I visited a customer who had a corrupted online redo log. His application was unavailable (Database (around 7.3) wouldn’t start). I asked for a backup of its database. He hadn’t heard of any database on his system… In fact, my company installed his mailing application in 1999 and the guy hadn’t backed up his DB since then. Some people are lucky. In fact, the system went throught Y2K, rebooted only a few times since 1999 and was there without them knowing about it!
    Here also, a hidden statup param. allows to restart the db with corrupted online redo, I started it in read-only, exported it, created a new db and imported the data. They were happy with just a end-afternoon outage. The morning after, I came and installed a proper backup procedure. For free.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: