Journaling filesystems. What's consistent?

Recently I’ve had a discussion about filecorruptions / corruptions on the oracle (database) level, and what journalling filesystems do to protect that.

It seems that the general believe is journalling filesystems work the following way:
1. write is done to the filesystems’ “intent log”
2. write is done to the filesystem at the actual location
3. previous write action in the “intent log” is flagged as truly written

This way, when a disaster happens, the system just has to redo all writes not flagged as written to be consistent. In fact, that’s what I’ve been taught in HPUX system administration classes.

As you’ve probably guessed by now, that is not the case. In fact a thesis of Vijayan Prabhakaran describes how the journaling works for Ext3, ReiserFS, JFS, XFS and Windows NTFS.

The thesis investigated the capability of failure resolution of above described filesystems. To do so, it investigated the way the journaling works. This investigation shows the default setup of all these filesystems do only do journalling for the filesystem metadata, NOT for the data (!!!) Ext3 and ReiserFS can be configured to do journalling for data too, but that requires reconfiguration.

This means that after a crash, your filesystem itself is in a consistent state after online recovery, but the data inside your files might not be…

Does anyone know how journalling is done on filesystems on AIX, SUN, HPUX, True64 and 3rd party filesystems like Veritas?

Advertisements
3 comments
  1. Hi Frits,

    Been looking a little at this in relation ZFS. In this case, it does not need to journal as the copy on write mechanism ensures the file system is always in a consistent state.

    The way i’ve heard dirty region logging described from Sun ZFS talks is that all it does is narrow down the area on disk that requires an fsck.

    Don’t know about you, but I know I’ve definately had to run fsck on ext3, oh and I’ve ran it on Vxfs as well. I would say definately journalling does not prevent the data from being in an inconsistent state.

    Now for ASM….

  2. For Oracle databases, we use pre-sized datafiles so the meta-data
    for the datafile is “static” (it changes only when the datafile is being
    resized manually or by an autoextend) most of the time. Generally,
    you would not have problems with Oracle data as database blocks
    within the file are also protected by redo roll-forward in a crash
    recovery.

  3. Hi Hermant! Thanks for your response.

    The intention of the blog is to grow awareness about how recovery of filesystems is handled. Large amounts of people, and (as I said) in some courses, the understanding is wrong (journalling protecting all writes).

    The second point which I am trying to make (which perhaps is a bit buried) is that a filesystem means there are two points in the stack which have a state of consistency, which can get out of sync.

    Of course this:
    -always has been so
    -the recovery has matured quite a long time

    I do not think (based upon my experience) that outages and crashes can not be solved by the oracle database these days, if filesystem metadata is not changed (the situation you’ve described: static filesystem metadata through pre-sized datafiles). I have heard situations where database controlfiles got out of sync, but do know exact details, on the other hand.

    If the “truth” for the database has changed (sync-ness of the stack is altered through filesystem recovery/fsck) it could lead to unrecoverable situations. I cannot predict the likelihood, however.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: