22 May, 2010

1 commit

  • Quota must being initialized if size or uid/git changes requested.
    But initialization performed in two different places:
    in case of i_size file system is responsible for dquot init
    , but in case of uid/gid init will be called internally in
    dquot_transfer().
    This ambiguity makes code harder to understand.
    Let's move this logic to one common helper function.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara

    Dmitry Monakhov
     

30 Mar, 2010

1 commit

  • In commit 9df93939b735 ("ext3: Use bitops to read/modify
    EXT3_I(inode)->i_state") ext3 changed its internal 'i_state' variable to
    use bitops for its state handling. However, unline the same ext4
    change, it didn't actually change the name of the field when it changed
    the semantics of it.

    As a result, an old use of 'i_state' remained in fs/ext3/ialloc.c that
    initialized the field to EXT3_STATE_NEW. And that does not work
    _at_all_ when we're now working with individually named bits rather than
    values that get masked. So the code tried to mark the state to be new,
    but in actual fact set the field to EXT3_STATE_JDATA. Which makes no
    sense at all, and screws up all the code that checks whether the inode
    was newly allocated.

    In particular, it made the xattr code unhappy, and caused various random
    behavior, like apparently

    https://bugzilla.redhat.com/show_bug.cgi?id=577911

    So fix the initialization, and rename the field to match ext4 so that we
    don't have this happen again.

    Cc: James Morris
    Cc: Stephen Smalley
    Cc: Daniel J Walsh
    Cc: Eric Paris
    Cc: Jan Kara
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Mar, 2010

2 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     
  • This gives the filesystem more information about the writeback that
    is happening. Trond requested this for the NFS unstable write handling,
    and other filesystems might benefit from this too by beeing able to
    distinguish between the different callers in more detail.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

05 Mar, 2010

7 commits

  • Get rid of the initialize dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_initialize helper to __dquot_initialize
    and vfs_dq_init to dquot_initialize to have a consistent namespace.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently various places in the VFS call vfs_dq_init directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the initialization. For most metadata operations
    this is a straight forward move into the methods, but for truncate and
    open it's a bit more complicated.

    For truncate we currently only call vfs_dq_init for the sys_truncate case
    because open already takes care of it for ftruncate and open(O_TRUNC) - the
    new code causes an additional vfs_dq_init for those which is harmless.

    For open the initialization is moved from do_filp_open into the open method,
    which means it happens slightly earlier now, and only for regular files.
    The latter is fine because we don't need to initialize it for operations
    on special files, and we already do it as part of the namespace operations
    for directories.

    Add a dquot_file_open helper that filesystems that support generic quotas
    can use to fill in ->open.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the transfer dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_transfer helper to __dquot_transfer
    and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
    and make the new dquot_transfer return a normal negative errno value
    which all callers expect.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the alloc_space, free_space, reserve_space, claim_space and
    release_rsv dquot operations - they are always called from the filesystem
    and if a filesystem really needs their own (which none currently does)
    it can just call into it's own routine directly.

    Move shared logic into the common __dquot_alloc_space,
    dquot_claim_space_nodirty and __dquot_free_space low-level methods,
    and rationalize the wrappers around it to move as much as possible
    code into the common block for CONFIG_QUOTA vs not. Also rename
    all these helpers to be named dquot_* instead of vfs_dq_*.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • - There is theoretical possibility to perform writepage on
    RO superblock. Add explicit check for what case.
    - Page must being locked before writepage.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara

    Dmitry Monakhov
     
  • We have to truncate blocks allocated to file during direct IO when we
    fail to update i_size properly.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • At several places we modify EXT3_I(inode)->i_state without holding i_mutex
    (ext3_release_file, ext3_bmap, ext3_journalled_writepage, ext3_do_update_inode,
    ...). These modifications are racy and we can lose updates to i_state. So
    convert handling of i_state to use bitops which are atomic.

    Signed-off-by: Jan Kara

    Jan Kara
     

23 Dec, 2009

1 commit

  • Currently all quota block reservation macros contains hardcoded "2"
    aka MAXQUOTAS value. This is no good because in some places it is not
    obvious to understand what does this digit represent. Let's introduce
    new macro with self descriptive name.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara

    Dmitry Monakhov
     

10 Dec, 2009

1 commit

  • When ext3_write_begin fails after allocating some blocks or
    generic_perform_write fails to copy data to write, we truncate blocks already
    instantiated beyond i_size. Although these blocks were never inside i_size, we
    have to truncate pagecache of these blocks so that corresponding buffers get
    unmapped. Otherwise subsequent __block_prepare_write (called because we are
    retrying the write) will find the buffers mapped, not call ->get_block, and
    thus the page will be backed by already freed blocks leading to filesystem and
    data corruption.

    Reported-by: James Y Knight
    Signed-off-by: Jan Kara

    Jan Kara
     

08 Dec, 2009

1 commit


04 Dec, 2009

1 commit


11 Nov, 2009

2 commits

  • We cannot rely on buffer dirty bits during fsync because pdflush can come
    before fsync is called and clear dirty bits without forcing a transaction
    commit. What we do is that we track which transaction has last changed
    the inode and which transaction last changed allocation and force it to
    disk on fsync.

    Signed-off-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V

    Jan Kara
     
  • On a 256M 4k block filesystem, doing this in a loop:

    dd if=/dev/zero of=test oflag=direct bs=1M count=64
    rm -f test

    eventually leads to spurious ENOSPC:

    dd: writing `test': No space left on device

    As with other block allocation callers, it looks like we need to
    potentially retry the allocations on the initial ENOSPC.

    A similar patch went into ext4 (commit
    fbbf69456619de5d251cb9f1df609069178c62d5)

    Signed-off-by: Eric Sandeen
    Signed-off-by: Jan Kara

    Eric Sandeen
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

16 Sep, 2009

3 commits

  • I've been struggling with this off and on while I've been testing the
    data=guarded work. The symptom is corrupted orphan lists and inodes
    with the wrong i_size stored on disk. I was convinced the
    data=guarded code was just missing a call to ext3_mark_inode_dirty, but
    tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
    wasn't actually making it to the drive.

    ext3_mark_inode_dirty can be called without locks held (atime updates
    and a few others), so the data=guarded code uses locks while updating
    the in-memory inode, and then calls ext3_mark_inode_dirty
    without any locks held.

    But, ext3_mark_inode_dirty has no internal locking to make sure that
    only one CPU is updating the buffer head at a time. Generally this
    works out ok because everyone that changes the inode then calls
    ext3_mark_inode_dirty themselves. Even though it races, eventually
    someone updates the buffer heads and things move on.

    But there is still a risk of the wrong values getting in, and the
    data=guarded code seems to hit the race very often.

    Since everyone that changes the inode also logs it, it should be
    possible to fix this with some memory barriers. I'll leave that as an
    exercise to the reader and lock the buffer head instead.

    It it probably a good idea to have a different patch series for lockless
    bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
    EXT3_STATE_NEW without any locks held.

    Signed-off-by: Chris Mason
    Signed-off-by: Jan Kara

    Chris Mason
     
  • During truncate we are sometimes forced to start a new transaction as the
    amount of blocks to be journaled is both quite large and hard to predict. So
    far we restarted a transaction while holding truncate_mutex and that violates
    lock ordering because truncate_mutex ranks below transaction start (and it
    can lead to a real deadlock with ext3_get_blocks() allocating new blocks
    from ext3_writepage()).

    Luckily, the problem is easy to fix: We just drop the truncate_mutex before
    restarting the transaction and acquire it afterwards. We are safe to do this as
    by the time ext3_truncate() is called, all the page cache for the truncated
    part of the file is dropped and so writepage() cannot come and allocate new
    blocks in the part of the file we are truncating. The rest of writers is
    stopped by us holding i_mutex.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     

16 Jul, 2009

2 commits

  • Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
    be a relict from some old days and setting disksize in this function does not
    make much sence. Currently it was set only by ext3_getblk(). Since the
    parameter has some effect only if create == 1, it is easy to check that the
    three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
    ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Contents of long symlinks is written via standard write methods. So when the
    write fails, we add inode to orphan list. But symlinks don't have .truncate
    method defined so nobody properly removes them from the orphan list (both on
    disk and in memory).

    Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
    (which is saner anyway since we don't need anything vmtruncate() does except
    from calling .truncate in these paths). We also add inode to orphan list only
    if ext3_can_truncate() is true (currently, it can be false for symlinks when
    there are no blocks allocated) - otherwise orphan list processing will complain
    and ext3_truncate() will not remove inode from on-disk orphan list.

    Signed-off-by: Jan Kara

    Jan Kara
     

24 Jun, 2009

1 commit


19 Jun, 2009

2 commits

  • As Ted pointed out, it can happen that ext3_truncate() returns without
    removing inode from orphan list. This way we could in some rare cases
    (like when we get ENOMEM from an allocation in ext3_truncate called
    because of failed ext3_write_begin) leave the inode on orphan list and
    that triggers assertion failure on umount.

    So make ext3_truncate() always remove inode from in-memory orphan list.

    Cc: Theodore Ts'o
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Chain verification in ext3_get_blocks() has been hosed since it called
    verify_chain(chain, NULL) which always returns success. As a result
    readers could in theory race with truncate. On the other hand the race
    probably cannot happen with the current locking scheme, since by the
    time ext3_truncate() is called all the pages are already removed and
    hence get_block() shouldn't be called on such pages...

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

12 Jun, 2009

1 commit


09 Apr, 2009

1 commit


04 Apr, 2009

1 commit


03 Apr, 2009

2 commits

  • In data=writeback mode, start an asynchronous flush when closing a
    file which had been previously truncated down to zero. This lowers
    the probability of data loss in the case of applications that attempt
    to replace a file using truncate.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Sometimes block_write_begin() can map buffers in a page but later we
    fail to copy data into those buffers (because the source page has been
    paged out in the mean time). We then end up with !uptodate mapped
    buffers. To add a bit more to the confusion, block_write_end() does
    not commit any data (and thus does not any mark buffers as uptodate) if
    we didn't succeed with copying all the data.

    Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new
    aops) missed these cases and thus we were inserting non-uptodate
    buffers to transaction's list which confuses JBD code and it reports IO
    errors, aborts a transaction and generally makes users afraid about
    their data ;-P.

    This patch fixes the problem by reorganizing ext3_..._write_end() code
    to first call block_write_end() to mark buffers with valid data
    uptodate and after that we file only uptodate buffers to transaction's
    lists.

    We also fix a problem where we could leave blocks allocated beyond i_size
    (i_disksize in fact) because of failed write. We now add inode to orphan
    list when write fails (to be safe in case we crash) and then truncate blocks
    beyond i_size in a separate transaction.

    Signed-off-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V
    Cc: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

28 Mar, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
    ext2: Zero our b_size in ext2_quota_read()
    trivial: fix typos/grammar errors in fs/Kconfig
    quota: Coding style fixes
    quota: Remove superfluous inlines
    quota: Remove uppercase aliases for quota functions.
    nfsd: Use lowercase names of quota functions
    jfs: Use lowercase names of quota functions
    udf: Use lowercase names of quota functions
    ufs: Use lowercase names of quota functions
    reiserfs: Use lowercase names of quota functions
    ext4: Use lowercase names of quota functions
    ext3: Use lowercase names of quota functions
    ext2: Use lowercase names of quota functions
    ramfs: Remove quota call
    vfs: Use lowercase names of quota functions
    quota: Remove dqbuf_t and other cleanups
    quota: Remove NODQUOT macro
    quota: Make global quota locks cacheline aligned
    quota: Move quota files into separate directory
    ext4: quota reservation for delayed allocation
    ...

    Linus Torvalds
     

27 Mar, 2009

1 commit

  • We don't have to start a transaction in writepage() when all the blocks
    are a properly allocated. Even in ordered mode either the data has been
    written via write() and they are thus already added to transaction's list
    or the data was written via mmap and then it's random in which transaction
    they get written anyway.

    This should help VM to pageout dirty memory without blocking on transaction
    commits.

    Signed-off-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Jan Kara
     

26 Mar, 2009

1 commit


05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Jan, 2009

1 commit


20 Oct, 2008

1 commit

  • For blocksize < pagesize we need to remove blocks that got allocated in
    block_write_begin() if we fail with ENOSPC for later blocks.
    block_write_begin() internally does this if it allocated page locally.
    This makes sure we don't have blocks outside inode.i_size during ENOSPC.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

04 Oct, 2008

1 commit

  • Any block based fs (this patch includes ext3) just has to declare its own
    fiemap() function and then call this generic function with its own
    get_block_t. This works well for block based filesystems that will map
    multiple contiguous blocks at one time, but will work for filesystems that
    only map one block at a time, you will just end up with an "extent" for each
    block. One gotcha is this will not play nicely where there is hole+data
    after the EOF. This function will assume its hit the end of the data as soon
    as it hits a hole after the EOF, so if there is any data past that it will
    not pick that up. AFAIK no block based fs does this anyway, but its in the
    comments of the function anyway just in case.

    Signed-off-by: Josef Bacik
    Signed-off-by: Mark Fasheh
    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org

    Josef Bacik
     

29 Jul, 2008

1 commit

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

26 Jul, 2008

1 commit

  • While freeing indirect blocks we attach a journal head to the parent
    buffer head, free the blocks, then journal the parent. If the indirect
    block list is corrupted and points to the parent the journal head will be
    detached when the block is cleared, causing an OOPS.

    Check for that explicitly and handle it gracefully.

    This patch fixes the third case (image hdb.20000057.nullderef.gz)
    reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

    Immediately above the change, in the ext3_free_data function, we call
    ext3_clear_blocks to clear the indirect blocks in this parent block. If
    one of those blocks happens to actually be the parent block it will clear
    b_private / BH_JBD.

    I did the check at the end rather than earlier as it seemed more elegant.
    I don't think there should be much practical difference, although it is
    possible the FS may not be quite so badly corrupted if we did it the other
    way (and didn't clear the block at all). To be honest, I'm not convinced
    there aren't other similar failure modes lurking in this code, although I
    couldn't find any with a quick review.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Duane Griffin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin