10 Jun, 2009

2 commits

  • In commit code, we scan buffers attached to a transaction. During this
    scan, we sometimes have to drop j_list_lock and then we recheck whether
    the journal buffer head didn't get freed by journal_try_to_free_buffers().
    But checking for buffer_jbd(bh) isn't enough because a new journal head
    could get attached to our buffer head. So add a check whether the journal
    head remained the same and whether it's still at the same transaction and
    list.

    This is a nasty bug and can cause problems like memory corruption (use after
    free) or trigger various assertions in JBD code (observed).

    Signed-off-by: Jan Kara
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The recent ->lookup() deadlock correction required the directory inode
    mutex to be dropped while waiting for expire completion. We were
    concerned about side effects from this change and one has been identified.

    I saw several error messages.

    They cause autofs to become quite confused and don't really point to the
    actual problem.

    Things like:

    handle_packet_missing_direct:1376: can't find map entry for (43,1827932)

    which is usually totally fatal (although in this case it wouldn't be
    except that I treat is as such because it normally is).

    do_mount_direct: direct trigger not valid or already mounted
    /test/nested/g3c/s1/ss1

    which is recoverable, however if this problem is at play it can cause
    autofs to become quite confused as to the dependencies in the mount tree
    because mount triggers end up mounted multiple times. It's hard to
    accurately check for this over mounting case and automount shouldn't need
    to if the kernel module is doing its job.

    There was one other message, similar in consequence of this last one but I
    can't locate a log example just now.

    When checking if a mount has already completed prior to adding a new mount
    request to the wait queue we check if the dentry is hashed and, if so, if
    it is a mount point. But, if a mount successfully completed while we
    slept on the wait queue mutex the dentry must exist for the mount to have
    completed so the test is not really needed.

    Mounts can also be done on top of a global root dentry, so for the above
    case, where a mount request completes and the wait queue entry has already
    been removed, the hashed test returning false can cause an incorrect
    callback to the daemon. Also, d_mountpoint() is not sufficient to check
    if a mount has completed for the multi-mount case when we don't have a
    real mount at the base of the tree.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     

07 Jun, 2009

1 commit

  • CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
    until the system runs out of memory. Nowhere is calling ima_inode_free()
    a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Jun, 2009

3 commits

  • OK, that's probably the easiest way to do that, as much as I don't like it...
    Since iget() et.al. will not accept I_FREEING (will wait to go away
    and restart), and since we'd better have serialization between new/free
    on fs data structures anyway, we can afford simply skipping I_FREEING
    et.al. in insert_inode_locked().

    We do that from new_inode, so it won't race with free_inode in any interesting
    ways and it won't race with iget (of any origin; nfsd or in case of fs
    corruption a lookup) since both still will wait for I_LOCK.

    Reviewed-by: "Theodore Ts'o"
    Acked-by: Jan Kara
    Tested-by: David Watson
    Signed-off-by: Al Viro

    Al Viro
     
  • The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of
    these three, only ext2 and jfs's get_block() function pays attention
    to bh->b_size --- which is normally always the filesystem blocksize
    except when the get_block() function is called by either
    mpage_readpage(), mpage_readpages(), or the direct I/O routines in
    fs/direct_io.c.

    Unfortunately, nobh_truncate_page() does not initialize map_bh before
    calling the filesystem-supplied get_block() function. So ext2 and jfs
    will try to calculate the number of blocks to map by taking stack
    garbage and shifting it left by inode->i_blkbits. This should be
    *mostly* harmless (except the filesystem will do some unnneeded work)
    unless the stack garbage is less than filesystem's blocksize, in which
    case maxblocks will be zero, and the attempt to find out whether or
    not the filesystem has a hole at a given logical block will fail, and
    the page cache entry might not get zero'ed out.

    Also if the stack garbage in in map_bh->state happens to have the
    BH_Mapped bit set, there could be an attempt to call readpage() on a
    non-existent page, which could cause nobh_truncate_page() to return an
    error when it should not.

    Fix this by initializing map_bh->state and map_bh->size.

    Fortunately, it's probably fairly unlikely that ext2 and jfs users
    mount with nobh these days.

    Signed-off-by: "Theodore Ts'o"
    Cc: Dave Kleikamp
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    Theodore Ts'o
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: Fix oops and use after free during space balancing
    Btrfs: set device->total_disk_bytes when adding new device

    Linus Torvalds
     

05 Jun, 2009

1 commit

  • The btrfs allocator uses list_for_each to walk the available block
    groups when searching for free blocks. It starts off with a hint
    to help find the best block group for a given allocation.

    The hint is resolved into a block group, but we don't properly check
    to make sure the block group we find isn't in the middle of being
    freed due to filesystem shrinking or balancing. If it is being
    freed, the list pointers in it are bogus and can't be trusted. But,
    the code happily goes along and uses them in the list_for_each loop,
    leading to all kinds of fun.

    The fix used here is to check to make sure the block group we find really
    is on the list before we use it. list_del_init is used when removing
    it from the list, so we can do a proper check.

    The allocation clustering code has a similar bug where it will trust
    the block group in the current free space cluster. If our allocation
    flags have changed (going from single spindle dup to raid1 for example)
    because the drives in the FS have changed, we're not allowed to use
    the old block group any more.

    The fix used here is to check the current cluster against the
    current allocation flags.

    Signed-off-by: Chris Mason

    Chris Mason
     

04 Jun, 2009

1 commit


03 Jun, 2009

1 commit


02 Jun, 2009

3 commits

  • It's possible to recurse into filesystem from the memory
    allocation, which deadlocks in xfs_qm_shake(). Add check
    for __GFP_FS, and bail out if it is not set.

    Signed-off-by: Felix Blyakher
    Signed-off-by: Hedi Berriche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Andi Kleen
    Signed-off-by: Felix Blyakher

    Felix Blyakher
     
  • In the case where growing a filesystem would leave the last AG
    too small, the fixup code has an overflow in the calculation
    of the new size with one fewer ag, because "nagcount" is a 32
    bit number. If the new filesystem has > 2^32 blocks in it
    this causes a problem resulting in an EINVAL return from growfs:

    # xfs_io -f -c "truncate 19998630180864" fsfile
    # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
    # mount -o loop fsfile /mnt
    # xfs_growfs /mnt

    meta-data=/dev/loop0 isize=256 agcount=52,
    agsize=76288719 blks
    = sectsz=512 attr=2
    data = bsize=4096 blocks=3905982455, imaxpct=5
    = sunit=0 swidth=0 blks
    naming =version 2 bsize=4096 ascii-ci=0
    log =internal bsize=4096 blocks=32768, version=2
    = sectsz=512 sunit=0 blks, lazy-count=0
    realtime =none extsz=4096 blocks=0, rtextents=0
    xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

    Reported-by: richard.ems@cape-horn-eng.com
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Eric Sandeen
     
  • Regreesion from commit ef8f7fc, which rearranged the code in
    xfs_swap_extents() leading to double unlock of xfs inode ilock.
    That resulted in xfs_fsr deadlocking itself on platforms, which
    don't handle double unlock of rw_semaphore nicely. It caused the
    count go negative, which represents the write holder, without
    really having one. ia64 is one of the platforms where deadlock
    was easily reproduced and the fix was tested.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Eric Sandeen
    Signed-off-by: Felix Blyakher

    Felix Blyakher
     

30 May, 2009

2 commits


29 May, 2009

7 commits

  • * git://git.infradead.org/~dwmw2/mtd-2.6.30:
    jffs2: Fix corruption when flash erase/write failure
    mtd: MXC NAND driver fixes (v5)

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
    Driver Core: do not oops when driver_unregister() is called for unregistered drivers
    sysfs: file.c: use create_singlethread_workqueue()

    Linus Torvalds
     
  • * 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
    svcrdma: dma unmap the correct length for the RPCRDMA header page.
    nfsd: Revert "svcrpc: take advantage of tcp autotuning"
    nfsd: fix hung up of nfs client while sync write data to nfs server

    Linus Torvalds
     
  • The flat loader uses an architecture's flat_stack_align() to align the
    stack but assumes word-alignment is enough for the data sections.

    However, on the Xtensa S6000 we have registers up to 128bit width
    which can be used from userspace and therefor need userspace stack and
    data-section alignment of at least this size.

    This patch drops flat_stack_align() and uses the same alignment that
    is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's
    not defined by the architecture.

    It also fixes m32r which was obviously kaput, aligning an
    uninitialized stack entry instead of the stack pointer.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oskar Schirmer
    Cc: David Howells
    Cc: Russell King
    Cc: Bryan Wu
    Cc: Geert Uytterhoeven
    Acked-by: Paul Mundt
    Cc: Greg Ungerer
    Signed-off-by: Johannes Weiner
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oskar Schirmer
     
  • proc_pident_instantiate() has following call flow.

    proc_pident_lookup()
    proc_pident_instantiate()
    proc_pid_make_inode()

    And, proc_pident_lookup() has following error handling.

    const struct pid_entry *p, *last;
    error = ERR_PTR(-ENOENT);
    if (!task)
    goto out_no_task;

    Then, proc_pident_instantiate should return ENOENT too when racing against
    exit(2) occur.

    EINAL has two bad reason.
    - it implies caller is wrong. bad the race isn't caller's mistake.
    - man 2 open don't explain EINVAL. user often don't handle it.

    Note: Other proc_pid_make_inode() caller already use ENOENT properly.

    Acked-by: Eric W. Biederman
    Cc: Alexey Dobriyan
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Erase errors such as:
    "Newly-erased block contained word 0xa4ef223e at offset 0x0296a014"
    and failure to write the clean marker,
    moves the offending erase block to erasing list before calling
    jffs2_erase_failed(). This is bad as jffs2_erase_failed() will
    also move the block to the bad_list, but is now moving the
    wrong block, causing FS corruption.

    Signed-off-by: Joakim Tjernlund
    Signed-off-by: David Woodhouse

    Joakim Tjernlund
     
  • We don't need a kernel thread per CPU for this application.

    Acked-by: Alex Chiang
    Cc: Lai Jiangshan
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Andrew Morton
     

28 May, 2009

3 commits

  • Commit 'Short write in nfsd becomes a full write to the client'
    (31dec2538e45e9fff2007ea1f4c6bae9f78db724) broken the sync write.
    With the following commands to reproduce:

    $ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
    $ cd /mnt
    $ echo aaaa > temp.txt

    Then nfs client is hung up.

    In SYNC mode the server alaways return the write count 0 to the
    client. This is because the value of host_err in nfsd_vfs_write()
    will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
    and then we return host_err(which is now 0) as write count.

    This patch fixed the problem.

    Signed-off-by: Wei Yongjun
    Signed-off-by: J. Bruce Fields

    Wei Yongjun
     
  • Fix up renamed filenames in comments in fs/cachefiles/internal.h.

    Originally, the files were all called cf-xxx.c, but they got renamed to
    just xxx.c.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix up renamed filenames in comments in fs/fscache/internal.h.

    Originally, the files were all called fsc-xxx.c, but they got renamed to
    just xxx.c.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

27 May, 2009

2 commits

  • If the asynchronous lease renewal fails (usually due to a soft timeout),
    then we _must_ schedule state recovery in order to ensure that we don't
    lose the lease unnecessarily or, if the lease is already lost, that we
    recover the locking state promptly...

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • fix build error with latest kbuild adjustments to initconst.

    The commit a447c0932445f92ce6f4c1bd020f62c5097a7842 ("vfs: Use
    const for kernel parser table") changed:

    static match_table_t __initdata tokens = {
    to
    static match_table_t __initconst tokens = {

    But the missing const causes popwerpc to fail with latest
    updates to __initconst like this:

    fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
    fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict

    The bug is only present with kbuild-next.
    Following patch has been build tested.

    Signed-off-by: Sam Ravnborg
    Cc: Steven Whitehouse
    Cc: Stephen Rothwell
    Acked-by: Jan Beulich
    Signed-off-by: Trond Myklebust

    Sam Ravnborg
     

24 May, 2009

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
    [CIFS] Avoid open on possible directories since Samba now rejects them

    Linus Torvalds
     
  • Small change (mostly formatting) to limit lookup based open calls to
    file create only.

    After discussion yesteday on samba-technical about the posix lookup
    regression, and looking at a problem with cifs posix open to one
    particular Samba version, Jeff and JRA realized that Samba server's
    behavior changed in this area (posix open behavior on files vs.
    directories). To make this behavior consistent, JRA just made a
    fix to Samba server to alter how it handles open of directories (now
    returning the equivalent of EISDIR instead of success). Since we don't
    know at lookup time whether the inode is a directory or file (and
    thus whether posix open will succeed with most current Samba server),
    this change avoids the posix open code on lookup open (just issues
    posix open on creates). This gets the semantic benefits we want
    (atomicity, posix byte range locks, improved write semantics on newly
    created files) and file create still is fast, and we avoid the problem
    that Jeff noticed yesterday with "openat" (and some open directory
    calls) of non-cached directories to one version of Samba server, and
    will work with future Samba versions (which include the fix jra just
    pushed into Samba server). I confirmed this approach with jra
    yesterday and with Shirish today.

    Posix open is only called (at lookup time) for file create now.
    For opens (rather than creates), because we do not know if it
    is a file or directory yet, and current Samba no longer allows
    us to do posix open on dirs, we could end up wasting an open call
    on what turns out to be a dir. For file opens, we wait to call posix
    open till cifs_open. It could be added here (lookup) in the future
    but the performance tradeoff of the extra network request when EISDIR
    or EACCES is returned would have to be weighed against the 50%
    reduction in network traffic in the other paths.

    Reviewed-by: Shirish Pargaonkar
    Tested-by: Jeff Layton
    CC: Jeremy Allison
    Signed-off-by: Steve French

    Steve French
     

22 May, 2009

3 commits


20 May, 2009

1 commit


19 May, 2009

2 commits

  • This is the third respin of the patch posted yesterday to fix the error
    handling in cifs_follow_symlink. It also includes a fix for a bogus NULL
    pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted.

    It's possible for CIFSSMBQueryUnixSymLink to return without setting
    target_path to a valid pointer. If that happens then the current value
    to which we're initializing this pointer could cause an oops when it's
    kfree'd.

    This patch is a little more comprehensive than the last patches. It
    reorganizes cifs_follow_link a bit for (hopefully) better readability.
    It should also eliminate the uneeded allocation of full_path on servers
    without unix extensions (assuming they can get to this point anyway, of
    which I'm not convinced).

    On a side note, I'm not sure I agree with the logic of enabling this
    query even when unix extensions are disabled on the client. It seems
    like that should disable this as well. But, changing that is outside the
    scope of this fix, so I've left it alone for now.

    Reported-by: Jeff Moyer
    Signed-off-by: Jeff Layton
    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Steve French

    Jeff Layton
     
  • The problem is that permission checking is skipped if atomic open is
    possible, but when exec opens a file, it just opens it O_READONLY which
    means EXEC permission will not be checked at that time.

    This problem is observed by the following sequence (executed as root):

    mount -t nfs4 server:/ /mnt4
    echo "ls" >/mnt4/foo
    chmod 744 /mnt4/foo
    su guest -c "mnt4/foo"

    Signed-off-by: Frank Filz
    Signed-off-by: Trond Myklebust
    Cc: stable@kernel.org
    Tested-by: Eugene Teo
    Signed-off-by: Linus Torvalds

    Frank Filz
     

18 May, 2009

3 commits


15 May, 2009

3 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: Fix race in ext4_inode_info.i_cached_extent
    ext4: Clear the unwritten buffer_head flag after the extent is initialized
    ext4: Use a fake block number for delayed new buffer_head
    ext4: Fix sub-block zeroing for writes into preallocated extents

    Linus Torvalds
     
  • devpts_get_sb() calls memset(0) to clear mount options and calls
    parse_mount_options() if user specified any mount options.

    The memset(0) is bogus since the 'mode' and 'ptmxmode' options are
    non-zero by default. parse_mount_options() restores options to default
    anyway and can properly deal with NULL mount options.

    So in devpts_get_sb() remove memset(0) and call parse_mount_options() even
    for NULL mount options.

    Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448.

    Signed-off-by: Sukadev Bhattiprolu
    Tested-by: Marc Dionne
    Reported-by: Eric Paris
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: "Rafael J. Wysocki"
    Reviewed-by: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • If two CPU's simultaneously call ext4_ext_get_blocks() at the same
    time, there is nothing protecting the i_cached_extent structure from
    being used and updated at the same time. This could potentially cause
    the wrong location on disk to be read or written to, including
    potentially causing the corruption of the block group descriptors
    and/or inode table.

    This bug has been in the ext4 code since almost the very beginning of
    ext4's development. Fortunately once the data is stored in the page
    cache cache, ext4_get_blocks() doesn't need to be called, so trying to
    replicate this problem to the point where we could identify its root
    cause was *extremely* difficult. Many thanks to Kevin Shanahan for
    working over several months to be able to reproduce this easily so we
    could finally nail down the cause of the corruption.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: "Aneesh Kumar K.V"

    Theodore Ts'o