15 Mar, 2011

2 commits

  • * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    NFS: NFSROOT should default to "proto=udp"
    nfs4: remove duplicated #include
    NFSv4: nfs4_state_mark_reclaim_nograce() should be static
    NFSv4: Fix the setlk error handler
    NFSv4.1: Fix the handling of the SEQUENCE status bits
    NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
    NFSv4.1 reclaim complete must wait for completion
    NFSv4: remove duplicate clientid in struct nfs_client
    NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY
    sunrpc: Propagate errors from xs_bind() through xs_create_sock()
    (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid
    nfs: fix compilation warning
    nfs: add kmalloc return value check in decode_and_add_ds
    SUNRPC: Remove resource leak in svc_rdma_send_error()
    nfs: close NFSv4 COMMIT vs. CLOSE race
    SUNRPC: Close a race in __rpc_wait_for_completion_task()

    Linus Torvalds
     
  • The kernel automatically evaluates partition tables of storage devices.
    The code for evaluating OSF partitions contains a bug that leaks data
    from kernel heap memory to userspace for certain corrupted OSF
    partitions.

    In more detail:

    for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {

    iterates from 0 to d_npartitions - 1, where d_npartitions is read from
    the partition table without validation and partition is a pointer to an
    array of at most 8 d_partitions.

    Add the proper and obvious validation.

    Signed-off-by: Timo Warns
    Cc: stable@kernel.org
    [ Changed the patch trivially to not repeat the whole le16_to_cpu()
    thing, and to use an explicit constant for the magic value '8' ]
    Signed-off-by: Linus Torvalds

    Timo Warns
     

14 Mar, 2011

2 commits

  • Fix for a dumb preadv()/pwritev() compat bug - unlike the native
    variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so
    e.g. on pipe the native preadv() will fail with -ESPIPE and compat one
    will act as readv() and succeed.

    Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
    -final.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: break out of shrink_delalloc earlier
    btrfs: fix not enough reserved space
    btrfs: fix dip leak
    Btrfs: make sure not to return overlapping extents to fiemap
    Btrfs: deal with short returns from copy_from_user
    Btrfs: fix regressions in copy_from_user handling

    Linus Torvalds
     

12 Mar, 2011

7 commits

  • Josef had changed shrink_delalloc to exit after three shrink
    attempts, which wasn't quite enough because new writers could
    race in and steal free space.

    But it also fixed deadlocks and stalls as we tried to recover
    delalloc reservations. The code was tweaked to loop 1024
    times, and would reset the counter any time a small amount
    of progress was made. This was too drastic, and with a
    lot of writers we can end up stuck in shrink_delalloc forever.

    The shrink_delalloc loop is fairly complex because the caller is looping
    too, and the caller will go ahead and force a transaction commit to make
    sure we reclaim space.

    This reworks things to exit shrink_delalloc when we've forced some
    writeback and the delalloc reservations have gone down. This means
    the writeback has not just started but has also finished at
    least some of the metadata changes required to reclaim delalloc
    space.

    If we've got this wrong, we're returning ENOSPC too early, which
    is a big improvement over the current behavior of hanging the machine.

    Test 224 in xfstests hammers on this nicely, and with 1000 writers
    trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
    other writers are able to continue until we get 100%.

    This is a worst case test for btrfs because the 1000 writers are doing
    small IO, and the small FS size means we don't have a lot of room
    for metadata chunks.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • There have been a number of recent reports that NFSROOT is no longer
    working with default mount options, but fails only with certain NICs.

    Brian Downing bisected to commit 56463e50 "NFS:
    Use super.c for NFSROOT mount option parsing". Among other things,
    this commit changes the default mount options for NFSROOT to use TCP
    instead of UDP as the underlying transport.

    TCP seems less able to deal with NICs that are slow to initialize.
    The system logs that have accompanied reports of problems all show
    that NFSROOT attempts to establish a TCP connection before the NIC is
    fully initialized, and thus the TCP connection attempt fails.

    When a TCP connection attempt fails during a mount operation, the
    NFS stack needs to fail the operation. Usually user space knows how
    and when to retry it. The network layer does not report a distinct
    error code for this particular failure mode. Thus, there isn't a
    clean way for the RPC client to see that it needs to retry in this
    case, but not in others.

    Because NFSROOT is used in some environments where it is not possible
    to update the kernel command line to specify "udp", the proper thing
    to do is change NFSROOT to use UDP by default, as it did before commit
    56463e50.

    To make it easier to see how to change default mount options for
    NFSROOT and to distinguish default settings from mandatory settings,
    I've adjusted a couple of areas to document the specifics.

    root_nfs_cat() is also modified to deal with commas properly when
    concatenating strings containing mount option lists. This keeps
    root_nfs_cat() call sites simpler, now that we may be concatenating
    multiple mount option strings.

    Tested-by: Brian Downing
    Tested-by: Mark Brown
    Signed-off-by: Chuck Lever
    Cc: # 2.6.37
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Remove duplicated #include('s) in
    fs/nfs/nfs4proc.c

    Signed-off-by: Huang Weiyi
    Signed-off-by: Trond Myklebust

    Huang Weiyi
     
  • There are no more external users of nfs4_state_mark_reclaim_nograce() or
    nfs4_state_mark_reclaim_reboot(), so mark them as static.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We want SEQUENCE status bits to be handled by the state manager in order
    to avoid threading issues.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • nfs4_schedule_state_recovery() should only be used when we need to force
    the state manager to check the lease. If we just want to start the
    state manager in order to handle a state recovery situation, we should be
    using nfs4_schedule_state_manager().

    This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
    its use with a set of helper functions that do the right thing.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

11 Mar, 2011

10 commits

  • Signed-off-by: Andy Adamson
    [Trond: fix whitespace errors]
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • Fix bug where we currently retry the EXCHANGEID call again, eventhough
    we already have a valid clientid. Instead, delay and retry the CREATE_SESSION
    call.

    Signed-off-by: Ricardo Labiaga
    Signed-off-by: Trond Myklebust

    Ricardo Labiaga
     
  • …r 63 are set in fileid

    The problem was use of an int32, which when converted to a uint64
    is sign extended resulting in a fileid that doesn't fit in 32 bits
    even though the intent of the function is to fit the fileid into
    32 bits.

    Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
    Reviewed-by: Jeff Layton <jlayton@redhat.com>
    [Trond: Added an include for compat.h]
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

    Frank Filz
     
  • this commit fix compilation warning as following:
    linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast

    Signed-off-by: Jovi Zhang
    Signed-off-by: Trond Myklebust

    Jovi Zhang
     
  • add kmalloc return value check in decode_and_add_ds

    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Trond Myklebust

    Stanislav Fomichev
     
  • I've been adding in more artificial delays in the NFSv4 commit and close
    codepaths to uncover races. The kernel I'm testing has the patch to
    close the race in __rpc_wait_for_completion_task that's in Trond's
    cthon2011 branch. The reproducer I've been using does this in a loop:

    mkdir("DIR");
    fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
    write(fd, "abcdefg", 7);
    close(fd);
    unlink("DIR/FILE");
    rmdir("DIR");

    The above reproducer shouldn't result in any silly-renaming. However,
    when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
    nfs_commit_release, I can almost always force one to occur. If I can
    force it to occur with that, then it can happen without that delay
    given the right timing.

    nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
    with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
    for the task to complete before putting its reference to it, so the last
    reference get put in rpc_release task and gets queued to a workqueue.

    In this situation, the last open context reference may be put by the
    COMMIT release instead of the close() syscall. The close() syscall
    returns too quickly and the unlink runs while the d_count is still
    high since the COMMIT release hasn't put its dentry reference yet.

    Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
    before putting the task reference when FLUSH_SYNC is set. With this, the
    last reference is put by the process that's initiating the FLUSH_SYNC
    commit and the race is closed.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     
  • Although they run as rpciod background tasks, under normal operation
    (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
    and nfs4_do_close() want to be fully synchronous. This means that when we
    exit, we want all references to the rpc_task to be gone, and we want
    any dentry references etc. held by that task to be released.

    For this reason these functions call __rpc_wait_for_completion_task(),
    followed by rpc_put_task() in the expectation that the latter will be
    releasing the last reference to the rpc_task, and thus ensuring that the
    callback_ops->rpc_release() has been called synchronously.

    This patch fixes a race which exists due to the fact that
    rpciod calls rpc_complete_task() (in order to wake up the callers of
    __rpc_wait_for_completion_task()) and then subsequently calls
    rpc_put_task() without ensuring that these two steps are done atomically.

    In order to avoid adding new spin locks, the patch uses the existing
    waitqueue spin lock to order the rpc_task reference count releases between
    the waiting process and rpciod.
    The common case where nobody is waiting for completion is optimised for by
    checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
    reference count is 1: in those cases we drop trying to grab the spin lock,
    and immediately free up the rpc_task.

    Those few processes that need to put the rpc_task from inside an
    asynchronous context and that do not care about ordering are given a new
    helper: rpc_put_task_async().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
    into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
    tree. So we should reserve space for these 5 items, not 3 items.

    Reported-by: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • The btrfs DIO code leaks dip structs when dip->csums allocation
    fails; bio->bi_end_io isn't set at the point where the free_ordered
    branch is consequently taken, thus bio_endio doesn't call the function
    which would free it in the normal case. Fix.

    Signed-off-by: Daniel J Blueman
    Acked-by: Miao Xie
    Signed-off-by: Chris Mason

    Daniel J Blueman
     

10 Mar, 2011

12 commits


09 Mar, 2011

3 commits


08 Mar, 2011

3 commits

  • a) struct inode is not going to be freed under ->d_compare();
    however, the thing PROC_I(inode)->sysctl points to just might.
    Fortunately, it's enough to make freeing that sucker delayed,
    provided that we don't step on its ->unregistering, clear
    the pointer to it in PROC_I(inode) before dropping the reference
    and check if it's NULL in ->d_compare().

    b) I'm not sure that we *can* walk into NULL inode here (we recheck
    dentry->seq between verifying that it's still hashed / fetching
    dentry->d_inode and passing it to ->d_compare() and there's no
    negative hashed dentries in /proc/sys/*), but if we can walk into
    that, we really should not have ->d_compare() return 0 on it!
    Said that, I really suspect that this check can be simply killed.
    Nick?

    Signed-off-by: Al Viro

    Al Viro
     
  • In case of a nonempty list, the return on error here is obviously bogus;
    it ends up being a pointer to the list head instead of to any valid
    delegation on the list.

    In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
    then renew_client may oops, and it may take an embarassingly long time to
    figure out why. Facepalm.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
    IP: [] nfsd4_delegreturn+0x125/0x200
    ...

    Cc: stable@kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • When copy_from_user is only able to copy some of the bytes we requested,
    we may end up creating a partially up to date page. To avoid garbage in
    the page, we need to treat a partial copy as a zero length copy.

    This makes the rest of the file_write code drop the page and
    retry the whole copy instead of marking the partially up to
    date page as dirty.

    Signed-off-by: Chris Mason
    cc: stable@kernel.org

    Chris Mason
     

07 Mar, 2011

1 commit

  • Commit 914ee295af418e936ec20a08c1663eaabe4cd07a fixed deadlocks in
    btrfs_file_write where we would catch page faults on pages we had
    locked.

    But, there were a few problems:

    1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
    data when the amount to copy is more than 4K and the offset to start
    copying from is not page aligned. The result was btrfs_file_write
    looping forever retrying the iov_iter_copy_from_user_atomic

    We deal with this by changing btrfs_file_write to drop down to single
    page copies when iov_iter_copy_from_user_atomic starts returning failure.

    2) The btrfs_file_write code was leaking delalloc reservations when
    iov_iter_copy_from_user_atomic returned zero. The looping above would
    result in the entire filesystem running out of delalloc reservations and
    constantly trying to flush things to disk.

    3) btrfs_file_write will lock down page cache pages, make sure
    any writeback is finished, do the copy_from_user and then release them.
    Before the loop runs we check the first and last pages in the write to
    see if they are only being partially modified. If the start or end of
    the write isn't aligned, we make sure the corresponding pages are
    up to date so that we don't introduce garbage into the file.

    With the copy_from_user changes, we're allowing the VM to reclaim the
    pages after a partial update from copy_from_user, but we're not
    making sure the page cache page is up to date when we loop around to
    resume the write.

    We deal with this by pushing the up to date checks down into the page
    prep code. This fits better with how the rest of file_write works.

    Signed-off-by: Chris Mason
    Reported-by: Mitch Harder
    cc: stable@kernel.org

    Chris Mason