06 Apr, 2013

2 commits

  • Pull GFS2 fixes from Steven Whitehouse:
    "There are two patches which fix up a couple of minor issues in the DLM
    interface code, a missing error path in gfs2_rs_alloc(), one patch
    which fixes a problem during "withdraw" and a fix for discards/FITRIM
    when using 4k sector sized devices."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
    GFS2: Issue discards in 512b sectors
    GFS2: Fix unlock of fcntl locks during withdrawn state
    GFS2: return error if malloc failed in gfs2_rs_alloc()
    GFS2: use memchr_inv
    GFS2: use kmalloc for lvb bitmap

    Linus Torvalds
     
  • This patch changes GFS2's discard issuing code so that it calls
    function sb_issue_discard rather than blkdev_issue_discard. The
    code was calling blkdev_issue_discard and specifying the correct
    sector offset and sector size, but blkdev_issue_discard expects
    these values to be in terms of 512 byte sectors, even if the native
    sector size for the device is different. Calling sb_issue_discard
    with the BLOCK size instead ensures the correct block-to-512b-sector
    translation. I verified that "minlen" is specified in blocks, so
    comparing it to a number of blocks is correct.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

04 Apr, 2013

8 commits


02 Apr, 2013

2 commits

  • Pull nfsd bugfix from J Bruce Fields:
    "An xdr decoding error--thanks, Toralf Förster, and Trinity!"

    * 'for-3.9' of git://linux-nfs.org/~bfields/linux:
    nfsd4: reject "negative" acl lengths

    Linus Torvalds
     
  • struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
    block_device allocated first time we access /dev/loopXX and deallocated on
    bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
    we want that block_device stay alive until we destroy the loop device
    with "losetup -d".

    But because we do not hold /dev/loopXX inode its counter goes 0, and
    inode/bdev can be destroyed at any moment. Usually it happens at memory
    pressure or when user drops inode cache (like in the test below). When later in
    loop_clr_fd() we want to use bdev we have use-after-free error with following
    stack:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
    bd_set_size+0x10/0xa0
    loop_clr_fd+0x1f8/0x420 [loop]
    lo_ioctl+0x200/0x7e0 [loop]
    lo_compat_ioctl+0x47/0xe0 [loop]
    compat_blkdev_ioctl+0x341/0x1290
    do_filp_open+0x42/0xa0
    compat_sys_ioctl+0xc1/0xf20
    do_sys_open+0x16e/0x1d0
    sysenter_dispatch+0x7/0x1a

    To prevent use-after-free we need to grab the device in loop_set_fd()
    and put it later in loop_clr_fd().

    The issue is reprodusible on current Linus head and v3.3. Here is the test:

    dd if=/dev/zero of=loop.file bs=1M count=1
    while [ true ]; do
    losetup /dev/loop0 loop.file
    echo 2 > /proc/sys/vm/drop_caches
    losetup -d /dev/loop0
    done

    [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
    time we call loop_set_fd() we check that loop_device->lo_state is
    Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
    it will get EBUSY. And if we try to loop_clr_fd() on unbound loop
    device we'll get ENXIO.

    loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
    loop_device->lo_ctl_mutex. ]

    Signed-off-by: Anatol Pomozov
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Anatol Pomozov
     

30 Mar, 2013

2 commits

  • Pull btrfs fixes from Chris Mason:
    "We've had a busy two weeks of bug fixing. The biggest patches in here
    are some long standing early-enospc problems (Josef) and a very old
    race where compression and mmap combine forces to lose writes (me).
    I'm fairly sure the mmap bug goes all the way back to the introduction
    of the compression code, which is proof that fsx doesn't trigger every
    possible mmap corner after all.

    I'm sure you'll notice one of these is from this morning, it's a small
    and isolated use-after-free fix in our scrub error reporting. I
    double checked it here."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: don't drop path when printing out tree errors in scrub
    Btrfs: fix wrong return value of btrfs_lookup_csum()
    Btrfs: fix wrong reservation of csums
    Btrfs: fix double free in the btrfs_qgroup_account_ref()
    Btrfs: limit the global reserve to 512mb
    Btrfs: hold the ordered operations mutex when waiting on ordered extents
    Btrfs: fix space accounting for unlink and rename
    Btrfs: fix space leak when we fail to reserve metadata space
    Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
    Btrfs: fix race between mmap writes and compression
    Btrfs: fix memory leak in btrfs_create_tree()
    Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
    Btrfs: fix missing qgroup reservation before fallocating
    Btrfs: handle a bogus chunk tree nicely
    Btrfs: update to use fs_state bit

    Linus Torvalds
     
  • After commit 21d8a15a (lookup_one_len: don't accept . and ..) reiserfs
    started failing to delete xattrs from inode. This was due to a buggy
    test for '.' and '..' in fill_with_dentries() which resulted in passing
    '.' and '..' entries to lookup_one_len() in some cases. That returned
    error and so we failed to iterate over all xattrs of and inode.

    Fix the test in fill_with_dentries() along the lines of the one in
    lookup_one_len().

    Reported-by: Pawel Zawora
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     

29 Mar, 2013

3 commits

  • A user reported a panic where we were panicing somewhere in
    tree_backref_for_extent from scrub_print_warning. He only captured the trace
    but looking at scrub_print_warning we drop the path right before we mess with
    the extent buffer to print out a bunch of stuff, which isn't right. So fix this
    by dropping the path after we use the eb if we need to. Thanks,

    Cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Pull sysfs fixes from Greg Kroah-Hartman:
    "Here are two fixes for sysfs that resolve issues that have been found
    by the Trinity fuzz tool, causing oopses in sysfs. They both have
    been in linux-next for a while to ensure that they do not cause any
    other problems."

    * tag 'driver-core-3.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    sysfs: handle failure path correctly for readdir()
    sysfs: fix race between readdir and lseek

    Linus Torvalds
     
  • Pull userns fixes from Eric W Biederman:
    "The bulk of the changes are fixing the worst consequences of the user
    namespace design oversight in not considering what happens when one
    namespace starts off as a clone of another namespace, as happens with
    the mount namespace.

    The rest of the changes are just plain bug fixes.

    Many thanks to Andy Lutomirski for pointing out many of these issues."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Restrict when proc and sysfs can be mounted
    ipc: Restrict mounting the mqueue filesystem
    vfs: Carefully propogate mounts across user namespaces
    vfs: Add a mount flag to lock read only bind mounts
    userns: Don't allow creation if the user is chrooted
    yama: Better permission check for ptraceme
    pid: Handle the exit of a multi-threaded init.
    scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.

    Linus Torvalds
     

28 Mar, 2013

9 commits

  • If we don't find the expected csum item, but find a csum item which is
    adjacent to the specified extent, we should return -EFBIG, or we should
    return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
    is not adjacent to the specified extent. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • We reserve the space for csums only when we write data into a file, in
    the other cases, such as tree log, log replay, we don't do reservation,
    so we can use the reservation of the transaction handle just for the former.
    And for the latter, we should use the tree's own reservation. But the
    function - btrfs_csum_file_blocks() didn't differentiate between these
    two types of the cases, fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • The function btrfs_find_all_roots is responsible to allocate
    memory for 'roots' and free it if errors happen,so the caller should not
    free it again since the work has been done.

    Besides,'tmp' is allocated after the function btrfs_find_all_roots,
    so we can return directly if btrfs_find_all_roots() fails.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Reviewed-by: Jan Schmidt
    Signed-off-by: Josef Bacik

    Wang Shilong
     
  • A user reported a problem where he was getting early ENOSPC with hundreds of
    gigs of free data space and 6 gigs of free metadata space. This is because the
    global block reserve was taking up the entire free metadata space. This is
    ridiculous, we have infrastructure in place to throttle if we start using too
    much of the global reserve, so instead of letting it get this huge just limit it
    to 512mb so that users can still get work done. This allowed the user to
    complete his rsync without issues. Thanks

    Cc: stable@vger.kernel.org
    Reported-and-tested-by: Stefan Priebe
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We need to hold the ordered_operations mutex while waiting on ordered extents
    since we splice and run the ordered extents list. We need to make sure anybody
    else who wants to wait on ordered extents does actually wait for them to be
    completed. This will keep us from bailing out of flushing in case somebody is
    already waiting on ordered extents to complete. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We are way over-reserving for unlink and rename. Rename is just some random
    huge number and unlink accounts for tree log operations that don't actually
    happen during unlink, not to mention the tree log doesn't take from the trans
    block rsv anyway so it's completely useless. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Dave reported a warning when running xfstest 275. We have been leaking delalloc
    metadata space when our reservations fail. This is because we were improperly
    calculating how much space to free for our checksum reservations. The problem
    is we would sometimes free up space that had already been freed in another
    thread and we would end up with negative usage for the delalloc space. This
    patch fixes the problem by calculating how much space the other threads would
    have already freed, and then calculate how much space we need to free had we not
    done the reservation at all, and then freeing any excess space. This makes
    xfstests 275 no longer have leaked space. Thanks

    Cc: stable@vger.kernel.org
    Reported-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When you take a snapshot, punch a hole where there has been data, then take
    another snapshot and try to send an incremental stream, btrfs send would
    give you EIO. That is because is_extent_unchanged had no support for holes
    being punched. With this patch, instead of returning EIO we just return
    0 (== the extent is not unchanged) and we're good.

    Signed-off-by: Jan Schmidt
    Cc: Alexander Block
    Signed-off-by: Josef Bacik

    Jan Schmidt
     
  • Commit 06ae43f34bcc ("Don't bother with redoing rw_verify_area() from
    default_file_splice_from()") lost the checks to test existence of the
    write/aio_write methods. My apologies ;-/

    Eventually, we want that in fs/splice.c side of things (no point
    repeating it for every buffer, after all), but for now this is the
    obvious minimal fix.

    Reported-by: Dave Jones
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

27 Mar, 2013

9 commits

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • As a matter of policy MNT_READONLY should not be changable if the
    original mounter had more privileges than creator of the mount
    namespace.

    Add the flag CL_UNPRIVILEGED to note when we are copying a mount from
    a mount namespace that requires more privileges to a mount namespace
    that requires fewer privileges.

    When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT
    if any of the mnt flags that should never be changed are set.

    This protects both mount propagation and the initial creation of a less
    privileged mount namespace.

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When a read-only bind mount is copied from mount namespace in a higher
    privileged user namespace to a mount namespace in a lesser privileged
    user namespace, it should not be possible to remove the the read-only
    restriction.

    Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
    remain read-only.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Guarantee that the policy of which files may be access that is
    established by setting the root directory will not be violated
    by user namespaces by verifying that the root directory points
    to the root of the mount namespace at the time of user namespace
    creation.

    Changing the root is a privileged operation, and as a matter of policy
    it serves to limit unprivileged processes to files below the current
    root directory.

    For reasons of simplicity and comprehensibility the privilege to
    change the root directory is gated solely on the CAP_SYS_CHROOT
    capability in the user namespace. Therefore when creating a user
    namespace we must ensure that the policy of which files may be access
    can not be violated by changing the root directory.

    Anyone who runs a processes in a chroot and would like to use user
    namespace can setup the same view of filesystems with a mount
    namespace instead. With this result that this is not a practical
    limitation for using user namespaces.

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Pull vfs fixes from Al Viro:
    "stable fodder; assorted deadlock fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vt: synchronize_rcu() under spinlock is not nice...
    Nest rename_lock inside vfsmount_lock
    Don't bother with redoing rw_verify_area() from default_file_splice_from()

    Linus Torvalds
     
  • ... lest we get livelocks between path_is_under() and d_path() and friends.

    The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
    it is possible to have thread B spin on attempt to take lock shared while thread
    A is already holding it shared, if B is on lower-numbered CPU than A and there's
    a thread C spinning on attempt to take the same lock exclusive.

    As the result, we need consistent ordering between vfsmount_lock (lglock) and
    rename_lock (seq_lock), even though everything that takes both is going to take
    vfsmount_lock only shared.

    Spotted-by: Brad Spengler
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • Pull NFS client bugfixes from Trond Myklebust:
    - Fix an NFSv4 idmapper regression
    - Fix an Oops in the pNFS blocks client
    - Fix up various issues with pNFS layoutcommit
    - Ensure correct read ordering of variables in
    rpc_wake_up_task_queue_locked

    * tag 'nfs-for-3.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    SUNRPC: Add barriers to ensure read ordering in rpc_wake_up_task_queue_locked
    NFSv4.1: Add a helper pnfs_commit_and_return_layout
    NFSv4.1: Always clear the NFS_INO_LAYOUTCOMMIT in layoutreturn
    NFSv4.1: Fix a race in pNFS layoutcommit
    pnfs-block: removing DM device maybe cause oops when call dev_remove
    NFSv4: Fix the string length returned by the idmapper

    Linus Torvalds
     
  • Since we only enforce an upper bound, not a lower bound, a "negative"
    length can get through here.

    The symptom seen was a warning when we attempt to a kmalloc with an
    excessive size.

    Reported-by: Toralf Förster
    Cc: stable@kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Btrfs uses page_mkwrite to ensure stable pages during
    crc calculations and mmap workloads. We call clear_page_dirty_for_io
    before we do any crcs, and this forces any application with the file
    mapped to wait for the crc to finish before it is allowed to change
    the file.

    With compression on, the clear_page_dirty_for_io step is happening after
    we've compressed the pages. This means the applications might be
    changing the pages while we are compressing them, and some of those
    modifications might not hit the disk.

    This commit adds the clear_page_dirty_for_io before compression starts
    and makes sure to redirty the page if we have to fallback to
    uncompressed IO as well.

    Signed-off-by: Chris Mason
    Reported-by: Alexandre Oliva
    cc: stable@vger.kernel.org

    Chris Mason
     

26 Mar, 2013

1 commit

  • Pull nfsd bugfixes from J Bruce Fields:
    "Fixes for a couple mistakes in the new DRC code. And thanks to Kent
    Overstreet for noticing we've been sync'ing the wrong range on stable
    writes since 3.8."

    * 'for-3.9' of git://linux-nfs.org/~bfields/linux:
    nfsd: fix bad offset use
    nfsd: fix startup order in nfsd_reply_cache_init
    nfsd: only unhash DRC entries that are in the hashtable

    Linus Torvalds
     

23 Mar, 2013

2 commits

  • vfs_writev() updates the offset argument - but the code then passes the
    offset to vfs_fsync_range(). Since offset now points to the offset after
    what was just written, this is probably not what was intended

    Introduced by face15025ffdf664de95e86ae831544154d26c9c "nfsd: use
    vfs_fsync_range(), not O_SYNC, for stable writes".

    Signed-off-by: Kent Overstreet
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: stable@vger.kernel.org
    Reviewed-by: Zach Brown
    Signed-off-by: J. Bruce Fields

    Kent Overstreet
     
  • Dave Jones found another /proc issue with his Trinity tool: thanks to
    the namespace model, we can have multiple /proc dentries that point to
    the same inode, aliasing directories in /proc//net/ for example.

    This ends up being a total disaster, because it acts like hardlinked
    directories, and causes locking problems. We rely on the topological
    sort of the inodes pointed to by dentries, and if we have aliased
    directories, that odering becomes unreliable.

    In short: don't do this. Multiple dentries with the same (directory)
    inode is just a bad idea, and the namespace code should never have
    exposed things this way. But we're kind of stuck with it.

    This solves things by just always allocating a new inode during /proc
    dentry lookup, instead of using "iget_locked()" to look up existing
    inodes by superblock and number. That actually simplies the code a bit,
    at the cost of potentially doing more inode [de]allocations.

    That said, the inode lookup wasn't free either (and did a lot of locking
    of inodes), so it is probably not that noticeable. We could easily keep
    the old lookup model for non-directory entries, but rather than try to
    be excessively clever this just implements the minimal and simplest
    workaround for the problem.

    Reported-and-tested-by: Dave Jones
    Analyzed-by: Al Viro
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Mar, 2013

2 commits

  • Pull CIFS fixes from Steve French:
    "Three small CIFS Fixes (the most important of the three fixes a recent
    problem authenticating to Windows 8 using cifs rather than SMB2)"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: ignore everything in SPNEGO blob after mechTypes
    cifs: delay super block destruction until all cifsFileInfo objects are gone
    cifs: map NT_STATUS_SHARING_VIOLATION to EBUSY instead of ETXTBSY

    Linus Torvalds
     
  • Pull ext4 fixes from Ted Ts'o:
    "Fix a number of regression and other bugs in ext4, most of which were
    relatively obscure cornercases or races that were found using
    regression tests."

    * tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
    ext4: fix data=journal fast mount/umount hang
    ext4: fix ext4_evict_inode() racing against workqueue processing code
    ext4: fix memory leakage in mext_check_coverage
    ext4: use s_extent_max_zeroout_kb value as number of kb
    ext4: use atomic64_t for the per-flexbg free_clusters count
    jbd2: fix use after free in jbd2_journal_dirty_metadata()
    ext4: reserve metadata block for every delayed write
    ext4: update reserved space after the 'correction'
    ext4: do not use yield()
    ext4: remove unused variable in ext4_free_blocks()
    ext4: fix WARN_ON from ext4_releasepage()
    ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
    ext4: update extent status tree after an extent is zeroed out
    ext4: fix wrong m_len value after unwritten extent conversion
    ext4: add self-testing infrastructure to do a sanity check
    ext4: avoid a potential overflow in ext4_es_can_be_merged()
    ext4: invalidate extent status tree during extent migration
    ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
    ext4: add warning to ext4_convert_unwritten_extents_endio
    ext4: disable merging of uninitialized extents
    ...

    Linus Torvalds