06 Feb, 2015

3 commits

  • commit 3175e1dcec40fab1a444c010087f2068b6b04732 upstream.

    If we start state recovery on a client that failed to initialise correctly,
    then we are very likely to Oops.

    Reported-by: "Mkrtchyan, Tigran"
    Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit ee8a1a8b160a87dc3a9c81a86796aa4db85ea815 upstream.

    We only support swap file calling nfs_direct_IO. However, application
    might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
    during IO and it can deadlock because we grab inode->i_mutex in
    nfs_file_direct_write(). So return 0 for such case. Then the generic
    layer will fall back to buffer IO.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit b07ef35244424cbeda9844198607c7077099c82c upstream.

    Commit 6fb1ca92a640 "udf: Fix race between write(2) and close(2)"
    changed the condition when preallocation is released. The idea was that
    we don't want to release the preallocation for an inode on close when
    there are other writeable file descriptors for the inode. However the
    condition was written in the opposite way so we released preallocation
    only if there were other writeable file descriptors. Fix the problem by
    changing the condition properly.

    Fixes: 6fb1ca92a6409a9d5b0696447cd4997bc9aaf5a2
    Reported-by: Fabian Frederick
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

30 Jan, 2015

1 commit

  • commit 378ff1a53b5724f3ac97b0aba3c9ecac072f6fcd upstream.

    It really needs to check that src is non-directory *and* use
    {un,}lock_two_nodirectories(). As it is, it's trivial to cause
    double-lock (ioctl(fd, CIFS_IOC_COPYCHUNK_FILE, fd)) and if the
    last argument is an fd of directory, we are asking for trouble
    by violating the locking order - all directories go before all
    non-directories. If the last argument is an fd of parent
    directory, it has 50% odds of locking child before parent,
    which will cause AB-BA deadlock if we race with unlink().

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

28 Jan, 2015

4 commits

  • commit 06bed7d18c2c07b3e3eeadf4bd357f6e806618cc upstream.

    This commit fixes a race whereby nlmclnt_init() first starts the lockd
    daemon, and then calls nlm_bind_host() with the expectation that
    nlmsvc_timeout has already been initialised. Unfortunately, there is no
    no synchronisation between lockd() and lockd_up() to guarantee that this
    is the case.

    Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc

    Fixes: 9a1b6bf818e74 ("LOCKD: Don't call utsname()->nodename...")
    Cc: Bruce Fields
    Cc: stable@vger.kernel.org # 3.10.x
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 1fc0703af3143914a389bfa081c7acb09502ed5d upstream.

    Currently, our trunking code will check for session trunking, but will
    fail to detect client id trunking. This is a problem, because it means
    that the client will fail to recognise that the two connections represent
    shared state, even if they do not permit a shared session.
    By removing the check for the server minor id, and only checking the
    major id, we will end up doing the right thing in both cases: we close
    down the new nfs_client and fall back to using the existing one.

    Fixes: 05f4c350ee02e ("NFS: Discover NFSv4 server trunking when mounting")
    Cc: Chuck Lever
    Tested-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 52d304eb4eaced9ad04b64ba7cd6ceb5153bbf18 upstream.

    commit 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e
    locks: generic_delete_lease doesn't need a file_lock at all

    moves the call to fl->fl_lmops->lm_change() to a place in the
    code where fl might be a non-lease lock.
    When that happens, fl_lmops is NULL and an Oops ensures.

    So add an extra test to restore correct functioning.

    Reported-by: Linda Walsh
    Link: https://bugzilla.suse.com/show_bug.cgi?id=912569
    Fixes: 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e
    Signed-off-by: NeilBrown
    Signed-off-by: Jeff Layton
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream.

    Since the rework of the sparse interrupt code to actually free the
    unused interrupt descriptors there exists a race between the /proc
    interfaces to the irq subsystem and the code which frees the interrupt
    descriptor.

    CPU0 CPU1
    show_interrupts()
    desc = irq_to_desc(X);
    free_desc(desc)
    remove_from_radix_tree();
    kfree(desc);
    raw_spinlock_irq(&desc->lock);

    /proc/interrupts is the only interface which can actively corrupt
    kernel memory via the lock access. /proc/stat can only read from freed
    memory. Extremly hard to trigger, but possible.

    The interfaces in /proc/irq/N/ are not affected by this because the
    removal of the proc file is serialized in procfs against concurrent
    readers/writers. The removal happens before the descriptor is freed.

    For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
    as the descriptor is never freed. It's merely cleared out with the irq
    descriptor lock held. So any concurrent proc access will either see
    the old correct value or the cleared out ones.

    Protect the lookup and access to the irq descriptor in
    show_interrupts() with the sparse_irq_lock.

    Provide kstat_irqs_usr() which is protecting the lookup and access
    with sparse_irq_lock and switch /proc/stat to use it.

    Document the existing kstat_irqs interfaces so it's clear that the
    caller needs to take care about protection. The users of these
    interfaces are either not affected due to SPARSE_IRQ=n or already
    protected against removal.

    Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

16 Jan, 2015

13 commits

  • commit 6f8960541b1eb6054a642da48daae2320fddba93 upstream.

    Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a
    check to skip delayed inode updates during log replay because it
    confuses the enospc code. But the delayed processing will end up
    ignoring delayed refs from log replay because the inode itself wasn't
    put through the delayed code.

    This can end up triggering a warning at commit time:

    WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()

    Which is repeated for each commit because we never process the delayed
    inode ref update.

    The fix used here is to change btrfs_delayed_delete_inode_ref to return
    an error if we're currently in log replay. The caller will do the ref
    deletion immediately and everything will work properly.

    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Chris Mason
     
  • commit 705304a863cc41585508c0f476f6d3ec28cf7e00 upstream.

    Same story as in commit 41080b5a2401 ("nfsd race fixes: ext2") (similar
    ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
    of insert_inode_locked() and a bug of a check for dead inodes needs to
    be fixed.

    If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
    insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
    the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.

    If nilfs_iget() is called before nilfs_new_inode() calls
    insert_inode_locked4(), it will create an in-core inode and read its
    data from the on-disk inode. But, nilfs_iget() will find i_nlink equals
    zero and fail at nilfs_read_inode_common(), which will lead it to call
    iget_failed() and cleanly fail.

    However, this sanity check doesn't work as expected for reused on-disk
    inodes because they leave a non-zero value in i_mode field and it
    hinders the test of i_nlink. This patch also fixes the issue by
    removing the test on i_mode that nilfs2 doesn't need.

    Signed-off-by: Ryusuke Konishi
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ryusuke Konishi
     
  • commit 021b77bee210843bed1ea91b5cad58235ff9c8e5 upstream.

    Probably this code was syncing a lot more often then intended because
    the do_sync variable wasn't set to zero.

    Fixes: c62988ec0910 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.')
    Signed-off-by: Dan Carpenter
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit 94ae1db226a5bcbb48372d81161f084c9e283fd8 upstream.

    Currently, nfs4_set_delegation takes a reference to an existing
    delegation and then checks to see if there is a conflict. If there is
    one, then it doesn't release that reference.

    Change the code to take the reference after the check and only if there
    is no conflict.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit bf7491f1be5e125eece2ec67e0f79d513caa6c7e upstream.

    Fix a bug where nfsd4_encode_components_esc() incorrectly calculates the
    length of server array in fs_location4--note that it is a count of the
    number of array elements, not a length in bytes.

    Signed-off-by: Benjamin Coddington
    Fixes: 082d4bd72a45 (nfsd4: "backfill" using write_bytes_to_xdr_buf)
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Coddington
     
  • commit 5a64e56976f1ba98743e1678c0029a98e9034c81 upstream.

    Fix a bug where nfsd4_encode_components_esc() includes the esc_end char as
    an additional string encoding.

    Signed-off-by: Benjamin Coddington
    Fixes: e7a0444aef4a "nfsd: add IPv6 addr escaping to fs_location hosts"
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Coddington
     
  • commit ef17af2a817db97d42dd2ec0a425231748e23dbc upstream.

    Bugs similar to the one in acbbe6fbb240 (kcmp: fix standard comparison
    bug) are in rich supply.

    In this variant, the problem is that struct xdr_netobj::len has type
    unsigned int, so the expression o1->len - o2->len _also_ has type
    unsigned int; it has completely well-defined semantics, and the result
    is some non-negative integer, which is always representable in a long
    long. But this means that if the conditional triggers, we are
    guaranteed to return a positive value from compare_blob.

    In this case it could be fixed by

    - res = o1->len - o2->len;
    + res = (long long)o1->len - (long long)o2->len;

    but I'd rather eliminate the usually broken 'return a - b;' idiom.

    Reviewed-by: Jeff Layton
    Signed-off-by: Rasmus Villemoes
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Rasmus Villemoes
     
  • commit fa0c5540739320258c3e3a45aaae9dae467b2504 upstream.

    When resirefs is trying to mount a partition, it creates a commit
    workqueue (sbi->commit_wq). But when mount fails later, the workqueue
    is not freed.

    Signed-off-by: Jiri Slaby
    Reported-by: auxsvr@gmail.com
    Reported-by: Benoît Monin
    Cc: Jan Kara
    Cc: reiserfs-devel@vger.kernel.org
    Fixes: 797d9016ceca69879bb273218810fa0beef46aac
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jiri Slaby
     
  • commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.

    After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
    tests inode->i_state locklessly to see whether it already has all the
    necessary I_DIRTY bits set. The comment above the barrier doesn't
    contain any useful information - memory barriers can't ensure "changes
    are seen by all cpus" by itself.

    And it sure enough was broken. Please consider the following
    scenario.

    CPU 0 CPU 1
    -------------------------------------------------------------------------------

    enters __writeback_single_inode()
    grabs inode->i_lock
    tests PAGECACHE_TAG_DIRTY which is clear
    enters __set_page_dirty()
    grabs mapping->tree_lock
    sets PAGECACHE_TAG_DIRTY
    releases mapping->tree_lock
    leaves __set_page_dirty()

    enters __mark_inode_dirty()
    smp_mb()
    sees I_DIRTY_PAGES set
    leaves __mark_inode_dirty()
    clears I_DIRTY_PAGES
    releases inode->i_lock

    Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
    to lead to an immediately critical problem because requeue_inode()
    later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
    deciding whether the inode needs to be requeued for IO and there are
    enough unintentional memory barriers inbetween, so while the inode
    ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
    IO list.

    The lack of explicit barrier may also theoretically affect the other
    I_DIRTY bits which deal with metadata dirtiness. There is no
    guarantee that a strong enough barrier exists between
    I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
    inode. Filesystem inode writeout path likely has enough stuff which
    can behave as full barrier but it's theoretically possible that the
    writeout may not see all the updates from ->dirty_inode().

    Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
    that I_DIRTY_PAGES needs a special treatment as it always needs to be
    cleared to be interlocked with the lockless test on
    __mark_inode_dirty() side. It's cleared unconditionally and
    reinstated after smp_mb() if the mapping still has dirty pages.

    Also add comments explaining how and why the barriers are paired.

    Lightly tested.

    Signed-off-by: Tejun Heo
    Cc: Jan Kara
    Cc: Mikulas Patocka
    Cc: Jens Axboe
    Cc: Al Viro
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 027bc8b08242c59e19356b4b2c189f2d849ab660 upstream.

    On some ARMs the memory can be mapped pgprot_noncached() and still
    be working for atomic operations. As pointed out by Colin Cross
    , in some cases you do want to use
    pgprot_noncached() if the SoC supports it to see a debug printk
    just before a write hanging the system.

    On ARMs, the atomic operations on strongly ordered memory are
    implementation defined. So let's provide an optional kernel parameter
    for configuring pgprot_noncached(), and use pgprot_writecombine() by
    default.

    Cc: Arnd Bergmann
    Cc: Rob Herring
    Cc: Randy Dunlap
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Olof Johansson
    Cc: Russell King
    Acked-by: Kees Cook
    Signed-off-by: Tony Lindgren
    Signed-off-by: Tony Luck
    Signed-off-by: Greg Kroah-Hartman

    Tony Lindgren
     
  • commit 7ae9cb81933515dc7db1aa3c47ef7653717e3090 upstream.

    Currently trying to use pstore on at least ARMs can hang as we're
    mapping the peristent RAM with pgprot_noncached().

    On ARMs, pgprot_noncached() will actually make the memory strongly
    ordered, and as the atomic operations pstore uses are implementation
    defined for strongly ordered memory, they may not work. So basically
    atomic operations have undefined behavior on ARM for device or strongly
    ordered memory types.

    Let's fix the issue by using write-combine variants for mappings. This
    corresponds to normal, non-cacheable memory on ARM. For many other
    architectures, this change does not change the mapping type as by
    default we have:

    #define pgprot_writecombine pgprot_noncached

    The reason why pgprot_noncached() was originaly used for pstore
    is because Colin Cross had observed lost
    debug prints right before a device hanging write operation on some
    systems. For the platforms supporting pgprot_noncached(), we can
    add a an optional configuration option to support that. But let's
    get pstore working first before adding new features.

    Cc: Arnd Bergmann
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Olof Johansson
    Cc: linux-kernel@vger.kernel.org
    Acked-by: Kees Cook
    Signed-off-by: Rob Herring
    [tony@atomide.com: updated description]
    Signed-off-by: Tony Lindgren
    Signed-off-by: Tony Luck
    Signed-off-by: Greg Kroah-Hartman

    Rob Herring
     
  • commit 53dc20b9a3d928b0744dad5aee65b610de1cc85d upstream.

    In ocfs2_link(), the parent directory inode passed to function
    ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of
    new_dentry not old_dentry. We should get old_dir from old_dentry and
    lookup old_dentry in old_dir in case another node remove the old dentry.

    With this change, hard linking works again, when paths are relative with
    at least one subdirectory. This is how the problem was reproducable:

    # mkdir a
    # mkdir b
    # touch a/test
    # ln a/test b/test
    ln: failed to create hard link `b/test' => `a/test': No such file or directory

    However when creating links in the same dir, it worked well.

    Now the link gets created.

    Fixes: 0e048316ff57 ("ocfs2: check existence of old dentry in ocfs2_link()")
    Signed-off-by: joyce.xue
    Reported-by: Szabo Aron - UBIT
    Cc: Mark Fasheh
    Cc: Joel Becker
    Tested-by: Aron Szabo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Xue jiufei
     
  • commit 136f49b9171074872f2a14ad0ab10486d1ba13ca upstream.

    For buffer write, page lock will be got in write_begin and released in
    write_end, in ocfs2_write_end_nolock(), before it unlock the page in
    ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
    for the read lock of journal->j_trans_barrier. Holding page lock and
    ask for journal->j_trans_barrier breaks the locking order.

    This will cause a deadlock with journal commit threads, ocfs2cmt will
    get write lock of journal->j_trans_barrier first, then it wakes up
    kjournald2 to do the commit work, at last it waits until done. To
    commit journal, kjournald2 needs flushing data first, it needs get the
    cache page lock.

    Since some ocfs2 cluster locks are holding by write process, this
    deadlock may hung the whole cluster.

    unlock pages before ocfs2_run_deallocs() can fix the locking order, also
    put unlock before ocfs2_commit_trans() to make page lock is unlocked
    before j_trans_barrier to preserve unlocking order.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Wengang Wang
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     

09 Jan, 2015

19 commits

  • commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.

    When we abort a transaction we iterate over all the ranges marked as dirty
    in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
    from those trees, add them back (unpin) to the free space caches and, if
    the fs was mounted with "-o discard", perform a discard on those regions.
    Also, after adding the regions to the free space caches, a fitrim ioctl call
    can see those ranges in a block group's free space cache and perform a discard
    on the ranges, so the same issue can happen without "-o discard" as well.

    This causes corruption, affecting one or multiple btree nodes (in the worst
    case leaving the fs unmountable) because some of those ranges (the ones in
    the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
    referred by the last committed super block - breaking the rule that anything
    that was committed by a transaction is untouched until the next transaction
    commits successfully.

    I ran into this while running in a loop (for several hours) the fstest that
    I recently submitted:

    [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

    The corruption always happened when a transaction aborted and then fsck complained
    like this:

    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
    *** fsck.btrfs output ***
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    read block failed check_tree_block
    Couldn't open file system

    In this case 94945280 corresponded to the root of a tree.
    Using frace what I observed was the following sequence of steps happened:

    1) transaction N started, fs_info->pinned_extents pointed to
    fs_info->freed_extents[0];

    2) node/eb 94945280 is created;

    3) eb is persisted to disk;

    4) transaction N commit starts, fs_info->pinned_extents now points to
    fs_info->freed_extents[1], and transaction N completes;

    5) transaction N + 1 starts;

    6) eb is COWed, and btrfs_free_tree_block() called for this eb;

    7) eb range (94945280 to 94945280 + 16Kb) is added to
    fs_info->pinned_extents (fs_info->freed_extents[1]);

    8) Something goes wrong in transaction N + 1, like hitting ENOSPC
    for example, and the transaction is aborted, turning the fs into
    readonly mode. The stack trace I got for example:

    [112065.253935] [] dump_stack+0x4d/0x66
    [112065.254271] [] warn_slowpath_common+0x7f/0x98
    [112065.254567] [] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
    [112065.261674] [] warn_slowpath_fmt+0x48/0x50
    [112065.261922] [] ? btrfs_free_path+0x26/0x29 [btrfs]
    [112065.262211] [] __btrfs_abort_transaction+0x50/0x10b [btrfs]
    [112065.262545] [] btrfs_remove_chunk+0x537/0x58b [btrfs]
    [112065.262771] [] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
    [112065.263105] [] cleaner_kthread+0x100/0x12f [btrfs]
    (...)
    [112065.264493] ---[ end trace dd7903a975a31a08 ]---
    [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
    [112065.264997] BTRFS info (device sdc): forced readonly

    9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
    fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
    turn calls btrfs_destroy_pinned_extent();

    10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
    marked as dirty in fs_info->freed_extents[], and for each one
    it calls discard, if the fs was mounted with "-o discard", and
    adds the range to the free space cache of the respective block
    group;

    11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
    sees the free space entries and performs a discard;

    12) After an umount and mount (or fsck), our eb's location on disk was full
    of zeroes, and it should have been untouched, because it was marked as
    dirty in the fs_info->pinned_extents tree, and therefore used by the
    trees that the last committed superblock points to.

    Fix this by not performing a discard and not adding the ranges to the free space
    caches - it's useless from this point since the fs is now in readonly mode and
    we won't write free space caches to disk anymore (otherwise we would leak space)
    nor any new superblock. By not adding the ranges to the free space caches, it
    prevents other code paths from allocating that space and write to it as well,
    therefore being safer and simpler.

    This isn't a new problem, as it's been present since 2011 (git commit
    acce952b0263825da32cf10489413dec78053347).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f upstream.

    Liu Bo pointed out that my previous fix would lose the generation update in the
    scenario I described. It is actually much worse than that, we could lose the
    entire extent if we lose power right after the transaction commits. Consider
    the following

    write extent 0-4k
    log extent in log tree
    commit transaction
    < power fail happens here
    ordered extent completes

    We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
    the transaction commit will reset the log so it isn't replayed. If we lose
    power before the transaction commit we are save, otherwise we are not.

    Fix this by keeping track of all extents we logged in this transaction. Then
    when we go to commit the transaction make sure we wait for all of those ordered
    extents to complete before proceeding. This will make sure that if we lose
    power after the transaction commit we still have our data. This also fixes the
    problem of the improperly updated extent generation. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit a28046956c71985046474283fa3bcd256915fb72 upstream.

    We use the modified list to keep track of which extents have been modified so we
    know which ones are candidates for logging at fsync() time. Newly modified
    extents are added to the list at modification time, around the same time the
    ordered extent is created. We do this so that we don't have to wait for ordered
    extents to complete before we know what we need to log. The problem is when
    something like this happens

    log extent 0-4k on inode 1
    copy csum for 0-4k from ordered extent into log
    sync log
    commit transaction
    log some other extent on inode 1
    ordered extent for 0-4k completes and adds itself onto modified list again
    log changed extents
    see ordered extent for 0-4k has already been logged
    at this point we assume the csum has been copied
    sync log
    crash

    On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
    which is the same one that we are replaying which also drops the csum, and then
    we won't find the csum in the log for that bytenr. This of course causes us to
    have errors about not having csums for certain ranges of our inode. So remove
    the modified list manipulation in unpin_extent_cache, any modified extents
    should have been added well before now, and we don't want them re-logged. This
    fixes my test that I could reliably reproduce this problem with. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 0d95c1bec906dd1ad951c9c001e798ca52baeb0f upstream.

    The sizes that are obtained from space infos are in raw units and have
    to be adjusted according to the raid factor. This was missing for
    f_bavail and df reported doubled size for raid1.

    Reported-by: Martin Steigerwald
    Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles")
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • commit 9dba8cf128ef98257ca719722280c9634e7e9dc7 upstream.

    If we have two fsync()'s race on different subvols one will do all of its work
    to get into the log_tree, wait on it's outstanding IO, and then allow the
    log_tree to finish it's commit. The problem is we were just free'ing that
    subvols logged extents instead of waiting on them, so whoever lost the race
    wouldn't really have their data on disk. Fix this by waiting properly instead
    of freeing the logged extents. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 942080643bce061c3dd9d5718d3b745dcb39a8bc upstream.

    Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
    end of the allocated buffer during encrypted filename decoding. This
    fix corrects the issue by getting rid of the unnecessary 0 write when
    the current bit offset is 2.

    Signed-off-by: Michael Halcrow
    Reported-by: Dmitry Chernenkov
    Suggested-by: Kees Cook
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Michael Halcrow
     
  • commit 332b122d39c9cbff8b799007a825d94b2e7c12f2 upstream.

    The ecryptfs_encrypted_view mount option greatly changes the
    functionality of an eCryptfs mount. Instead of encrypting and decrypting
    lower files, it provides a unified view of the encrypted files in the
    lower filesystem. The presence of the ecryptfs_encrypted_view mount
    option is intended to force a read-only mount and modifying files is not
    supported when the feature is in use. See the following commit for more
    information:

    e77a56d [PATCH] eCryptfs: Encrypted passthrough

    This patch forces the mount to be read-only when the
    ecryptfs_encrypted_view mount option is specified by setting the
    MS_RDONLY flag on the superblock. Additionally, this patch removes some
    broken logic in ecryptfs_open() that attempted to prevent modifications
    of files when the encrypted view feature was in use. The check in
    ecryptfs_open() was not sufficient to prevent file modifications using
    system calls that do not operate on a file descriptor.

    Signed-off-by: Tyler Hicks
    Reported-by: Priya Bansal
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit e237ec37ec154564f8690c5bd1795339955eeef9 upstream.

    Check that length specified in a component of a symlink fits in the
    input buffer we are reading. Also properly ignore component length for
    component types that do not use it. Otherwise we read memory after end
    of buffer for corrupted udf image.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit a1d47b262952a45aae62bd49cfaf33dd76c11a2c upstream.

    UDF specification allows arbitrarily large symlinks. However we support
    only symlinks at most one block large. Check the length of the symlink
    so that we don't access memory beyond end of the symlink block.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit e159332b9af4b04d882dbcfe1bb0117f0a6d4b58 upstream.

    Verify that inode size is sane when loading inode with data stored in
    ICB. Otherwise we may get confused later when working with the inode and
    inode size is too big.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 0e5cc9a40ada6046e6bc3bdfcd0c0d7e4b706b14 upstream.

    Symlink reading code does not check whether the resulting path fits into
    the page provided by the generic code. This isn't as easy as just
    checking the symlink size because of various encoding conversions we
    perform on path. So we have to check whether there is still enough space
    in the buffer on the fly.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit a682e9c28cac152e6e54c39efcf046e0c8cfcf63 upstream.

    If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
    return value is then (in most cases) just overwritten before we return.
    This can result in reporting success to userspace although error happened.

    This bug was introduced by commit 2e54eb96e2c8 ("BKL: Remove BKL from
    ncpfs"). Propagate the errors correctly.

    Coverity id: 1226925.

    Fixes: 2e54eb96e2c80 ("BKL: Remove BKL from ncpfs")
    Signed-off-by: Jan Kara
    Cc: Petr Vandrovec
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

    - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit b2f5d4dc38e034eecb7987e513255265ff9aa1cf upstream.

    Forced unmount affects not just the mount namespace but the underlying
    superblock as well. Restrict forced unmount to the global root user
    for now. Otherwise it becomes possible a user in a less privileged
    mount namespace to force the shutdown of a superblock of a filesystem
    in a more privileged mount namespace, allowing a DOS attack on root.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 3e1866410f11356a9fd869beb3e95983dc79c067 upstream.

    Now that remount is properly enforcing the rule that you can't remove
    nodev at least sandstorm.io is breaking when performing a remount.

    It turns out that there is an easy intuitive solution implicitly
    add nodev on remount when nodev was implicitly added on mount.

    Tested-by: Cedric Bosdonnat
    Tested-by: Richard Weinberger
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit c297abfdf15b4480704d6b566ca5ca9438b12456 upstream.

    While reviewing the code of umount_tree I realized that when we append
    to a preexisting unmounted list we do not change pprev of the former
    first item in the list.

    Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
    the former first item of the list will stomp unmounted.first leaving
    it set to some random mount point which we are likely to free soon.

    This isn't likely to hit, but if it does I don't know how anyone could
    track it down.

    [ This happened because we don't have all the same operations for
    hlist's as we do for normal doubly-linked lists. In particular,
    list_splice() is easy on our standard doubly-linked lists, while
    hlist_splice() doesn't exist and needs both start/end entries of the
    hlist. And commit 38129a13e6e7 incorrectly open-coded that missing
    hlist_splice().

    We should think about making these kinds of "mindless" conversions
    easier to get right by adding the missing hlist helpers - Linus ]

    Fixes: 38129a13e6e71f666e0468e99fdd932a687b4d7e switch mnt_hash to hlist
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 4e2024624e678f0ebb916e6192bd23c1f9fdf696 upstream.

    We didn't check length of rock ridge ER records before printing them.
    Thus corrupted isofs image can cause us to access and print some memory
    behind the buffer with obvious consequences.

    Reported-and-tested-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 08d4f7722268755ee34ed1c9e8afee7dfff022bb upstream.

    This patch fixes kmemcheck warning in switch_names. The function
    switch_names swaps inline names of two dentries. It swaps full arrays
    d_iname, no matter how many bytes are really used by the strings. Reading
    data beyond string ends results in kmemcheck warning.

    We fix the bug by marking both arrays as fully initialized.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 4bd5a980de87d2b5af417485bde97b8eb3d6cf6a upstream.

    nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
    early so that it is safe to call .release in case nfs4_alloc_pages
    fails.

    Signed-off-by: Peng Tao
    Fixes: a47970ff78147 ("NFSv4.1: Hold reference to layout hdr in layoutget")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao