09 Jan, 2015

13 commits

  • commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.

    When we abort a transaction we iterate over all the ranges marked as dirty
    in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
    from those trees, add them back (unpin) to the free space caches and, if
    the fs was mounted with "-o discard", perform a discard on those regions.
    Also, after adding the regions to the free space caches, a fitrim ioctl call
    can see those ranges in a block group's free space cache and perform a discard
    on the ranges, so the same issue can happen without "-o discard" as well.

    This causes corruption, affecting one or multiple btree nodes (in the worst
    case leaving the fs unmountable) because some of those ranges (the ones in
    the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
    referred by the last committed super block - breaking the rule that anything
    that was committed by a transaction is untouched until the next transaction
    commits successfully.

    I ran into this while running in a loop (for several hours) the fstest that
    I recently submitted:

    [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

    The corruption always happened when a transaction aborted and then fsck complained
    like this:

    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
    *** fsck.btrfs output ***
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    Check tree block failed, want=94945280, have=0
    read block failed check_tree_block
    Couldn't open file system

    In this case 94945280 corresponded to the root of a tree.
    Using frace what I observed was the following sequence of steps happened:

    1) transaction N started, fs_info->pinned_extents pointed to
    fs_info->freed_extents[0];

    2) node/eb 94945280 is created;

    3) eb is persisted to disk;

    4) transaction N commit starts, fs_info->pinned_extents now points to
    fs_info->freed_extents[1], and transaction N completes;

    5) transaction N + 1 starts;

    6) eb is COWed, and btrfs_free_tree_block() called for this eb;

    7) eb range (94945280 to 94945280 + 16Kb) is added to
    fs_info->pinned_extents (fs_info->freed_extents[1]);

    8) Something goes wrong in transaction N + 1, like hitting ENOSPC
    for example, and the transaction is aborted, turning the fs into
    readonly mode. The stack trace I got for example:

    [112065.253935] [] dump_stack+0x4d/0x66
    [112065.254271] [] warn_slowpath_common+0x7f/0x98
    [112065.254567] [] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
    [112065.261674] [] warn_slowpath_fmt+0x48/0x50
    [112065.261922] [] ? btrfs_free_path+0x26/0x29 [btrfs]
    [112065.262211] [] __btrfs_abort_transaction+0x50/0x10b [btrfs]
    [112065.262545] [] btrfs_remove_chunk+0x537/0x58b [btrfs]
    [112065.262771] [] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
    [112065.263105] [] cleaner_kthread+0x100/0x12f [btrfs]
    (...)
    [112065.264493] ---[ end trace dd7903a975a31a08 ]---
    [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
    [112065.264997] BTRFS info (device sdc): forced readonly

    9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
    fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
    turn calls btrfs_destroy_pinned_extent();

    10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
    marked as dirty in fs_info->freed_extents[], and for each one
    it calls discard, if the fs was mounted with "-o discard", and
    adds the range to the free space cache of the respective block
    group;

    11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
    sees the free space entries and performs a discard;

    12) After an umount and mount (or fsck), our eb's location on disk was full
    of zeroes, and it should have been untouched, because it was marked as
    dirty in the fs_info->pinned_extents tree, and therefore used by the
    trees that the last committed superblock points to.

    Fix this by not performing a discard and not adding the ranges to the free space
    caches - it's useless from this point since the fs is now in readonly mode and
    we won't write free space caches to disk anymore (otherwise we would leak space)
    nor any new superblock. By not adding the ranges to the free space caches, it
    prevents other code paths from allocating that space and write to it as well,
    therefore being safer and simpler.

    This isn't a new problem, as it's been present since 2011 (git commit
    acce952b0263825da32cf10489413dec78053347).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit a28046956c71985046474283fa3bcd256915fb72 upstream.

    We use the modified list to keep track of which extents have been modified so we
    know which ones are candidates for logging at fsync() time. Newly modified
    extents are added to the list at modification time, around the same time the
    ordered extent is created. We do this so that we don't have to wait for ordered
    extents to complete before we know what we need to log. The problem is when
    something like this happens

    log extent 0-4k on inode 1
    copy csum for 0-4k from ordered extent into log
    sync log
    commit transaction
    log some other extent on inode 1
    ordered extent for 0-4k completes and adds itself onto modified list again
    log changed extents
    see ordered extent for 0-4k has already been logged
    at this point we assume the csum has been copied
    sync log
    crash

    On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
    which is the same one that we are replaying which also drops the csum, and then
    we won't find the csum in the log for that bytenr. This of course causes us to
    have errors about not having csums for certain ranges of our inode. So remove
    the modified list manipulation in unpin_extent_cache, any modified extents
    should have been added well before now, and we don't want them re-logged. This
    fixes my test that I could reliably reproduce this problem with. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 942080643bce061c3dd9d5718d3b745dcb39a8bc upstream.

    Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
    end of the allocated buffer during encrypted filename decoding. This
    fix corrects the issue by getting rid of the unnecessary 0 write when
    the current bit offset is 2.

    Signed-off-by: Michael Halcrow
    Reported-by: Dmitry Chernenkov
    Suggested-by: Kees Cook
    Signed-off-by: Tyler Hicks
    Signed-off-by: Greg Kroah-Hartman

    Michael Halcrow
     
  • commit 332b122d39c9cbff8b799007a825d94b2e7c12f2 upstream.

    The ecryptfs_encrypted_view mount option greatly changes the
    functionality of an eCryptfs mount. Instead of encrypting and decrypting
    lower files, it provides a unified view of the encrypted files in the
    lower filesystem. The presence of the ecryptfs_encrypted_view mount
    option is intended to force a read-only mount and modifying files is not
    supported when the feature is in use. See the following commit for more
    information:

    e77a56d [PATCH] eCryptfs: Encrypted passthrough

    This patch forces the mount to be read-only when the
    ecryptfs_encrypted_view mount option is specified by setting the
    MS_RDONLY flag on the superblock. Additionally, this patch removes some
    broken logic in ecryptfs_open() that attempted to prevent modifications
    of files when the encrypted view feature was in use. The check in
    ecryptfs_open() was not sufficient to prevent file modifications using
    system calls that do not operate on a file descriptor.

    Signed-off-by: Tyler Hicks
    Reported-by: Priya Bansal
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit a1d47b262952a45aae62bd49cfaf33dd76c11a2c upstream.

    UDF specification allows arbitrarily large symlinks. However we support
    only symlinks at most one block large. Check the length of the symlink
    so that we don't access memory beyond end of the symlink block.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit a682e9c28cac152e6e54c39efcf046e0c8cfcf63 upstream.

    If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
    return value is then (in most cases) just overwritten before we return.
    This can result in reporting success to userspace although error happened.

    This bug was introduced by commit 2e54eb96e2c8 ("BKL: Remove BKL from
    ncpfs"). Propagate the errors correctly.

    Coverity id: 1226925.

    Fixes: 2e54eb96e2c80 ("BKL: Remove BKL from ncpfs")
    Signed-off-by: Jan Kara
    Cc: Petr Vandrovec
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

    - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit b2f5d4dc38e034eecb7987e513255265ff9aa1cf upstream.

    Forced unmount affects not just the mount namespace but the underlying
    superblock as well. Restrict forced unmount to the global root user
    for now. Otherwise it becomes possible a user in a less privileged
    mount namespace to force the shutdown of a superblock of a filesystem
    in a more privileged mount namespace, allowing a DOS attack on root.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 3e1866410f11356a9fd869beb3e95983dc79c067 upstream.

    Now that remount is properly enforcing the rule that you can't remove
    nodev at least sandstorm.io is breaking when performing a remount.

    It turns out that there is an easy intuitive solution implicitly
    add nodev on remount when nodev was implicitly added on mount.

    Tested-by: Cedric Bosdonnat
    Tested-by: Richard Weinberger
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit c297abfdf15b4480704d6b566ca5ca9438b12456 upstream.

    While reviewing the code of umount_tree I realized that when we append
    to a preexisting unmounted list we do not change pprev of the former
    first item in the list.

    Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
    the former first item of the list will stomp unmounted.first leaving
    it set to some random mount point which we are likely to free soon.

    This isn't likely to hit, but if it does I don't know how anyone could
    track it down.

    [ This happened because we don't have all the same operations for
    hlist's as we do for normal doubly-linked lists. In particular,
    list_splice() is easy on our standard doubly-linked lists, while
    hlist_splice() doesn't exist and needs both start/end entries of the
    hlist. And commit 38129a13e6e7 incorrectly open-coded that missing
    hlist_splice().

    We should think about making these kinds of "mindless" conversions
    easier to get right by adding the missing hlist helpers - Linus ]

    Fixes: 38129a13e6e71f666e0468e99fdd932a687b4d7e switch mnt_hash to hlist
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 4e2024624e678f0ebb916e6192bd23c1f9fdf696 upstream.

    We didn't check length of rock ridge ER records before printing them.
    Thus corrupted isofs image can cause us to access and print some memory
    behind the buffer with obvious consequences.

    Reported-and-tested-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 4bd5a980de87d2b5af417485bde97b8eb3d6cf6a upstream.

    nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
    early so that it is safe to call .release in case nfs4_alloc_pages
    fails.

    Signed-off-by: Peng Tao
    Fixes: a47970ff78147 ("NFSv4.1: Hold reference to layout hdr in layoutget")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit f54e18f1b831c92f6512d2eedb224cd63d607d3d upstream.

    Rock Ridge extensions define so called Continuation Entries (CE) which
    define where is further space with Rock Ridge data. Corrupted isofs
    image can contain arbitrarily long chain of these, including a one
    containing loop and thus causing kernel to end in an infinite loop when
    traversing these entries.

    Limit the traversal to 32 entries which should be more than enough space
    to store all the Rock Ridge data.

    Reported-by: P J P
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

07 Dec, 2014

6 commits

  • commit 92a56555bd576c61b27a5cab9f38a33a1e9a1df5 upstream.

    If a SIGKILL is sent to a task waiting in __nfs_iocounter_wait,
    it will busy-wait or soft lockup in its while loop.
    nfs_wait_bit_killable won't sleep, and the loop won't exit on
    the error return.

    Stop the busy-wait by breaking out of the loop when
    nfs_wait_bit_killable returns an error.

    Signed-off-by: David Jeffery
    Signed-off-by: Trond Myklebust
    [ kamal: backport to 3.13-stable: context ]
    Cc: Moritz Mühlenhoff
    Signed-off-by: Kamal Mostafa
    Signed-off-by: Greg Kroah-Hartman

    David Jeffery
     
  • commit 8c3cac5e6a85f03602ffe09c44f14418699e31ec upstream.

    A leftover lock on the list is surely a sign of a problem of some sort,
    but it's not necessarily a reason to panic the box. Instead, just log a
    warning with some info about the lock, and then delete it like we would
    any other lock.

    In the event that the filesystem declares a ->lock f_op, we may end up
    leaking something, but that's generally preferable to an immediate
    panic.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton
    Cc: Markus Blank-Burian
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 1b19453d1c6abcfa7c312ba6c9f11a277568fc94 upstream.

    Currently, the DRC cache pruner will stop scanning the list when it
    hits an entry that is RC_INPROG. It's possible however for a call to
    take a *very* long time. In that case, we don't want it to block other
    entries from being pruned if they are expired or we need to trim the
    cache to get back under the limit.

    Fix the DRC cache pruner to just ignore RC_INPROG entries.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Cc: Joseph Salisbury
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit c6c15e1ed303ffc47e696ea1c9a9df1761c1f603 upstream.

    The currect code for nfsd41_cb_get_slot() and nfsd4_cb_done() has no
    locking in order to guarantee atomicity, and so allows for races of
    the form.

    Task 1 Task 2
    ====== ======
    if (test_and_set_bit(0) != 0) {
    clear_bit(0)
    rpc_wake_up_next(queue)
    rpc_sleep_on(queue)
    return false;
    }

    This patch breaks the race condition by adding a retest of the bit
    after the call to rpc_sleep_on().

    Signed-off-by: Trond Myklebust
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 6d0ba0432a5e10bc714ba9c5adc460e726e5fbb4 upstream.

    Even when security labels are disabled we support at least the same
    attributes as v4.1.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit 835f252c6debd204fcd607c79975089b1ecd3472 upstream.

    https://bugzilla.kernel.org/show_bug.cgi?id=86831

    Markus reported that when shutting down mysqld (with AIO support,
    on a ext3 formatted Harddrive) leads to a negative number of dirty pages
    (underrun to the counter). The negative number results in a drastic reduction
    of the write performance because the page cache is not used, because the kernel
    thinks it is still 2 ^ 32 dirty pages open.

    Add a warn trace in __dec_zone_state will catch this easily:

    static inline void __dec_zone_state(struct zone *zone, enum
    zone_stat_item item)
    {
    atomic_long_dec(&zone->vm_stat[item]);
    + WARN_ON_ONCE(item == NR_FILE_DIRTY &&
    atomic_long_read(&zone->vm_stat[item]) < 0);
    atomic_long_dec(&vm_stat[item]);
    }

    [ 21.341632] ------------[ cut here ]------------
    [ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
    cancel_dirty_page+0x164/0x224()
    [ 21.355296] Modules linked in: wutbox_cp sata_mv
    [ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
    [ 21.366793] Workqueue: events free_ioctx
    [ 21.370760] [] (unwind_backtrace) from []
    (show_stack+0x20/0x24)
    [ 21.378562] [] (show_stack) from []
    (dump_stack+0x24/0x28)
    [ 21.385840] [] (dump_stack) from []
    (warn_slowpath_common+0x84/0x9c)
    [ 21.393976] [] (warn_slowpath_common) from []
    (warn_slowpath_null+0x2c/0x34)
    [ 21.402800] [] (warn_slowpath_null) from []
    (cancel_dirty_page+0x164/0x224)
    [ 21.411524] [] (cancel_dirty_page) from []
    (truncate_inode_page+0x8c/0x158)
    [ 21.420272] [] (truncate_inode_page) from []
    (truncate_inode_pages_range+0x11c/0x53c)
    [ 21.429890] [] (truncate_inode_pages_range) from
    [] (truncate_pagecache+0x88/0xac)
    [ 21.439252] [] (truncate_pagecache) from []
    (truncate_setsize+0x5c/0x74)
    [ 21.447731] [] (truncate_setsize) from []
    (put_aio_ring_file.isra.14+0x34/0x90)
    [ 21.456826] [] (put_aio_ring_file.isra.14) from
    [] (aio_free_ring+0x20/0xcc)
    [ 21.465660] [] (aio_free_ring) from []
    (free_ioctx+0x24/0x44)
    [ 21.473190] [] (free_ioctx) from []
    (process_one_work+0x134/0x47c)
    [ 21.481132] [] (process_one_work) from []
    (worker_thread+0x130/0x414)
    [ 21.489350] [] (worker_thread) from []
    (kthread+0xd4/0xec)
    [ 21.496621] [] (kthread) from []
    (ret_from_fork+0x14/0x20)
    [ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

    The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
    (bypasses the VFS dirty pages increment) when init, and aio fs uses
    *default_backing_dev_info* as the backing dev, which does not disable
    the dirty pages accounting capability.
    So truncating aio ring file will contribute to accounting dirty pages (VFS
    dirty pages decrement), then error occurs.

    The original goal is keeping these pages in memory (can not be reclaimed
    or swapped) in life-time via marking it dirty. But thinking more, we have
    already pinned pages via elevating the page's refcount, which can already
    achieve the goal, so the SetPageDirty seems unnecessary.

    In order to fix the issue, using the __set_page_dirty_no_writeback instead
    of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
    set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

    With the above change, the dirty pages accounting can work well. But as we
    known, aio fs is an anonymous one, which should never cause any real write-back,
    we can ignore the dirty pages (write back) accounting by disabling the dirty
    pages (write back) accounting capability. So we introduce an aio private
    backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
    replace the default one.

    Reported-by: Markus Königshaus
    Signed-off-by: Gu Zheng
    Acked-by: Andrew Morton
    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Greg Kroah-Hartman

    Gu Zheng
     

22 Nov, 2014

14 commits

  • commit d23da150a37c9fe3cc83dbaf71b3e37fd434ed52 upstream.

    We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Tim Chen
     
  • commit 28f2cd4f6da24a1aa06c226618ed5ad69e13df64 upstream.

    This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream.

    ... it does that itself (via kmap_atomic())

    Signed-off-by: Al Viro
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream.

    This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.

    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete. This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that. This never actually happened anywhere in
    the code.

    read_cache_page_async() had 3 different callers:

    - read_cache_page() which is the sync version, it would just wait for
    the requested read to complete using wait_on_page_read().

    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
    function it supplied doesn't do any async reads, and would complete
    before the filler function returns - making it actually a sync read.

    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
    a similar story to JFFS2 - the filler function doesn't do anything that
    reminds async reads and would always complete before the filler function
    returns.

    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async(). While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().

    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     
  • commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.

    shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream.

    The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 1b2ad41214c9bf6e8befa000f0522629194bf540 upstream.

    Now that rgrps use the address space which is part of the super
    block, we need to update gfs2_mapping2sbd() to take account of
    that. The only way to do that easily is to use a different set
    of address_space_operations for rgrps.

    Reported-by: Abhi Das
    Tested-by: Abhi Das
    Signed-off-by: Steven Whitehouse
    Signed-off-by: Greg Kroah-Hartman

    Steven Whitehouse
     
  • commit 0c116cadd94b16b30b1dd90d38b2784d9b39b01a upstream.

    This patch removes the assumption made previously, that we only need to
    check the delegation stateid when it matches the stateid on a cached
    open.

    If we believe that we hold a delegation for this file, then we must assume
    that its stateid may have been revoked or expired too. If we don't test it
    then our state recovery process may end up caching open/lock state in a
    situation where it should not.
    We therefore rename the function nfs41_clear_delegation_stateid as
    nfs41_check_delegation_stateid, and change it to always run through the
    delegation stateid test and recovery process as outlined in RFC5661.

    http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 869f9dfa4d6d57b79e0afc3af14772c2a023eeb1 upstream.

    Any attempt to call nfs_remove_bad_delegation() while a delegation is being
    returned is currently a no-op. This means that we can end up looping
    forever in nfs_end_delegation_return() if something causes the delegation
    to be revoked.
    This patch adds a mechanism whereby the state recovery code can communicate
    to the delegation return code that the delegation is no longer valid and
    that it should not be used when reclaiming state.
    It also changes the return value for nfs4_handle_delegation_recall_error()
    to ensure that nfs_end_delegation_return() does not reattempt the lock
    reclaim before state recovery is done.

    http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 16caf5b6101d03335b386e77e9e14136f989be87 upstream.

    Variable 'err' needn't be initialized when nfs_getattr() uses it to
    check whether it should call generic_fillattr() or not. That can result
    in spurious error returns. Initialize 'err' properly.

    Signed-off-by: Jan Kara
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit f8ebf7a8ca35dde321f0cd385fee6f1950609367 upstream.

    If state recovery failed, then we should not attempt to reclaim delegated
    state.

    http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 4dfd4f7af0afd201706ad186352ca423b0f17d4b upstream.

    NFSv4.0 does not have TEST_STATEID/FREE_STATEID functionality, so
    unlike NFSv4.1, the recovery procedure when stateids have expired or
    have been revoked requires us to just forget the delegation.

    http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit ece9c72accdc45c3a9484dacb1125ce572647288 upstream.

    Priority of a merged request is computed by ioprio_best(). If one of the
    requests has undefined priority (IOPRIO_CLASS_NONE) and another request
    has priority from IOPRIO_CLASS_BE, the function will return the
    undefined priority which is wrong. Fix the function to properly return
    priority of a request with the defined priority.

    Fixes: d58cdfb89ce0c6bd5f81ae931a984ef298dbda20
    Signed-off-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 8c393f9a721c30a030049a680e1bf896669bb279 upstream.

    For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets.
    So we need to take care to free them when freeing dreq.

    Ideally this needs to be done inside layout driver where ds_cinfo.buckets
    are allocated. But buckets are attached to dreq and reused across LD IO iterations.
    So I feel it's OK to free them in the generic layer.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     

15 Nov, 2014

7 commits

  • commit 6e5aafb27419f32575b27ef9d6a31e5d54661aca upstream.

    If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
    the csums we allocate and free them. But the code was using list_entry
    incorrectly, and ended up trying to free the on-stack list_head instead.

    This bug came from commit 0678b6185

    btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

    Signed-off-by: Chris Mason
    Reported-by: Erik Berg
    Signed-off-by: Greg Kroah-Hartman

    Chris Mason
     
  • commit 5ef828c4152726f56751c78ea844f08d2b2a4fa3 upstream.

    The commit

    83e782e xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD

    added a new function xfs_sb_quota_from_disk() which swaps
    on-disk XFS_OQUOTA_* flags for in-core XFS_GQUOTA_* and XFS_PQUOTA_*
    flags after the superblock is read.

    However, if log recovery is required, the superblock is read again,
    and the modified in-core flags are re-read from disk, so we have
    XFS_OQUOTA_* flags in memory again. This causes the
    XFS_QM_NEED_QUOTACHECK() test to be true, because the XFS_OQUOTA_CHKD
    is still set, and not XFS_GQUOTA_CHKD or XFS_PQUOTA_CHKD.

    Change xfs_sb_from_disk to call xfs_sb_quota_from disk and always
    convert the disk flags to in-memory flags.

    Add a lower-level function which can be called with "false" to
    not convert the flags, so that the sb verifier can verify
    exactly what was on disk, per Brian Foster's suggestion.

    Reported-by: Cyril B.
    Signed-off-by: Eric Sandeen
    Cc: Arkadiusz Miśkiewicz
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     
  • commit 474d2605d119479e5aa050f738632e63589d4bb5 upstream.

    Due to a switched left and right side of an assignment,
    dquot_writeback_dquots() never returned error. This could result in
    errors during quota writeback to not be reported to userspace properly.
    Fix it.

    Coverity-id: 1226884
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 7938db449bbc55bbeb164bec7af406212e7e98f1 upstream.

    The check whether quota format is set even though there are no
    quota files with journalled quota is pointless and it actually
    makes it impossible to turn off journalled quotas (as there's
    no way to unset journalled quota format). Just remove the check.

    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 51904b08072a8bf2b9ed74d1bd7a5300a614471d upstream.

    Unknown operation numbers are caught in nfsd4_decode_compound() which
    sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The
    error causes the main loop in nfsd4_proc_compound() to skip most
    processing. But nfsd4_proc_compound also peeks ahead at the next
    operation in one case and doesn't take similar precautions there.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    J. Bruce Fields
     
  • commit 599a9b77ab289d85c2d5c8607624efbe1f552b0f upstream.

    When we fail to load block bitmap in __ext4_new_inode() we will
    dereference NULL pointer in ext4_journal_get_write_access(). So check
    for error from ext4_read_block_bitmap().

    Coverity-id: 989065
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 98c1a7593fa355fda7f5a5940c8bf5326ca964ba upstream.

    If metadata checksumming is turned on for the FS, we need to tell the
    journal to use checksumming too.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong