30 Dec, 2014

1 commit

  • Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") has added a separate parameter for
    specifying gfp mask for radix tree allocations.

    Not only this is less than optimal from the API point of view because it
    is error prone, it is also buggy currently because
    grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
    fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
    AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
    the radix tree allocation wouldn't obey the restriction and might
    recurse into filesystem and cause deadlocks. This is the case for most
    filesystems unfortunately because only ext4 and gfs2 are using
    AOP_FLAG_NOFS.

    Let's simply remove radix_gfp_mask parameter because the allocation
    context is same for both page cache and for the radix tree. Just make
    sure that the radix tree gets only the sane subset of the mask (e.g. do
    not pass __GFP_WRITE).

    Long term it is more preferable to convert remaining users of
    AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
    interface even further.

    Reported-by: Dave Chinner
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Dec, 2014

1 commit


14 Dec, 2014

1 commit

  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

10 Oct, 2014

1 commit


09 Oct, 2014

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable fixes:
    - fix an NFSv4.1 state renewal regression
    - fix open/lock state recovery error handling
    - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    - fix statd when reconnection fails
    - don't wake tasks during connection abort
    - don't start reboot recovery if lease check fails
    - fix duplicate proc entries

    Features:
    - pNFS block driver fixes and clean ups from Christoph
    - More code cleanups from Anna
    - Improve mmap() writeback performance
    - Replace use of PF_TRANS with a more generic mechanism for avoiding
    deadlocks in nfs_release_page"

    * tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
    NFSv4.1: Fix an NFSv4.1 state renewal regression
    NFSv4: fix open/lock state recovery error handling
    NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    NFS: Fabricate fscache server index key correctly
    SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
    NFSv3: Fix missing includes of nfs3_fs.h
    NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
    NFS: avoid waiting at all in nfs_release_page when congested.
    NFS: avoid deadlocks with loop-back mounted NFS filesystems.
    MM: export page_wakeup functions
    SCHED: add some "wait..on_bit...timeout()" interfaces.
    NFS: don't use STABLE writes during writeback.
    NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
    rpc: Add -EPERM processing for xs_udp_send_request()
    rpc: return sent and err from xs_sendpages()
    lockd: Try to reconnect if statd has moved
    SUNRPC: Don't wake tasks during connection abort
    Fixing lease renewal
    nfs: fix duplicate proc entries
    pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
    ...

    Linus Torvalds
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

25 Sep, 2014

2 commits

  • This will allow NFS to wait for PG_private to be cleared and,
    particularly, to send a wake-up when it is.

    Signed-off-by: NeilBrown
    Acked-by: Andrew Morton
    Signed-off-by: Trond Myklebust

    NeilBrown
     
  • In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
    sched: Allow wait_on_bit_action() functions to support a timeout

    I suggested that a "wait_on_bit_timeout()" interface would not meet my
    need. This isn't true - I was just over-engineering.

    Including a 'private' field in wait_bit_key instead of a focused
    "timeout" field was just premature generalization. If some other
    use is ever found, it can be generalized or added later.

    So this patch renames "private" to "timeout" with a meaning "stop
    waiting when "jiffies" reaches or passes "timeout",
    and adds two of the many possible wait..bit..timeout() interfaces:

    wait_on_page_bit_killable_timeout(), which is the one I want to use,
    and out_of_line_wait_on_bit_timeout() which is a reasonably general
    example. Others can be added as needed.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: NeilBrown
    Acked-by: Ingo Molnar
    Signed-off-by: Trond Myklebust

    NeilBrown
     

09 Sep, 2014

1 commit


12 Aug, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     
  • If DIO results in short write and sync write fails, we want to bugger off
    whether the DIO part has written anything or not; the logics on the return
    will take care of the right return value.

    Cc: stable@vger.kernel.org [3.16]
    Reported-by: Anton Altaparmakov
    Signed-off-by: Al Viro

    Al Viro
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

2 commits

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     
  • Do we really need an exported alias for __SetPageReferenced()? Its
    callers better know what they're doing, in which case the page would not
    be already marked referenced. Kill init_page_accessed(), just
    __SetPageReferenced() inline.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Aug, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
    from Frederic Weisbecker.

    This necessiated quite some background infrastructure rework,
    including:

    * Clean up some irq-work internals
    * Implement remote irq-work
    * Implement nohz kick on top of remote irq-work
    * Move full dynticks timer enqueue notification to new kick
    * Move multi-task notification to new kick
    * Remove unecessary barriers on multi-task notification

    - Remove proliferation of wait_on_bit() action functions and allow
    wait_on_bit_action() functions to support a timeout. (Neil Brown)

    - Another round of sched/numa improvements, cleanups and fixes. (Rik
    van Riel)

    - Implement fast idling of CPUs when the system is partially loaded,
    for better scalability. (Tim Chen)

    - Restructure and fix the CPU hotplug handling code that may leave
    cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
    cpu. (Kirill Tkhai)

    - Robustify the sched topology setup code. (Peterz Zijlstra)

    - Improve sched_feat() handling wrt. static_keys (Jason Baron)

    - Misc fixes.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    sched/fair: Fix 'make xmldocs' warning caused by missing description
    sched: Use macro for magic number of -1 for setparam
    sched: Robustify topology setup
    sched: Fix sched_setparam() policy == -1 logic
    sched: Allow wait_on_bit_action() functions to support a timeout
    sched: Remove proliferation of wait_on_bit() action functions
    sched/numa: Revert "Use effective_load() to balance NUMA loads"
    sched: Fix static_key race with sched_feat()
    sched: Remove extra static_key*() function indirection
    sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
    sched: Transform resched_task() into resched_curr()
    sched/deadline: Kill task_struct->pi_top_task
    sched: Rework check_for_tasks()
    sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
    sched/fair: Disable runtime_enabled on dying rq
    sched/numa: Change scan period code to match intent
    sched/numa: Rework best node setting in task_numa_migrate()
    sched/numa: Examine a task move when examining a task swap
    sched/numa: Simplify task_numa_compare()
    sched/numa: Use effective_load() to balance NUMA loads
    ...

    Linus Torvalds
     

31 Jul, 2014

1 commit

  • Fix kernel-doc warnings in mm/filemap.c: pagecache_get_page():

    Warning(..//mm/filemap.c:1054): No description found for parameter 'cache_gfp_mask'
    Warning(..//mm/filemap.c:1054): No description found for parameter 'radix_gfp_mask'
    Warning(..//mm/filemap.c:1054): Excess function parameter 'gfp_mask' description in 'pagecache_get_page'

    Fixes: 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible")

    [mgorman@suse.de: change everything]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Randy Dunlap
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit


05 Jun, 2014

3 commits

  • If a page is marked for immediate reclaim then it is moved to the tail of
    the LRU list. This occurs when the system is under enough memory pressure
    for pages under writeback to reach the end of the LRU but we test for this
    using atomic operations on every writeback. This patch uses an optimistic
    non-atomic test first. It'll miss some pages in rare cases but the
    consequences are not severe enough to warrant such a penalty.

    While the function does not dominate profiles during a simple dd test the
    cost of it is reduced.

    73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
    23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • page_endio() takes care of updating all the appropriate page flags once
    I/O has finished to a page. Switch to using mapping_set_error() instead
    of setting AS_EIO directly; this will handle thin-provisioned devices
    correctly.

    Signed-off-by: Matthew Wilcox
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

04 Jun, 2014

1 commit

  • …el/git/tip/tip into next

    Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - reduced/streamlined smp_mb__*() interface that allows more usecases
    and makes the existing ones less buggy, especially in rarer
    architectures

    - add rwsem implementation comments

    - bump up lockdep limits"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    rwsem: Add comments to explain the meaning of the rwsem's count field
    lockdep: Increase static allocations
    arch: Mass conversion of smp_mb__*()
    arch,doc: Convert smp_mb__*()
    arch,xtensa: Convert smp_mb__*()
    arch,x86: Convert smp_mb__*()
    arch,tile: Convert smp_mb__*()
    arch,sparc: Convert smp_mb__*()
    arch,sh: Convert smp_mb__*()
    arch,score: Convert smp_mb__*()
    arch,s390: Convert smp_mb__*()
    arch,powerpc: Convert smp_mb__*()
    arch,parisc: Convert smp_mb__*()
    arch,openrisc: Convert smp_mb__*()
    arch,mn10300: Convert smp_mb__*()
    arch,mips: Convert smp_mb__*()
    arch,metag: Convert smp_mb__*()
    arch,m68k: Convert smp_mb__*()
    arch,m32r: Convert smp_mb__*()
    arch,ia64: Convert smp_mb__*()
    ...

    Linus Torvalds
     

24 May, 2014

1 commit

  • In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors(). That
    smells fishy. Looking further, this is basically what happens:

    blkdev_aio_read()
    generic_file_aio_read()
    filemap_write_and_wait_range()
    if (!mapping->nr_pages)
    filemap_check_errors()

    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation. The
    patch below tests each of these bits before clearing them, avoiding this
    issue. In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.

    Signed-off-by: Jens Axboe
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

07 May, 2014

12 commits

  • no callers left

    Signed-off-by: Al Viro

    Al Viro
     
  • all users converted to __generic_file_write_iter() now

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Now It Can Be Done(tm) - we don't need to do iov_shorten() in
    generic_file_direct_write() anymore, now that all ->direct_IO()
    instances are converted to proper iov_iter methods and honour
    iter->count and iter->iov_offset properly.

    Get rid of count/ocount arguments of generic_file_direct_write(),
    while we are at it.

    Signed-off-by: Al Viro

    Al Viro
     
  • For now, just use the same thing we pass to ->direct_IO() - it's all
    iovec-based at the moment. Pass it explicitly to iov_iter_init() and
    account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
    uses.

    Signed-off-by: Al Viro

    Al Viro
     
  • iov_iter-using variant of generic_file_aio_read(). Some callers
    converted. Note that it's still not quite there for use as ->read_iter() -
    we depend on having zero iter->iov_offset in O_DIRECT case. Fortunately,
    that's true for all converted callers (and for generic_file_aio_read() itself).

    Signed-off-by: Al Viro

    Al Viro
     
  • the thing is, we want to advance what's given to ->direct_IO() as we
    are forming the request; however, the callers care about the amount
    of data actually transferred, not the amount we tried to transfer.
    It's more convenient to allow ->direct_IO() instances do use
    iov_iter_advance() on the copy of iov_iter, leaving the actual
    advancing of the original to caller.

    Signed-off-by: Al Viro

    Al Viro
     
  • all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

    Signed-off-by: Al Viro

    Al Viro
     
  • unmodified, for now

    Signed-off-by: Al Viro

    Al Viro
     
  • all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
    checked - generic_segment_checks() done after that is just an odd way
    to spell iov_length().

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Dave Jones reports the following crash when find_get_pages_tag() runs
    into an exceptional entry:

    kernel BUG at mm/filemap.c:1347!
    RIP: find_get_pages_tag+0x1cb/0x220
    Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

    1343 /*
    1344 * This function is never used on a shmem/tmpfs
    1345 * mapping, so a swap entry won't be found here.
    1346 */
    1347 BUG();

    After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
    page cache radix trees") this comment and BUG() are out of date because
    exceptional entries can now appear in all mappings - as shadows of
    recently evicted pages.

    However, as Hugh Dickins notes,

    "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
    any other PAGECACHE_TAG_*) to appear on an exceptional entry.

    I expect it comes down to an occasional race in RCU lookup of the
    radix_tree: lacking absolute synchronization, we might sometimes
    catch an exceptional entry, with the tag which really belongs with
    the unexceptional entry which was there an instant before."

    And indeed, not only is the tree walk lockless, the tags are also read
    in chunks, one radix tree node at a time. There is plenty of time for
    page reclaim to swoop in and replace a page that was already looked up
    as tagged with a shadow entry.

    Remove the BUG() and update the comment. While reviewing all other
    lookup sites for whether they properly deal with shadow entries of
    evicted pages, update all the comments and fix memcg file charge moving
    to not miss shmem/tmpfs swapcache pages.

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

19 Apr, 2014

1 commit


18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds