04 Jun, 2016

8 commits

  • The optimistic fast path may use cpuset_current_mems_allowed instead of
    of a NULL nodemask supplied by the caller for cpuset allocations. The
    preferred zone is calculated on this basis for statistic purposes and as
    a starting point in the zonelist iterator.

    However, if the context can ignore memory policies due to being atomic
    or being able to ignore watermarks then the starting point in the
    zonelist iterator is no longer correct. This patch resets the zonelist
    iterator in the allocator slowpath if the context can ignore memory
    policies. This will alter the zone used for statistics but only after
    it is known that it makes sense for that context. Resetting it before
    entering the slowpath would potentially allow an ALLOC_CPUSET allocation
    to be accounted for against the wrong zone. Note that while nodemask is
    not explicitly set to the original nodemask, it would only have been
    overwritten if cpuset_enabled() and it was reset before the slowpath was
    entered.

    Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
    Fixes: c33d6c06f60f710 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Signed-off-by: Mel Gorman
    Reported-by: Geert Uytterhoeven
    Tested-by: Geert Uytterhoeven
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Geert Uytterhoeven reported the following problem that bisected to
    commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone
    in a zonelist twice") on m68k/ARAnyM

    BUG: scheduling while atomic: cron/668/0x10c9a0c0
    Modules linked in:
    CPU: 0 PID: 668 Comm: cron Not tainted 4.6.0-atari-05133-gc33d6c06f60f710f #364
    Call Trace: [] __schedule_bug+0x40/0x54
    __schedule+0x312/0x388
    __schedule+0x0/0x388
    prepare_to_wait+0x0/0x52
    schedule+0x64/0x82
    schedule_timeout+0xda/0x104
    set_next_entity+0x18/0x40
    pick_next_task_fair+0x78/0xda
    io_schedule_timeout+0x36/0x4a
    bit_wait_io+0x0/0x40
    bit_wait_io+0x12/0x40
    __wait_on_bit+0x46/0x76
    wait_on_page_bit_killable+0x64/0x6c
    bit_wait_io+0x0/0x40
    wake_bit_function+0x0/0x4e
    __lock_page_or_retry+0xde/0x124
    do_scan_async+0x114/0x17c
    lookup_swap_cache+0x24/0x4e
    handle_mm_fault+0x626/0x7de
    find_vma+0x0/0x66
    down_read+0x0/0xe
    wait_on_page_bit_killable_timeout+0x77/0x7c
    find_vma+0x16/0x66
    do_page_fault+0xe6/0x23a
    res_func+0xa3c/0x141a
    buserr_c+0x190/0x6d4
    res_func+0xa3c/0x141a
    buserr+0x20/0x28
    res_func+0xa3c/0x141a
    buserr+0x20/0x28

    The relationship is not obvious but it's due to a failure to rescan the
    full zonelist after the fair zone allocation policy exhausts the batch
    count. While this is a functional problem, it's also a performance
    issue. A page allocator microbenchmark showed the following

    4.7.0-rc1 4.7.0-rc1
    vanilla reset-v1r2
    Min alloc-odr0-1 327.00 ( 0.00%) 326.00 ( 0.31%)
    Min alloc-odr0-2 235.00 ( 0.00%) 235.00 ( 0.00%)
    Min alloc-odr0-4 198.00 ( 0.00%) 198.00 ( 0.00%)
    Min alloc-odr0-8 170.00 ( 0.00%) 170.00 ( 0.00%)
    Min alloc-odr0-16 156.00 ( 0.00%) 156.00 ( 0.00%)
    Min alloc-odr0-32 150.00 ( 0.00%) 150.00 ( 0.00%)
    Min alloc-odr0-64 146.00 ( 0.00%) 146.00 ( 0.00%)
    Min alloc-odr0-128 145.00 ( 0.00%) 145.00 ( 0.00%)
    Min alloc-odr0-256 155.00 ( 0.00%) 155.00 ( 0.00%)
    Min alloc-odr0-512 168.00 ( 0.00%) 165.00 ( 1.79%)
    Min alloc-odr0-1024 175.00 ( 0.00%) 174.00 ( 0.57%)
    Min alloc-odr0-2048 180.00 ( 0.00%) 180.00 ( 0.00%)
    Min alloc-odr0-4096 187.00 ( 0.00%) 186.00 ( 0.53%)
    Min alloc-odr0-8192 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-16384 191.00 ( 0.00%) 191.00 ( 0.00%)
    Min alloc-odr1-1 736.00 ( 0.00%) 445.00 ( 39.54%)
    Min alloc-odr1-2 343.00 ( 0.00%) 335.00 ( 2.33%)
    Min alloc-odr1-4 277.00 ( 0.00%) 270.00 ( 2.53%)
    Min alloc-odr1-8 238.00 ( 0.00%) 233.00 ( 2.10%)
    Min alloc-odr1-16 224.00 ( 0.00%) 218.00 ( 2.68%)
    Min alloc-odr1-32 210.00 ( 0.00%) 208.00 ( 0.95%)
    Min alloc-odr1-64 207.00 ( 0.00%) 203.00 ( 1.93%)
    Min alloc-odr1-128 276.00 ( 0.00%) 202.00 ( 26.81%)
    Min alloc-odr1-256 206.00 ( 0.00%) 202.00 ( 1.94%)
    Min alloc-odr1-512 207.00 ( 0.00%) 202.00 ( 2.42%)
    Min alloc-odr1-1024 208.00 ( 0.00%) 205.00 ( 1.44%)
    Min alloc-odr1-2048 213.00 ( 0.00%) 212.00 ( 0.47%)
    Min alloc-odr1-4096 218.00 ( 0.00%) 216.00 ( 0.92%)
    Min alloc-odr1-8192 341.00 ( 0.00%) 219.00 ( 35.78%)

    Note that order-0 allocations are unaffected but higher orders get a
    small boost from this patch and a large reduction in system CPU usage
    overall as can be seen here:

    4.7.0-rc1 4.7.0-rc1
    vanilla reset-v1r2
    User 85.32 86.31
    System 2221.39 2053.36
    Elapsed 2368.89 2202.47

    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Geert Uytterhoeven
    Tested-by: Geert Uytterhoeven
    Tested-by: Mikulas Patocka
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Oleg has noted that siglock usage in try_oom_reaper is both pointless
    and dangerous. signal_group_exit can be checked lockless. The problem
    is that sighand becomes NULL in __exit_signal so we can crash.

    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")
    Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
    buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
    is never removed from the pcp list. Fix this by removing the page
    before retrying. Also we don't need to check if page is non-NULL,
    because we simply grab it from the list which was just tested for being
    non-empty.

    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Link: http://lkml.kernel.org/r/20160530090154.GM2527@techsingularity.net
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mel Gorman
    Reported-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Fix erroneous z3fold header access in a HEADLESS page in reclaim
    function, and change one remaining direct handle-to-buddy conversion to
    use the appropriate helper.

    Link: http://lkml.kernel.org/r/5748706F.9020208@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • memcg_offline_kmem() may be called from memcg_free_kmem() after a css
    init failure. memcg_free_kmem() is a ->css_free callback which is
    called without cgroup_mutex and memcg_offline_kmem() ends up using
    css_for_each_descendant_pre() without any locking. Fix it by adding rcu
    read locking around it.

    mkdir: cannot create directory `65530': No space left on device
    ===============================
    [ INFO: suspicious RCU usage. ]
    4.6.0-work+ #321 Not tainted
    -------------------------------
    kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
    [ 527.243970] other info that might help us debug this:
    [ 527.244715]
    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by kworker/0:5/1664:
    #0: ("cgroup_destroy"){.+.+..}, at: [] process_one_work+0x165/0x4a0
    #1: ((&css->destroy_work)#3){+.+...}, at: [] process_one_work+0x165/0x4a0
    [ 527.248098] stack backtrace:
    CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    Workqueue: cgroup_destroy css_free_work_fn
    Call Trace:
    dump_stack+0x68/0xa1
    lockdep_rcu_suspicious+0xd7/0x110
    css_next_descendant_pre+0x7d/0xb0
    memcg_offline_kmem.part.44+0x4a/0xc0
    mem_cgroup_css_free+0x1ec/0x200
    css_free_work_fn+0x49/0x5e0
    process_one_work+0x1c5/0x4a0
    worker_thread+0x49/0x490
    kthread+0xea/0x100
    ret_from_fork+0x1f/0x40

    Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
    Signed-off-by: Tejun Heo
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Per the discussion with Joonsoo Kim [1], we need check the return value
    of lookup_page_ext() for all call sites since it might return NULL in
    some cases, although it is unlikely, i.e. memory hotplug.

    Tested with ltp with "page_owner=0".

    [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

    [akpm@linux-foundation.org: fix build-breaking typos]
    [arnd@arndb.de: fix build problems from lookup_page_ext]
    Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Signed-off-by: Arnd Bergmann
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When remapping pages accounting for 4G or more memory space, the
    operation 'count << PAGE_SHIFT' overflows as it is performed on an
    integer. Solution: cast before doing the bitshift.

    [akpm@linux-foundation.org: fix vm_unmap_ram() also]
    [akpm@linux-foundation.org: fix vmap() as well, per Guillermo]
    Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es
    Signed-off-by: Guillermo Julián Moreno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillermo Julián Moreno
     

28 May, 2016

13 commits

  • Pull vfs fixes from Al Viro:
    "Followups to the parallel lookup work:

    - update docs

    - restore killability of the places that used to take ->i_mutex
    killably now that we have down_write_killable() merged

    - Additionally, it turns out that I missed a prerequisite for
    security_d_instantiate() stuff - ->getxattr() wasn't the only thing
    that could be called before dentry is attached to inode; with smack
    we needed the same treatment applied to ->setxattr() as well"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->setxattr() to passing dentry and inode separately
    switch xattr_handler->set() to passing dentry and inode separately
    restore killability of old mutex_lock_killable(&inode->i_mutex) users
    add down_write_killable_nested()
    update D/f/directory-locking

    Linus Torvalds
     
  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • When we have !NO_BOOTMEM, the deferred page struct initialization
    doesn't work well because the pages reserved in bootmem are released to
    the page allocator uncoditionally. It causes memory corruption and
    system crash eventually.

    As Mel suggested, the bootmem is retiring slowly. We fix the issue by
    simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

    Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • Move the comments for get_mctgt_type() to be before get_mctgt_type()
    implementation.

    Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • mem_cgroup_margin() might return (memory.limit - memory_count) when the
    memsw.limit is in excess. This doesn't happen usually because we do not
    allow excess on hard limits and (memory.limit
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • pageblock_order can be (at least) an unsigned int or an unsigned long
    depending on the kernel config and architecture, so use max_t(unsigned
    long, ...) when comparing it.

    fixes these warnings:

    In file included from include/asm-generic/bug.h:13:0,
    from arch/powerpc/include/asm/bug.h:127,
    from include/linux/bug.h:4,
    from include/linux/mmdebug.h:4,
    from include/linux/mm.h:8,
    from include/linux/memblock.h:18,
    from mm/cma.c:28:
    mm/cma.c: In function 'cma_init_reserved_mem':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    mm/cma.c:186:27: note: in expansion of macro 'max'
    alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
    ^
    mm/cma.c: In function 'cma_declare_contiguous':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:9: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:21: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
    page from a pte, the "address" must be THP aligned in order for the
    page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.

    Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Mika Westerberg
    Tested-by: Mika Westerberg
    Reviewed-by: Andrea Arcangeli
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tetsuo has reported:
    Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
    Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
    sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
    sh cpuset=/ mems_allowed=0
    CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    dump_stack+0x85/0xc8
    dump_header+0x5b/0x394
    oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    In other words:

    __oom_reap_task exit_mm
    atomic_inc_not_zero
    tsk->mm = NULL
    mmput
    atomic_dec_and_test # > 0
    exit_oom_victim # New victim will be
    # selected

    # no TIF_MEMDIE task so we can select a new one
    unmap_page_range # to release the memory

    The race exists even without the oom_reaper because anybody who pins the
    address space and gets preempted might race with exit_mm but oom_reaper
    made this race more probable.

    We can address the oom_reaper part by using oom_lock for __oom_reap_task
    because this would guarantee that a new oom victim will not be selected
    if the oom reaper might race with the exit path. This doesn't solve the
    original issue, though, because somebody else still might be pinning
    mm_users and so __mmput won't be called to release the memory but that
    is not really realiably solvable because the task will get away from the
    oom sight as soon as it is unhashed from the task_list and so we cannot
    guarantee a new victim won't be selected.

    [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • register_page_bootmem_info_node() is invoked in mem_init(), so it will
    be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
    enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
    until page_alloc_init_late() is done, so replace pfn_to_nid() by
    early_pfn_to_nid().

    Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • page_ext_init() checks suitable pages with pfn_to_nid(), but
    pfn_to_nid() depends on memmap which will not be setup fully until
    page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
    pfn_to_nid() so that page extension could be still used early even
    though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
    allocation call sites.

    Suggested by Joonsoo Kim [1], this fix basically undoes the change
    introduced by commit b8f1a75d61d840 ("mm: call page_ext_init() after all
    struct pages are initialized") and fixes the same problem with a better
    approach.

    [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

    Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • If the current process is exiting, we don't invoke oom killer, instead
    we give it access to memory reserves and try to reap its mm in case
    nobody is going to use it. There's a mistake in the code performing
    this check - we just ignore any process of the same thread group no
    matter if it is exiting or not - see try_oom_reaper. Fix it.

    Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • preparation for similar switch in ->setxattr() (see the next commit for
    rationale).

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2016

6 commits

  • Merge fixes from Andrew Morton:
    "10 fixes"

    * emailed patches from Andrew Morton :
    drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
    update "mm/zsmalloc: don't fail if can't create debugfs info"
    dma-debug: avoid spinlock recursion when disabling dma-debug
    mm: oom_reaper: remove some bloat
    memcg: fix mem_cgroup_out_of_memory() return value.
    ocfs2: fix improper handling of return errno
    mm: slub: remove unused virt_to_obj()
    mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
    mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
    seqlock: fix raw_read_seqcount_latch()

    Linus Torvalds
     
  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     
  • Some updates to commit d34f615720d1 ("mm/zsmalloc: don't fail if can't
    create debugfs info"):

    - add pr_warn to all stat failure cases
    - do not prevent module loading on stat failure

    Link: http://lkml.kernel.org/r/1463671123-5479-1-git-send-email-ddstreet@ieee.org
    Signed-off-by: Dan Streetman
    Reviewed-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
    task after an eligible task was found, "false" if it found a TIF_MEMDIE
    task before an eligible task is found.

    This difference confuses memory_max_write() which checks the return
    value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to
    continue looping, mem_cgroup_out_of_memory() should return "true" in
    this case.

    This patch sets a dummy pointer in order to return "true".

    Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable
    stackdepot for SLAB") added 'reserved' field, but never used it.

    Link: http://lkml.kernel.org/r/1464021054-2307-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
    requires some ordering wrt other initialization operations, e.g.
    page_ext_init has to happen after the whole memmap is initialized
    properly.

    For SPARSEMEM this requires to wait for page_alloc_init_late. Other
    memory models (e.g. flatmem) might have different initialization
    layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT
    depends on MEMORY_HOTPLUG which in turn

    depends on SPARSEMEM || X86_64_ACPI_NUMA
    depends on ARCH_ENABLE_MEMORY_HOTPLUG

    and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
    memory model:

    config ARCH_FLATMEM_ENABLE
    def_bool y
    depends on X86_32 && !NUMA

    so FLATMEM is ruled out via dependency maze. Be explicit and disable
    FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
    subtle initialization bugs

    [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

24 May, 2016

8 commits

  • Merge yet more updates from Andrew Morton:

    - Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a
    kernel-based workaround for existing userspace issues.

    - A few hotfixes

    - befs cleanups

    - nilfs2 updates

    - sys_wait() changes

    - kexec updates

    - kdump

    - scripts/gdb updates

    - the last of the MM queue

    - a few other misc things

    * emailed patches from Andrew Morton : (84 commits)
    kgdb: depends on VT
    drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
    drm/radeon: make radeon_mn_get wait for mmap_sem killable
    drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
    uprobes: wait for mmap_sem for write killable
    prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
    exec: make exec path waiting for mmap_sem killable
    aio: make aio_setup_ring killable
    coredump: make coredump_wait wait for mmap_sem for write killable
    vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
    ipc, shm: make shmem attach/detach wait for mmap_sem killable
    mm, fork: make dup_mmap wait for mmap_sem for write killable
    mm, proc: make clear_refs killable
    mm: make vm_brk killable
    mm, elf: handle vm_brk error
    mm, aout: handle vm_brk failures
    mm: make vm_munmap killable
    mm: make vm_mmap killable
    mm: make mmap_sem for write waits killable for mm syscalls
    MAINTAINERS: add co-maintainer for scripts/gdb
    ...

    Linus Torvalds
     
  • Now that all the callers handle vm_brk failure we can change it wait for
    mmap_sem killable to help oom_reaper to not get blocked just because
    vm_brk gets blocked behind mmap_sem readers.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Almost all current users of vm_munmap are ignoring the return value and
    so they do not handle potential error. This means that some VMAs might
    stay behind. This patch doesn't try to solve those potential problems.
    Quite contrary it adds a new failure mode by using down_write_killable
    in vm_munmap. This should be safer than other failure modes, though,
    because the process is guaranteed to die as soon as it leaves the kernel
    and exit_mmap will clean the whole address space.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Alexander Viro
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_oom may be invoked multiple times while a process is handling
    a page fault, in which case current->memcg_in_oom will be overwritten
    leaking the previously taken css reference.

    Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull drm updates from Dave Airlie:
    "Here's the main drm pull request for 4.7, it's been a busy one, and
    I've been a bit more distracted in real life this merge window. Lots
    more ARM drivers, not sure if it'll ever end. I think I've at least
    one more coming the next merge window.

    But changes are all over the place, support for AMD Polaris GPUs is in
    here, some missing GM108 support for nouveau (found in some Lenovos),
    a bunch of MST and skylake fixes.

    I've also noticed a few fixes from Arnd in my inbox, that I'll try and
    get in asap, but I didn't think they should hold this up.

    New drivers:
    - Hisilicon kirin display driver
    - Mediatek MT8173 display driver
    - ARC PGU - bitstreamer on Synopsys ARC SDP boards
    - Allwinner A13 initial RGB output driver
    - Analogix driver for DisplayPort IP found in exynos and rockchip

    DRM Core:
    - UAPI headers fixes and C++ safety
    - DRM connector reference counting
    - DisplayID mode parsing for Dell 5K monitors
    - Removal of struct_mutex from drivers
    - Connector registration cleanups
    - MST robustness fixes
    - MAINTAINERS updates
    - Lockless GEM object freeing
    - Generic fbdev deferred IO support

    panel:
    - Support for a bunch of new panels

    i915:
    - VBT refactoring
    - PLL computation cleanups
    - DSI support for BXT
    - Color manager support
    - More atomic patches
    - GEM improvements
    - GuC fw loading fixes
    - DP detection fixes
    - SKL GPU hang fixes
    - Lots of BXT fixes

    radeon/amdgpu:
    - Initial Polaris support
    - GPUVM/Scheduler/Clock/Power improvements
    - ASYNC pageflip support
    - New mesa feature support

    nouveau:
    - GM108 support
    - Power sensor support improvements
    - GR init + ucode fixes.
    - Use GPU provided topology information

    vmwgfx:
    - Add host messaging support

    gma500:
    - Some cleanups and fixes

    atmel:
    - Bridge support
    - Async atomic commit support

    fsl-dcu:
    - Timing controller for LCD support
    - Pixel clock polarity support

    rcar-du:
    - Misc fixes

    exynos:
    - Pipeline clock support
    - Exynoss4533 SoC support
    - HW trigger mode support
    - export HDMI_PHY clock
    - DECON5433 fixes
    - Use generic prime functions
    - use DMA mapping APIs

    rockchip:
    - Lots of little fixes

    vc4:
    - Render node support
    - Gamma ramp support
    - DPI output support

    msm:
    - Mostly cleanups and fixes
    - Conversion to generic struct fence

    etnaviv:
    - Fix for prime buffer handling
    - Allow hangcheck to be coalesced with other wakeups

    tegra:
    - Gamme table size fix"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits)
    drm/edid: add displayid detailed 1 timings to the modelist. (v1.1)
    drm/edid: move displayid validation to it's own function.
    drm/displayid: Iterate over all DisplayID blocks
    drm/edid: move displayid tiled block parsing into separate function.
    drm: Nuke ->vblank_disable_allowed
    drm/vmwgfx: Report vmwgfx version to vmware.log
    drm/vmwgfx: Add VMWare host messaging capability
    drm/vmwgfx: Kill some lockdep warnings
    drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode
    drm/nouveau/core: recognise GM108 chipsets
    drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup
    drm/nouveau/gr/gk104-: share implementation of ppc exception init
    drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx
    drm/nouveau/bios/pll: check BIT table version before trying to parse it
    drm/nouveau/bios/pll: prevent oops when limits table can't be parsed
    drm/nouveau/volt/gk104: round up in gk104_volt_set
    drm/nouveau/fb/gm200: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gf100-: allocate mmu debug buffers
    drm/nouveau/fb: allow chipset-specific actions for oneinit()
    ...

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this update was stabilized before the merge window and
    appeared in -next. The "device dax" implementation was revised this
    week in response to review feedback, and to address failures detected
    by the recently expanded ndctl unit test suite.

    Not included in this pull request are two dax topic branches (dax
    error handling, and dax radix-tree locking). These topics were
    deferred to get a few more days of -next integration testing, and to
    coordinate a branch baseline with Ted and the ext4 tree. Vishal and
    Ross will send the error handling and locking topics respectively in
    the next few days.

    This branch has received a positive build result from the kbuild robot
    across 226 configs.

    Summary:

    - Device DAX for persistent memory: Device DAX is the device-centric
    analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
    ranges to be allocated and mapped without need of an intervening
    file system. Device DAX is strict, precise and predictable.
    Specifically this interface:

    a) Guarantees fault granularity with respect to a given page size
    (pte, pmd, or pud) set at configuration time.

    b) Enforces deterministic behavior by being strict about what
    fault scenarios are supported.

    Persistent memory is the first target, but the mechanism is also
    targeted for exclusive allocations of performance/feature
    differentiated memory ranges.

    - Support for the HPE DSM (device specific method) command formats.
    This enables management of these first generation devices until a
    unified DSM specification materializes.

    - Further ACPI 6.1 compliance with support for the common dimm
    identifier format.

    - Various fixes and cleanups across the subsystem"

    * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
    libnvdimm, dax: fix deletion
    libnvdimm, dax: fix alignment validation
    libnvdimm, dax: autodetect support
    libnvdimm: release ida resources
    Revert "block: enable dax for raw block devices"
    /dev/dax, core: file operations and dax-mmap
    /dev/dax, pmem: direct access to persistent memory
    libnvdimm: stop requiring a driver ->remove() method
    libnvdimm, dax: record the specified alignment of a dax-device instance
    libnvdimm, dax: reserve space to store labels for device-dax
    libnvdimm, dax: introduce device-dax infrastructure
    nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
    tools/testing/nvdimm: ND_CMD_CALL support
    nfit: disable vendor specific commands
    nfit: export subsystem ids as attributes
    nfit: fix format interface code byte order per ACPI6.1
    nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
    nfit, libnvdimm: clarify "commands" vs "_DSMs"
    libnvdimm: increase max envelope size for ioctl
    acpi/nfit: Add sysfs "id" for NVDIMM ID
    ...

    Linus Torvalds
     

23 May, 2016

1 commit

  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 May, 2016

4 commits

  • The "Device DAX" core enables dax mappings of performance / feature
    differentiated memory. An open mapping or file handle keeps the backing
    struct device live, but new mappings are only possible while the device
    is enabled. Faults are handled under rcu_read_lock to synchronize
    with the enabled state of the device.

    Similar to the filesystem-dax case the backing memory may optionally
    have struct page entries. However, unlike fs-dax there is no support
    for private mappings, or mappings that are not backed by media (see
    use of zero-page in fs-dax).

    Mappings are always guaranteed to match the alignment of the dax_region.
    If the dax_region is configured to have a 2MB alignment, all mappings
    are guaranteed to be backed by a pmd entry. Contrast this determinism
    with the fs-dax case where pmd mappings are opportunistic. If userspace
    attempts to force a misaligned mapping, the driver will fail the mmap
    attempt. See dax_dev_check_vma() for other scenarios that are rejected,
    like MAP_PRIVATE mappings.

    Cc: Hannes Reinecke
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: Ross Zwisler
    Acked-by: "Paul E. McKenney"
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In addition to replacing the entry, we also clear all associated tags.
    This is really a one-off special for page_cache_tree_delete() which had
    far too much detailed knowledge about how the radix tree works.

    For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
    can be used by radix_tree_delete_item() as well as
    radix_tree_replace_clear_tags().

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • I've been receiving increasingly concerned notes from 0day about how
    much my recent changes have been bloating the radix tree. Make it
    happier by only including multiorder support if
    CONFIG_TRANSPARENT_HUGEPAGES is set.

    This is an independent Kconfig option, so other radix tree users can
    also set it if they have a need.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Change the return type of zs_pool_stat_create() to void, and remove the
    logic to abort pool creation if the stat debugfs dir/file could not be
    created.

    The debugfs stat file is for debugging/information only, and doesn't
    affect operation of zsmalloc; there is no reason to abort creating the
    pool if the stat file can't be created. This was seen with zswap, which
    used the same name for all pool creations, which caused zsmalloc to fail
    to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled.

    Signed-off-by: Dan Streetman
    Reviewed-by: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman