16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

15 Jul, 2017

1 commit

  • Jörn Engel noticed that the expand_upwards() function might not return
    -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
    if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

    Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
    which all define TASK_SIZE as 0xffffffff, but since none of those have
    an upwards-growing stack we currently have no actual issue.

    Nevertheless let's fix this just in case any of the architectures with
    an upward-growing stack (currently parisc, metag and partly ia64) define
    TASK_SIZE similar.

    Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
    Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
    Signed-off-by: Helge Deller
    Reported-by: Jörn Engel
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     

13 Jul, 2017

5 commits

  • Currently the writeback statistics code uses a percpu counters to hold
    various statistics. Furthermore we have 2 families of functions - those
    which disable local irq and those which doesn't and whose names begin
    with double underscore. However, they both end up calling
    __add_wb_stats which in turn calls percpu_counter_add_batch which is
    already irq-safe.

    Exploiting this fact allows to eliminated the __wb_* functions since
    they don't add any further protection than we already have.
    Furthermore, refactor the wb_* function to call __add_wb_stat directly
    without the irq-disabling dance. This will likely result in better
    runtime of code which deals with modifying the stat counters.

    While at it also document why percpu_counter_add_batch is in fact
    preempt and irq-safe since at least 3 people got confused.

    Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Acked-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Josef Bacik
    Cc: Mel Gorman
    Cc: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • Page migration (for memory hotplug, soft_offline_page or mbind) needs to
    allocate a new memory. This can trigger an oom killer if the target
    memory is depleated. Although quite unlikely, still possible,
    especially for the memory hotplug (offlining of memoery).

    Up to now we didn't really have reasonable means to back off.
    __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
    single node and that is not suitable for all callers.

    But now that we have __GFP_RETRY_MAYFAIL we should use it. It is
    preferable to fail the migration than disrupt the system by killing some
    processes.

    Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
    request size we can drop the hackish implementation for !costly orders.
    __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
    progress and backs of when we are out of memory for the requested size.
    Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
    to silent the oom killer anymore.

    Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • With gcc 4.1.2:

    mm/memory.o: In function `create_huge_pmd':
    memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page'

    Interestingly, create_huge_pmd() is emitted in the assembler output, but
    never called.

    Converting transparent_hugepage_enabled() from a macro to a static
    inline function reduced the ability of the compiler to remove unused
    code.

    Fix this by marking create_huge_pmd() inline.

    Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()")
    Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

11 Jul, 2017

33 commits

  • The helper function get_wild_bug_type() does not need to be in global
    scope, so make it static.

    Cleans up sparse warning:

    "symbol 'get_wild_bug_type' was not declared. Should it be static?"

    Link: http://lkml.kernel.org/r/20170622090049.10658-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • They return positive value, that is, true, if non-zero value is found.
    Rename them to reduce confusion.

    Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop
    Signed-off-by: Joonsoo Kim
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • KASAN doesn't happen work with memory hotplug because hotplugged memory
    doesn't have any shadow memory. So any access to hotplugged memory
    would cause a crash on shadow check.

    Use memory hotplug notifier to allocate and map shadow memory when the
    hotplugged memory is going online and free shadow after the memory
    offlined.

    Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: "H. Peter Anvin"
    Cc: Alexander Potapenko
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For some unaligned memory accesses we have to check additional byte of
    the shadow memory. Currently we load that byte speculatively to have
    only single load + branch on the optimistic fast path.

    However, this approach has some downsides:

    - It's unaligned access, so this prevents porting KASAN on
    architectures which doesn't support unaligned accesses.

    - We have to map additional shadow page to prevent crash if speculative
    load happens near the end of the mapped memory. This would
    significantly complicate upcoming memory hotplug support.

    I wasn't able to notice any performance degradation with this patch. So
    these speculative loads is just a pain with no gain, let's remove them.

    Link: http://lkml.kernel.org/r/20170601162338.23540-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Mark Rutland
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • There is missing optimization in zero_p4d_populate() that can save some
    memory when mapping zero shadow. Implement it like as others.

    Link: http://lkml.kernel.org/r/1494829255-23946-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Andrey Ryabinin
    Cc: "Kirill A . Shutemov"
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
    ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
    zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE).
    I think however the fix is unneededly complicated.

    This patch replaces the dynamic calculation of zs_size_classes at init
    time by a compile time calculation that uses the DIV_ROUND_UP() macro
    already used in get_size_class_index().

    [akpm@linux-foundation.org: use min_t]
    Link: http://lkml.kernel.org/r/20170630114859.1979-1-jmarchan@redhat.com
    Signed-off-by: Jerome Marchand
    Acked-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Mahendran Ganesh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Andrey reported a potential deadlock with the memory hotplug lock and
    the cpu hotplug lock.

    The reason is that memory hotplug takes the memory hotplug lock and then
    calls stop_machine() which calls get_online_cpus(). That's the reverse
    lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

    The problem has been there forever. The reason why this was never
    reported is that the cpu hotplug locking had this homebrewn recursive
    reader writer semaphore construct which due to the recursion evaded the
    full lock dep coverage. The memory hotplug code copied that construct
    verbatim and therefor has similar issues.

    Three steps to fix this:

    1) Convert the memory hotplug locking to a per cpu rwsem so the
    potential issues get reported proper by lockdep.

    2) Lock the online cpus in mem_hotplug_begin() before taking the memory
    hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
    code to avoid recursive locking.

    3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
    hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
    by invoking lru_add_drain_all_cpuslocked() instead.

    Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
    Reported-by: Andrey Ryabinin
    Signed-off-by: Thomas Gleixner
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Vladimir Davydov
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • The rework of the cpu hotplug locking unearthed potential deadlocks with
    the memory hotplug locking code.

    The solution for these is to rework the memory hotplug locking code as
    well and take the cpu hotplug lock before the memory hotplug lock in
    mem_hotplug_begin(), but this will cause a recursive locking of the cpu
    hotplug lock when the memory hotplug code calls lru_add_drain_all().

    Split out the inner workings of lru_add_drain_all() into
    lru_add_drain_all_cpuslocked() so this function can be invoked from the
    memory hotplug code with the cpu hotplug lock held.

    Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
    Signed-off-by: Thomas Gleixner
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Vladimir Davydov
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Use rlimit() helper instead of manually writing whole chain from current
    task to rlim_cur.

    Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com
    Signed-off-by: Krzysztof Opasiak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Opasiak
     
  • list_lru_count_node() iterates over all memcgs to get the total number of
    entries on the node but it can race with memcg_drain_all_list_lrus(),
    which migrates the entries from a dead cgroup to another. This can return
    incorrect number of entries from list_lru_count_node().

    Fix this by keeping track of entries per node and simply return it in
    list_lru_count_node().

    Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Acked-by: Vladimir Davydov
    Cc: Jan Kara
    Cc: Alexander Polakov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sahitya Tummala
     
  • expand_stack(vma) fails if address < stack_guard_gap even if there is no
    vma->vm_prev. I don't think this makes sense, and we didn't do this
    before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
    between vmas").

    We do not need a gap in this case, any address is fine as long as
    security_mmap_addr() doesn't object.

    This also simplifies the code, we know that address >= prev->vm_end and
    thus underflow is not possible.

    Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
    introduced a regression in some rust and Java environments which are
    trying to implement their own stack guard page. They are punching a new
    MAP_FIXED mapping inside the existing stack Vma.

    This will confuse expand_{downwards,upwards} into thinking that the
    stack expansion would in fact get us too close to an existing non-stack
    vma which is a correct behavior wrt safety. It is a real regression on
    the other hand.

    Let's work around the problem by considering PROT_NONE mapping as a part
    of the stack. This is a gros hack but overflowing to such a mapping
    would trap anyway an we only can hope that usespace knows what it is
    doing and handle it propely.

    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • presently pages in the balloon device have random value, and these pages
    will be scanned by ksmd on the host. They usually cannot be merged.
    Enqueue zero pages will resolve this problem.

    Link: http://lkml.kernel.org/r/1498698637-26389-1-git-send-email-zhenwei.pi@youruncloud.com
    Signed-off-by: zhenwei.pi
    Cc: Gioh Kim
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhenwei.pi
     
  • The align_offset parameter is used by bitmap_find_next_zero_area_off()
    to represent the offset of map's base from the previous alignment
    boundary; the function ensures that the returned index, plus the
    align_offset, honors the specified align_mask.

    The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical
    address, not CMA region position") has the cma driver calculate the
    offset to the *next* alignment boundary. In most cases, the base
    alignment is greater than that specified when making allocations,
    resulting in a zero offset whether we align up or down. In the example
    given with the commit, the base alignment (8MB) was half the requested
    alignment (16MB) so the math also happened to work since the offset is
    8MB in both directions. However, when requesting allocations with an
    alignment greater than twice that of the base, the returned index would
    not be correctly aligned.

    Also, the align_order arguments of cma_bitmap_aligned_mask() and
    cma_bitmap_aligned_offset() should not be negative so the argument type
    was made unsigned.

    Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position")
    Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
    Signed-off-by: Angus Clark
    Signed-off-by: Doug Berger
    Acked-by: Gregory Fong
    Cc: Doug Berger
    Cc: Angus Clark
    Cc: Laura Abbott
    Cc: Vlastimil Babka
    Cc: Greg Kroah-Hartman
    Cc: Lucas Stach
    Cc: Catalin Marinas
    Cc: Shiraz Hashim
    Cc: Jaewon Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Berger
     
  • __remove_zone() sets up up zone_type, but never uses it for anything.
    This does not cause a warning, due to the (necessary) use of
    -Wno-unused-but-set-variable. However, it's noise, so just delete it.

    Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
    Signed-off-by: John Hubbard
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • get_cpu_var() disables preemption and returns the per-CPU version of the
    variable. Disabling preemption is useful to ensure atomic access to the
    variable within the critical section.

    In this case however, after the per-CPU version of the variable is
    obtained the ->free_lock is acquired. For that reason it seems the raw
    accessor could be used. It only seems that ->slots_ret should be
    retested (because with disabled preemption this variable can not be set
    to NULL otherwise).

    This popped up during PREEMPT-RT testing because it tries to take
    spinlocks in a preempt disabled section. In RT, spinlocks can sleep.

    Link: http://lkml.kernel.org/r/20170623114755.2ebxdysacvgxzott@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Michal Hocko
    Cc: Tim Chen
    Cc: Thomas Gleixner
    Cc: Ying Huang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Since current_order starts as MAX_ORDER-1 and is then only decremented,
    the second half of the loop condition seems superfluous. However, if
    order is 0, we may decrement current_order past 0, making it UINT_MAX.
    This is obviously too subtle ([1], [2]).

    Since we need to add some comment anyway, change the two variables to
    signed, making the counting-down for loop look more familiar, and
    apparently also making gcc generate slightly smaller code.

    [1] https://lkml.org/lkml/2016/6/20/493
    [2] https://lkml.org/lkml/2017/6/19/345

    [akpm@linux-foundation.org: fix up reject fixupping]
    Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reported-by: Hao Lee
    Acked-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • pagetypeinfo_showmixedcount_print is found to take a lot of time to
    complete and it does this holding the zone lock and disabling
    interrupts. In some cases it is found to take more than a second (On a
    2.4GHz,8Gb RAM,arm64 cpu).

    Avoid taking the zone lock similar to what is done by read_page_owner,
    which means possibility of inaccurate results.

    Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: zhongjiang
    Cc: Sergey Senozhatsky
    Cc: Sudip Mukherjee
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Sebastian Andrzej Siewior
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     
  • new_page is yet another duplication of the migration callback which has
    to handle hugetlb migration specially. We can safely use the generic
    new_page_nodemask for the same purpose.

    Please note that gigantic hugetlb pages do not need any special handling
    because alloc_huge_page_nodemask will make sure to check pages in all
    per node pools. The reason this was done previously was that
    alloc_huge_page_node treated NO_NUMA_NODE and a specific node
    differently and so alloc_huge_page_node(nid) would check on this
    specific node.

    Link: http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reported-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • alloc_huge_page_nodemask tries to allocate from any numa node in the
    allowed node mask starting from lower numa nodes. This might lead to
    filling up those low NUMA nodes while others are not used. We can
    reduce this risk by introducing a concept of the preferred node similar
    to what we have in the regular page allocator. We will start allocating
    from the preferred nid and then iterate over all allowed nodes in the
    zonelist order until we try them all.

    This is mimicing the page allocator logic except it operates on per-node
    mempools. dequeue_huge_page_vma already does this so distill the
    zonelist logic into a more generic dequeue_huge_page_nodemask and use it
    in alloc_huge_page_nodemask.

    This will allow us to use proper per numa distance fallback also for
    alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
    can get rid of alloc_huge_page_node helper which doesn't have any user
    anymore.

    Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "mm, hugetlb: allow proper node fallback dequeue".

    While working on a hugetlb migration issue addressed in a separate
    patchset[1] I have noticed that the hugetlb allocations from the
    preallocated pool are quite subotimal.

    [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org

    There is no fallback mechanism implemented and no notion of preferred
    node. I have tried to work around it but Vlastimil was right to push
    back for a more robust solution. It seems that such a solution is to
    reuse zonelist approach we use for the page alloctor.

    This series has 3 patches. The first one tries to make hugetlb
    allocation layers more clear. The second one implements the zonelist
    hugetlb pool allocation and introduces a preferred node semantic which
    is used by the migration callbacks. The last patch is a clean up.

    This patch (of 3):

    Hugetlb allocation path for fresh huge pages is unnecessarily complex
    and it mixes different interfaces between layers.

    __alloc_buddy_huge_page is the central place to perform a new
    allocation. It checks for the hugetlb overcommit and then relies on
    __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is
    all good except that __alloc_buddy_huge_page pushes vma and address down
    the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
    two different allocation modes - one for memory policy and other node
    specific (or to make it more obscure node non-specific) requests.

    This just screams for a reorganization.

    This patch pulls out all the vma specific handling up to
    __alloc_buddy_huge_page_with_mpol where it belongs.
    __alloc_buddy_huge_page will get nodemask argument and
    __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
    page allocator.

    In short:
    __alloc_buddy_huge_page_with_mpol - memory policy handling
    __alloc_buddy_huge_page - overcommit handling and accounting
    __hugetlb_alloc_buddy_huge_page - page allocator layer

    Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
    is not really needed because the page allocator already handles the
    cpusets update.

    Finally __hugetlb_alloc_buddy_huge_page had a special case for node
    specific allocations (when no policy is applied and there is a node
    given). This has relied on __GFP_THISNODE to not fallback to a different
    node. alloc_huge_page_node is the only caller which relies on this
    behavior so move the __GFP_THISNODE there.

    Not only does this remove quite some code it also should make those
    layers easier to follow and clear wrt responsibilities.

    Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • During the debugging of the problem described in
    https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
    https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
    output is not really useful to understand issues related to the oom
    reaper.

    So, I assume, that adding some tracepoints might help with debugging of
    similar issues.

    Trace the following events:
    1) a process is marked as an oom victim,
    2) a process is added to the oom reaper list,
    3) the oom reaper starts reaping process's mm,
    4) the oom reaper finished reaping,
    5) the oom reaper skips reaping.

    How it works in practice? Below is an example which show how the problem
    mentioned above can be found: one process is added twice to the
    oom_reaper list:

    $ cd /sys/kernel/debug/tracing
    $ echo "oom:mark_victim" > set_event
    $ echo "oom:wake_reaper" >> set_event
    $ echo "oom:skip_task_reaping" >> set_event
    $ echo "oom:start_task_reaping" >> set_event
    $ echo "oom:finish_task_reaping" >> set_event
    $ cat trace_pipe
    allocate-502 [001] .... 91.836405: mark_victim: pid=502
    allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502
    allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502
    oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502
    oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502
    oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502

    Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
    monitor. The monitor has to stop handling #PF events in the range being
    freed. We are reusing userfaultfd_remove callback along with the logic
    required to re-get and re-validate the VMA which may change or disappear
    because userfaultfd_remove releases mmap_sem.

    Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The condition checking for THP straddling end of invalidated range is
    wrong - it checks 'index' against 'end' but 'index' has been already
    advanced to point to the end of THP and thus the condition can never be
    true. As a result THP straddling 'end' has been fully invalidated.
    Given the nature of invalidate_mapping_pages(), this could be only
    performance issue. In fact, we are lucky the condition is wrong because
    if it was ever true, we'd leave locked page behind.

    Fix the condition checking for THP straddling 'end' and also properly
    unlock the page. Also update the comment before the condition to
    explain why we decide not to invalidate the page as it was not clear to
    me and I had to ask Kirill.

    Link: http://lkml.kernel.org/r/20170619124723.21656-1-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The hugetlb code has its own function to report human-readable sizes.
    Convert it to use the shared string_get_size() function. This will lead
    to a minor difference in user visible output (MiB/GiB instead of MB/GB),
    but some would argue that's desirable anyway.

    Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: Liam R. Howlett
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Cc: Gerald Schaefer
    Cc: zhong jiang
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Alice has reported the following UBSAN splat:

    UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
    signed integer overflow:
    -2147483644 - 2147483525 cannot be represented in type 'long int'
    CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P O 4.9.25-gentoo #4
    Hardware name: XXXXXX, BIOS YYYYYY
    Call Trace:
    dump_stack+0x59/0x87
    ubsan_epilogue+0xe/0x40
    handle_overflow+0xbb/0xf0
    __ubsan_handle_sub_overflow+0x12/0x20
    memcg_check_events.isra.36+0x223/0x360
    mem_cgroup_commit_charge+0x55/0x140
    wp_page_copy+0x34e/0xb80
    do_wp_page+0x1e6/0x1300
    handle_mm_fault+0x88b/0x1990
    __do_page_fault+0x2de/0x8a0
    do_page_fault+0x1a/0x20
    error_code+0x67/0x6c

    The reason is that we subtract two signed types. Let's fix this by
    truly mimicing time_after and cast the result of the subtraction.

    Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Alice Ferrazzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • A few hugetlb allocators loop while calling the page allocator and can
    potentially prevent rescheduling if the page allocator slowpath is not
    utilized.

    Conditionally schedule when large numbers of hugepages can be allocated.

    Anshuman:
    "Fixes a task which was getting hung while writing like 10000 hugepages
    (16MB on POWER8) into /proc/sys/vm/nr_hugepages."

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Mike Kravetz
    Tested-by: Anshuman Khandual
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
    neighbor node when mem-offline") has duplicated a large part of
    alloc_migrate_target with some hotplug specific special casing.

    To be more precise it tried to enfore the allocation from a different
    node than the original page. As a result the two function diverged in
    their shared logic, e.g. the hugetlb allocation strategy.

    Let's unify the two and express different NUMA requirements by the given
    nodemask. new_node_page will simply exclude the node it doesn't care
    about and alloc_migrate_target will use all the available nodes.
    alloc_migrate_target will then learn to migrate hugetlb pages more
    sanely and use preallocated pool when possible.

    Please note that alloc_migrate_target used to call alloc_page resp.
    alloc_pages_current so the memory policy of the current context which is
    quite strange when we consider that it is used in the context of
    alloc_contig_range which just tries to migrate pages which stand in the
    way.

    Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • new_node_page will try to use the origin's next NUMA node as the
    migration destination for hugetlb pages. If such a node doesn't have
    any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
    to allocate a surplus page instead. This is quite subotpimal for any
    configuration when hugetlb pages are no distributed to all NUMA nodes
    evenly. Say we have a hotplugable node 4 and spare hugetlb pages are
    node 0

    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
    /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

    Now we consume the whole pool on node 4 and try to offline this node.
    All the allocated pages should be moved to node0 which has enough
    preallocated pages to hold them. With the current implementation
    offlining very likely fails because hugetlb allocations during runtime
    are much less reliable.

    Fix this by reusing the nodemask which excludes migration source and try
    to find a first node which has a page in the preallocated pool first and
    fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
    consumed.

    [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
    Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • new_node_page tries to allocate the target page on a different NUMA node
    than the source page. This makes sense in most cases during the hotplug
    because we are likely to offline the whole numa node. But there are
    cases where there are no other nodes to fallback (e.g. when offlining
    parts of the only existing node) and we have to fallback to allocating
    from the source node. The current code does that but it can be
    simplified by checking the nmask and updating it before we even try to
    allocate rather than special casing it.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: zhong jiang
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • movable_node kernel parameter allows making hotpluggable NUMA nodes to
    put all the hotplugable memory into movable zone which allows more or
    less reliable memory hotremove. At least this is the case for the NUMA
    nodes present during the boot (see find_zone_movable_pfns_for_nodes).

    This is not the case for the memory hotplug, though.

    echo online > /sys/devices/system/memory/memoryXYZ/state

    will default to a kernel zone (usually ZONE_NORMAL) unless the
    particular memblock is already in the movable zone range which is not
    the case normally when onlining the memory from the udev rule context
    for a freshly hotadded NUMA node. The only option currently is to have
    a special udev rule to echo online_movable to all memblocks belonging to
    such a node which is rather clumsy. Not to mention this is inconsistent
    as well because what ended up in the movable zone during the boot will
    end up in a kernel zone after hotremove & hotadd without special care.

    It would be nice to reuse memblock_is_hotpluggable but the runtime
    hotplug doesn't have that information available because the boot and
    hotplug paths are not shared and it would be really non trivial to make
    them use the same code path because the runtime hotplug doesn't play
    with the memblock allocator at all.

    Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
    movable_node is enabled and the range doesn't overlap with the existing
    normal zone. This should provide a reasonable default onlining
    strategy.

    Strictly speaking the semantic is not identical with the boot time
    initialization because find_zone_movable_pfns_for_nodes covers only the
    hotplugable range as described by the BIOS/FW. From my experience this
    is usually a full node though (except for Node0 which is special and
    never goes away completely). If this turns out to be a problem in the
    real life we can tweak the code to store hotplug flag into memblocks but
    let's keep this simple now.

    Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Yasuaki Ishimatsu
    Cc:
    Cc: Kani Toshimitsu
    Cc:
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When migrating a transparent hugepage, migrate_misplaced_transhuge_page
    guards itself against a concurrent fastgup of the page by checking that
    the page count is equal to 2 before and after installing the new pmd.

    If the page count changes, then the pmd is reverted back to the original
    entry, however there is a small window where the new (possibly writable)
    pmd is installed and the underlying page could be written by userspace.
    Restoring the old pmd could therefore result in loss of data.

    This patch fixes the problem by freezing the page count whilst updating
    the page tables, which protects against a concurrent fastgup without the
    need to restore the old pmd in the failure case (since the page count
    can no longer change under our feet).

    Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Will Deacon
    Acked-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Mark Rutland
    Cc: Steve Capper
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • When the user specifies too many hugepages or an invalid
    default_hugepagesz the communication to the user is implicit in the
    allocation message. This patch adds a warning when the desired page
    count is not allocated and prints an error when the default_hugepagesz
    is invalid on boot.

    During boot hugepages will allocate until there is a fraction of the
    hugepage size left. That is, we allocate until either the request is
    satisfied or memory for the pages is exhausted. When memory for the
    pages is exhausted, it will most likely lead to the system failing with
    the OOM manager not finding enough (or anything) to kill (unless you're
    using really big hugepages in the order of 100s of MB or in the GBs).
    The user will most likely see the OOM messages much later in the boot
    sequence than the implicitly stated message. Worse yet, you may even
    get an OOM for each processor which causes many pages of OOMs on modern
    systems. Although these messages will be printed earlier than the OOM
    messages, at least giving the user errors and warnings will highlight
    the configuration as an issue. I'm trying to point the user in the
    right direction by providing a more robust statement of what is failing.

    During the sysctl or echo command, the user can check the results much
    easier than if the system hangs during boot and the scenario of having
    nothing to OOM for kernel memory is highly unlikely.

    Mike said:
    "Before sending out this patch, I asked Liam off list why he was doing
    it. Was it something he just thought would be useful? Or, was there
    some type of user situation/need. He said that he had been called in
    to assist on several occasions when a system OOMed during boot. In
    almost all of these situations, the user had grossly misconfigured
    huge pages.

    DB users want to pre-allocate just the right amount of huge pages, but
    sometimes they can be really off. In such situations, the huge page
    init code just allocates as many huge pages as it can and reports the
    number allocated. There is no indication that it quit allocating
    because it ran out of memory. Of course, a user could compare the
    number in the message to what they requested on the command line to
    determine if they got all the huge pages they requested. The thought
    was that it would be useful to at least flag this situation. That way,
    the user might be able to better relate the huge page allocation
    failure to the OOM.

    I'm not sure if the e-mail discussion made it obvious that this is
    something he has seen on several occasions.

    I see Michal's point that this will only flag the situation where
    someone configures huge pages very badly. And, a more extensive look
    at the situation of misconfiguring huge pages might be in order. But,
    this has happened on several occasions which led to the creation of
    this patch"

    [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
    Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.com
    Signed-off-by: Liam R. Howlett
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Cc: Gerald Schaefer
    Cc: zhongjiang
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liam R. Howlett