13 May, 2017

8 commits

  • Although there are a ton of free swap and anonymous LRU page in elgible
    zones, OOM happened.

    balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
    CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x21d/0x3f0
    out_of_memory+0xd8/0x390
    __alloc_pages_slowpath+0xbc1/0xc50
    __alloc_pages_nodemask+0x1a5/0x1c0
    pte_alloc_one+0x20/0x50
    __pte_alloc+0x1e/0x110
    __handle_mm_fault+0x919/0x960
    handle_mm_fault+0x77/0x120
    __do_page_fault+0x27a/0x550
    trace_do_page_fault+0x43/0x150
    do_async_page_fault+0x2c/0x90
    async_page_fault+0x28/0x30
    Mem-Info:
    active_anon:424716 inactive_anon:65314 isolated_anon:0
    active_file:52 inactive_file:46 isolated_file:0
    unevictable:0 dirty:27 writeback:0 unstable:0
    slab_reclaimable:3967 slab_unreclaimable:4125
    mapped:133 shmem:43 pagetables:1674 bounce:0
    free:4637 free_pcp:225 free_cma:0
    Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 992 992 1952
    DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 959
    Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
    DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
    Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
    378 total pagecache pages
    17 pages in swap cache
    Swap cache stats: add 17325, delete 17302, find 0/27
    Free swap = 978940kB
    Total swap = 1048572kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    12629 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned
    [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
    [ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
    [ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd

    With investigation, skipping page of isolate_lru_pages makes reclaim
    void because it returns zero nr_taken easily so LRU shrinking is
    effectively nothing and just increases priority aggressively. Finally,
    OOM happens.

    The problem is that get_scan_count determines nr_to_scan with eligible
    zones so although priority drops to zero, it couldn't reclaim any pages
    if the LRU contains mostly ineligible pages.

    get_scan_count:

    size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
    size = size >> sc->priority;

    Assumes sc->priority is 0 and LRU list is as follows.

    N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

    (Ie, small eligible pages are in the head of LRU but others are
    almost ineligible pages)

    In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
    tail of the LRU are not eligible pages. If get_scan_count counts
    skipped pages, it doesn't reclaim any pages remained after scanning 4
    pages so it ends up OOM happening.

    This patch makes isolate_lru_pages try to scan pages until it encounters
    eligible zones's pages.

    [akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
    Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
    Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have encountered need_resched warnings in __collapse_huge_page_copy()
    while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

    mm->mmap_sem is held for write, but the iteration is well bounded.

    Reschedule as needed.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Currently, we didn't invalidate page tables during invalidate_inode_pages2()
    for DAX. That could result in e.g. 2MiB zero page being mapped into
    page tables while there were already underlying blocks allocated and
    thus data seen through mmap were different from data seen by read(2).
    The following sequence reproduces the problem:

    - open an mmap over a 2MiB hole

    - read from a 2MiB hole, faulting in a 2MiB zero page

    - write to the hole with write(3p). The write succeeds but we
    incorrectly leave the 2MiB zero page mapping intact.

    - via the mmap, read the data that was just written. Since the zero
    page mapping is still intact we read back zeroes instead of the new
    data.

    Fix the problem by unconditionally calling invalidate_inode_pages2_range()
    in dax_iomap_actor() for new block allocations and by properly
    invalidating page tables in invalidate_inode_pages2_range() for DAX
    mappings.

    Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
    Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
    v4.

    This series fixes data corruption that can happen for DAX mounts when
    page faults race with write(2) and as a result page tables get out of
    sync with block mappings in the filesystem and thus data seen through
    mmap is different from data seen through read(2).

    The series passes testing with t_mmap_stale test program from Ross and
    also other mmap related tests on DAX filesystem.

    This patch (of 4):

    dax_invalidate_mapping_entry() currently removes DAX exceptional entries
    only if they are clean and unlocked. This is done via:

    invalidate_mapping_pages()
    invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

    However, for page cache pages removed in invalidate_mapping_pages()
    there is an additional criteria which is that the page must not be
    mapped. This is noted in the comments above invalidate_mapping_pages()
    and is checked in invalidate_inode_page().

    For DAX entries this means that we can can end up in a situation where a
    DAX exceptional entry, either a huge zero page or a regular DAX entry,
    could end up mapped but without an associated radix tree entry. This is
    inconsistent with the rest of the DAX code and with what happens in the
    page cache case.

    We aren't able to unmap the DAX exceptional entry because according to
    its comments invalidate_mapping_pages() isn't allowed to block, and
    unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

    Since we essentially never have unmapped DAX entries to evict from the
    radix tree, just remove dax_invalidate_mapping_entry().

    Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
    Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Reported-by: Jan Kara
    Cc: Dan Williams
    Cc: [4.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Commit 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
    pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
    turned out to be a bad idea for some architectures. E.g. m68k fails
    with

    In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
    from arch/m68k/include/asm/pgtable.h:4,
    from include/linux/vmalloc.h:9,
    from arch/m68k/kernel/module.c:9:
    arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
    >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
    #define pgd_offset_k(address) pgd_offset(&init_mm, address)

    as spotted by kernel build bot. nios2 fails for other reason

    In file included from include/asm-generic/io.h:767:0,
    from arch/nios2/include/asm/io.h:61,
    from include/linux/io.h:25,
    from arch/nios2/include/asm/pgtable.h:18,
    from include/linux/mm.h:70,
    from include/linux/pid_namespace.h:6,
    from include/linux/ptrace.h:9,
    from arch/nios2/include/uapi/asm/elf.h:23,
    from arch/nios2/include/asm/elf.h:22,
    from include/linux/elf.h:4,
    from include/linux/module.h:15,
    from init/main.c:16:
    include/linux/vmalloc.h: In function '__vmalloc_node_flags':
    include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?

    which is due to the newly added #include , which on nios2
    includes and thus and which
    again includes .

    Tweaking that around just turns out a bigger headache than necessary.
    This patch reverts 1f5307b1e094 and reimplements the original fix in a
    different way. __vmalloc_node_flags can stay static inline which will
    cover vmalloc* functions. We only have one external user
    (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
    provide the caller directly. This is much simpler and it doesn't really
    need any games with header files.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: revert old comment]
    Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
    Fixes: 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
    Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Tobias Klauser
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • One return case of `__collapse_huge_page_swapin()` does not invoke
    tracepoint while every other return case does. This commit adds a
    tracepoint invocation for the case.

    Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
    Signed-off-by: SeongJae Park
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
    zoneinfo"), /proc/zoneinfo will show unpopulated zones.

    A memoryless node, having no populated zones at all, was previously
    ignored, but will now trigger the WARN() in is_zone_first_populated().

    Remove this warning, as its only purpose was to warn of a situation that
    has since been enabled.

    Aside: The "per-node stats" are still printed under the first populated
    zone, but that's not necessarily the first stanza any more. I'm not
    sure which criteria is more important with regard to not breaking
    parsers, but it looks a little weird to the eye.

    Fixes: e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
    Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

12 May, 2017

1 commit

  • Pull more arm64 updates from Catalin Marinas:

    - Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is
    enabled. This requires a check for __GFP_NOWARN in alloc_vmap_area()

    - Improve/sanitise user tagged pointers handling in the kernel

    - Inline asm fixes/cleanups

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: Silence first allocation with CONFIG_ARM64_MODULE_PLTS=y
    ARM: Silence first allocation with CONFIG_ARM_MODULE_PLTS=y
    mm: Silence vmap() allocation failures based on caller gfp_flags
    arm64: uaccess: suppress spurious clang warning
    arm64: atomic_lse: match asm register sizes
    arm64: armv8_deprecated: ensure extension of addr
    arm64: uaccess: ensure extension of access_ok() addr
    arm64: ensure extension of smp_store_release value
    arm64: xchg: hazard against entire exchange variable
    arm64: documentation: document tagged pointer stack constraints
    arm64: entry: improve data abort handling of tagged pointers
    arm64: hw_breakpoint: fix watchpoint matching for tagged pointers
    arm64: traps: fix userspace cache maintenance emulation on a tagged pointer

    Linus Torvalds
     

11 May, 2017

2 commits

  • If the caller has set __GFP_NOWARN don't print the following message:
    vmap allocation for size 15736832 failed: use vmalloc= to increase
    size.

    This can happen with the ARM/Linux or ARM64/Linux module loader built
    with CONFIG_ARM{,64}_MODULE_PLTS=y which does a first attempt at loading
    a large module from module space, then falls back to vmalloc space.

    Acked-by: Michal Hocko
    Signed-off-by: Florian Fainelli
    Signed-off-by: Catalin Marinas

    Florian Fainelli
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

10 May, 2017

1 commit


09 May, 2017

22 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - various misc things

    - procfs updates

    - lib/ updates

    - checkpatch updates

    - kdump/kexec updates

    - add kvmalloc helpers, use them

    - time helper updates for Y2038 issues. We're almost ready to remove
    current_fs_time() but that awaits a btrfs merge.

    - add tracepoints to DAX

    * emailed patches from Andrew Morton : (114 commits)
    drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
    selftests/vm: add a test for virtual address range mapping
    dax: add tracepoint to dax_insert_mapping()
    dax: add tracepoint to dax_writeback_one()
    dax: add tracepoints to dax_writeback_mapping_range()
    dax: add tracepoints to dax_load_hole()
    dax: add tracepoints to dax_pfn_mkwrite()
    dax: add tracepoints to dax_iomap_pte_fault()
    mtd: nand: nandsim: convert to memalloc_noreclaim_*()
    treewide: convert PF_MEMALLOC manipulations to new helpers
    mm: introduce memalloc_noreclaim_{save,restore}
    mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
    mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
    mm/huge_memory.c: use zap_deposited_table() more
    time: delete CURRENT_TIME_SEC and CURRENT_TIME
    gfs2: replace CURRENT_TIME with current_time
    apparmorfs: replace CURRENT_TIME with current_time()
    lustre: replace CURRENT_TIME macro
    fs: ubifs: replace CURRENT_TIME_SEC with current_time
    fs: ufs: use ktime_get_real_ts64() for birthtime
    ...

    Linus Torvalds
     
  • The previous patch ("mm: prevent potential recursive reclaim due to
    clearing PF_MEMALLOC") has shown that simply setting and clearing
    PF_MEMALLOC in current->flags can result in wrongly clearing a
    pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
    Let's introduce helpers that support proper nesting by saving the
    previous stat of the flag, similar to the existing memalloc_noio_* and
    memalloc_nofs_* helpers. Convert existing setting/clearing of
    PF_MEMALLOC within mm to the new helpers.

    There are no known issues with the converted code, but the change makes
    it more robust.

    Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrey Ryabinin
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "more robust PF_MEMALLOC handling"

    This series aims to unify the setting and clearing of PF_MEMALLOC, which
    prevents recursive reclaim. There are some places that clear the flag
    unconditionally from current->flags, which may result in clearing a
    pre-existing flag. This already resulted in a bug report that Patch 1
    fixes (without the new helpers, to make backporting easier). Patch 2
    introduces the new helpers, modelled after existing memalloc_noio_* and
    memalloc_nofs_* helpers, and converts mm core to use them. Patches 3
    and 4 convert non-mm code.

    This patch (of 4):

    __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
    during page migration by lock_page() (see the comment in
    __unmap_and_move()). Then it unconditionally clears the flag, which can
    clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
    This was not a problem until commit a8161d1ed609 ("mm, page_alloc:
    restructure direct compaction handling in slowpath"), because direct
    compation was called only after direct reclaim, which was skipped when
    PF_MEMALLOC flag was set.

    Even now it's only a theoretical issue, as the new callsite of
    __alloc_pages_direct_compact() is reached only for costly orders and
    when gfp_pfmemalloc_allowed() is true, which means either
    __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true. There is no
    such known context, but let's play it safe and make
    __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
    already set.

    Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
    Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Although all architectures use a deposited page table for THP on
    anonymous VMAs, some architectures (s390 and powerpc) require the
    deposited storage even for file backed VMAs due to quirks of their MMUs.

    This patch adds support for depositing a table in DAX PMD fault handling
    path for archs that require it. Other architectures should see no
    functional changes.

    Link: http://lkml.kernel.org/r/20170411174233.21902-3-oohall@gmail.com
    Signed-off-by: Oliver O'Halloran
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • Depending on the flags of the PMD being zapped there may or may not be a
    deposited pgtable to be freed. In two of the three cases this is open
    coded while the third uses the zap_deposited_table() helper. This patch
    converts the others to use the helper to clean things up a bit.

    Link: http://lkml.kernel.org/r/20170411174233.21902-2-oohall@gmail.com
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
    vhost_vsock because it would really like to prefer kmalloc to the
    vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
    allocation to vmalloc") for more context. Michael Tsirkin has also
    noted:

    "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
    all accesses are slowed down. Allocation is not on data path, accesses
    are."

    The similar applies to other vhost_kvzalloc users.

    Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are
    two things to be careful about. First we should prevent from the OOM
    killer and so have to involve __GFP_NORETRY by default and secondly
    override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
    ignored for !costly orders.

    Supporting __GFP_REPEAT like semantic for !costly request is possible it
    would require changes in the page allocator. This is out of scope of
    this patch.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Michael S. Tsirkin
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __vmalloc_node_flags used to be static inline but this has changed by
    "mm: introduce kv[mz]alloc helpers" because kvmalloc_node needs to use
    it as well and the code is outside of the vmalloc proper. I haven't
    realized that changing this will lead to a subtle bug though. The
    function is responsible to track the caller as well. This caller is
    then printed by /proc/vmallocinfo. If __vmalloc_node_flags is not
    inline then we would get only direct users of __vmalloc_node_flags as
    callers (e.g. v[mz]alloc) which reduces usefulness of this debugging
    feature considerably. It simply doesn't help to see that the given
    range belongs to vmalloc as a caller:

    0xffffc90002c79000-0xffffc90002c7d000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N0=3
    0xffffc90002c81000-0xffffc90002c85000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
    0xffffc90002c8d000-0xffffc90002c91000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
    0xffffc90002c95000-0xffffc90002c99000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3

    We really want to catch the _caller_ of the vmalloc function. Fix this
    issue by making __vmalloc_node_flags static inline again.

    Link: http://lkml.kernel.org/r/20170502134657.12381-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree. Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead. This is patch 6. There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code. Yet we do not have any common helper
    for that and so users have invented their own helpers. Some of them are
    really creative when doing so. Let's just add kv[mz]alloc and make sure
    it is implemented properly. This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures. This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

    This patch also changes some existing users and removes helpers which
    are specific for them. In some cases this is not possible (e.g.
    ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
    require GFP_NO{FS,IO} context which is not vmalloc compatible in general
    (note that the page table allocation is GFP_KERNEL). Those need to be
    fixed separately.

    While we are at it, document that __vmalloc{_node} about unsupported gfp
    mask because there seems to be a lot of confusion out there.
    kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
    superset) flags to catch new abusers. Existing ones would have to die
    slowly.

    [sfr@canb.auug.org.au: f2fs fixup]
    Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Reviewed-by: Andreas Dilger [ext4 part]
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The migrate scanner in async compaction is currently limited to
    MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce
    latency, based on the assumption that non-MOVABLE pageblocks are
    unlikely to contain movable pages.

    However, with the exception of THP's, most high-order allocations are
    not movable. Should the async compaction succeed, this increases the
    chance that the non-MOVABLE allocations will fallback to a MOVABLE
    pageblock, making the long-term fragmentation worse.

    This patch attempts to help the situation by changing async direct
    compaction so that the migrate scanner only scans the pageblocks of the
    requested migratetype. If it's a non-MOVABLE type and there are such
    pageblocks that do contain movable pages, chances are that the
    allocation can succeed within one of such pageblocks, removing the need
    for a fallback. If that fails, the subsequent sync attempt will ignore
    this restriction.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 30%. The number of movable allocations falling back is reduced by
    12%.

    Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation patch. We are going to need migratetype at lower layers
    than compact_zone() and compact_finished().

    Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation for making the decisions more complex and depending on
    compact_control flags. No functional change.

    Link: http://lkml.kernel.org/r/20170307131545.28577-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When stealing pages from pageblock of a different migratetype, we count
    how many free pages were stolen, and change the pageblock's migratetype
    if more than half of the pageblock was free. This might be too
    conservative, as there might be other pages that are not free, but were
    allocated with the same migratetype as our allocation requested.

    While we cannot determine the migratetype of allocated pages precisely
    (at least without the page_owner functionality enabled), we can count
    pages that compaction would try to isolate for migration - those are
    either on LRU or __PageMovable(). The rest can be assumed to be
    MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
    distinguish. This counting can be done as part of free page stealing
    with little additional overhead.

    The page stealing code is changed so that it considers free pages plus
    pages of the "good" migratetype for the decision whether to change
    pageblock's migratetype.

    The result should be more accurate migratetype of pageblocks wrt the
    actual pages in the pageblocks, when stealing from semi-occupied
    pageblocks. This should help the efficiency of page grouping by
    mobility.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 47%. The number of movable allocations falling back to other
    pageblocks are increased by 55%, but these events don't cause permanent
    fragmentation, so the tradeoff should be positive. Later patches also
    offset the movable fallback increase to some extent.

    [akpm@linux-foundation.org: merge fix]
    Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __rmqueue_fallback() function is called when there's no free page of
    requested migratetype, and we need to steal from a different one.

    There are various heuristics to make this event infrequent and reduce
    permanent fragmentation. The main one is to try stealing from a
    pageblock that has the most free pages, and possibly steal them all at
    once and convert the whole pageblock. Precise searching for such
    pageblock would be expensive, so instead the heuristics walks the free
    lists from MAX_ORDER down to requested order and assumes that the block
    with highest-order free page is likely to also have the most free pages
    in total.

    Chances are that together with the highest-order page, we steal also
    pages of lower orders from the same block. But then we still split the
    highest order page. This is wasteful and can contribute to
    fragmentation instead of avoiding it.

    This patch thus changes __rmqueue_fallback() to just steal the page(s)
    and put them on the freelist of the requested migratetype, and only
    report whether it was successful. Then we pick (and eventually split)
    the smallest page with __rmqueue_smallest(). This all happens under
    zone lock, so nobody can steal it from us in the process. This should
    reduce fragmentation due to fallbacks. At worst we are only stealing a
    single highest-order page and waste some cycles by moving it between
    lists and then removing it, but fallback is not exactly hot path so that
    should not be a concern. As a side benefit the patch removes some
    duplicate code by reusing __rmqueue_smallest().

    [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
    Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
    Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When detecting whether compaction has succeeded in forming a high-order
    page, __compact_finished() employs a watermark check, followed by an own
    search for a suitable page in the freelists. This is not ideal for two
    reasons:

    - The watermark check also searches high-order freelists, but has a
    less strict criteria wrt fallback. It's therefore redundant and waste
    of cycles. This was different in the past when high-order watermark
    check attempted to apply reserves to high-order pages.

    - The watermark check might actually fail due to lack of order-0 pages.
    Compaction can't help with that, so there's no point in continuing
    because of that. It's possible that high-order page still exists and
    it terminates.

    This patch therefore removes the watermark check. This should save some
    cycles and terminate compaction sooner in some cases.

    Link: http://lkml.kernel.org/r/20170307131545.28577-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "try to reduce fragmenting fallbacks", v3.

    Last year, Johannes Weiner has reported a regression in page mobility
    grouping [1] and while the exact cause was not found, I've come up with
    some ways to improve it by reducing the number of allocations falling
    back to different migratetype and causing permanent fragmentation.

    The series was tested with mmtests stress-highalloc modified to do
    GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
    balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
    wasn't woken up) on UMA machine with 4GB memory. There were 5 repeats
    of each run, as the extfrag stats are quite volatile (note the stats
    below are sums, not averages, as it was less perl hacking for me).

    Success rate are the same, already high due to the low allocation order
    used, so I'm not including them.

    Compaction stats:
    (the patches are stacked, and I haven't measured the non-functional-changes
    patches separately)

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Compaction stalls 22449 24680 24846 19765 22059 17480
    Compaction success 12971 14836 14608 10475 11632 8757
    Compaction failures 9477 9843 10238 9290 10426 8722
    Page migrate success 3109022 3370438 3312164 1695105 1608435 2111379
    Page migrate failure 911588 1149065 1028264 1112675 1077251 1026367
    Compaction pages isolated 7242983 8015530 7782467 4629063 4402787 5377665
    Compaction migrate scanned 980838938 987367943 957690188 917647238 947155598 1018922197
    Compaction free scanned 557926893 598946443 602236894 594024490 541169699 763651731
    Compaction cost 10243 10578 10304 8286 8398 9440

    Compaction stats are mostly within noise until patch 4, which decreases
    the number of compactions, and migrations. Part of that could be due to
    more pageblocks marked as unmovable, and async compaction skipping
    those. This changes a bit with patch 7, but not so much. Patch 8
    increases free scanner stats and migrations, which comes from the
    changed termination criteria. Interestingly number of compactions
    decreases - probably the fully compacted pageblock satisfies multiple
    subsequent allocations, so it amortizes.

    Next comes the extfrag tracepoint, where "fragmenting" means that an
    allocation had to fallback to a pageblock of another migratetype which
    wasn't fully free (which is almost all of the fallbacks). I have
    locally added another tracepoint for "Page steal" into
    steal_suitable_fallback() which triggers in situations where we are
    allowed to do move_freepages_block(). If we decide to also do
    set_pageblock_migratetype(), it's "Pages steal with pageblock" with
    break down for which allocation migratetype we are stealing and from
    which fallback migratetype. The last part "due to counting" comes from
    patch 4 and counts the events where the counting of movable pages
    allowed us to change pageblock's migratetype, while the number of free
    pages alone wouldn't be enough to cross the threshold.

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Page alloc extfrag event 10155066 8522968 10164959 15622080 13727068 13140319
    Extfrag fragmenting 10149231 8517025 10159040 15616925 13721391 13134792
    Extfrag fragmenting for unmovable 159504 168500 184177 97835 70625 56948
    Extfrag fragmenting unmovable placed with movable 153613 163549 172693 91740 64099 50917
    Extfrag fragmenting unmovable placed with reclaim. 5891 4951 11484 6095 6526 6031
    Extfrag fragmenting for reclaimable 4738 4829 6345 4822 5640 5378
    Extfrag fragmenting reclaimable placed with movable 1836 1902 1851 1579 1739 1760
    Extfrag fragmenting reclaimable placed with unmov. 2902 2927 4494 3243 3901 3618
    Extfrag fragmenting for movable 9984989 8343696 9968518 15514268 13645126 13072466
    Pages steal 179954 192291 210880 123254 94545 81486
    Pages steal with pageblock 22153 18943 20154 33562 29969 33444
    Pages steal with pageblock for unmovable 14350 12858 13256 20660 19003 20852
    Pages steal with pageblock for unmovable from mov. 12812 11402 11683 19072 17467 19298
    Pages steal with pageblock for unmovable from recl. 1538 1456 1573 1588 1536 1554
    Pages steal with pageblock for movable 7114 5489 5965 11787 10012 11493
    Pages steal with pageblock for movable from unmov. 6885 5291 5541 11179 9525 10885
    Pages steal with pageblock for movable from recl. 229 198 424 608 487 608
    Pages steal with pageblock for reclaimable 689 596 933 1115 954 1099
    Pages steal with pageblock for reclaimable from unmov. 273 219 537 658 547 667
    Pages steal with pageblock for reclaimable from mov. 416 377 396 457 407 432
    Pages steal with pageblock due to counting 11834 10075 7530
    ... for unmovable 8993 7381 4616
    ... for movable 2792 2653 2851
    ... for reclaimable 49 41 63

    What we can see is that "Extfrag fragmenting for unmovable" and "...
    placed with movable" drops with almost each patch, which is good as we
    are polluting less movable pageblocks with unmovable pages.

    The most significant change is patch 4 with movable page counting. On
    the other hand it increases "Extfrag fragmenting for movable" by 50%.
    "Pages steal" drops though, so these movable allocation fallbacks find
    only small free pages and are not allowed to steal whole pageblocks
    back. "Pages steal with pageblock" raises, because the patch increases
    the chances of pageblock migratetype changes to happen. This affects
    all migratetypes.

    The summary is that patch 4 is not a clear win wrt these stats, but I
    believe that the tradeoff it makes is a good one. There's less
    pollution of movable pageblocks by unmovable allocations. There's less
    stealing between pageblock, and those that remain have higher chance of
    changing migratetype also the pageblock itself, so it should more
    faithfully reflect the migratetype of the pages within the pageblock.
    The increase of movable allocations falling back to unmovable pageblock
    might look dramatic, but those allocations can be migrated by compaction
    when needed, and other patches in the series (7-9) improve that aspect.

    Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
    also reduce the impact on movable fallbacks from patch 4.

    [1] https://www.spinics.net/lists/linux-mm/msg114237.html

    This patch (of 8):

    While currently there are (mostly by accident) no holes in struct
    compact_control (on x86_64), but we are going to add more bool flags, so
    place them all together to the end of the structure. While at it, just
    order all fields from largest to smallest.

    Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Pull ext4 updates from Ted Ts'o:

    - add GETFSMAP support

    - some performance improvements for very large file systems and for
    random write workloads into a preallocated file

    - bug fixes and cleanups.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: cleanup write flags handling from jbd2_write_superblock()
    ext4: mark superblock writes synchronous for nobarrier mounts
    ext4: inherit encryption xattr before other xattrs
    ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
    ext4: avoid unnecessary transaction stalls during writeback
    ext4: preload block group descriptors
    ext4: make ext4_shutdown() static
    ext4: support GETFSMAP ioctls
    vfs: add common GETFSMAP ioctl definitions
    ext4: evict inline data when writing to memory map
    ext4: remove ext4_xattr_check_entry()
    ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
    ext4: merge ext4_xattr_list() into ext4_listxattr()
    ext4: constify static data that is never modified
    ext4: trim return value and 'dir' argument from ext4_insert_dentry()
    jbd2: fix dbench4 performance regression for 'nobarrier' mounts
    jbd2: Fix lockdep splat with generic/270 test
    mm: retry writepages() on ENOMEM when doing an data integrity writeback

    Linus Torvalds
     
  • Wrong sign of iov_iter_revert() argument. Unfortunately, slipped through
    the testing, since most of the time we don't do anything to the iterator
    afterwards and potential oops on walking the iter->iov too far backwards
    is too infrequent to be easily triggered.

    Add a sanity check in iov_iter_revert() to catch bugs like this one;
    fortunately, the same braino hadn't happened in other callers, but we'd
    better have a warning if such thing crops up.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2017

2 commits

  • Pull staging/IIO updates from Greg KH:
    "Here is the big staging tree update for 4.12-rc1.

    It's a big one, adding about 350k new lines of crap^Wcode, mostly all
    in a big dump of media drivers from Intel. But there's other new
    drivers in here as well, yet-another-wifi driver, new IIO drivers, and
    a new crypto accelerator.

    We also deleted a bunch of stuff, mostly in patch cleanups, but also
    the Android ION code has shrunk a lot, and the Android low memory
    killer driver was finally deleted, much to the celebration of the -mm
    developers.

    All of these have been in linux-next with a few build issues that will
    show up when you merge to your tree"

    Merge conflicts in the new rtl8723bs driver (due to the wifi changes
    this merge window) handled as per linux-next, courtesy of Stephen
    Rothwell.

    * tag 'staging-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1182 commits)
    staging: fsl-mc/dpio: add cpu LE conversion for dpaa2_fd
    staging: ks7010: remove line continuations in quoted strings
    staging: vt6656: use tabs instead of spaces
    staging: android: ion: Fix unnecessary initialization of static variable
    staging: media: atomisp: fix range checking on clk_num
    staging: media: atomisp: fix misspelled word in comment
    staging: media: atomisp: kmap() can't fail
    staging: atomisp: remove #ifdef for runtime PM functions
    staging: atomisp: satm include directory is gone
    atomisp: remove some more unused files
    atomisp: remove hmm_load/store/clear indirections
    atomisp: kill off mmgr_free
    atomisp: clean up the hmm init/cleanup indirections
    atomisp: handle allocation calls before init in the hmm layer
    staging: fsl-dpaa2/eth: Add maintainer for Ethernet driver
    staging: fsl-dpaa2/eth: Add TODO file
    staging: fsl-dpaa2/eth: Add trace points
    staging: fsl-dpaa2/eth: Add driver specific stats
    staging: fsl-dpaa2/eth: Add ethtool support
    staging: fsl-dpaa2/eth: Add Freescale DPAA2 Ethernet driver
    ...

    Linus Torvalds
     
  • Pull arm64 updates from Catalin Marinas:

    - kdump support, including two necessary memblock additions:
    memblock_clear_nomap() and memblock_cap_memory_range()

    - ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex
    numbers and weaker release consistency

    - arm64 ACPI platform MSI support

    - arm perf updates: ACPI PMU support, L3 cache PMU in some Qualcomm
    SoCs, Cortex-A53 L2 cache events and DTLB refills, MAINTAINERS update
    for DT perf bindings

    - architected timer errata framework (the arch/arm64 changes only)

    - support for DMA_ATTR_FORCE_CONTIGUOUS in the arm64 iommu DMA API

    - arm64 KVM refactoring to use common system register definitions

    - remove support for ASID-tagged VIVT I-cache (no ARMv8 implementation
    using it and deprecated in the architecture) together with some
    I-cache handling clean-up

    - PE/COFF EFI header clean-up/hardening

    - define BUG() instruction without CONFIG_BUG

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
    arm64: Fix the DMA mmap and get_sgtable API with DMA_ATTR_FORCE_CONTIGUOUS
    arm64: Print DT machine model in setup_machine_fdt()
    arm64: pmu: Wire-up Cortex A53 L2 cache events and DTLB refills
    arm64: module: split core and init PLT sections
    arm64: pmuv3: handle pmuv3+
    arm64: Add CNTFRQ_EL0 trap handler
    arm64: Silence spurious kbuild warning on menuconfig
    arm64: pmuv3: use arm_pmu ACPI framework
    arm64: pmuv3: handle !PMUv3 when probing
    drivers/perf: arm_pmu: add ACPI framework
    arm64: add function to get a cpu's MADT GICC table
    drivers/perf: arm_pmu: split out platform device probe logic
    drivers/perf: arm_pmu: move irq request/free into probe
    drivers/perf: arm_pmu: split cpu-local irq request/free
    drivers/perf: arm_pmu: rename irq request/free functions
    drivers/perf: arm_pmu: handle no platform_device
    drivers/perf: arm_pmu: simplify cpu_pmu_request_irqs()
    drivers/perf: arm_pmu: factor out pmu registration
    drivers/perf: arm_pmu: fold init into alloc
    drivers/perf: arm_pmu: define armpmu_init_fn
    ...

    Linus Torvalds
     

04 May, 2017

4 commits

  • Pull tracing updates from Steven Rostedt:
    "New features for this release:

    - Pretty much a full rewrite of the processing of function plugins.
    i.e. echo do_IRQ:stacktrace > set_ftrace_filter

    - The rewrite was needed to add plugins to be unique to tracing
    instances. i.e. mkdir instance/foo; cd instances/foo; echo
    do_IRQ:stacktrace > set_ftrace_filter The old way was written very
    hacky. This removes a lot of those hacks.

    - New "function-fork" tracing option. When set, pids in the
    set_ftrace_pid will have their children added when the processes
    with their pids listed in the set_ftrace_pid file forks.

    - Exposure of "maxactive" for kretprobe in kprobe_events

    - Allow for builtin init functions to be traced by the function
    tracer (via the kernel command line). Module init function tracing
    will come in the next release.

    - Added more selftests, and have selftests also test in an instance"

    * tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits)
    ring-buffer: Return reader page back into existing ring buffer
    selftests: ftrace: Allow some event trigger tests to run in an instance
    selftests: ftrace: Have some basic tests run in a tracing instance too
    selftests: ftrace: Have event tests also run in an tracing instance
    selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances
    selftests: ftrace: Allow some tests to be run in a tracing instance
    tracing/ftrace: Allow for instances to trigger their own stacktrace probes
    tracing/ftrace: Allow for the traceonoff probe be unique to instances
    tracing/ftrace: Enable snapshot function trigger to work with instances
    tracing/ftrace: Allow instances to have their own function probes
    tracing/ftrace: Add a better way to pass data via the probe functions
    ftrace: Dynamically create the probe ftrace_ops for the trace_array
    tracing: Pass the trace_array into ftrace_probe_ops functions
    tracing: Have the trace_array hold the list of registered func probes
    ftrace: If the hash for a probe fails to update then free what was initialized
    ftrace: Have the function probes call their own function
    ftrace: Have each function probe use its own ftrace_ops
    ftrace: Have unregister_ftrace_function_probe_func() return a value
    ftrace: Add helper function ftrace_hash_move_and_update_ops()
    ftrace: Remove data field from ftrace_func_probe structure
    ...

    Linus Torvalds
     
  • Makes the report easier to read.

    Link: http://lkml.kernel.org/r/20170302134851.101218-10-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Changes double-free report header from

    BUG: Double free or freeing an invalid pointer
    Unexpected shadow byte: 0xFB

    to

    BUG: KASAN: double-free or invalid-free in kmalloc_oob_left+0xe5/0xef

    This makes a bug uniquely identifiable by the first report line. To
    account for removing of the unexpected shadow value, print shadow bytes
    at the end of the report as in reports for other kinds of bugs.

    Link: http://lkml.kernel.org/r/20170302134851.101218-9-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Moves page description after the stacks since it's less important.

    Link: http://lkml.kernel.org/r/20170302134851.101218-8-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov