08 Aug, 2016

2 commits


06 Aug, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "Here's the second round of block updates for this merge window.

    It's a mix of fixes for changes that went in previously in this round,
    and fixes in general. This pull request contains:

    - Fixes for loop from Christoph

    - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

    - A blk-mq timeout fix, when on frozen queues. From Gabriel.

    - Writeback fix from Jan, ensuring that __writeback_single_inode()
    does the right thing.

    - Fix for bio->bi_rw usage in f2fs from me.

    - Error path deadlock fix in blk-mq sysfs registration from me.

    - Floppy O_ACCMODE fix from Jiri.

    - Fix to the new bio op methods from Mike.

    One more followup will be coming here, ensuring that we don't
    propagate the block types outside of block. That, and a rename of
    bio->bi_rw is coming right after -rc1 is cut.

    - Various little fixes"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mm/block: convert rw_page users to bio op use
    loop: make do_req_filebacked more robust
    loop: don't try to use AIO for discards
    blk-mq: fix deadlock in blk_mq_register_disk() error path
    Include: blkdev: Removed duplicate 'struct request;' declaration.
    Fixup direct bi_rw modifiers
    block: fix bdi vs gendisk lifetime mismatch
    blk-mq: Allow timeouts to run while queue is freezing
    nbd: fix race in ioctl
    block: fix use-after-free in seq file
    f2fs: drop bio->bi_rw manual assignment
    block: add missing group association in bio-cloning functions
    blkcg: kill unused field nr_undestroyed_grps
    writeback: Write dirty times for WB_SYNC_ALL writeback
    floppy: fix open(O_ACCMODE) for ioctl-only open

    Linus Torvalds
     

05 Aug, 2016

8 commits

  • Pull more powerpc updates from Michael Ellerman:
    "These were delayed for various reasons, so I let them sit in next a
    bit longer, rather than including them in my first pull request.

    Fixes:
    - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
    - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
    - Move register_process_table() out of ppc_md from Michael Ellerman

    Use jump_label use for [cpu|mmu]_has_feature():
    - Add mmu_early_init_devtree() from Michael Ellerman
    - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
    - Do hash device tree scanning earlier from Michael Ellerman
    - Do radix device tree scanning earlier from Michael Ellerman
    - Do feature patching before MMU init from Michael Ellerman
    - Check features don't change after patching from Michael Ellerman
    - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
    - Convert mmu_has_feature() to returning bool from Michael Ellerman
    - Convert cpu_has_feature() to returning bool from Michael Ellerman
    - Define radix_enabled() in one place & use static inline from Michael Ellerman
    - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
    - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
    - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
    - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
    - Remove mfvtb() from Kevin Hao
    - Move cpu_has_feature() to a separate file from Kevin Hao
    - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
    - Add option to use jump label for cpu_has_feature() from Kevin Hao
    - Add option to use jump label for mmu_has_feature() from Kevin Hao
    - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
    - Annotate jump label assembly from Michael Ellerman

    TLB flush enhancements from Aneesh Kumar K.V:
    - radix: Implement tlb mmu gather flush efficiently
    - Add helper for finding SLBE LLP encoding
    - Use hugetlb flush functions
    - Drop multiple definition of mm_is_core_local
    - radix: Add tlb flush of THP ptes
    - radix: Rename function and drop unused arg
    - radix/hugetlb: Add helper for finding page size
    - hugetlb: Add flush_hugetlb_tlb_range
    - remove flush_tlb_page_nohash

    Add new ptrace regsets from Anshuman Khandual and Simon Guo:
    - elf: Add powerpc specific core note sections
    - Add the function flush_tmregs_to_thread
    - Enable in transaction NT_PRFPREG ptrace requests
    - Enable in transaction NT_PPC_VMX ptrace requests
    - Enable in transaction NT_PPC_VSX ptrace requests
    - Adapt gpr32_get, gpr32_set functions for transaction
    - Enable support for NT_PPC_CGPR
    - Enable support for NT_PPC_CFPR
    - Enable support for NT_PPC_CVMX
    - Enable support for NT_PPC_CVSX
    - Enable support for TM SPR state
    - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    - Enable support for EBB registers
    - Enable support for Performance Monitor registers"

    * tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
    powerpc/mm: Move register_process_table() out of ppc_md
    powerpc/perf: Fix incorrect event codes in power9-event-list
    powerpc/32: Fix early access to cpu_spec relocation
    powerpc/ptrace: Enable support for Performance Monitor registers
    powerpc/ptrace: Enable support for EBB registers
    powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    powerpc/ptrace: Enable support for TM SPR state
    powerpc/ptrace: Enable support for NT_PPC_CVSX
    powerpc/ptrace: Enable support for NT_PPC_CVMX
    powerpc/ptrace: Enable support for NT_PPC_CFPR
    powerpc/ptrace: Enable support for NT_PPC_CGPR
    powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
    powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
    powerpc/process: Add the function flush_tmregs_to_thread
    elf: Add powerpc specific core note sections
    powerpc/mm: remove flush_tlb_page_nohash
    powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
    ...

    Linus Torvalds
     
  • It causes NULL dereference error and failure to get type_a->regions[0]
    info if parameter type_b of __next_mem_range_rev() == NULL

    Fix this by checking before dereferring and initializing idx_b to 0

    The approach is tested by dumping all types of region via
    __memblock_dump_all() and __next_mem_range_rev() fixed to UART
    separately the result is okay after checking the logs.

    Link: http://lkml.kernel.org/r/57A0320D.6070102@zoho.com
    Signed-off-by: zijun_hu
    Tested-by: zijun_hu
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • With m68k-linux-gnu-gcc-4.1:

    include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called
    include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here

    Commit c146a2b98eb5 ("mm, kasan: account for object redzone in SLUB's
    nearest_obj()") made fixup_red_left() global, but forgot to remove the
    inline keyword.

    Fixes: c146a2b98eb5898e ("mm, kasan: account for object redzone in SLUB's nearest_obj()")
    Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Cc: Alexander Potapenko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Paul Mackerras and Reza Arbab reported that machines with memoryless
    nodes fail when vmstats are refreshed. Paul reported an oops as follows

    Unable to handle kernel paging request for data at address 0xff7a10000
    Faulting instruction address: 0xc000000000270cd0
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
    task: c000000ff0680010 task.stack: c000000ff0704000
    NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
    REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
    MSR: 9000000102009033 CR: 846b6824 XER: 20000000
    CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
    NIP refresh_zone_stat_thresholds+0x80/0x240
    LR refresh_zone_stat_thresholds+0x98/0x240
    Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

    Both supplied potential fixes but one potentially misses checks and
    another had redundant initialisations. This version initialises
    per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

    Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Paul Mackerras
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/accomodate/accommodate/

    Link: http://lkml.kernel.org/r/20160804121824.18100-1-kuleshovmail@gmail.com
    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • At present it is obvious that memory online and offline will fail when
    KASAN is enabled. So add the condition to limit the memory_hotplug when
    KASAN is enabled.

    Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     
  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     

04 Aug, 2016

1 commit

  • If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
    BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):

    mm/built-in.o: In function `shmem_alloc_hugepage':
    shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'

    To fix this, move the assignment to hindex after the check for huge
    pages support.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

03 Aug, 2016

17 commits

  • Merge yet more updates from Andrew Morton:

    - the rest of ocfs2

    - various hotfixes, mainly MM

    - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

    - printk updates

    - firmware

    - checkpatch

    - nilfs2

    - more kexec stuff than usual

    - rapidio updates

    - w1 things

    * emailed patches from Andrew Morton : (111 commits)
    ipc: delete "nr_ipc_ns"
    kcov: allow more fine-grained coverage instrumentation
    init/Kconfig: add clarification for out-of-tree modules
    config: add android config fragments
    init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
    relay: add global mode support for buffer-only channels
    init: allow blacklisting of module_init functions
    w1:omap_hdq: fix regression
    w1: add helper macro module_w1_family
    w1: remove need for ida and use PLATFORM_DEVID_AUTO
    rapidio/switches: add driver for IDT gen3 switches
    powerpc/fsl_rio: apply changes for RIO spec rev 3
    rapidio: modify for rev.3 specification changes
    rapidio: change inbound window size type to u64
    rapidio/idt_gen2: fix locking warning
    rapidio: fix error handling in mbox request/release functions
    rapidio/tsi721_dma: advance queue processing from transfer submit call
    rapidio/tsi721: add messaging mbox selector parameter
    rapidio/tsi721: add PCIe MRRS override parameter
    rapidio/tsi721_dma: add channel mask and queue size parameters
    ...

    Linus Torvalds
     
  • The vm_brk() alignment calculations should refuse to overflow. The ELF
    loader depending on this, but it has been fixed now. No other unsafe
    callers have been found.

    Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reported-by: Hector Marco-Gisbert
    Cc: Ismael Ripoll Ripoll
    Cc: Alexander Viro
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Chen Gang
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We must call shrink_slab() for each memory cgroup on both global and
    memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
    changed that so that now shrink_slab() is only called with memcg != NULL
    on memcg reclaim. As a result, memcg-aware shrinkers (including
    dentry/inode) are never invoked on global reclaim. Fix that.

    Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
    Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If the total amount of memory assigned to quarantine is less than the
    amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
    may overflow. Instead, set it to zero.

    [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
    Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
    Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
    Signed-off-by: Alexander Potapenko
    Reported-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently we just dump stack in case of double free bug.
    Let's dump all info about the object that we have.

    [aryabinin@virtuozzo.com: change double free message per Alexander]
    Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Size of slab object already stored in cache->object_size.

    Note, that kmalloc() internally rounds up size of allocation, so
    object_size may be not equal to alloc_size, but, usually we don't need
    to know the exact size of allocated object. In case if we need that
    information, we still can figure it out from the report. The dump of
    shadow memory allows to identify the end of allocated memory, and
    thereby the exact allocation size.

    Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • SLUB doesn't require disabled interrupts to call ___cache_free().

    Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied
    by __GFP_RECLAIM) allocation. So, basically we call it on almost every
    allocation. quarantine_reduce() sometimes is heavy operation, and
    calling it with disabled interrupts may trigger hard LOCKUP:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258
    Call Trace:
    dump_stack+0x68/0x96
    watchdog_overflow_callback+0x15b/0x190
    __perf_event_overflow+0x1b1/0x540
    perf_event_overflow+0x14/0x20
    intel_pmu_handle_irq+0x36a/0xad0
    perf_event_nmi_handler+0x2c/0x50
    nmi_handle+0x128/0x480
    default_do_nmi+0xb2/0x210
    do_nmi+0x1aa/0x220
    end_repeat_nmi+0x1a/0x1e
    <> __kernel_text_address+0x86/0xb0
    print_context_stack+0x7b/0x100
    dump_trace+0x12b/0x350
    save_stack_trace+0x2b/0x50
    set_track+0x83/0x140
    free_debug_processing+0x1aa/0x420
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100

    Reduce the quarantine_reduce iff direct reclaim is allowed.

    Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation")
    Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Once an object is put into quarantine, we no longer own it, i.e. object
    could leave the quarantine and be reallocated. So having set_track()
    call after the quarantine_put() may corrupt slab objects.

    BUG kmalloc-4096 (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------
    Disabling lock debugging due to kernel taint
    INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b
    ...
    INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc+0x109/0x3e0
    mmap_region+0x53e/0xe40
    do_mmap+0x70f/0xa50
    vm_mmap_pgoff+0x147/0x1b0
    SyS_mmap_pgoff+0x2c7/0x5b0
    SyS_mmap+0x1b/0x30
    do_syscall_64+0x1a0/0x4e0
    return_from_SYSCALL_64+0x0/0x7a
    INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x (null) flags=0x8000000000004080
    INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588
    Redzone ffff8804540de840: bb bb bb bb bb bb bb bb ........
    Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc kkkkkkkk.R....`.

    Similarly, poisoning after the quarantine_put() leads to false positive
    use-after-free reports:

    BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0
    Read of size 8 by task trinity-c0/3036
    CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
    Call Trace:
    dump_stack+0x68/0x96
    kasan_report_error+0x222/0x600
    __asan_report_load8_noabort+0x61/0x70
    anon_vma_interval_tree_insert+0x304/0x430
    anon_vma_chain_link+0x91/0xd0
    anon_vma_clone+0x136/0x3f0
    anon_vma_fork+0x81/0x4c0
    copy_process.part.47+0x2c43/0x5b20
    _do_fork+0x16d/0xbd0
    SyS_clone+0x19/0x20
    do_syscall_64+0x1a0/0x4e0
    entry_SYSCALL64_slow_path+0x25/0x25

    Fix this by putting an object in the quarantine after all other
    operations.

    Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB")
    Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Reported-by: Vegard Nossum
    Reported-by: Sasha Levin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
    runs his database load with memory online and offline running in
    parallel. The reason is that huge_pmd_share might detect a shared pmd
    which is currently migrated and so it has migration pte which is
    !pte_huge.

    There doesn't seem to be any easy way to prevent from the race and in
    fact seeing the migration swap entry is not harmful. Both callers of
    huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
    will copy the swap entry and make it COW if needed. hugetlb_fault will
    back off and so the page fault is retries if the page is still under
    migration and waits for its completion in hugetlb_fault.

    That means that the BUG_ON is wrong and we should update it. Let's
    simply check that all present ptes are pte_huge instead.

    Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In powerpc servers with large memory(32TB), we watched several soft
    lockups for hugepage under stress tests.

    The call traces are as follows:
    1.
    get_page_from_freelist+0x2d8/0xd50
    __alloc_pages_nodemask+0x180/0xc20
    alloc_fresh_huge_page+0xb0/0x190
    set_max_huge_pages+0x164/0x3b0

    2.
    prep_new_huge_page+0x5c/0x100
    alloc_fresh_huge_page+0xc8/0x190
    set_max_huge_pages+0x164/0x3b0

    This patch fixes such soft lockups. It is safe to call cond_resched()
    there because it is out of spin_lock/unlock section.

    Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Paul Gortmaker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia He
     
  • Every swap-in anonymous page starts from inactive lru list's head. It
    should be activated unconditionally when VM decide to reclaim because
    page table entry for the page always usually has marked accessed bit.
    Thus, their window size for getting a new referece is 2 * NR_inactive +
    NR_active while others is NR_inactive + NR_active.

    It's not fair that it has more chance to be referenced compared to other
    newly allocated page which starts from active lru list's head.

    Johannes:

    : The page can still have a valid copy on the swap device, so prefering to
    : reclaim that page over a fresh one could make sense. But as you point
    : out, having it start inactive instead of active actually ends up giving it
    : *more* LRU time, and that seems to be without justification.

    Rik:

    : The reason newly read in swap cache pages start on the inactive list is
    : that we do some amount of read-around, and do not know which pages will
    : get used.
    :
    : However, immediately activating the ones that DO get used, like your patch
    : does, is the right thing to do.

    Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Nadav Amit
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • I ran into this:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
    in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
    2 locks held by trinity-c1/1434:
    #0: (&mm->mmap_sem){......}, at: [] __do_page_fault+0x1ce/0x8f0
    #1: (rcu_read_lock){......}, at: [] filemap_map_pages+0xd6/0xdd0

    CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x65/0x84
    panic+0x185/0x2dd
    ___might_sleep+0x51c/0x600
    __might_sleep+0x90/0x1a0
    __alloc_pages_nodemask+0x5b1/0x2160
    alloc_pages_current+0xcc/0x370
    pte_alloc_one+0x12/0x90
    __pte_alloc+0x1d/0x200
    alloc_set_pte+0xe3e/0x14a0
    filemap_map_pages+0x42b/0xdd0
    handle_mm_fault+0x17d5/0x28b0
    __do_page_fault+0x310/0x8f0
    trace_do_page_fault+0x18d/0x310
    do_async_page_fault+0x27/0xa0
    async_page_fault+0x28/0x30

    The important bits from the above is that filemap_map_pages() is calling
    into the page allocator while holding rcu_read_lock (sleeping is not
    allowed inside RCU read-side critical sections).

    According to Kirill Shutemov, the prefaulting code in do_fault_around()
    is supposed to take care of this, but missing error handling means that
    the allocation failure can go unnoticed.

    We don't need to return VM_FAULT_OOM (or any other error) here, since we
    can just let the normal fault path try again.

    Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Kirill A. Shutemov
    Cc: "Hillf Danton"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • Pull KVM updates from Paolo Bonzini:

    - ARM: GICv3 ITS emulation and various fixes. Removal of the
    old VGIC implementation.

    - s390: support for trapping software breakpoints, nested
    virtualization (vSIE), the STHYI opcode, initial extensions
    for CPU model support.

    - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
    of cleanups, preliminary to this and the upcoming support for
    hardware virtualization extensions.

    - x86: support for execute-only mappings in nested EPT; reduced
    vmexit latency for TSC deadline timer (by about 30%) on Intel
    hosts; support for more than 255 vCPUs.

    - PPC: bugfixes.

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
    KVM: PPC: Introduce KVM_CAP_PPC_HTM
    MIPS: Select HAVE_KVM for MIPS64_R{2,6}
    MIPS: KVM: Reset CP0_PageMask during host TLB flush
    MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
    MIPS: KVM: Sign extend MFC0/RDHWR results
    MIPS: KVM: Fix 64-bit big endian dynamic translation
    MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
    MIPS: KVM: Use 64-bit CP0_EBase when appropriate
    MIPS: KVM: Set CP0_Status.KX on MIPS64
    MIPS: KVM: Make entry code MIPS64 friendly
    MIPS: KVM: Use kmap instead of CKSEG0ADDR()
    MIPS: KVM: Use virt_to_phys() to get commpage PFN
    MIPS: Fix definition of KSEGX() for 64-bit
    KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
    kvm: x86: nVMX: maintain internal copy of current VMCS
    KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
    KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
    KVM: arm64: vgic-its: Simplify MAPI error handling
    KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
    KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
    ...

    Linus Torvalds
     

01 Aug, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull fuse updates from Miklos Szeredi:
    "This fixes error propagation from writeback to fsync/close for
    writeback cache mode as well as adding a missing capability flag to
    the INIT message. The rest are cleanups.

    (The commits are recent but all the code actually sat in -next for a
    while now. The recommits are due to conflict avoidance and the
    addition of Cc: stable@...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use filemap_check_errors()
    mm: export filemap_check_errors() to modules
    fuse: fix wrong assignment of ->flags in fuse_send_init()
    fuse: fuse_flush must check mapping->flags for errors
    fuse: fsync() did not return IO errors
    fuse: don't mess with blocking signals
    new helper: wait_event_killable_exclusive()
    fuse: improve aio directIO write performance for size extending writes

    Linus Torvalds
     

29 Jul, 2016

9 commits

  • Can be used by fuse, btrfs and f2fs to replace opencoded variants.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since THP allocations during page faults can be costly, extra decisions
    are employed for them to avoid excessive reclaim and compaction, if the
    initial compaction doesn't look promising. The detection has never been
    perfect as there is no gfp flag specific to THP allocations. At this
    moment it checks the whole combination of flags that makes up
    GFP_TRANSHUGE, and hopes that no other users of such combination exist,
    or would mind being treated the same way. Extra care is also taken to
    separate allocations from khugepaged, where latency doesn't matter that
    much.

    It is however possible to distinguish these allocations in a simpler and
    more reliable way. The key observation is that after the initial
    compaction followed by the first iteration of "standard"
    reclaim/compaction, both __GFP_NORETRY allocations and costly
    allocations without __GFP_REPEAT are declared as failures:

    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
    goto nopage;

    /*
    * Do not retry costly high order allocations unless they are
    * __GFP_REPEAT
    */
    if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
    goto nopage;

    This means we can further distinguish allocations that are costly order
    *and* additionally include the __GFP_NORETRY flag. As it happens,
    GFP_TRANSHUGE allocations do already fall into this category. This will
    also allow other costly allocations with similar high-order benefit vs
    latency considerations to use this semantic. Furthermore, we can
    distinguish THP allocations that should try a bit harder (such as from
    khugepageed) by removing __GFP_NORETRY, as will be done in the next
    patch.

    Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The retry loop in __alloc_pages_slowpath is supposed to keep trying
    reclaim and compaction (and OOM), until either the allocation succeeds,
    or returns with failure. Success here is more probable when reclaim
    precedes compaction, as certain watermarks have to be met for compaction
    to even try, and more free pages increase the probability of compaction
    success. On the other hand, starting with light async compaction (if
    the watermarks allow it), can be more efficient, especially for smaller
    orders, if there's enough free memory which is just fragmented.

    Thus, the current code starts with compaction before reclaim, and to
    make sure that the last reclaim is always followed by a final
    compaction, there's another direct compaction call at the end of the
    loop. This makes the code hard to follow and adds some duplicated
    handling of migration_mode decisions. It's also somewhat inefficient
    that even if reclaim or compaction decides not to retry, the final
    compaction is still attempted. Some gfp flags combination also shortcut
    these retry decisions by "goto noretry;", making it even harder to
    follow.

    This patch attempts to restructure the code with only minimal functional
    changes. The call to the first compaction and THP-specific checks are
    now placed above the retry loop, and the "noretry" direct compaction is
    removed.

    The initial compaction is additionally restricted only to costly orders,
    as we can expect smaller orders to be held back by watermarks, and only
    larger orders to suffer primarily from fragmentation. This better
    matches the checks in reclaim's shrink_zones().

    There are two other smaller functional changes. One is that the upgrade
    from async migration to light sync migration will always occur after the
    initial compaction. This is how it has been until recent patch "mm,
    oom: protect !costly allocations some more", which introduced upgrading
    the mode based on COMPACT_COMPLETE result, but kept the final compaction
    always upgraded, which made it even more special. It's better to return
    to the simpler handling for now, as migration modes will be further
    modified later in the series.

    The second change is that once both reclaim and compaction declare it's
    not worth to retry the reclaim/compact loop, there is no final
    compaction attempt. As argued above, this is intentional. If that
    final compaction were to succeed, it would be due to a wrong retry
    decision, or simply a race with somebody else freeing memory for us.

    The main outcome of this patch should be simpler code. Logically, the
    initial compaction without reclaim is the exceptional case to the
    reclaim/compaction scheme, but prior to the patch, it was the last loop
    iteration that was exceptional. Now the code matches the logic better.
    The change also enable the following patches.

    Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
    kswapd, it first tries get_page_from_freelist() with the new
    alloc_flags, as it may succeed e.g. due to using min watermark instead
    of low watermark. It makes sense to to do this attempt before adjusting
    zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
    path if we just wake up kswapd and successfully allocate.

    This patch therefore moves the initial attempt above the retry label and
    reorganizes a bit the part below the retry label. We still have to
    attempt get_page_from_freelist() on each retry, as some allocations
    cannot do that as part of direct reclaim or compaction, and yet are not
    allowed to fail (even though they do a WARN_ON_ONCE() and thus should
    not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
    and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
    it. As a side-effect, the attempts from direct reclaim/compaction will
    also no longer obey watermarks once this is set, but there's little harm
    in that.

    Kswapd wakeups are also done on each retry to be safe from potential
    races resulting in kswapd going to sleep while a process (that may not
    be able to reclaim by itself) is still looping.

    Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
    initialized, so move the initialization above the retry: label. Also
    make the comment above the initialization more descriptive.

    The only exception in the alloc_flags being constant is
    ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
    allocating thread. We can fix this, and make the code simpler and a bit
    more effective at the same time, by moving the part that determines
    ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

    This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
    places in __alloc_pages_slowpath() anymore. The only two tests for the
    flag can instead call gfp_pfmemalloc_allowed().

    Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka