15 Aug, 2020

17 commits

  • Merge more updates from Andrew Morton:
    "Subsystems affected by this patch series: mm/hotfixes, lz4, exec,
    mailmap, mm/thp, autofs, sysctl, mm/kmemleak, mm/misc and lib"

    * emailed patches from Andrew Morton : (35 commits)
    virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
    ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
    rtl818x: constify ioreadX() iomem argument (as in generic implementation)
    iomap: constify ioreadX() iomem argument (as in generic implementation)
    sh: use generic strncpy()
    sh: clkfwk: remove r8/r16/r32
    include/asm-generic/vmlinux.lds.h: align ro_after_init
    mm: annotate a data race in page_zonenum()
    mm/swap.c: annotate data races for lru_rotate_pvecs
    mm/rmap: annotate a data race at tlb_flush_batched
    mm/mempool: fix a data race in mempool_free()
    mm/list_lru: fix a data race in list_lru_count_one
    mm/memcontrol: fix a data race in scan count
    mm/page_counter: fix various data races at memsw
    mm/swapfile: fix and annotate various data races
    mm/filemap.c: fix a data race in filemap_fault()
    mm/swap_state: mark various intentional data races
    mm/page_io: mark various intentional data races
    mm/frontswap: mark various intentional data races
    mm/kmemleak: silence KCSAN splats in checksum
    ...

    Linus Torvalds
     
  • Read to lru_add_pvec->nr could be interrupted and then write to the same
    variable. The write has local interrupt disabled, but the plain reads
    result in data races. However, it is unlikely the compilers could do much
    damage here given that lru_add_pvec->nr is a "unsigned char" and there is
    an existing compiler barrier. Thus, annotate the reads using the
    data_race() macro. The data races were reported by KCSAN,

    BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page

    write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
    rotate_reclaimable_page+0x2df/0x490
    pagevec_add at include/linux/pagevec.h:81
    (inlined by) rotate_reclaimable_page at mm/swap.c:259
    end_page_writeback+0x1b5/0x2b0
    end_swap_bio_write+0x1d0/0x280
    bio_endio+0x297/0x560
    dec_pending+0x218/0x430 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x297/0x560
    blk_update_request+0x201/0x920
    scsi_end_request+0x6b/0x4a0
    scsi_io_completion+0xb7/0x7e0
    scsi_finish_command+0x1ed/0x2a0
    scsi_softirq_done+0x1c9/0x1d0
    blk_done_softirq+0x181/0x1d0
    __do_softirq+0xd9/0x57c
    irq_exit+0xa2/0xc0
    do_IRQ+0x8b/0x190
    ret_from_intr+0x0/0x42
    delay_tsc+0x46/0x80
    __const_udelay+0x3c/0x40
    __udelay+0x10/0x20
    kcsan_setup_watchpoint+0x202/0x3a0
    __tsan_read1+0xc2/0x100
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain_cpu at mm/swap.c:602
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    2 locks held by oom02/37761:
    #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
    #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
    irq event stamp: 1949217
    trace_hardirqs_on_thunk+0x1a/0x1c
    __do_softirq+0x2e7/0x57c
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
    Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Marco Elver
    Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • mm->tlb_flush_batched could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

    write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
    try_to_unmap_one+0x59a/0x1ab0
    set_tlb_ubc_flush_pending at mm/rmap.c:635
    (inlined by) try_to_unmap_one at mm/rmap.c:1538
    rmap_walk_anon+0x296/0x650
    rmap_walk+0xdf/0x100
    try_to_unmap+0x18a/0x2f0
    shrink_page_list+0xef6/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
    flush_tlb_batched_pending+0x29/0x90
    flush_tlb_batched_pending at mm/rmap.c:682
    change_p4d_range+0x5dd/0x1030
    change_pte_range at mm/mprotect.c:44
    (inlined by) change_pmd_range at mm/mprotect.c:212
    (inlined by) change_pud_range at mm/mprotect.c:240
    (inlined by) change_p4d_range at mm/mprotect.c:260
    change_protection+0x222/0x310
    change_prot_numa+0x3e/0x60
    task_numa_work+0x219/0x350
    task_work_run+0xed/0x140
    prepare_exit_to_usermode+0x2cc/0x2e0
    ret_from_intr+0x32/0x42

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    flush_tlb_batched_pending() is under PTL but the write is not, but
    mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
    shattered. Thus, mark it as an intentional data race by using the data
    race macro.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • mempool_t pool.curr_nr could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in mempool_free / remove_element

    write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
    remove_element+0x4a/0x1c0
    remove_element at mm/mempool.c:132
    mempool_alloc+0x102/0x210
    (inlined by) mempool_alloc at mm/mempool.c:399
    bio_alloc_bioset+0x106/0x2c0
    get_swap_bio+0x49/0x230
    __swap_writepage+0x680/0xc30
    swap_writepage+0x9c/0xf0
    pageout+0x33e/0xae0
    shrink_page_list+0x1f57/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
    mempool_free+0x3e/0x150
    mempool_free at mm/mempool.c:492
    bio_free+0x192/0x280
    bio_put+0x91/0xd0
    end_swap_bio_write+0x1d8/0x280
    bio_endio+0x2c2/0x5b0
    dec_pending+0x22b/0x440 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x2c2/0x5b0
    blk_update_request+0x217/0x940
    scsi_end_request+0x6b/0x4d0
    scsi_io_completion+0xb7/0x7e0
    scsi_finish_command+0x223/0x310
    scsi_softirq_done+0x1d5/0x210
    blk_mq_complete_request+0x224/0x250
    scsi_mq_done+0xc2/0x250
    pqi_raid_io_complete+0x5a/0x70 [smartpqi]
    pqi_irq_handler+0x150/0x1410 [smartpqi]
    __handle_irq_event_percpu+0x90/0x540
    handle_irq_event_percpu+0x49/0xd0
    handle_irq_event+0x85/0xca
    handle_edge_irq+0x13f/0x3e0
    do_IRQ+0x86/0x190

    Since the write is under pool->lock but the read is done as lockless.
    Even though the commit 5b990546e334 ("mempool: fix and document
    synchronization and memory barrier usage") introduced the smp_wmb() and
    smp_rmb() pair to improve the situation, it is adequate to protect it
    from data races which could lead to a logic bug, so fix it by adding
    READ_ONCE() for the read.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct list_lru_one l.nr_items could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

    write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
    list_lru_isolate_move+0xf9/0x130
    list_lru_isolate_move at mm/list_lru.c:180
    inode_lru_isolate+0x12b/0x2a0
    __list_lru_walk_one+0x122/0x3d0
    list_lru_walk_one+0x75/0xa0
    prune_icache_sb+0x8b/0xc0
    super_cache_scan+0x1b8/0x250
    do_shrink_slab+0x256/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
    list_lru_count_one+0x116/0x2f0
    list_lru_count_one at mm/list_lru.c:193
    super_cache_count+0xe8/0x170
    do_shrink_slab+0x95/0x6d0
    shrink_slab+0x41b/0x4a0
    shrink_node+0x35c/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    A shattered l.nr_items could affect the shrinker behaviour due to a data
    race. Fix it by adding READ_ONCE() for the read. Since the writes are
    aligned and up to word-size, assume those are safe from data races to
    avoid readability issues of writing WRITE_ONCE(var, var + val).

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could had
    memcg->memsw->watermark and memcg->memsw->failcnt been accessed
    concurrently as reported by KCSAN,

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
    page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x58/0x140
    __memcg_kmem_charge+0xcc/0x280
    __alloc_pages_nodemask+0x1e1/0x450
    alloc_pages_current+0xa6/0x120
    pte_alloc_one+0x17/0xd0
    __pte_alloc+0x3a/0x1f0
    copy_p4d_range+0xc36/0x1990
    copy_page_range+0x21d/0x360
    dup_mmap+0x5f5/0x7a0
    dup_mm+0xa2/0x240
    copy_process+0x1b3f/0x3460
    _do_fork+0xaa/0xa20
    __x64_sys_clone+0x13b/0x170
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
    page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
    try_charge+0x131/0xd50 mm/memcontrol.c:2405
    mem_cgroup_try_charge+0x159/0x460
    mem_cgroup_try_charge_delay+0x3d/0xa0
    wp_page_copy+0x14d/0x930
    do_wp_page+0x107/0x7b0
    __handle_mm_fault+0xce6/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

    write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
    page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
    page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
    try_charge+0x185/0xbf0 mm/memcontrol.c:2405
    __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
    __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
    __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

    Since watermark could be compared or set to garbage due to a data race
    which would change the code logic, fix it by adding a pair of READ_ONCE()
    and WRITE_ONCE() in those places.

    The "failcnt" counter is tolerant of some degree of inaccuracy and is only
    used to report stats, a data race will not be harmful, thus mark it as an
    intentional data race using the data_race() macro.

    Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
    Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Tetsuo Handa
    Cc: Marco Elver
    Cc: Dmitry Vyukov
    Cc: Johannes Weiner
    Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
    be accessed concurrently separately as noticed by KCSAN,

    === si.highest_bit ===

    write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
    swap_range_alloc+0x81/0x130
    swap_range_alloc at mm/swapfile.c:681
    scan_swap_map_slots+0x371/0xb90
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
    scan_swap_map_slots+0x4a6/0xb90
    scan_swap_map_slots at mm/swapfile.c:892
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0xf2/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    === si.swap_map[offset] ===

    write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
    __swap_entry_free_locked+0x8c/0x100
    __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
    __swap_entry_free.constprop.20+0x69/0xb0
    free_swap_and_cache+0x53/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
    _swap_info_get+0x81/0xa0
    _swap_info_get at mm/swapfile.c:1140
    free_swap_and_cache+0x40/0xa0
    unmap_page_range+0x7f8/0x1d70
    unmap_single_vma+0xcd/0x170
    unmap_vmas+0x18b/0x220
    exit_mmap+0xee/0x220
    mmput+0x10e/0x270
    do_exit+0x59b/0xf40
    do_group_exit+0x8b/0x180

    === si.flags ===

    write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1795/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
    _swap_info_get+0x41/0xa0
    __swap_info_get at mm/swapfile.c:1114
    put_swap_page+0x84/0x490
    __remove_mapping+0x384/0x5f0
    shrink_page_list+0xff1/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290

    The writes are under si->lock but the reads are not. For si.highest_bit
    and si.swap_map[offset], data race could trigger logic bugs, so fix them
    by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
    except those isolated reads where they compare against zero which a data
    race would cause no harm. Thus, annotate them as intentional data races
    using the data_race() macro.

    For si.flags, the readers are only interested in a single bit where a
    data race there would cause no issue there.

    [cai@lca.pw: add a missing annotation for si->flags in memory.c]
    Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct file_ra_state ra.mmap_miss could be accessed concurrently during
    page faults as noticed by KCSAN,

    BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

    write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
    filemap_fault+0x920/0xfc0
    do_sync_mmap_readahead at mm/filemap.c:2384
    (inlined by) filemap_fault at mm/filemap.c:2486
    __xfs_filemap_fault+0x112/0x3e0 [xfs]
    xfs_filemap_fault+0x74/0x90 [xfs]
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
    filemap_map_pages+0xc2e/0xd80
    filemap_map_pages at mm/filemap.c:2625
    do_fault+0x3da/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    ra.mmap_miss is used to contribute the readahead decisions, a data race
    could be undesirable. Both the read and write is only under non-exclusive
    mmap_sem, two concurrent writers could even underflow the counter. Fix
    the underflow by writing to a local variable before committing a final
    store to ra.mmap_miss given a small inaccuracy of the counter should be
    acceptable.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Qian Cai
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • swap_cache_info.* could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache

    write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
    lookup_swap_cache+0x12e/0x460
    lookup_swap_cache at mm/swap_state.c:322
    do_swap_page+0x112/0xeb0
    __handle_mm_fault+0xc7a/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
    lookup_swap_cache+0x117/0x460
    lookup_swap_cache at mm/swap_state.c:322
    shmem_swapin_page+0xc7/0x9e0
    shmem_getpage_gfp+0x2ca/0x16c0
    shmem_fault+0xef/0x3c0
    __do_fault+0x9e/0x220
    do_fault+0x4a0/0x920
    __handle_mm_fault+0xc69/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
    __delete_from_swap_cache+0x681/0x8b0
    __delete_from_swap_cache at mm/swap_state.c:178

    read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
    __delete_from_swap_cache+0x66e/0x8b0
    __delete_from_swap_cache at mm/swap_state.c:178

    Both the read and write are done as lockless. Since swap_cache_info.*
    are only used to print out counter information, even if any of them
    missed a few incremental due to data races, it will be harmless, so just
    mark it as an intentional data race using the data_race() macro.

    While at it, fix a checkpatch.pl warning,

    WARNING: Single statement macros should not use a do {} while (0) loop

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • struct swap_info_struct si.flags could be accessed concurrently as noticed
    by KCSAN,

    BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage

    write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
    scan_swap_map_slots+0x6fe/0xb50
    scan_swap_map_slots at mm/swapfile.c:887
    get_swap_pages+0x39d/0x5c0
    get_swap_page+0x377/0x524
    add_to_swap+0xe4/0x1c0
    shrink_page_list+0x1740/0x2820
    shrink_inactive_list+0x316/0x8b0
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
    swap_readpage+0x204/0x6a0
    swap_readpage at mm/page_io.c:380
    read_swap_cache_async+0xa2/0xb0
    swapin_readahead+0x6a0/0x890
    do_swap_page+0x465/0xeb0
    __handle_mm_fault+0xc7a/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    Other reads,

    read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
    __swap_writepage+0x140/0xc20
    __swap_writepage at mm/page_io.c:289

    read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
    swap_set_page_dirty+0x44/0x1f4
    swap_set_page_dirty at mm/page_io.c:442

    The write is under &si->lock, but the reads are done as lockless. Since
    the reads only check for a specific bit in the flag, it is harmless even
    if load tearing happens. Thus, just mark them as intentional data races
    using the data_race() macro.

    [cai@lca.pw: add a missing annotation]
    Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • There are a few information counters that are intentionally not protected
    against increment races, so just annotate them using the data_race()
    macro.

    BUG: KCSAN: data-race in __frontswap_store / __frontswap_store

    write to 0xffffffff8b7174d8 of 8 bytes by task 6396 on cpu 103:
    __frontswap_store+0x2d0/0x344
    inc_frontswap_failed_stores at mm/frontswap.c:70
    (inlined by) __frontswap_store at mm/frontswap.c:280
    swap_writepage+0x83/0xf0
    pageout+0x33e/0xae0
    shrink_page_list+0x1f57/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffffffff8b7174d8 of 8 bytes by task 6405 on cpu 47:
    __frontswap_store+0x2b9/0x344
    inc_frontswap_failed_stores at mm/frontswap.c:70
    (inlined by) __frontswap_store at mm/frontswap.c:280
    swap_writepage+0x83/0xf0
    pageout+0x33e/0xae0
    shrink_page_list+0x1f57/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    do_try_to_free_pages+0x1f7/0xa10
    try_to_free_pages+0x26c/0x5e0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1581114499-5042-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Even if KCSAN is disabled for kmemleak, update_checksum() could still call
    crc32() (which is outside of kmemleak.c) to dereference object->pointer.
    Thus, the value of object->pointer could be accessed concurrently as
    noticed by KCSAN,

    BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock

    write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
    do_raw_spin_lock+0x114/0x200
    debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
    (inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
    _raw_spin_lock+0x40/0x50
    __handle_mm_fault+0xa9e/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
    crc32_le_base+0x67/0x350
    crc32_le_base+0x67/0x350:
    crc32_body at lib/crc32.c:106
    (inlined by) crc32_le_generic at lib/crc32.c:179
    (inlined by) crc32_le at lib/crc32.c:197
    kmemleak_scan+0x528/0xd90
    update_checksum at mm/kmemleak.c:1172
    (inlined by) kmemleak_scan at mm/kmemleak.c:1497
    kmemleak_scan_thread+0xcc/0xfa
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    If a shattered value was returned due to a data race, it will be corrected
    in the next scan. Thus, let KCSAN ignore all reads in the region to
    silence KCSAN in case the write side is non-atomic.

    Suggested-by: Marco Elver
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Marco Elver
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/20200317182754.2180-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This function returns the number of bytes in a THP. It is like
    page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
    is disabled.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "THP prep patches".

    These are some generic cleanups and improvements, which I would like
    merged into mmotm soon. The first one should be a performance improvement
    for all users of compound pages, and the others are aimed at getting code
    to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small
    systems). Also better documented / less confusing than the current prefix
    mixture of compound, hpage and thp.

    This patch (of 7):

    This removes a few instructions from functions which need to know how many
    pages are in a compound page. The storage used is either page->mapping on
    64-bit or page->index on 32-bit. Both of these are fine to overlay on
    tail pages.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org
    Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This reverts commit 26e7deadaae175.

    Sonny reported that one of their tests started failing on the latest
    kernel on their Chrome OS platform. The root cause is that the above
    commit removed the protection line of empty zone, while the parser used in
    the test relies on the protection line to mark the end of each zone.

    Let's revert it to avoid breaking userspace testing or applications.

    Fixes: 26e7deadaae175 ("mm/vmstat.c: do not show lowmem reserve protection information of empty zone)"
    Reported-by: Sonny Rao
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: David Rientjes
    Cc: [5.8.x]
    Link: http://lkml.kernel.org/r/20200811075412.12872-1-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • This remoes the code from the COW path to call debug_dma_assert_idle(),
    which was added many years ago.

    Google shows that it hasn't caught anything in the 6+ years we've had it
    apart from a false positive, and Hugh just noticed how it had a very
    unfortunate spinlock serialization in the COW path.

    He fixed that issue the previous commit (a85ffd59bd36: "dma-debug: fix
    debug_dma_assert_idle(), use rcu_read_lock()"), but let's see if anybody
    even notices when we remove this function entirely.

    NOTE! We keep the dma tracking infrastructure that was added by the
    commit that introduced it. Partly to make it easier to resurrect this
    debug code if we ever deside to, and partly because that tracking by pfn
    and offset looks quite reasonable.

    The problem with this debug code was simply that it was expensive and
    didn't seem worth it, not that it was wrong per se.

    Acked-by: Dan Williams
    Acked-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Aug, 2020

1 commit

  • Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the
    parent cgroup") adds memory tracking to the memcg kernel structures
    themselves to make cgroups liable for the memory they are consuming
    through the allocation of child groups (which can be significant).

    This code is a bit awkward as it's spread out through several functions:
    The outermost function does memalloc_use_memcg(parent) to set up
    current->active_memcg, which designates which cgroup to charge, and the
    inner functions pass GFP_ACCOUNT to request charging for specific
    allocations. To make sure this dependency is satisfied at all times -
    to make sure we don't randomly charge whoever is calling the functions -
    the inner functions warn on !current->active_memcg.

    However, this triggers a false warning when the root memcg itself is
    allocated. No parent exists in this case, and so current->active_memcg
    is rightfully NULL. It's a false positive, not indicative of a bug.

    Delete the warnings for now, we can revisit this later.

    Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup")
    Signed-off-by: Johannes Weiner
    Reported-by: Stephen Rothwell
    Acked-by: Roman Gushchin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Aug, 2020

22 commits

  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Here're the last pieces of page fault accounting that were still done
    outside handle_mm_fault() where we still have regs==NULL when calling
    handle_mm_fault():

    arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault
    arch/sparc/mm/fault_32.c: force_user_fault
    arch/um/kernel/trap.c: handle_page_fault
    mm/gup.c: faultin_page
    fixup_user_fault
    mm/hmm.c: hmm_vma_fault
    mm/ksm.c: break_ksm

    Some of them has the issue of duplicated accounting for page fault
    retries. Some of them didn't do the accounting at all.

    This patch cleans all these up by letting handle_mm_fault() to do per-task
    page fault accounting even if regs==NULL (though we'll still skip the perf
    event accountings). With that, we can safely remove all the outliers now.

    There's another functional change in that now we account the page faults
    to the caller of gup, rather than the task_struct that passed into the gup
    code. More information of this can be found at [1].

    After this patch, below things should never be touched again outside
    handle_mm_fault():

    - task_struct.[maj|min]_flt
    - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]

    [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Patch series "mm: Page fault accounting cleanups", v5.

    This is v5 of the pf accounting cleanup series. It originates from Gerald
    Schaefer's report on an issue a week ago regarding to incorrect page fault
    accountings for retried page fault after commit 4064b9827063 ("mm: allow
    VM_FAULT_RETRY for multiple times"):

    https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

    What this series did:

    - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault. For example, page fault
    retries should not be counted in page fault counters. Same to the
    perf events.

    - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g. errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense. And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

    - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR). More information in patch 1.

    - Always account the page fault onto the one that triggered the page
    fault. This does not matter much for #PF handlings, but mostly for
    gup. More information on this in patch 25.

    Patchset layout:

    Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
    Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
    Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
    Patch 25: Cleanup GUP task_struct pointer since it's not needed any more

    This patch (of 25):

    This is a preparation patch to move page fault accountings into the
    general code in handle_mm_fault(). This includes both the per task
    flt_maj/flt_min counters, and the major/minor page fault perf events. To
    do this, the pt_regs pointer is passed into handle_mm_fault().

    PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
    handlers.

    So far, all the pt_regs pointer that passed into handle_mm_fault() is
    NULL, which means this patch should have no intented functional change.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Albert Ou
    Cc: Alexander Gordeev
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Gerald Schaefer
    Cc: Greentime Hu
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: John Hubbard
    Cc: Jonas Bonn
    Cc: Ley Foon Tan
    Cc: "Luck, Tony"
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
    Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1596180906-8442-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • new_non_cma_page() in gup.c requires to allocate the new page that is not
    on the CMA area. new_non_cma_page() implements it by using allocation
    scope APIs.

    However, there is a work-around for hugetlb. Normal hugetlb page
    allocation API for migration is alloc_huge_page_nodemask(). It consists
    of two steps. First is dequeing from the pool. Second is, if there is no
    available page on the queue, allocating by using the page allocator.

    new_non_cma_page() can't use this API since first step (deque) isn't aware
    of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb
    internal function for the second step, alloc_migrate_huge_page(), to
    global scope and uses it directly. This is suboptimal since hugetlb pages
    on the queue cannot be utilized.

    This patch tries to fix this situation by making the deque function on
    hugetlb CMA aware. In the deque function, CMA memory is skipped if
    PF_MEMALLOC_NOCMA flag is found.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have well defined scope API to exclude CMA region. Use it rather than
    manipulating gfp_mask manually. With this change, we can now restore
    __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
    would result in that the ZONE_MOVABLE is also searched by page allocator.
    For hugetlb, gfp_mask is redefined since it has a regular allocation mask
    filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
    filter since a new user for gfp_mask filter, gup, want to be silent when
    allocation fails.

    Note that this can be considered as a fix for the commit 9a4e9f3b2d73
    ("mm: update get_user_pages_longterm to migrate pages allocated from CMA
    region"). However, "Fixes" tag isn't added here since it is just
    suboptimal but it doesn't cause any problem.

    Suggested-by: Michal Hocko
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Roman Gushchin
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a well-defined standard migration target callback. Use it
    directly.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are some similar functions for migration target allocation. Since
    there is no fundamental difference, it's better to keep just one rather
    than keeping all variants. This patch implements base migration target
    allocation function. In the following patches, variants will be converted
    to use this function.

    Changes should be mechanical, but, unfortunately, there are some
    differences. First, some callers' nodemask is assgined to NULL since NULL
    nodemask will be considered as all available nodes, that is,
    &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
    redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
    user provided gfp_mask has it. This is because future caller of this
    function requires to set this node constaint. Lastly, if provided nodeid
    is NUMA_NO_NODE, nodeid is set up to the node where migration source
    lives. It helps to remove simple wrappers for setting up the nodeid.

    Note that PageHighmem() call in previous function is changed to open-code
    "is_highmem_idx()" since it provides more readability.

    [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
    [akpm@linux-foundation.org: fix typo in comment]

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • …egular THP allocations

    new_page_nodemask is a migration callback and it tries to use a common gfp
    flags for the target page allocation whether it is a base page or a THP.
    The later only adds GFP_TRANSHUGE to the given mask. This results in the
    allocation being slightly more aggressive than necessary because the
    resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
    allocations usually exclude this flag to reduce over eager background
    reclaim during a high THP allocation load which has been seen during large
    mmaps initialization. There is no indication that this is a problem for
    migration as well but theoretically the same might happen when migrating
    large mappings to a different node. Make the migration callback
    consistent with regular THP allocations.

    [akpm@linux-foundation.org: fix comment typo, per Vlastimil]

    Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Hellwig <hch@infradead.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Roman Gushchin <guro@fb.com>
    Link: http://lkml.kernel.org/r/1594622517-20681-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Joonsoo Kim
     
  • There is no difference between two migration callback functions,
    alloc_huge_page_node() and alloc_huge_page_nodemask(), except
    __GFP_THISNODE handling. It's redundant to have two almost similar
    functions in order to handle this flag. So, this patch tries to remove
    one by introducing a new argument, gfp_mask, to
    alloc_huge_page_nodemask().

    After introducing gfp_mask argument, it's caller's job to provide correct
    gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
    to provide gfp_mask.

    Note that it's safe to remove a node id check in alloc_huge_page_node()
    since there is no caller passing NUMA_NO_NODE as a node id.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It's not performance sensitive function. Move it to .c. This is a
    preparation step for future change.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Acked-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "clean-up the migration target allocation functions", v5.

    This patch (of 9):

    For locality, it's better to migrate the page to the same node rather than
    the node of the current caller's cpu.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1594622517-20681-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Add helpers to wrap the get_fs/set_fs magic for undoing any damange done
    by set_fs(KERNEL_DS). There is no real functional benefit, but this
    documents the intent of these calls better, and will allow stubbing the
    functions out easily for kernels builds that do not allow address space
    overrides in the future.

    [hch@lst.de: drop two incorrect hunks, fix a commit log typo]
    Link: http://lkml.kernel.org/r/20200714105505.935079-6-hch@lst.de

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Acked-by: Mark Rutland
    Acked-by: Greentime Hu
    Acked-by: Geert Uytterhoeven
    Cc: Nick Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Link: http://lkml.kernel.org/r/20200710135706.537715-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Change "as as" to "as a".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-16-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "if".
    Fix subject/verb agreement.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-15-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "marked".
    Change "time time" to "same time".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-13-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "and".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-12-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "them" and "that".
    Change "the the" to "to the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "that" in two places.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-9-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap