04 May, 2017

40 commits

  • Moves page description after the stacks since it's less important.

    Link: http://lkml.kernel.org/r/20170302134851.101218-8-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Changes slab object description from:

    Object at ffff880068388540, in cache kmalloc-128 size: 128

    to:

    The buggy address belongs to the object at ffff880068388540
    which belongs to the cache kmalloc-128 of size 128
    The buggy address is located 123 bytes inside of
    128-byte region [ffff880068388540, ffff8800683885c0)

    Makes it more explanatory and adds information about relative offset of
    the accessed address to the start of the object.

    Link: http://lkml.kernel.org/r/20170302134851.101218-7-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Change report header format from:

    BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0 at addr ffff880069437950
    Read of size 8 by task insmod/3925

    to:

    BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0
    Read of size 8 at addr ffff880069437950 by task insmod/3925

    The exact access address is not usually important, so move it to the
    second line. This also makes the header look visually balanced.

    Link: http://lkml.kernel.org/r/20170302134851.101218-6-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Simplify logic for describing a memory address. Add addr_to_page()
    helper function.

    Makes the code easier to follow.

    Link: http://lkml.kernel.org/r/20170302134851.101218-5-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Change stack traces headers from:

    Allocated:
    PID = 42

    to:

    Allocated by task 42:

    Makes the report one line shorter and look better.

    Link: http://lkml.kernel.org/r/20170302134851.101218-4-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Unify KASAN report header format for different kinds of bad memory
    accesses. Makes the code simpler.

    Link: http://lkml.kernel.org/r/20170302134851.101218-3-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Patch series "kasan: improve error reports", v2.

    This patchset improves KASAN reports by making them easier to read and a
    little more detailed. Also improves mm/kasan/report.c readability.

    Effectively changes a use-after-free report to:

    ==================================================================
    BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan]
    Write of size 1 at addr ffff88006aa59da8 by task insmod/3951

    CPU: 1 PID: 3951 Comm: insmod Tainted: G B 4.10.0+ #84
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x292/0x398
    print_address_description+0x73/0x280
    kasan_report.part.2+0x207/0x2f0
    __asan_report_store1_noabort+0x2c/0x30
    kmalloc_uaf+0xaa/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: 0033:0x7f22cfd0b9da
    RSP: 002b:00007ffe69118a78 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
    RAX: ffffffffffffffda RBX: 0000555671242090 RCX: 00007f22cfd0b9da
    RDX: 00007f22cffcaf88 RSI: 000000000004df7e RDI: 00007f22d0399000
    RBP: 00007f22cffcaf88 R08: 0000000000000003 R09: 0000000000000000
    R10: 00007f22cfd07d0a R11: 0000000000000206 R12: 0000555671243190
    R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004

    Allocated by task 3951:
    save_stack_trace+0x16/0x20
    save_stack+0x43/0xd0
    kasan_kmalloc+0xad/0xe0
    kmem_cache_alloc_trace+0x82/0x270
    kmalloc_uaf+0x56/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Freed by task 3951:
    save_stack_trace+0x16/0x20
    save_stack+0x43/0xd0
    kasan_slab_free+0x72/0xc0
    kfree+0xe8/0x2b0
    kmalloc_uaf+0x85/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc

    The buggy address belongs to the object at ffff88006aa59da0
    which belongs to the cache kmalloc-16 of size 16
    The buggy address is located 8 bytes inside of
    16-byte region [ffff88006aa59da0, ffff88006aa59db0)
    The buggy address belongs to the page:
    page:ffffea0001aa9640 count:1 mapcount:0 mapping: (null) index:0x0
    flags: 0x100000000000100(slab)
    raw: 0100000000000100 0000000000000000 0000000000000000 0000000180800080
    raw: ffffea0001abe380 0000000700000007 ffff88006c401b40 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88006aa59c80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
    ffff88006aa59d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
    >ffff88006aa59d80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    ^
    ffff88006aa59e00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    ffff88006aa59e80: fb fb fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
    ==================================================================

    from:

    ==================================================================
    BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan] at addr ffff88006c4dcb28
    Write of size 1 by task insmod/3984
    CPU: 1 PID: 3984 Comm: insmod Tainted: G B 4.10.0+ #83
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x292/0x398
    kasan_object_err+0x1c/0x70
    kasan_report.part.1+0x20e/0x4e0
    __asan_report_store1_noabort+0x2c/0x30
    kmalloc_uaf+0xaa/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: 0033:0x7feca0f779da
    RSP: 002b:00007ffdfeae5218 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
    RAX: ffffffffffffffda RBX: 000055a064c13090 RCX: 00007feca0f779da
    RDX: 00007feca1236f88 RSI: 000000000004df7e RDI: 00007feca1605000
    RBP: 00007feca1236f88 R08: 0000000000000003 R09: 0000000000000000
    R10: 00007feca0f73d0a R11: 0000000000000206 R12: 000055a064c14190
    R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004
    Object at ffff88006c4dcb20, in cache kmalloc-16 size: 16
    Allocated:
    PID = 3984
    save_stack_trace+0x16/0x20
    save_stack+0x43/0xd0
    kasan_kmalloc+0xad/0xe0
    kmem_cache_alloc_trace+0x82/0x270
    kmalloc_uaf+0x56/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    Freed:
    PID = 3984
    save_stack_trace+0x16/0x20
    save_stack+0x43/0xd0
    kasan_slab_free+0x73/0xc0
    kfree+0xe8/0x2b0
    kmalloc_uaf+0x85/0xb6 [test_kasan]
    kmalloc_tests_init+0x4f/0xa48 [test_kasan]
    do_one_initcall+0xf3/0x390
    do_init_module+0x215/0x5d0
    load_module+0x54de/0x82b0
    SYSC_init_module+0x3be/0x430
    SyS_init_module+0x9/0x10
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    Memory state around the buggy address:
    ffff88006c4dca00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    ffff88006c4dca80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    >ffff88006c4dcb00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    ^
    ffff88006c4dcb80: fb fb fc fc 00 00 fc fc fb fb fc fc fb fb fc fc
    ffff88006c4dcc00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
    ==================================================================

    This patch (of 9):

    Introduce get_shadow_bug_type() function, which determines bug type
    based on the shadow value for a particular kernel address. Introduce
    get_wild_bug_type() function, which determines bug type for addresses
    which don't have a corresponding shadow value.

    Link: http://lkml.kernel.org/r/20170302134851.101218-2-andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Memory error handler calls try_to_unmap() for error pages in various
    states. If the error page is a mlocked page, error handling could fail
    with "still referenced by 1 users" message. This is because the page is
    linked to and stays in lru cache after the following call chain.

    try_to_unmap_one
    page_remove_rmap
    clear_page_mlock
    putback_lru_page
    lru_cache_add

    memory_failure() calls shake_page() to hanlde the similar issue, but
    current code doesn't cover because shake_page() is called only before
    try_to_unmap(). So this patches adds shake_page().

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • shake_page() is called before going into core error handling code in
    order to ensure that the error page is flushed from lru_cache lists
    where pages stay during transferring among LRU lists.

    But currently it's not fully functional because when the page is linked
    to lru_cache by calling activate_page(), its PageLRU flag is set and
    shake_page() is skipped. The result is to fail error handling with
    "still referenced by 1 users" message.

    When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
    is clear, so that's fine.

    This patch makes shake_page() unconditionally called to avoild the
    failure.

    Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
    Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
    Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: kernel test robot
    Cc: Xiaolong Ye
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • In swapcache_free_entries(), if swap_info_get_cont() returns NULL,
    something wrong occurs for the swap entry. But we should still continue
    to free the following swap entries in the array instead of skip them to
    avoid swap space leak. This is just problem in error path, where system
    may be in an inconsistent state, but it is still good to fix it.

    Link: http://lkml.kernel.org/r/20170421124739.24534-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Tim Chen
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • MIPS just got changed to only accept a pointer argument for access_ok(),
    causing one warning in drivers/scsi/pmcraid.c. I tried changing x86 the
    same way and found the same warning in __get_user_pages_fast() and
    nowhere else in the kernel during randconfig testing:

    mm/gup.c: In function '__get_user_pages_fast':
    mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion]

    It would probably be a good idea to enforce type-safety in general, so
    let's change this file to not cause a warning if we do that.

    I don't know why the warning did not appear on MIPS.

    Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
    Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Cc: Alexander Viro
    Acked-by: Ingo Molnar
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • cleancache_invalidate_inode() called truncate_inode_pages_range() and
    invalidate_inode_pages2_range() twice - on entry and on exit. It's
    stupid and waste of time. It's enough to call it once at exit.

    Link: http://lkml.kernel.org/r/20170424164135.22350-5-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
    avoid pointless lookups in empty radix tree and bail out immediately
    after cleancache invalidation.

    Link: http://lkml.kernel.org/r/20170424164135.22350-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
    which doen't make any sense.

    Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
    regardless of mapping->nrpages value.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Patch series "Properly invalidate data in the cleancache", v2.

    We've noticed that after direct IO write, buffered read sometimes gets
    stale data which is coming from the cleancache. The reason for this is
    that some direct write hooks call call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero, so we may not invalidate
    data in the cleancache.

    Another odd thing is that we check only for ->nrpages and don't check
    for ->nrexceptional, but invalidate_inode_pages2[_range] also
    invalidates exceptional entries as well. So we invalidate exceptional
    entries only if ->nrpages != 0? This doesn't feel right.

    - Patch 1 fixes direct IO writes by removing ->nrpages check.
    - Patch 2 fixes similar case in invalidate_bdev().
    Note: I only fixed conditional cleancache_invalidate_inode() here.
    Do we also need to add ->nrexceptional check in into invalidate_bdev()?

    - Patches 3-4: some optimizations.

    This patch (of 4):

    Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero. This can't be right,
    because invalidate_inode_pages2[_range]() also invalidate data in the
    cleancache via cleancache_invalidate_inode() call. So if page cache is
    empty but there is some data in the cleancache, buffered read after
    direct IO write would get stale data from the cleancache.

    Also it doesn't feel right to check only for ->nrpages because
    invalidate_inode_pages2[_range] invalidates exceptional entries as well.

    Fix this by calling invalidate_inode_pages2[_range]() regardless of
    nrpages state.

    Note: nfs,cifs,9p doesn't need similar fix because the never call
    cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
    they are not affected by this bug.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • In page_same_filled function, all elements in the page is compared with
    next index value. The current comparison routine compares the (i)th and
    (i+1)th values of the page.

    In this case, two load operaions occur for each comparison. But if we
    store first value of the page stores at 'val' variable and using it to
    compare with others, the load opearation is reduced. It reduce load
    operation per page by up to 64times.

    Link: http://lkml.kernel.org/r/1488428104-7257-1-git-send-email-sangwoo2.park@lge.com
    Signed-off-by: Sangwoo Park
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sangwoo Park
     
  • The zram_free_page already handles NULL handle case and same page so use
    it to reduce error probability. (Acutaully, I made a mistake when I
    handled same page feature)

    Link: http://lkml.kernel.org/r/1492052365-16169-7-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With element, sometime I got confused handle and element access. It
    might be my bad but I think it's time to introduce accessor to prevent
    future idiot like me. This patch is just clean-up patch so it shouldn't
    change any behavior.

    Link: http://lkml.kernel.org/r/1492052365-16169-6-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • It's redundant now. Instead, remove it and use zram structure directly.

    Link: http://lkml.kernel.org/r/1492052365-16169-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With this clean-up phase, I want to use zram's wrapper function to lock
    table access which is more consistent with other zram's functions.

    Link: http://lkml.kernel.org/r/1492052365-16169-4-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
    However, the mixed code for handling normal/partial IO is too mess,
    error-prone to modify IO handler functions with upcoming feature so this
    patch aims for cleaning up zram's IO handling functions.

    Link: http://lkml.kernel.org/r/1492052365-16169-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "zram clean up", v2.

    This patchset aims to clean up zram .

    [1] clean up multiple pages's bvec handling.
    [2] clean up partial IO handling
    [3-6] clean up zram via using accessor and removing pointless structure.

    With [2-6] applied, we can get a few hundred bytes as well as huge
    readibility enhance.

    x86: 708 byte save

    add/remove: 1/1 grow/shrink: 0/11 up/down: 478/-1186 (-708)
    function old new delta
    zram_special_page_read - 478 +478
    zram_reset_device 317 314 -3
    mem_used_max_store 131 128 -3
    compact_store 96 93 -3
    mm_stat_show 203 197 -6
    zram_add 719 712 -7
    zram_slot_free_notify 229 214 -15
    zram_make_request 819 803 -16
    zram_meta_free 128 111 -17
    zram_free_page 180 151 -29
    disksize_store 432 361 -71
    zram_decompress_page.isra 504 - -504
    zram_bvec_rw 2592 2080 -512
    Total: Before=25350773, After=25350065, chg -0.00%

    ppc64: 231 byte save

    add/remove: 2/0 grow/shrink: 1/9 up/down: 681/-912 (-231)
    function old new delta
    zram_special_page_read - 480 +480
    zram_slot_lock - 200 +200
    vermagic 39 40 +1
    mm_stat_show 256 248 -8
    zram_meta_free 200 184 -16
    zram_add 944 912 -32
    zram_free_page 348 308 -40
    disksize_store 572 492 -80
    zram_decompress_page 664 564 -100
    zram_slot_free_notify 292 160 -132
    zram_make_request 1132 1000 -132
    zram_bvec_rw 2768 2396 -372
    Total: Before=17565825, After=17565594, chg -0.00%

    This patch (of 6):

    Johannes Thumshirn reported system goes the panic when using NVMe over
    Fabrics loopback target with zram.

    The reason is zram expects each bvec in bio contains a single page
    but nvme can attach a huge bulk of pages attached to the bio's bvec
    so that zram's index arithmetic could be wrong so that out-of-bound
    access makes system panic.

    [1] in mainline solved solved the problem by limiting max_sectors with
    SECTORS_PER_PAGE but it makes zram slow because bio should split with
    each pages so this patch makes zram aware of multiple pages in a bvec
    so it could solve without any regression(ie, bio split).

    [1] 0bc315381fe9, zram: set physical queue limits to avoid array out of
    bounds accesses

    Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox
    Signed-off-by: Minchan Kim
    Reported-by: Johannes Thumshirn
    Tested-by: Johannes Thumshirn
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sergey Senozhatsky
    Cc: Hannes Reinecke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit c0a32fc5a2e4 ("mm: more intensive memory corruption debugging")
    changed to check debug_guardpage_minorder() > 0 when reporting
    allocation failures. The reasoning was

    When we use guard page to debug memory corruption, it shrinks
    available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
    value. In such case memory allocation failures can be common and
    printing errors can flood dmesg. If somebody debug corruption,
    allocation failures are not the things he/she is interested about.

    but this is misguided.

    Allocation requests with __GFP_NOWARN flag by definition do not cause
    flooding of allocation failure messages. Allocation requests with
    __GFP_NORETRY flag likely also have __GFP_NOWARN flag. Costly
    allocation requests likely also have __GFP_NOWARN flag.

    Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
    __GFP_NOWARN flag or __GFP_HIGH flag. Non-costly allocation requests
    with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
    small to fail" memory-allocation rule.

    Therefore, as a whole, shrinking available pages by
    debug_guardpage_minorder= kernel boot parameter might cause flooding of
    OOM killer messages but unlikely causes flooding of allocation failure
    messages. Let's remove debug_guardpage_minorder() > 0 check which would
    likely be pointless.

    Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J . Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • It helps to provide page flag description along with the raw value in
    error paths during soft offline process. From sample experiments

    Before the patch:

    soft offline: 0x6100: migration failed 1, type 3ffff800008018
    soft offline: 0x7400: migration failed 1, type 3ffff800008018

    After the patch:

    soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)
    soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)

    Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • madvise_behavior_valid() should be called before acting upon the
    behavior parameter. Hence move up the function. This also includes
    MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
    for the system call madvise().

    Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: David Rientjes
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
    through madvise() system call.

    * madvise_memory_failure() was misleading to accommodate handling of
    both memory_failure() as well as soft_offline_page() functions.
    Basically it handles memory error injection from user space which
    can go either way as memory failure or soft offline. Renamed as
    madvise_inject_error() instead.

    * Renamed struct page pointer 'p' to 'page'.

    * pr_info() was essentially printing PFN value but it said 'page'
    which was misleading. Made the process virtual address explicit.

    Before the patch:

    Soft offlining page 0x15e3e at 0x3fff8c230000
    Soft offlining page 0x1f3 at 0x3fffa0da0000
    Soft offlining page 0x744 at 0x3fff7d200000
    Soft offlining page 0x1634d at 0x3fff95e20000
    Soft offlining page 0x16349 at 0x3fff95e30000
    Soft offlining page 0x1d6 at 0x3fff9e8b0000
    Soft offlining page 0x5f3 at 0x3fff91bd0000

    Injecting memory failure for page 0x15c8b at 0x3fff83280000
    Injecting memory failure for page 0x16190 at 0x3fff83290000
    Injecting memory failure for page 0x740 at 0x3fff9a2e0000
    Injecting memory failure for page 0x741 at 0x3fff9a2f0000

    After the patch:

    Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
    Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
    Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
    Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
    Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
    Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
    Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
    Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000

    Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
    Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
    Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
    Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
    Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
    Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
    Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000

    Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Adding a brief overview of hugetlbfs reservation design and
    implementation as an aid to those making code modifications in this
    area.

    Link: http://lkml.kernel.org/r/1491586995-13085-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Hillf Danton
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This is a code cleanup patch, no functionality changes. There are 2
    unused function prototype in swap.h, they are removed.

    Link: http://lkml.kernel.org/r/20170405071017.23677-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items.

    This increases the size of the event array, but we'll eventually want
    most of the VM events tracked on a per-cgroup basis anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 091d0d55b286 ("shm: fix null pointer deref when userspace
    specifies invalid hugepage size") had replaced MAP_HUGE_MASK with
    SHM_HUGE_MASK. Though both of them contain the same numeric value of
    0x3f, MAP_HUGE_MASK flag sounds more appropriate than the other one in
    the context. Hence change it back.

    Link: http://lkml.kernel.org/r/20170404045635.616-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Matthew Wilcox
    Acked-by: Balbir Singh
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Tetsuo has reported that sysrq triggered OOM killer will print a
    misleading information when no tasks are selected:

    sysrq: SysRq : Manual OOM execution
    Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child
    Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB
    sysrq: SysRq : Manual OOM execution
    Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child
    Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled

    The real reason is that there are no eligible tasks for the OOM killer
    to select but since commit 7c5f64f84483 ("mm: oom: deduplicate victim
    selection code for memcg and global oom") the semantic of out_of_memory
    has changed without updating moom_callback.

    This patch updates moom_callback to tell that no task was eligible which
    is the case for both oom killer disabled and no eligible tasks. In
    order to help distinguish first case from the second add printk to both
    oom_killer_{enable,disable}. This information is useful on its own
    because it might help debugging potential memory allocation failures.

    Fixes: 7c5f64f84483 ("mm: oom: deduplicate victim selection code for memcg and global oom")
    Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently, selftest for userfaultfd is compiled three times: for
    anonymous, shared and hugetlb memory. Let's combine all the cases into
    a single executable which will have a command line option for selection
    of the test type.

    Link: http://lkml.kernel.org/r/1490869741-5913-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Fix variable name error in comments. No code changes.

    Link: http://lkml.kernel.org/r/20170403161655.5081-1-haolee.swjtu@gmail.com
    Signed-off-by: Hao Lee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hao Lee
     
  • Add a warning diagnostics to user if we failed to allocate swap slots
    cache and use it.

    [akpm@linux-foundation.org: use WARN_ONCE return value, fix grammar in message]
    Link: http://lkml.kernel.org/r/20170328234827.GA10107@linux.intel.com
    Signed-off-by: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • It is preferred, and the rest of migrate.h gets it right.

    Link: http://lkml.kernel.org/r/1490336009-8024-1-git-send-email-pushkar.iit@gmail.com
    Signed-off-by: Pushkar Jambhlekar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pushkar Jambhlekar
     
  • On SPARSEMEM systems page poisoning is enabled after buddy is up,
    because of the dependency on page extension init. This causes the pages
    released by free_all_bootmem not to be poisoned. This either delays or
    misses the identification of some issues because the pages have to
    undergo another cycle of alloc-free-alloc for any corruption to be
    detected.

    Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
    flag. Since all the free pages will now be poisoned, the flag need not
    be verified before checking the poison during an alloc.

    [vinmenon@codeaurora.org: fix Kconfig]
    Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
    Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Laura Abbott
    Tested-by: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon