22 Aug, 2015

2 commits

  • Introduce generic kasan_populate_zero_shadow(shadow_start,
    shadow_end). This function maps kasan_zero_page to the
    [shadow_start, shadow_end] addresses.

    This replaces x86_64 specific populate_zero_shadow() and will
    be used for ARM64 in follow on patches.

    The main changes from original version are:

    * Use p?d_populate*() instead of set_p?d()
    * Use memblock allocator directly instead of vmemmap_alloc_block()
    * __pa() instead of __pa_nodebug(). __pa() causes troubles
    iff we use it before kasan_early_init(). kasan_populate_zero_shadow()
    will be used later, so we ok with __pa() here.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Catalin Marinas
    Cc: Alexander Potapenko
    Cc: Alexey Klimov
    Cc: Andrew Morton
    Cc: Aneesh Kumar K.V
    Cc: Arnd Bergmann
    Cc: David Keitel
    Cc: Dmitry Vyukov
    Cc: Linus Torvalds
    Cc: Linus Walleij
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yury
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1439444244-26057-3-git-send-email-ryabinin.a.a@gmail.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Cc: [3.6+]
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2015

6 commits

  • cma_bitmap_maxno() was marked as static and not static inline, which can
    cause warnings about this function not being used if this file is included
    in a file that does not call that function, and violates the conventions
    used elsewhere. The two options are to move the function implementation
    back to mm/cma.c or make it inline here, and it's simple enough for the
    latter to make sense.

    Signed-off-by: Gregory Fong
    Cc: Joonsoo Kim
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gregory Fong
     
  • When we add a new node, the edge of memory may be wrong.

    e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],

    1. hotremove the node3,
    2. then hotadd node3 with a part of memory, mem:[26G-30G],
    3. call hotadd_new_pgdat()
    free_area_init_node()
    get_pfn_range_for_nid()
    4. it will return wrong start_pfn and end_pfn, because we have not
    update the memblock.

    This patch also fixes a BUG_ON during hot-addition, please see
    http://marc.info/?l=linux-kernel&m=142961156129456&w=2

    Signed-off-by: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Cc: Kamezawa Hiroyuki
    Cc: Taku Izumi
    Cc: Tang Chen
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Update my email address.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Bug:

    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:1957!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
    CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
    Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
    task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
    RIP: split_huge_page_to_list+0xdb/0x120
    Call Trace:
    memory_failure+0x32e/0x7c0
    madvise_hwpoison+0x8b/0x160
    SyS_madvise+0x40/0x240
    ? do_page_fault+0x37/0x90
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
    RIP split_huge_page_to_list+0xdb/0x120
    RSP
    ---[ end trace aee7ce0df8e44076 ]---

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define MB 1024*1024

    int main(void)
    {
    char *mem;

    posix_memalign((void **)&mem, 2 * MB, 200 * MB);

    madvise(mem, 200 * MB, MADV_HWPOISON);

    free(mem);

    return 0;
    }

    Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
    The get_user_pages_fast() which called in madvise_hwpoison() will get
    huge zero page if the page is not allocated before. Huge zero page is a
    tranparent huge page, however, it is not an anonymous page.
    memory_failure will split the huge zero page and trigger
    BUG_ON(is_huge_zero_page(page));

    After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling
    for non-tail-refcounted thp"), memory_failure will not catch non anon
    thp from madvise_hwpoison path and this bug occur.

    Fix it by catching non anon thp in memory_failure in order to not split
    huge zero page in madvise_hwpoison path.

    After this patch:

    Injecting memory failure for page 0x202800 at 0x7fd8ae800000
    MCE: 0x202800: non anonymous thp
    [...]

    [akpm@linux-foundation.org: remove second split, per Wanpeng]
    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Hugetlbfs pages will get a refcount in get_any_page() or
    madvise_hwpoison() if soft offlining through madvise. The refcount which
    is held by the soft offline path should be released if we fail to isolate
    hugetlbfs pages.

    Fix it by reducing the refcount for both isolation success and failure.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • After trying to drain pages from pagevec/pageset, we try to get reference
    count of the page again, however, the reference count of the page is not
    reduced if the page is still not on LRU list.

    Fix it by adding the put_page() to drop the page reference which is from
    __get_any_page().

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

07 Aug, 2015

12 commits

  • The initial value of global_wb_domain.dirty_limit set by
    writeback_set_ratelimit() is zeroed out by the memset in
    wb_domain_init().

    Signed-off-by: Rabin Vincent
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rabin Vincent
     
  • Now page freeing code doesn't consider PageHWPoison as a bad page, so by
    setting it before completing the page containment, we can prevent the
    error page from being reused just after successful page migration.

    I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
    page table entry is transformed into migration entry, not to hwpoison
    entry.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • "non anonymous thp" case is still racy with freeing thp, which causes
    panic due to put_page() for refcount-0 page. It seems that closing up
    this race might be hard (and/or not worth doing,) so let's give up the
    error handling for this case.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When memory_failure() is called on a page which are just freed after
    page migration from soft offlining, the counter num_poisoned_pages is
    raised twi= ce. So let's fix it with using TestSetPageHWPoison.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Recently I addressed a few of hwpoison race problems and the patches are
    merged on v4.2-rc1. It made progress, but unfortunately some problems
    still remain due to less coverage of my testing. So I'm trying to fix
    or avoid them in this series.

    One point I'm expecting to discuss is that patch 4/5 changes the page
    flag set to be checked on free time. In current behavior, __PG_HWPOISON
    is not supposed to be set when the page is freed. I think that there is
    no strong reason for this behavior, and it causes a problem hard to fix
    only in error handler side (because __PG_HWPOISON could be set at
    arbitrary timing.) So I suggest to change it.

    With this patchset, hwpoison stress testing in official mce-test
    testsuite (which previously failed) passes.

    This patch (of 5):

    In "just unpoisoned" path, we do put_page and then unlock_page, which is
    a wrong order and causes "freeing locked page" bug. So let's fix it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Commit 92923ca3aace ("mm: meminit: only set page reserved in the
    memblock region") broke memory hotplug which expects the memmap for
    newly added sections to be reserved until onlined by
    online_pages_range(). This patch marks hotplugged pages as reserved
    when adding new zones.

    Signed-off-by: Mel Gorman
    Reported-by: David Vrabel
    Tested-by: David Vrabel
    Cc: Nathan Zimmer
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch fixes creation of new kmem-caches after enabling
    sanity_checks for existing mergeable kmem-caches in runtime: before that
    patch creation fails because unique name in sysfs already taken by
    existing kmem-cache.

    Unlike other debug options this doesn't change object layout and could
    be enabled and disabled at any time.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
    before basic setup") introduced a rwsem to signal completion of the
    initialization workers.

    Lockdep complains about possible recursive locking:
    =============================================
    [ INFO: possible recursive locking detected ]
    4.1.0-12802-g1dc51b8 #3 Not tainted
    ---------------------------------------------
    swapper/0/1 is trying to acquire lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0xc7/0xe6

    but task is already holding lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0x3e/0xe6

    Replace the rwsem by a completion together with an atomic
    "outstanding work counter".

    [peterz@infradead.org: Barrier removal on the grounds of being pointless]
    [mgorman@suse.de: Applied review feedback]
    Signed-off-by: Nicolai Stange
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • early_pfn_to_nid() historically was inherently not SMP safe but only
    used during boot which is inherently single threaded or during hotplug
    which is protected by a giant mutex.

    With deferred memory initialisation there was a thread-safe version
    introduced and the early_pfn_to_nid would trigger a BUG_ON if used
    unsafely. Memory hotplug hit that check. This patch makes
    early_pfn_to_nid introduces a lock to make it safe to use during
    hotplug.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Ng
    Tested-by: Alex Ng
    Acked-by: Peter Zijlstra (Intel)
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Aug, 2015

1 commit

  • Nikolay has reported a hang when a memcg reclaim got stuck with the
    following backtrace:

    PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
    #0 __schedule at ffffffff815ab152
    #1 schedule at ffffffff815ab76e
    #2 schedule_timeout at ffffffff815ae5e5
    #3 io_schedule_timeout at ffffffff815aad6a
    #4 bit_wait_io at ffffffff815abfc6
    #5 __wait_on_bit at ffffffff815abda5
    #6 wait_on_page_bit at ffffffff8111fd4f
    #7 shrink_page_list at ffffffff81135445
    #8 shrink_inactive_list at ffffffff81135845
    #9 shrink_lruvec at ffffffff81135ead
    #10 shrink_zone at ffffffff811360c3
    #11 shrink_zones at ffffffff81136eff
    #12 do_try_to_free_pages at ffffffff8113712f
    #13 try_to_free_mem_cgroup_pages at ffffffff811372be
    #14 try_charge at ffffffff81189423
    #15 mem_cgroup_try_charge at ffffffff8118c6f5
    #16 __add_to_page_cache_locked at ffffffff8112137d
    #17 add_to_page_cache_lru at ffffffff81121618
    #18 pagecache_get_page at ffffffff8112170b
    #19 grow_dev_page at ffffffff811c8297
    #20 __getblk_slow at ffffffff811c91d6
    #21 __getblk_gfp at ffffffff811c92c1
    #22 ext4_ext_grow_indepth at ffffffff8124565c
    #23 ext4_ext_create_new_leaf at ffffffff81246ca8
    #24 ext4_ext_insert_extent at ffffffff81246f09
    #25 ext4_ext_map_blocks at ffffffff8124a848
    #26 ext4_map_blocks at ffffffff8121a5b7
    #27 mpage_map_one_extent at ffffffff8121b1fa
    #28 mpage_map_and_submit_extent at ffffffff8121f07b
    #29 ext4_writepages at ffffffff8121f6d5
    #30 do_writepages at ffffffff8112c490
    #31 __filemap_fdatawrite_range at ffffffff81120199
    #32 filemap_flush at ffffffff8112041c
    #33 ext4_alloc_da_blocks at ffffffff81219da1
    #34 ext4_rename at ffffffff81229b91
    #35 ext4_rename2 at ffffffff81229e32
    #36 vfs_rename at ffffffff811a08a5
    #37 SYSC_renameat2 at ffffffff811a3ffc
    #38 sys_renameat2 at ffffffff811a408e
    #39 sys_rename at ffffffff8119e51e
    #40 system_call_fastpath at ffffffff815afa89

    Dave Chinner has properly pointed out that this is a deadlock in the
    reclaim code because ext4 doesn't submit pages which are marked by
    PG_writeback right away.

    The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
    with too many dirty pages") and it was applied only when may_enter_fs
    was specified. The code has been changed by c3b94f44fcb0 ("memcg:
    further prevent OOM with too many dirty pages") which has removed the
    __GFP_FS restriction with a reasoning that we do not get into the fs
    code. But this is not sufficient apparently because the fs doesn't
    necessarily submit pages marked PG_writeback for IO right away.

    ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
    submit the bio. Instead it tries to map more pages into the bio and
    mpage_map_one_extent might trigger memcg charge which might end up
    waiting on a page which is marked PG_writeback but hasn't been submitted
    yet so we would end up waiting for something that never finishes.

    Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
    before we go to wait on the writeback. The page fault path, which is
    the only path that triggers memcg oom killer since 3.12, shouldn't
    require GFP_NOFS and so we shouldn't reintroduce the premature OOM
    killer issue which was originally addressed by the heuristic.

    As per David Chinner the xfs is doing similar thing since 2.6.15 already
    so ext4 is not the only affected filesystem. Moreover he notes:

    : For example: IO completion might require unwritten extent conversion
    : which executes filesystem transactions and GFP_NOFS allocations. The
    : writeback flag on the pages can not be cleared until unwritten
    : extent conversion completes. Hence memory reclaim cannot wait on
    : page writeback to complete in GFP_NOFS context because it is not
    : safe to do so, memcg reclaim or otherwise.

    Cc: stable@vger.kernel.org # 3.9+
    [tytso@mit.edu: corrected the control flow]
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Reported-by: Nikolay Borisov
    Signed-off-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Jul, 2015

5 commits

  • In CMA, 1 bit in bitmap means 1 << order_per_bits pages so size of
    bitmap is cma->count >> order_per_bits rather than just cma->count.
    This patch fixes it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Sasha Levin
    Cc: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • CMA has alloc/free interface for debugging. It is intended that
    alloc/free occurs in specific CMA region, but, currently, alloc/free
    interface is on root dir due to the bug so we can't select CMA region
    where alloc/free happens.

    This patch fixes this problem by making alloc/free interface per CMA
    region.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Sasha Levin
    Cc: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, we set wrong gfp_mask to page_owner info in case of isolated
    freepage by compaction and split page. It causes incorrect mixed
    pageblock report that we can get from '/proc/pagetypeinfo'. This metric
    is really useful to measure fragmentation effect so should be accurate.
    This patch fixes it by setting correct information.

    Without this patch, after kernel build workload is finished, number of
    mixed pageblock is 112 among roughly 210 movable pageblocks.

    But, with this fix, output shows that mixed pageblock is just 57.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When I tested my new patches, I found that page pointer which is used
    for setting page_owner information is changed. This is because page
    pointer is used to set new migratetype in loop. After this work, page
    pointer could be out of bound. If this wrong pointer is used for
    page_owner, access violation happens. Below is error message that I
    got.

    BUG: unable to handle kernel paging request at 0000000000b00018
    IP: [] save_stack_address+0x30/0x40
    PGD 1af2d067 PUD 166e0067 PMD 0
    Oops: 0002 [#1] SMP
    ...snip...
    Call Trace:
    print_context_stack+0xcf/0x100
    dump_trace+0x15f/0x320
    save_stack_trace+0x2f/0x50
    __set_page_owner+0x46/0x70
    __isolate_free_page+0x1f7/0x210
    split_free_page+0x21/0xb0
    isolate_freepages_block+0x1e2/0x410
    compaction_alloc+0x22d/0x2d0
    migrate_pages+0x289/0x8b0
    compact_zone+0x409/0x880
    compact_zone_order+0x6d/0x90
    try_to_compact_pages+0x110/0x210
    __alloc_pages_direct_compact+0x3d/0xe6
    __alloc_pages_nodemask+0x6cd/0x9a0
    alloc_pages_current+0x91/0x100
    runtest_store+0x296/0xa50
    simple_attr_write+0xbd/0xe0
    __vfs_write+0x28/0xf0
    vfs_write+0xa9/0x1b0
    SyS_write+0x46/0xb0
    system_call_fastpath+0x16/0x75

    This patch fixes this error by moving up set_page_owner().

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The kbuild test robot reported the following

    tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
    head: 14a6f1989dae9445d4532941bdd6bbad84f4c8da
    commit: 3b242c66ccbd60cf47ab0e8992119d9617548c23 x86: mm: enable deferred struct page initialisation on x86-64
    date: 3 days ago
    config: x86_64-randconfig-x006-201527 (attached as .config)
    reproduce:
    git checkout 3b242c66ccbd60cf47ab0e8992119d9617548c23
    # save the attached .config to linux build tree
    make ARCH=x86_64

    All warnings (new ones prefixed by >>):

    mm/page_alloc.c: In function 'early_page_uninitialised':
    >> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
    int nid = early_pfn_to_nid(pfn);

    It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
    configurations. This patch avoids the warning by not declaring nid.

    Signed-off-by: Mel Gorman
    Reported-by: Wu Fengguang
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Jul, 2015

1 commit

  • Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

1 commit

  • Pull block fixes from Jens Axboe:
    "Mainly sending this off now for the writeback fixes, since they fix a
    real regression introduced with the cgroup writeback changes. The
    NVMe fix could wait for next pull for this series, but it's simple
    enough that we might as well include it.

    This contains:

    - two cgroup writeback fixes from Tejun, fixing a user reported issue
    with luks crypt devices hanging when being closed.

    - NVMe error cleanup fix from Jon Derrick, fixing a case where we'd
    attempt to free an unregistered IRQ"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    NVMe: Fix irq freeing when queue_request_irq fails
    writeback: don't drain bdi_writeback_congested on bdi destruction
    writeback: don't embed root bdi_writeback_congested in bdi_writeback

    Linus Torvalds
     

03 Jul, 2015

1 commit

  • …scm/linux/kernel/git/paulg/linux

    Pull module_init replacement part two from Paul Gortmaker:
    "Replace module_init with appropriate alternate initcall in non
    modules.

    This series converts non-modular code that is using the module_init()
    call to hook itself into the system to instead use one of our
    alternate priority initcalls.

    Unlike the previous series that used device_initcall and hence was a
    runtime no-op, these commits change to one of the alternate initcalls,
    because (a) we have them and (b) it seems like the right thing to do.

    For example, it would seem logical to use arch_initcall for arch
    specific setup code and fs_initcall for filesystem setup code.

    This does mean however, that changes in the init ordering will be
    taking place, and so there is a small risk that some kind of implicit
    init ordering issue may lie uncovered. But I think it is still better
    to give these ones sensible priorities than to just assign them all to
    device_initcall in order to exactly preserve the old ordering.

    Thad said, we have already made similar changes in core kernel code in
    commit c96d6660dc65 ("kernel: audit/fix non-modular users of
    module_init in core code") without any regressions reported, so this
    type of change isn't without precedent. It has also got the same
    local testing and linux-next coverage as all the other pull requests
    that I'm sending for this merge window have got.

    Once again, there is an unused module_exit function removal that shows
    up as an outlier upon casual inspection of the diffstat"

    * tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
    x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
    mm/page_owner.c: use late_initcall to hook in enabling
    lib/list_sort: use late_initcall to hook in self tests
    arm: use subsys_initcall in non-modular pl320 IPC code
    powerpc: don't use module_init for non-modular core hugetlb code
    powerpc: use subsys_initcall for Freescale Local Bus
    x86: don't use module_init for non-modular core bootflag code
    netfilter: don't use module_init/exit in core IPV4 code
    fs/notify: don't use module_init for non-modular inotify_user code
    mm: replace module_init usages with subsys_initcall in nommu.c

    Linus Torvalds
     

02 Jul, 2015

4 commits

  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    This patch fixes the bug by updating bdi destruction to not wait for
    cong's to drain. A cong is unlinked from bdi->cgwb_congested_tree on
    bdi destuction regardless of its reference count as the bdi may go
    away any point after destruction. wb_congested_put() checks whether
    the cong is already unlinked on release.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
    Tested-by: Jon Christopherson
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    To fix the bug, bdi destruction will be updated to not wait for cong's
    to drain, which naturally means that cong's may outlive the associated
    bdi. This is fine for non-root cong's but is problematic for the root
    cong's which are embedded in their bdi's as they may end up getting
    dereferenced after the containing bdi's are freed.

    This patch makes root cong's behave the same as non-root cong's. They
    are no longer embedded in their bdi's but allocated separately during
    bdi initialization, indexed and reference counted the same way.

    * As cong handling is the same for all wb's, wb->congested
    initialization is moved into wb_init().

    * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
    bdi->wb_congested is now a pointer pointing to the root cong
    allocated during bdi init and minimal refcnting operations are
    implemented.

    * The above makes root wb init paths diverge depending on
    CONFIG_CGROUP_WRITEBACK. root wb init is moved to cgwb_bdi_init().

    This patch in itself shouldn't cause any consequential behavior
    differences but prepares for the actual fix.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Tested-by: Jon Christopherson

    Added include to backing-dev.h for kfree() definition.

    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Merge third patchbomb from Andrew Morton:

    - the rest of MM

    - scripts/gdb updates

    - ipc/ updates

    - lib/ updates

    - MAINTAINERS updates

    - various other misc things

    * emailed patches from Andrew Morton : (67 commits)
    genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
    genalloc: rename dev_get_gen_pool() to gen_pool_get()
    x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
    MAINTAINERS: add zpool
    MAINTAINERS: BCACHE: Kent Overstreet has changed email address
    MAINTAINERS: move Jens Osterkamp to CREDITS
    MAINTAINERS: remove unused nbd.h pattern
    MAINTAINERS: update brcm gpio filename pattern
    MAINTAINERS: update brcm dts pattern
    MAINTAINERS: update sound soc intel patterns
    MAINTAINERS: remove website for paride
    MAINTAINERS: update Emulex ocrdma email addresses
    bcache: use kvfree() in various places
    libcxgbi: use kvfree() in cxgbi_free_big_mem()
    target: use kvfree() in session alloc and free
    IB/ehca: use kvfree() in ipz_queue_{cd}tor()
    drm/nouveau/gem: use kvfree() in u_free()
    drm: use kvfree() in drm_free_large()
    cxgb4: use kvfree() in t4_free_mem()
    cxgb3: use kvfree() in cxgb_free_mem()
    ...

    Linus Torvalds
     
  • Avoid the warning:

    WARNING: mm/built-in.o(.text.unlikely+0xc22): Section mismatch in reference from the function .new_kmalloc_cache() to the variable .init.rodata:kmalloc_info
    The function .new_kmalloc_cache() references
    the variable __initconst kmalloc_info.

    Signed-off-by: Christoph Lameter
    Reported-by: Stephen Rothwell
    Tested-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2015

6 commits

  • Waiman Long reported that 24TB machines hit OOM during basic setup when
    struct page initialisation was deferred. One approach is to initialise
    memory on demand but it interferes with page allocator paths. This patch
    creates dedicated threads to initialise memory before basic setup. It
    then blocks on a rw_semaphore until completion as a wait_queue and counter
    is overkill. This may be slower to boot but it's simplier overall and
    also gets rid of a section mangling which existed so kswapd could do the
    initialisation.

    [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
    Signed-off-by: Mel Gorman
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Scott Norton
    Tested-by: Daniel J Blueman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • mminit_verify_page_links() is an extremely paranoid check that was
    introduced when memory initialisation was being heavily reworked.
    Profiles indicated that up to 10% of parallel memory initialisation was
    spent on checking this for every page. The cost could be reduced but in
    practice this check only found problems very early during the
    initialisation rewrite and has found nothing since. This patch removes an
    expensive unnecessary check.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During parallel sturct page initialisation, ranges are checked for every
    PFN unnecessarily which increases boot times. This patch alters when the
    ranges are checked.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Parallel struct page frees pages one at a time. Try free pages as single
    large pages where possible.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Deferred struct page initialisation is using pfn_to_page() on every PFN
    unnecessarily. This patch minimises the number of lookups and scheduler
    checks.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Only a subset of struct pages are initialised at the moment. When this
    patch is applied kswapd initialise the remaining struct pages in parallel.

    This should boot faster by spreading the work to multiple CPUs and
    initialising data that is local to the CPU. The user-visible effect on
    large machines is that free memory will appear to rapidly increase early
    in the lifetime of the system until kswapd reports that all memory is
    initialised in the kernel log. Once initialised there should be no other
    user-visibile effects.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman