15 Feb, 2017

1 commit

  • commit a810007afe239d59c1115fcaa06eb5b480f876e9 upstream.

    Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub
    initialisation as described in

    https://bugzilla.kernel.org/show_bug.cgi?id=177551.

    Bail out early from init_cache_random_seq if s->random_seq is already
    initialised. This prevents destroying the previously computed
    random_seq offsets later in the function.

    If the offsets are destroyed, then shuffle_freelist will truncate
    page->freelist to just the first object (orphaning the rest).

    Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
    Link: http://lkml.kernel.org/r/20170207140707.20824-1-sean@erifax.org
    Signed-off-by: Sean Rees
    Reported-by:
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Thomas Garnier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sean Rees
     

09 Feb, 2017

4 commits

  • commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream.

    do_generic_file_read() can be told to perform a large request from
    userspace. If the system is under OOM and the reading task is the OOM
    victim then it has an access to memory reserves and finishing the full
    request can lead to the full memory depletion which is dangerous. Make
    sure we rather go with a short read and allow the killed task to
    terminate.

    Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Christoph Hellwig
    Cc: Tetsuo Handa
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.

    Reading a sysfs "memoryN/valid_zones" file leads to the following oops
    when the first page of a range is not backed by struct page.
    show_valid_zones() assumes that 'start_pfn' is always valid for
    page_zone().

    BUG: unable to handle kernel paging request at ffffea017a000000
    IP: show_valid_zones+0x6f/0x160

    This issue may happen on x86-64 systems with 64GiB or more memory since
    their memory block size is bumped up to 2GiB. [1] An example of such
    systems is desribed below. 0x3240000000 is only aligned by 1GiB and
    this memory block starts from 0x3200000000, which is not backed by
    struct page.

    BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

    Since test_pages_in_a_zone() already checks holes, fix this issue by
    extending this function to return 'valid_start' and 'valid_end' for a
    given range. show_valid_zones() then proceeds with the valid range.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Greg Kroah-Hartman
    Cc: Zhang Zhen
    Cc: Reza Arbab
    Cc: David Rientjes
    Cc: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream.

    Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

    A sysfs memory file is created for each 2GiB memory block on x86-64 when
    the system has 64GiB or more memory. [1] When the start address of a
    memory block is not backed by struct page, i.e. a memory range is not
    aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
    kernel oops. This issue was observed on multiple x86-64 systems with
    more than 64GiB of memory. This patch-set fixes this issue.

    Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
    test the start section.

    Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
    to return valid [start, end).

    Note for stable kernels: The memory block size change was made by commit
    bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
    systems"), which was accepted to 3.9. However, this patch-set depends
    on (and fixes) the change to test_pages_in_a_zone() made by commit
    5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
    test_pages_in_a_zone()"), which was accepted to 4.4.

    So, I recommend that we backport it up to 4.4.

    [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

    This patch (of 2):

    test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
    section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
    is called for testing the range of a sysfs memory file, 'start_pfn' is
    always aligned by section.

    Fix it by properly setting 'sec_end_pfn' to the next section pfn.

    Also make sure that this function returns 1 only when the range belongs
    to a zone.

    Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Andrew Banman
    Cc: Reza Arbab
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Toshi Kani
     
  • commit d7b028f56a971a2e4d8d7887540a144eeefcd4ab upstream.

    Add zswap_init_failed bool that prevents changing any of the module
    params, if init_zswap() fails, and set zswap_enabled to false. Change
    'enabled' param to a callback, and check zswap_init_failed before
    allowing any change to 'enabled', 'zpool', or 'compressor' params.

    Any driver that is built-in to the kernel will not be unloaded if its
    init function returns error, and its module params remain accessible for
    users to change via sysfs. Since zswap uses param callbacks, which
    assume that zswap has been initialized, changing the zswap params after
    a failed initialization will result in WARNING due to the param
    callbacks expecting a pool to already exist. This prevents that by
    immediately exiting any of the param callbacks if initialization failed.

    This was reported here:
    https://marc.info/?l=linux-mm&m=147004228125528&w=4

    And fixes this WARNING:
    [ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60

    The warning is just noise, and not serious. However, when init fails,
    zswap frees all its percpu dstmem pages and its kmem cache. The kmem
    cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
    the percpu dstmem pages are definitely a problem, as they're used as
    temporary buffer for compressed pages before copying into place in the
    zpool.

    If the user does get zswap enabled after an init failure, then zswap
    will likely Oops on the first page it tries to compress (or worse, start
    corrupting memory).

    Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
    Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
    Signed-off-by: Dan Streetman
    Reported-by: Marcin Miroslaw
    Cc: Seth Jennings
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     

01 Feb, 2017

8 commits

  • commit 3674534b775354516e5c148ea48f51d4d1909a78 upstream.

    When memory.move_charge_at_immigrate is enabled and precharges are
    depleted during move, mem_cgroup_move_charge_pte_range() will attempt to
    increase the size of the precharge.

    Prevent precharges from ever looping by setting __GFP_NORETRY. This was
    probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is
    pointless as written.

    Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit 8a1f780e7f28c7c1d640118242cf68d528c456cd upstream.

    online_{kernel|movable} is used to change the memory zone to
    ZONE_{NORMAL|MOVABLE} and online the memory.

    To check that memory zone can be changed, zone_can_shift() is used.
    Currently the function returns minus integer value, plus integer
    value and 0. When the function returns minus or plus integer value,
    it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

    But when the function returns 0, there are two meanings.

    One of the meanings is that the memory zone does not need to be changed.
    For example, when memory is in ZONE_NORMAL and onlined by online_kernel
    the memory zone does not need to be changed.

    Another meaning is that the memory zone cannot be changed. When memory
    is in ZONE_NORMAL and onlined by online_movable, the memory zone may
    not be changed to ZONE_MOVALBE due to memory online limitation(see
    Documentation/memory-hotplug.txt). In this case, memory must not be
    onlined.

    The patch changes the return type of zone_can_shift() so that memory
    online operation fails when memory zone cannot be changed as follows:

    Before applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 8388608
    managed 8388608

    online_movable operation succeeded. But memory is onlined as
    ZONE_NORMAL, not ZONE_MOVABLE.

    After applying patch:
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320
    # echo online_movable > memory4097/state
    bash: echo: write error: Invalid argument
    # grep -A 35 "Node 2" /proc/zoneinfo
    Node 2, zone Normal

    node_scanned 0
    spanned 8388608
    present 7864320
    managed 7864320

    online_movable operation failed because of failure of changing
    the memory zone from ZONE_NORMAL to ZONE_MOVABLE

    Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
    Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     
  • commit e47483bca2cc59a4593b37a270b16ee42b1d9f08 upstream.

    Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
    triggers OOM killer in few seconds, despite lots of free memory. The
    test attempts to repeatedly fault in memory in one process in a cpuset,
    while changing allowed nodes of the cpuset between 0 and 1 in another
    process.

    The problem comes from insufficient protection against cpuset changes,
    which can cause get_page_from_freelist() to consider all zones as
    non-eligible due to nodemask and/or current->mems_allowed. This was
    masked in the past by sufficient retries, but since commit 682a3385e773
    ("mm, page_alloc: inline the fast path of the zonelist iterator") we fix
    the preferred_zoneref once, and don't iterate over the whole zonelist in
    further attempts, thus the only eligible zones might be placed in the
    zonelist before our starting point and we always miss them.

    A previous patch fixed this problem for current->mems_allowed. However,
    cpuset changes also update the task's mempolicy nodemask. The fix has
    two parts. We have to repeat the preferred_zoneref search when we
    detect cpuset update by way of seqcount, and we have to check the
    seqcount before considering OOM.

    [akpm@linux-foundation.org: fix typo in comment]
    Link: http://lkml.kernel.org/r/20170120103843.24587-5-vbabka@suse.cz
    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Signed-off-by: Vlastimil Babka
    Reported-by: Ganapatrao Kulkarni
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 5ce9bfef1d27944c119a397a9d827bef795487ce upstream.

    This is a preparation for the following patch to make review simpler.
    While the primary motivation is a bug fix, this also simplifies the fast
    path, although the moved code is only enabled when cpusets are in use.

    Link: http://lkml.kernel.org/r/20170120103843.24587-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Ganapatrao Kulkarni
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 16096c25bf0ca5d87e4fa6ec6108ba53feead212 upstream.

    Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode
    triggers OOM killer in few seconds, despite lots of free memory. The
    test attempts to repeatedly fault in memory in one process in a cpuset,
    while changing allowed nodes of the cpuset between 0 and 1 in another
    process.

    One possible cause is that in the fast path we find the preferred
    zoneref according to current mems_allowed, so that it points to the
    middle of the zonelist, skipping e.g. zones of node 1 completely. If
    the mems_allowed is updated to contain only node 1, we never reach it in
    the zonelist, and trigger OOM before checking the cpuset_mems_cookie.

    This patch fixes the particular case by redoing the preferred zoneref
    search if we switch back to the original nodemask. The condition is
    also slightly changed so that when the last non-root cpuset is removed,
    we don't miss it.

    Note that this is not a full fix, and more patches will follow.

    Link: http://lkml.kernel.org/r/20170120103843.24587-3-vbabka@suse.cz
    Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
    Signed-off-by: Vlastimil Babka
    Reported-by: Ganapatrao Kulkarni
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream.

    Patch series "fix premature OOM regression in 4.7+ due to cpuset races".

    This is v2 of my attempt to fix the recent report based on LTP cpuset
    stress test [1]. The intention is to go to stable 4.9 LTSS with this,
    as triggering repeated OOMs is not nice. That's why the patches try to
    be not too intrusive.

    Unfortunately why investigating I found that modifying the testcase to
    use per-VMA policies instead of per-task policies will bring the OOM's
    back, but that seems to be much older and harder to fix problem. I have
    posted a RFC [2] but I believe that fixing the recent regressions has a
    higher priority.

    Longer-term we might try to think how to fix the cpuset mess in a better
    and less error prone way. I was for example very surprised to learn,
    that cpuset updates change not only task->mems_allowed, but also
    nodemask of mempolicies. Until now I expected the parameter to
    alloc_pages_nodemask() to be stable. I wonder why do we then treat
    cpusets specially in get_page_from_freelist() and distinguish HARDWALL
    etc, when there's unconditional intersection between mempolicy and
    cpuset. I would expect the nodemask adjustment for saving overhead in
    g_p_f(), but that clearly doesn't happen in the current form. So we
    have both crazy complexity and overhead, AFAICS.

    [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
    [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz

    This patch (of 4):

    Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
    zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
    which can theoretically happen due to concurrent cpuset modification. We
    check the zoneref pointer which is never NULL and we should check the zone
    pointer. Also document this in first_zones_zonelist() comment per Michal
    Hocko.

    Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
    Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Cc: Ganapatrao Kulkarni
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit d51e9894d27492783fc6d1b489070b4ba66ce969 upstream.

    Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to
    alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by
    mpol_cond_put() before accessing the embedded nodemask by
    __alloc_pages_nodemask(). The commit log says it's so "we can use a
    single exit path within the function" but that's clearly wrong. We can
    still do that when doing mpol_cond_put() after the allocation attempt.

    Make sure the mempolicy is not freed prematurely, otherwise
    __alloc_pages_nodemask() can end up using a bogus nodemask, which could
    lead e.g. to premature OOM.

    Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma")
    Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream.

    In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
    __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
    after a COW was resolved to setting the (newly introduced) FOLL_COW
    instead. Simultaneously, the check in gup.c was updated to still allow
    writes with FOLL_FORCE set if FOLL_COW had also been set.

    However, a similar check in huge_memory.c was forgotten. As a result,
    remote memory writes to ro regions of memory backed by transparent huge
    pages cause an infinite loop in the kernel (handle_mm_fault sets
    FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
    out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
    true.

    While in this state the process is stil SIGKILLable, but little else
    works (e.g. no ptrace attach, no other signals). This is easily
    reproduced with the following code (assuming thp are set to always):

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define TEST_SIZE 5 * 1024 * 1024

    int main(void) {
    int status;
    pid_t child;
    int fd = open("/proc/self/mem", O_RDWR);
    void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    assert(addr != MAP_FAILED);
    pid_t parent_pid = getpid();
    if ((child = fork()) == 0) {
    void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    assert(addr2 != MAP_FAILED);
    memset(addr2, 'a', TEST_SIZE);
    pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
    return 0;
    }
    assert(child == waitpid(child, &status, 0));
    assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
    return 0;
    }

    Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
    to the update in gup.c in the original commit. The same pattern exists
    in follow_devmap_pmd. However, we should not be able to reach that
    check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
    ever do.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
    Signed-off-by: Keno Fischer
    Acked-by: Kirill A. Shutemov
    Cc: Greg Thelen
    Cc: Nicholas Piggin
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Keno Fischer
     

20 Jan, 2017

6 commits

  • commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream.

    return_unused_surplus_pages() decrements the global reservation count,
    and frees any unused surplus pages that were backing the reservation.

    Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in
    return_unused_surplus_pages()") added a call to cond_resched_lock in the
    loop freeing the pages.

    As a result, the hugetlb_lock could be dropped, and someone else could
    use the pages that will be freed in subsequent iterations of the loop.
    This could result in inconsistent global hugetlb page state, application
    api failures (such as mmap) failures or application crashes.

    When dropping the lock in return_unused_surplus_pages, make sure that
    the global reservation count (resv_huge_pages) remains sufficiently
    large to prevent someone else from claiming pages about to be freed.

    Analyzed by Paul Cassella.

    Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
    Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Paul Cassella
    Suggested-by: Michal Hocko
    Cc: Masayoshi Mizuma
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit c4e490cf148e85ead0d1b1c2caaba833f1d5b29f upstream.

    This patch fixes a bug in the freelist randomization code. When a high
    random number is used, the freelist will contain duplicate entries. It
    will result in different allocations sharing the same chunk.

    It will result in odd behaviours and crashes. It should be uncommon but
    it depends on the machines. We saw it happening more often on some
    machines (every few hours of running tests).

    Fixes: c7ce4f60ac19 ("mm: SLAB freelist randomization")
    Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com
    Signed-off-by: John Sperbeck
    Signed-off-by: Thomas Garnier
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    John Sperbeck
     
  • commit f05714293a591038304ddae7cb0dd747bb3786cc upstream.

    During developemnt for zram-swap asynchronous writeback, I found strange
    corruption of compressed page, resulting in:

    Modules linked in: zram(E)
    CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff88007620b840 task.stack: ffff880078090000
    RIP: set_freeobj.part.43+0x1c/0x1f
    RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
    RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
    RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
    RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
    R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
    FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
    Call Trace:
    obj_malloc+0x22b/0x260
    zs_malloc+0x1e4/0x580
    zram_bvec_rw+0x4cd/0x830 [zram]
    page_requests_rw+0x9c/0x130 [zram]
    zram_thread+0xe6/0x173 [zram]
    kthread+0xca/0xe0
    ret_from_fork+0x25/0x30

    With investigation, it reveals currently stable page doesn't support
    anonymous page. IOW, reuse_swap_page can reuse the page without waiting
    writeback completion so it can overwrite page zram is compressing.

    Unfortunately, zram has used per-cpu stream feature from v4.7.
    It aims for increasing cache hit ratio of scratch buffer for
    compressing. Downside of that approach is that zram should ask
    memory space for compressed page in per-cpu context which requires
    stricted gfp flag which could be failed. If so, it retries to
    allocate memory space out of per-cpu context so it could get memory
    this time and compress the data again, copies it to the memory space.

    In this scenario, zram assumes the data should never be changed
    but it is not true unless stable page supports. So, If the data is
    changed under us, zram can make buffer overrun because second
    compression size could be bigger than one we got in previous trial
    and blindly, copy bigger size object to smaller buffer which is
    buffer overrun. The overrun breaks zsmalloc free object chaining
    so system goes crash like above.

    I think below is same problem.
    https://bugzilla.suse.com/show_bug.cgi?id=997574

    Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
    writeback in there so the approach in this patch is simply return false if
    we found it needs stable page. Although it increases memory footprint
    temporarily, it happens rarely and it should be reclaimed easily althoug
    it happened. Also, It would be better than waiting of IO completion,
    which is critial path for application latency.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
    Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Darrick J. Wong
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

    Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 20f664aabeb88d582b623a625f83b0454fa34f07 upstream.

    Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:

    http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

    The problem is currently page fault handler doesn't supports dirty bit
    emulation of pmd for non-HW dirty-bit architecture so that application
    stucks until VM marked the pmd dirty.

    How the emulation work depends on the architecture. In case of arm64,
    when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
    mark the pte dirty via triggering page fault when store access happens.
    Once the page fault occurs, VM marks the pmd dirty and arch code for
    setting pmd will clear PTE_RDONLY for application to proceed.

    IOW, if VM doesn't mark the pmd dirty, application hangs forever by
    repeated fault(i.e., store op but the pmd is PTE_RDONLY).

    This patch enables pmd dirty-bit emulation for those architectures.

    [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: Andreas Schwab
    Tested-by: Andreas Schwab
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Jason Evans
    Cc: Will Deacon
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 965d004af54088d138f806d04d803fb60d441986 upstream.

    Currently in DAX if we have three read faults on the same hole address we
    can end up with the following:

    Thread 0 Thread 1 Thread 2
    -------- -------- --------
    dax_iomap_fault
    grab_mapping_entry
    lock_slot

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_load_hole
    find_or_create_page
    ...
    page_cache_tree_insert
    dax_wake_mapping_entry_waiter

    __radix_tree_replace


    get_page
    lock_page
    ...
    put_locked_mapping_entry
    unlock_page
    put_page

    The crux of the problem is that once we insert a 4k zero page, all
    locking from then on is done in terms of that 4k zero page and any
    additional threads sleeping on the empty DAX entry will never be woken.

    Fix this by waking all sleepers when we replace the DAX radix tree entry
    with a 4k zero page. This will allow all sleeping threads to
    successfully transition from locking based on the DAX empty entry to
    locking on the 4k zero page.

    With the test case reported by Xiong this happens very regularly in my
    test setup, with some runs resulting in 9+ threads in this deadlocked
    state. With this fix I've been able to run that same test dozens of
    times in a loop without issue.

    Fixes: ac401cc78242 ("dax: New fault locking")
    Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Xiong Zhou
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     

12 Jan, 2017

4 commits

  • commit 6afcf8ef0ca0a69d014f8edb613d94821f0ae700 upstream.

    Since commit bda807d44454 ("mm: migrate: support non-lru movable page
    migration") isolate_migratepages_block) can isolate !PageLRU pages which
    would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
    pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
    heuristics based on those counters such as pgdat_reclaimable_pages resp.
    too_many_isolated which would lead to unexpected stalls during the
    direct reclaim without any good reason. Note that
    __alloc_contig_migrate_range can isolate a lot of pages at once.

    On mobile devices such as 512M ram android Phone, it may use a big zram
    swap. In some cases zram(zsmalloc) uses too many non-lru but
    migratedable pages, such as:

    MemTotal: 468148 kB
    Normal free:5620kB
    Free swap:4736kB
    Total swap:409596kB
    ZRAM: 164616kB(zsmalloc non-lru pages)
    active_anon:60700kB
    inactive_anon:60744kB
    active_file:34420kB
    inactive_file:37532kB

    Fix this by only accounting lru pages to NR_ISOLATED_* in
    isolate_migratepages_block right after they were isolated and we still
    know they were on LRU. Drop acct_isolated because it is called after
    the fact and we've lost that information. Batching per-cpu counter
    doesn't make much improvement anyway. Also make sure that we uncharge
    only LRU pages when putting them back on the LRU in
    putback_movable_pages resp. when unmap_and_move migrates the page.

    [mhocko@suse.com: replace acct_isolated() with direct counting]
    Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
    Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
    Signed-off-by: Ming Ling
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ming Ling
     
  • commit 59749e6ce53735d8b696763742225f126e94603f upstream.

    The radix tree counts valid entries in each tree node. Entries stored
    in the tree cannot be removed by simpling storing NULL in the slot or
    the internal counters will be off and the node never gets freed again.

    When collapsing a shmem page fails, restore the holes that were filled
    with radix_tree_insert() with a proper radix tree deletion.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 91a45f71078a6569ec3ca5bef74e1ab58121d80e upstream.

    Patch series "mm: workingset: radix tree subtleties & single-page file
    refaults", v3.

    This is another revision of the radix tree / workingset patches based on
    feedback from Jan and Kirill.

    This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
    entries without radix tree node"). That patch fixed an issue that was
    caused mainly by the page cache sneaking special shadow page entries
    into the radix tree and relying on subtleties in the radix tree code to
    make that work. The fix also had to stop tracking refaults for
    single-page files because shadow pages stored as direct pointers in
    radix_tree_root->rnode weren't properly handled during tree extension.

    These patches make the radix tree code explicitely support and track
    such special entries, to eliminate the subtleties and to restore the
    thrash detection for single-page files.

    This patch (of 9):

    When a radix tree iteration drops the tree lock, another thread might
    swoop in and free the node holding the current slot. The iteration
    needs to do another tree lookup from the current index to continue.

    [kirill.shutemov@linux.intel.com: re-lookup for replacement]
    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 3999f52e3198e76607446ab1a4610c1ddc406c56 upstream.

    We cannot use the pte value used in set_pte_at for pte_same comparison,
    because archs like ppc64, filter/add new pte flag in set_pte_at.
    Instead fetch the pte value inside hugetlb_cow. We are comparing pte
    value to make sure the pte didn't change since we dropped the page table
    lock. hugetlb_cow get called with page table lock held, and we can take
    a copy of the pte value before we drop the page table lock.

    With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
    previous mapping (huge_pte_none entries), by forcing a cow in the fault
    path. This avoid take an addition fault to covert a read-only mapping
    to read/write. Here we were comparing a recently instantiated pte (via
    set_pte_at) to the pte values from linux page table. As explained above
    on ppc64 such pte_same check returned wrong result, resulting in us
    taking an additional fault on ppc64.

    Fixes: 6a119eae942c ("powerpc/mm: Add a _PAGE_PTE bit")
    Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Jan Stancek
    Acked-by: Hillf Danton
    Cc: Mike Kravetz
    Cc: Scott Wood
    Cc: Michael Ellerman
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

06 Jan, 2017

5 commits

  • commit a6de734bc002fe2027ccc074fbbd87d72957b7a4 upstream.

    Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
    defer debugging checks of pages allocated from the PCP") will allow the
    per-cpu list counter to be out of sync with the per-cpu list contents if
    a struct page is corrupted.

    The consequence is an infinite loop if the per-cpu lists get fully
    drained by free_pcppages_bulk because all the lists are empty but the
    count is positive. The infinite loop occurs here

    do {
    batch_free++;
    if (++migratetype == MIGRATE_PCPTYPES)
    migratetype = 0;
    list = &pcp->lists[migratetype];
    } while (list_empty(list));

    What the user sees is a bad page warning followed by a soft lockup with
    interrupts disabled in free_pcppages_bulk().

    This patch keeps the accounting in sync.

    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Jesper Dangaard Brouer
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 5f33a0803bbd781de916f5c7448cbbbbc763d911 upstream.

    Our system uses significantly more slab memory with memcg enabled with
    the latest kernel. With 3.10 kernel, slab uses 2G memory, while with
    4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we
    have two memcg for one shrinker. In do_shrink_slab:

    1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size
    is 1024, then no memory is freed. nr_deferred = 700

    2. Check cg2. nr_deferred = 700. Assume freeable = 20, then
    total_scan = 10 or 40. Let's assume it's 10. No memory is freed.
    nr_deferred = 10.

    The deferred share of cg1 is lost in this case. kswapd will free no
    memory even run above steps again and again.

    The fix makes sure one memcg's deferred share isn't lost.

    Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 upstream.

    It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit d05c5f7ba164aed3db02fb188c26d0dd94f5455b upstream.

    We truncated the possible read iterator to s_maxbytes in commit
    c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
    but our end condition handling was wrong: it's not an error to try to
    read at the end of the file.

    Reading past the end should return EOF (0), not EINVAL.

    See for example

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
    http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

    where a md5sum of a maximally sized file fails because the final read is
    exactly at s_maxbytes.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-by: Joseph Salisbury
    Cc: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

    During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

07 Dec, 2016

1 commit

  • The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
    want to race with generating new pages by faulting them in.

    However, the wait-queue used to delay the page faulting has a serious
    problem: the wait queue head (in shmem_fallocate()) is allocated on the
    stack, and the code expects that "wake_up_all()" will make sure that all
    the queue entries are gone before the stack frame is de-allocated.

    And that is not at all necessarily the case.

    Yes, a normal wake-up sequence will remove the wait-queue entry that
    caused the wakeup (see "autoremove_wake_function()"), but the key
    wording there is "that caused the wakeup". When there are multiple
    possible wakeup sources, the wait queue entry may well stay around.

    And _particularly_ in a page fault path, we may be faulting in new pages
    from user space while we also have other things going on, and there may
    well be other pending wakeups.

    So despite the "wake_up_all()", it's not at all guaranteed that all list
    entries are removed from the wait queue head on the stack.

    Fix this by introducing a new wakeup function that removes the list
    entry unconditionally, even if the target process had already woken up
    for other reasons. Use that "synchronous" function to set up the
    waiters in shmem_fault().

    This problem has never been seen in the wild afaik, but Dave Jones has
    reported it on and off while running trinity. We thought we fixed the
    stack corruption with the blk-mq rq_list locking fix (commit
    7fe311302f7d: "blk-mq: update hardware and software queues for sleeping
    alloc"), but it turns out there was _another_ stack corruptor hiding
    in the trinity runs.

    Vegard Nossum (also running trinity) was able to trigger this one fairly
    consistently, and made us look once again at the shmem code due to the
    faults often being in that area.

    Reported-and-tested-by: Vegard Nossum .
    Reported-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Dec, 2016

2 commits

  • Boris Zhmurov has reported RCU stalls during the kswapd reclaim:

    INFO: rcu_sched detected stalls on CPUs/tasks:
    23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
    (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
    Task dump for CPU 23:
    kswapd1 R running task 0 148 2 0x00000008
    Call Trace:
    shrink_node+0xd2/0x2f0
    kswapd+0x2cb/0x6a0
    mem_cgroup_shrink_node+0x160/0x160
    kthread+0xbd/0xe0
    __switch_to+0x1fa/0x5c0
    ret_from_fork+0x1f/0x40
    kthread_create_on_node+0x180/0x180

    a closer code inspection has shown that we might indeed miss all the
    scheduling points in the reclaim path if no pages can be isolated from
    the LRU list. This is a pathological case but other reports from Donald
    Buczek have shown that we might indeed hit such a path:

    clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
    kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
    [...]
    kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

    this is minute long snapshot which didn't take a single page from the
    LRU. It is not entirely clear why only 1303 pages have been scanned
    during that time (maybe there was a heavy IRQ activity interfering).

    In any case it looks like we can really hit long periods without
    scheduling on non preemptive kernels so an explicit cond_resched() in
    shrink_node_memcg which is independent on the reclaim operation is due.

    Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Boris Zhmurov
    Tested-by: Boris Zhmurov
    Reported-by: Donald Buczek
    Reported-by: "Christopher S. Aker"
    Reported-by: Paul Menzel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") has made the workingset shadow nodes shrinker memcg aware. The
    implementation is not correct though because memcg_kmem_enabled() might
    become true while we are doing a global reclaim when the sc->memcg might
    be NULL which is exactly what Marek has seen:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
    IP: [] mem_cgroup_node_nr_lru_pages+0x20/0x40
    PGD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1
    task: ffff880011863b00 task.stack: ffff880011868000
    RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP: e02b:ffff88001186bc70 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
    RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
    R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
    R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
    Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
    Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
    RIP mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP
    CR2: 0000000000000400
    ---[ end trace 100494b9edbdfc4d ]---

    This patch fixes the issue by checking sc->memcg rather than
    memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
    that only memcg aware shrinkers will get non-NULL memcgs and only if
    memcg_kmem_enabled is true.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Marek Marczykowski-Górecki
    Tested-by: Marek Marczykowski-Górecki
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Dec, 2016

5 commits

  • Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
    PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we
    cannot user page_to_pgoff() to check whether we've got the right page
    for the radix-tree index.

    Let's introduce page_to_index() which would return radix-tree index for
    given page.

    We will be able to get rid of this once hugetlb will be switched to
    multi-order entries.

    Fixes: fc127da085c2 ("truncate: handle file thp")
    Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Doug Nelson
    Tested-by: Doug Nelson
    Reviewed-by: Naoya Horiguchi
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Gcc revision 241896 implements use-after-scope detection. Will be
    available in gcc 7. Support it in KASAN.

    Gcc emits 2 new callbacks to poison/unpoison large stack objects when
    they go in/out of scope. Implement the callbacks and add a test.

    [dvyukov@google.com: v3]
    Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com
    Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • kasan_global struct is part of compiler/runtime ABI. gcc revision
    241983 has added a new field to kasan_global struct. Update kernel
    definition of kasan_global struct to include the new field.

    Without this patch KASAN is broken with gcc 7.

    Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • The following program triggers BUG() in munlock_vma_pages_range():

    // autogenerated by syzkaller (http://github.com/google/syzkaller)
    #include

    int main()
    {
    mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
    mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
    return 0;
    }

    The test-case constructs the situation when munlock_vma_pages_range()
    finds PTE-mapped THP-head in the middle of page table and, by mistake,
    skips HPAGE_PMD_NR pages after that.

    As result, on the next iteration it hits the middle of PMD-mapped THP
    and gets upset seeing mlocked tail page.

    The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
    during munlock_vma_page(). It would guarantee that the page is
    PMD-mapped as we never mlock PTE-mapeed THPs.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: syzkaller
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c")
    moved code from huge_memory.c to khugepaged.c. Some of this code should
    be compiled only when CONFIG_SYSFS is enabled but the condition around
    this code was not moved into khugepaged.c.

    The result is a compilation error when CONFIG_SYSFS is disabled:

    mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
    mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'

    This commit adds the #ifdef CONFIG_SYSFS around the code related to
    sysfs.

    Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Jérémy Lefaure
    Acked-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérémy Lefaure
     

30 Nov, 2016

1 commit

  • Linus found there still is a race in mremap after commit 5d1904204c99
    ("mremap: fix race between mremap() and page cleanning").

    As described by Linus:
    "the issue is that another thread might make the pte be dirty (in the
    hardware walker, so no locking of ours will make any difference)
    *after* we checked whether it was dirty, but *before* we removed it
    from the page tables"

    Fix it by moving the check after we removed it from the page table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

12 Nov, 2016

2 commits

  • Limit the number of kmemleak false positives by including
    .data.ro_after_init in memory scanning. To achieve this we need to add
    symbols for start and end of the section to the linker scripts.

    The problem was been uncovered by commit 56989f6d8568 ("genetlink: mark
    families as __ro_after_init").

    Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com
    Reviewed-by: Catalin Marinas
    Signed-off-by: Jakub Kicinski
    Cc: Arnd Bergmann
    Cc: Cong Wang
    Cc: Johannes Berg
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Kicinski
     
  • While testing OBJFREELIST_SLAB integration with pagealloc, we found a
    bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
    CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
    for loading drivers or creating new caches will fail.

    The original kmem_cache is created early making OFF_SLAB not possible.
    When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
    is enabled it will try to enable it first under certain conditions.
    Given kmem_cache(sys) reuses the original flag, you can have both flags
    at the same time resulting in allocation failures and odd behaviors.

    This fix discards allocator specific flags from memcg before calling
    create_cache.

    The bug exists since 4.6-rc1 and affects testing debug pagealloc
    configurations.

    Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
    Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: Thomas Garnier
    Tested-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen