10 Nov, 2015

2 commits

  • commit 47aee4d8e314384807e98b67ade07f6da476aa75 upstream.

    Use is_zero_pfn() on pteval only after pte_present() check on pteval
    (It might be better idea to introduce is_zero_pte() which checks
    pte_present() first).

    Otherwise when working on a swap or migration entry and if pte_pfn's
    result is equal to zero_pfn by chance, we lose user's data in
    __collapse_huge_page_copy(). So if you're unlucky, the application
    segfaults and finally you could see below message on exit:

    BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3

    Fixes: ca0984caa823 ("mm: incorporate zero pages into transparent huge pages")
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrea Arcangeli
    Acked-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 296291cdd1629c308114504b850dc343eabc2782 upstream.

    Currently a simple program below issues a sendfile(2) system call which
    takes about 62 days to complete in my test KVM instance.

    int fd;
    off_t off = 0;

    fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
    ftruncate(fd, 2);
    lseek(fd, 0, SEEK_END);
    sendfile(fd, fd, &off, 0xfffffff);

    Now you should not ask kernel to do a stupid stuff like copying 256MB in
    2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
    should have a way to stop you.

    We actually do have a check for fatal_signal_pending() in
    generic_perform_write() which triggers in this path however because we
    always succeed in writing something before the check is done, we return
    value > 0 from generic_perform_write() and thus the information about
    signal gets lost.

    Fix the problem by doing the signal check before writing anything. That
    way generic_perform_write() returns -EINTR, the error gets propagated up
    and the sendfile loop terminates early.

    Signed-off-by: Jan Kara
    Reported-by: Dmitry Vyukov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

27 Oct, 2015

1 commit

  • commit 424cdc14138088ada1b0e407a2195b2783c6e5ef upstream.

    page_counter_memparse() returns pages for the threshold, while
    mem_cgroup_usage() returns bytes for memory usage. Convert the
    threshold to bytes.

    Fixes: 3e32cb2e0a12b6915 ("memcg: rename cgroup_event to mem_cgroup_event").
    Signed-off-by: Shaohua Li
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

23 Oct, 2015

3 commits

  • commit 03a2d2a3eafe4015412cf4e9675ca0e2d9204074 upstream.

    Commit description is copied from the original post of this bug:

    http://comments.gmane.org/gmane.linux.kernel.mm/135349

    Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
    larger cache size than the size index INDEX_NODE mapping. In kernels
    3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.

    However, sometimes we can't get the right output we expected via
    kmalloc_size(INDEX_NODE + 1), causing a BUG().

    The mapping table in the latest kernel is like:
    index = {0, 1, 2 , 3, 4, 5, 6, n}
    size = {0, 96, 192, 8, 16, 32, 64, 2^n}
    The mapping table before 3.10 is like this:
    index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
    size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}

    The problem on my mips64 machine is as follows:

    (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
    && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
    and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
    kmalloc_index(sizeof(struct kmem_cache_node))

    (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.

    (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
    = PAGE_SIZE".

    (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
    "flags |= CFLGS_OFF_SLAB" will be covered.

    (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
    "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
    here may be NULL while kernel bootup.

    (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
    BUG info as the following shows (may be only mips64 has this problem):

    This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
    the BUG by adding 'size >= 256' check to guarantee that all necessary
    small sized slabs are initialized regardless sequence of slab size in
    mapping table.

    Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...")
    Signed-off-by: Joonsoo Kim
    Reported-by: Liuhailong
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     
  • commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream.

    SunDong reported the following on

    https://bugzilla.kernel.org/show_bug.cgi?id=103841

    I think I find a linux bug, I have the test cases is constructed. I
    can stable recurring problems in fedora22(4.0.4) kernel version,
    arch for x86_64. I construct transparent huge page, when the parent
    and child process with MAP_SHARE, MAP_PRIVATE way to access the same
    huge page area, it has the opportunity to lead to huge page copy on
    write failure, and then it will munmap the child corresponding mmap
    area, but then the child mmap area with VM_MAYSHARE attributes, child
    process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
    functions (vma - > vm_flags & VM_MAYSHARE).

    There were a number of problems with the report (e.g. it's hugetlbfs that
    triggers this, not transparent huge pages) but it was fundamentally
    correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
    looks like this

    vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
    next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
    prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
    pgoff 0 file ffff88106bdb9800 private_data (null)
    flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
    ------------
    kernel BUG at mm/hugetlb.c:462!
    SMP
    Modules linked in: xt_pkttype xt_LOG xt_limit [..]
    CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
    Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
    set_vma_resv_flags+0x2d/0x30

    The VM_BUG_ON is correct because private and shared mappings have
    different reservation accounting but the warning clearly shows that the
    VMA is shared.

    When a private COW fails to allocate a new page then only the process
    that created the VMA gets the page -- all the children unmap the page.
    If the children access that data in the future then they get killed.

    The problem is that the same file is mapped shared and private. During
    the COW, the allocation fails, the VMAs are traversed to unmap the other
    private pages but a shared VMA is found and the bug is triggered. This
    patch identifies such VMAs and skips them.

    Signed-off-by: Mel Gorman
    Reported-by: SunDong
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 3aaa76e125c1dd58c9b599baa8c6021896874c12 upstream.

    Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    each hugetlb page maintains its active flag to avoid a race condition
    betwe= en multiple calls of isolate_huge_page(), but current kernel
    doesn't set the f= lag on a hugepage allocated by migration because the
    proper putback routine isn= 't called. This means that users could
    still encounter the race referred to by bcc54222309c in this special
    case, so this patch fixes it.

    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

30 Sep, 2015

2 commits

  • commit c54839a722a02818677bcabe57e957f0ce4f841d upstream.

    reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
    number of pages removed from the candidate list. But shrink_page_list()
    puts back mlocked pages without passing it to caller and without
    counting as nr_reclaimed. This increases nr_isolated.

    To fix this, this patch changes shrink_page_list() to pass unevictable
    pages back to caller. Caller will take care those pages.

    Minchan said:

    It fixes two issues.

    1. With unevictable page, cma_alloc will be successful.

    Exactly speaking, cma_alloc of current kernel will fail due to
    unevictable pages.

    2. fix leaking of NR_ISOLATED counter of vmstat

    With it, too_many_isolated works. Otherwise, it could make hang until
    the process get SIGKILL.

    Signed-off-by: Jaewon Kim
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jaewon Kim
     
  • commit 2f064f3485cd29633ad1b3cfb00cc519509a3d72 upstream.

    Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

14 Sep, 2015

2 commits

  • commit 036138080a4376e5f3e5d0cca8ac99084c5cf06e upstream.

    Hugetlbfs pages will get a refcount in get_any_page() or
    madvise_hwpoison() if soft offlining through madvise. The refcount which
    is held by the soft offline path should be released if we fail to isolate
    hugetlbfs pages.

    Fix it by reducing the refcount for both isolation success and failure.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     
  • commit 4f32be677b124a49459e2603321c7a5605ceb9f8 upstream.

    After trying to drain pages from pagevec/pageset, we try to get reference
    count of the page again, however, the reference count of the page is not
    reduced if the page is still not on LRU list.

    Fix it by adding the put_page() to drop the page reference which is from
    __get_any_page().

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     

17 Aug, 2015

1 commit

  • commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream.

    Nikolay has reported a hang when a memcg reclaim got stuck with the
    following backtrace:

    PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
    #0 __schedule at ffffffff815ab152
    #1 schedule at ffffffff815ab76e
    #2 schedule_timeout at ffffffff815ae5e5
    #3 io_schedule_timeout at ffffffff815aad6a
    #4 bit_wait_io at ffffffff815abfc6
    #5 __wait_on_bit at ffffffff815abda5
    #6 wait_on_page_bit at ffffffff8111fd4f
    #7 shrink_page_list at ffffffff81135445
    #8 shrink_inactive_list at ffffffff81135845
    #9 shrink_lruvec at ffffffff81135ead
    #10 shrink_zone at ffffffff811360c3
    #11 shrink_zones at ffffffff81136eff
    #12 do_try_to_free_pages at ffffffff8113712f
    #13 try_to_free_mem_cgroup_pages at ffffffff811372be
    #14 try_charge at ffffffff81189423
    #15 mem_cgroup_try_charge at ffffffff8118c6f5
    #16 __add_to_page_cache_locked at ffffffff8112137d
    #17 add_to_page_cache_lru at ffffffff81121618
    #18 pagecache_get_page at ffffffff8112170b
    #19 grow_dev_page at ffffffff811c8297
    #20 __getblk_slow at ffffffff811c91d6
    #21 __getblk_gfp at ffffffff811c92c1
    #22 ext4_ext_grow_indepth at ffffffff8124565c
    #23 ext4_ext_create_new_leaf at ffffffff81246ca8
    #24 ext4_ext_insert_extent at ffffffff81246f09
    #25 ext4_ext_map_blocks at ffffffff8124a848
    #26 ext4_map_blocks at ffffffff8121a5b7
    #27 mpage_map_one_extent at ffffffff8121b1fa
    #28 mpage_map_and_submit_extent at ffffffff8121f07b
    #29 ext4_writepages at ffffffff8121f6d5
    #30 do_writepages at ffffffff8112c490
    #31 __filemap_fdatawrite_range at ffffffff81120199
    #32 filemap_flush at ffffffff8112041c
    #33 ext4_alloc_da_blocks at ffffffff81219da1
    #34 ext4_rename at ffffffff81229b91
    #35 ext4_rename2 at ffffffff81229e32
    #36 vfs_rename at ffffffff811a08a5
    #37 SYSC_renameat2 at ffffffff811a3ffc
    #38 sys_renameat2 at ffffffff811a408e
    #39 sys_rename at ffffffff8119e51e
    #40 system_call_fastpath at ffffffff815afa89

    Dave Chinner has properly pointed out that this is a deadlock in the
    reclaim code because ext4 doesn't submit pages which are marked by
    PG_writeback right away.

    The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
    with too many dirty pages") and it was applied only when may_enter_fs
    was specified. The code has been changed by c3b94f44fcb0 ("memcg:
    further prevent OOM with too many dirty pages") which has removed the
    __GFP_FS restriction with a reasoning that we do not get into the fs
    code. But this is not sufficient apparently because the fs doesn't
    necessarily submit pages marked PG_writeback for IO right away.

    ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
    submit the bio. Instead it tries to map more pages into the bio and
    mpage_map_one_extent might trigger memcg charge which might end up
    waiting on a page which is marked PG_writeback but hasn't been submitted
    yet so we would end up waiting for something that never finishes.

    Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
    before we go to wait on the writeback. The page fault path, which is
    the only path that triggers memcg oom killer since 3.12, shouldn't
    require GFP_NOFS and so we shouldn't reintroduce the premature OOM
    killer issue which was originally addressed by the heuristic.

    As per David Chinner the xfs is doing similar thing since 2.6.15 already
    so ext4 is not the only affected filesystem. Moreover he notes:

    : For example: IO completion might require unwritten extent conversion
    : which executes filesystem transactions and GFP_NOFS allocations. The
    : writeback flag on the pages can not be cleared until unwritten
    : extent conversion completes. Hence memory reclaim cannot wait on
    : page writeback to complete in GFP_NOFS context because it is not
    : safe to do so, memcg reclaim or otherwise.

    Cc: stable@vger.kernel.org # 3.9+
    [tytso@mit.edu: corrected the control flow]
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Reported-by: Nikolay Borisov
    Signed-off-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

04 Aug, 2015

2 commits

  • commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream.

    Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 641844f5616d7c6597309f560838f996466d7aac upstream.

    Currently the initial value of order in dissolve_free_huge_page is 64 or
    32, which leads to the following warning in static checker:

    mm/hugetlb.c:1203 dissolve_free_huge_pages()
    warn: potential right shift more than type allows '9,18,64'

    This is a potential risk of infinite loop, because 1 << order (== 0) is used
    in for-loop like this:

    for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
    ...

    So this patch fixes it by using global minimum_order calculated at boot time.

    text data bss dec hex filename
    28313 469 84236 113018 1b97a mm/hugetlb.o
    28256 473 84236 112965 1b945 mm/hugetlb.o (patched)

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Reported-by: Dan Carpenter
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

22 Jul, 2015

3 commits

  • commit 0867a57c4f80a566dda1bac975b42fcd857cb489 upstream.

    Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
    local node"), we handle THP allocations on page fault in a special way -
    for non-interleave memory policies, the allocation is only attempted on
    the node local to the current CPU, if the policy's nodemask allows the
    node.

    This is motivated by the assumption that THP benefits cannot offset the
    cost of remote accesses, so it's better to fallback to base pages on the
    local node (which might still be available, while huge pages are not due
    to fragmentation) than to allocate huge pages on a remote node.

    The nodemask check prevents us from violating e.g. MPOL_BIND policies
    where the local node is not among the allowed nodes. However, the
    current implementation can still give surprising results for the
    MPOL_PREFERRED policy when the preferred node is different than the
    current CPU's local node.

    In such case we should honor the preferred node and not use the local
    node, which is what this patch does. If hugepage allocation on the
    preferred node fails, we fall back to base pages and don't try other
    nodes, with the same motivation as is done for the local node hugepage
    allocations. The patch also moves the MPOL_INTERLEAVE check around to
    simplify the hugepage specific test.

    The difference can be demonstrated using in-tree transhuge-stress test
    on the following 2-node machine where half memory on one node was
    occupied to show the difference.

    > numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
    node 0 size: 7878 MB
    node 0 free: 3623 MB
    node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
    node 1 size: 8045 MB
    node 1 free: 7818 MB
    node distances:
    node 0 1
    0: 10 21
    1: 21 10

    Before the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages

    Number of successful THP allocations corresponds to free memory on node 0 in
    the first case and node 1 in the second case, i.e. -p parameter is ignored and
    cpu binding "wins".

    After the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -p1 -C0 ./transhuge-stress
    transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages

    The -p parameter is respected regardless of cpu binding.

    > numactl -C0 ./transhuge-stress
    transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -C12 ./transhuge-stress
    transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages

    Without -p parameter, hugepage restriction to CPU-local node works as before.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2 upstream.

    Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
    following INFO splat is logged:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.1.0-rc7-next-20150612 #1 Not tainted
    -------------------------------
    kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
    other info that might help us debug this:
    rcu_scheduler_active = 1, debug_locks = 0
    3 locks held by systemd/1:
    #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1f/0x40
    #1: (rcu_read_lock_bh){......}, at: [] ipv6_add_addr+0x62/0x540
    #2: (addrconf_hash_lock){+...+.}, at: [] ipv6_add_addr+0x184/0x540
    stack backtrace:
    CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
    Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
    Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

    Additional backtrace lines are truncated. In addition, the above splat
    is followed by several "BUG: sleeping function called from invalid
    context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
    these are the clue to the fix. Routine kmemleak_alloc_percpu() always
    uses GFP_KERNEL for its allocations, whereas it should follow the gfp
    from its callers.

    Reviewed-by: Catalin Marinas
    Reviewed-by: Kamalesh Babulal
    Acked-by: Martin KaFai Lau
    Signed-off-by: Larry Finger
    Cc: Martin KaFai Lau
    Cc: Catalin Marinas
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Larry Finger
     
  • commit c5f3b1a51a591c18c8b33983908e7fdda6ae417e upstream.

    The kmemleak scanning thread can run for minutes. Callbacks like
    kmemleak_free() are allowed during this time, the race being taken care
    of by the object->lock spinlock. Such lock also prevents a memory block
    from being freed or unmapped while it is being scanned by blocking the
    kmemleak_free() -> ... -> __delete_object() function until the lock is
    released in scan_object().

    When a kmemleak error occurs (e.g. it fails to allocate its metadata),
    kmemleak_enabled is set and __delete_object() is no longer called on
    freed objects. If kmemleak_scan is running at the same time,
    kmemleak_free() no longer waits for the object scanning to complete,
    allowing the corresponding memory block to be freed or unmapped (in the
    case of vfree()). This leads to kmemleak_scan potentially triggering a
    page fault.

    This patch separates the kmemleak_free() enabling/disabling from the
    overall kmemleak_enabled nob so that we can defer the disabling of the
    object freeing tracking until the scanning thread completed. The
    kmemleak_free_part() is deliberately ignored by this patch since this is
    only called during boot before the scanning thread started.

    Signed-off-by: Catalin Marinas
    Reported-by: Vignesh Radhakrishnan
    Tested-by: Vignesh Radhakrishnan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     

18 Jun, 2015

1 commit

  • It appears that, at some point last year, XFS made directory handling
    changes which bring it into lockdep conflict with shmem_zero_setup():
    it is surprising that mmap() can clone an inode while holding mmap_sem,
    but that has been so for many years.

    Since those few lockdep traces that I've seen all implicated selinux,
    I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
    v3.13's commit c7277090927a ("security: shmem: implement kernel private
    shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
    the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

    This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
    (and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
    which cloned inode in mmap(), but if so, I cannot locate them now.

    Reported-and-tested-by: Prarit Bhargava
    Reported-and-tested-by: Daniel Wagner
    Reported-and-tested-by: Morten Stevens
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Jun, 2015

4 commits

  • If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
    pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
    will dereference the NULL pool->handle_cachep.

    Modify destroy_handle_cache() to avoid this.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
    swapout path because the spin_lock_irq(&mapping->tree_lock) in the
    caller doesn't actually disable the hardware interrupts - which is fine,
    because on -rt the tophalves run in process context and so we are still
    safe from preemption while updating the statistics.

    Remove the VM_BUG_ON() but keep the comment of what we rely on.

    Signed-off-by: Johannes Weiner
    Reported-by: Clark Williams
    Cc: Fernando Lopez-Lezcano
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When trimming memcg consumption excess (see memory.high), we call
    try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
    in the current context, which can result in a deadlock. Fix this.

    Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory")
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[] [] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
    unlock_page+0x6d/0x70
    generic_write_end+0x53/0xb0
    xfs_vm_write_end+0x29/0x80 [xfs]
    generic_perform_write+0x10a/0x1e0
    xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
    xfs_file_write_iter+0x79/0x120 [xfs]
    __vfs_write+0xd4/0x110
    vfs_write+0xac/0x1c0
    SyS_write+0x58/0xd0
    system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP [] __wake_up_bit+0x20/0x70
    RSP
    CR2: ffffc90008963690

    Reproduce method (re-add a node)::
    Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

    This seems an use-after-free problem, and the root cause is
    zone->wait_table was not set to *NULL* after free it in
    try_offline_node.

    When hot re-add a node, we will reuse the pgdat of it, so does the zone
    struct, and when add pages to the target zone, it will init the zone
    first (including the wait_table) if the zone is not initialized. The
    judgement of zone initialized is based on zone->wait_table:

    static inline bool zone_is_initialized(struct zone *zone)
    {
    return !!zone->wait_table;
    }

    so if we do not set the zone->wait_table to *NULL* after free it, the
    memory hotplug routine will skip the init of new zone when hot re-add
    the node, and the wait_table still points to the freed memory, then we
    will access the invalid address when trying to wake up the waiting
    people after the i/o operation with the page is done, such as mentioned
    above.

    Signed-off-by: Gu Zheng
    Reported-by: Taku Izumi
    Reviewed by: Yasuaki Ishimatsu
    Cc: KAMEZAWA Hiroyuki
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

15 May, 2015

3 commits

  • NUMA balancing is meant to be disabled by default on UMA machines but
    the check is using nr_node_ids (highest node) instead of
    num_online_nodes (online nodes).

    The consequences are that a UMA machine with a node ID of 1 or higher
    will enable NUMA balancing. This will incur useless overhead due to
    minor faults with the impact depending on the workload. These are the
    impact on the stats when running a kernel build on a single node machine
    whose node ID happened to be 1:

    vanilla patched
    NUMA base PTE updates 5113158 0
    NUMA huge PMD updates 643 0
    NUMA page range updates 5442374 0
    NUMA hint faults 2109622 0
    NUMA hint local faults 2109622 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • I had an issue:

    Unable to handle kernel NULL pointer dereference at virtual address 0000082a
    pgd = cc970000
    [0000082a] *pgd=00000000
    Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    PC is at get_pageblock_flags_group+0x5c/0xb0
    LR is at unset_migratetype_isolate+0x148/0x1b0
    pc : [] lr : [] psr: 80000093
    sp : c7029d00 ip : 00000105 fp : c7029d1c
    r10: 00000001 r9 : 0000000a r8 : 00000004
    r7 : 60000013 r6 : 000000a4 r5 : c0a357e4 r4 : 00000000
    r3 : 00000826 r2 : 00000002 r1 : 00000000 r0 : 0000003f
    Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
    Control: 10c5387d Table: 2cb7006a DAC: 00000015
    Backtrace:
    get_pageblock_flags_group+0x0/0xb0
    unset_migratetype_isolate+0x0/0x1b0
    undo_isolate_page_range+0x0/0xdc
    __alloc_contig_range+0x0/0x34c
    alloc_contig_range+0x0/0x18

    This issue is because when calling unset_migratetype_isolate() to unset
    a part of CMA memory, it try to access the buddy page to get its status:

    if (order >= pageblock_order) {
    page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
    buddy_idx = __find_buddy_index(page_idx, order);
    buddy = page + (buddy_idx - page_idx);

    if (!is_migrate_isolate_page(buddy)) {

    But the begin addr of this part of CMA memory is very close to a part of
    memory that is reserved at boot time (not in buddy system). So add a
    check before accessing it.

    [akpm@linux-foundation.org: use conventional code layout]
    Signed-off-by: Hui Zhu
    Suggested-by: Laura Abbott
    Suggested-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Not all kmem allocations should be accounted to memcg. The following
    patch gives an example when accounting of a certain type of allocations to
    memcg can effectively result in a memory leak. This patch adds the
    __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
    allocation to go through the root cgroup. It will be used by the next
    patch.

    Note, since in case of kmemleak enabled each kmalloc implies yet another
    allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
    gfp_kmemleak_mask.

    Alternatively, we could introduce a per kmem cache flag disabling
    accounting for all allocations of a particular kind, but (a) we would not
    be able to bypass accounting for kmalloc then and (b) a kmem cache with
    this flag set could not be merged with a kmem cache without this flag,
    which would increase the number of global caches and therefore
    fragmentation even if the memory cgroup controller is not used.

    Despite its generic name, currently __GFP_NOACCOUNT disables accounting
    only for kmem allocations while user page allocations are always charged.
    To catch abusing of this flag, a warning is issued on an attempt of
    passing it to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Greg Kroah-Hartman
    Cc: [4.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 May, 2015

1 commit

  • Pull block fixes from Jens Axboe:
    "A collection of fixes since the merge window;

    - fix for a double elevator module release, from Chao Yu. Ancient bug.

    - the splice() MORE flag fix from Christophe Leroy.

    - a fix for NVMe, fixing a patch that went in in the merge window.
    From Keith.

    - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

    - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

    - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
    bad merge issue with FUA writes.

    - division-by-zero fix for writeback from Tejun.

    - a block bounce page accounting fix, making sure we inc/dec after
    bouncing so that pre/post IO pages match up. From Wang YanQing"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    splice: sendfile() at once fails for big files
    blk-mq: don't lose requests if a stopped queue restarts
    blk-mq: fix FUA request hang
    block: destroy bdi before blockdev is unregistered.
    block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
    elevator: fix double release of elevator module
    writeback: use |1 instead of +1 to protect against div by zero
    blk-mq: fix CPU hotplug handling
    blk-mq: fix race between timeout and CPU hotplug
    NVMe: Fix VPD B0 max sectors translation

    Linus Torvalds
     

06 May, 2015

4 commits

  • Hwpoison injector checks PageLRU of the raw target page to find out
    whether the page is an appropriate target, but current code now filters
    out thp tail pages, which prevents us from testing for such cases via this
    interface. So let's check hpage instead of p.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Hwpoison injection via debugfs:hwpoison/corrupt-pfn takes a refcount of
    the target page. But current code doesn't release it if the target page
    is not supposed to be injected, which results in memory leak. This patch
    simply adds the refcount releasing code.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If multiple soft offline events hit one free page/hugepage concurrently,
    soft_offline_page() can handle the free page/hugepage multiple times,
    which makes num_poisoned_pages counter increased more than once. This
    patch fixes this wrong counting by checking TestSetPageHWPoison for normal
    papes and by checking the return value of dequeue_hwpoisoned_huge_page()
    for hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently memory_failure() calls shake_page() to sweep pages out from
    pcplists only when the victim page is 4kB LRU page or thp head page.
    But we should do this for a thp tail page too.

    Consider that a memory error hits a thp tail page whose head page is on
    a pcplist when memory_failure() runs. Then, the current kernel skips
    shake_pages() part, so hwpoison_user_mappings() returns without calling
    split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
    still cleared due to the skip of shake_page().

    As a result, me_huge_page() runs for the thp, which is broken behavior.

    One effect is a leak of the thp. And another is to fail to isolate the
    memory error, so later access to the error address causes another MCE,
    which kills the processes which used the thp.

    This patch fixes this problem by calling shake_page() for thp tail case.

    Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Acked-by: Dean Nelson
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Cc: Jin Dongming
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

24 Apr, 2015

1 commit

  • mm/page-writeback.c has several places where 1 is added to the divisor
    to prevent division by zero exceptions; however, if the original
    divisor is equivalent to -1, adding 1 leads to division by zero.

    There are three places where +1 is used for this purpose - one in
    pos_ratio_polynom() and two in bdi_position_ratio(). The second one
    in bdi_position_ratio() actually triggered div-by-zero oops on a
    machine running a 3.10 kernel. The divisor is

    x_intercept - bdi_setpoint + 1 == span + 1

    span is confirmed to be (u32)-1. It isn't clear how it ended up that
    but it could be from write bandwidth calculation underflow fixed by
    c72efb658f7c ("writeback: fix possible underflow in write bandwidth
    calculation").

    At any rate, +1 isn't a proper protection against div-by-zero. This
    patch converts all +1 protections to |1. Note that
    bdi_update_dirty_ratelimit() was already using |1 before this patch.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Apr, 2015

1 commit

  • Pull third hunk of vfs changes from Al Viro:
    "This contains the ->direct_IO() changes from Omar + saner
    generic_write_checks() + dealing with fcntl()/{read,write}() races
    (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
    repeatedly looking at ->f_flags, which can be changed by fcntl(2),
    check ->ki_flags - which cannot) + infrastructure bits for dhowells'
    d_inode annotations + Christophs switch of /dev/loop to
    vfs_iter_write()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
    block: loop: switch to VFS ITER_BVEC
    configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
    VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
    VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
    VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
    NFS: Don't use d_inode as a variable name
    VFS: Impose ordering on accesses of d_inode and d_flags
    VFS: Add owner-filesystem positive/negative dentry checks
    nfs: generic_write_checks() shouldn't be done on swapout...
    ocfs2: use __generic_file_write_iter()
    mirror O_APPEND and O_DIRECT into iocb->ki_flags
    switch generic_write_checks() to iocb and iter
    ocfs2: move generic_write_checks() before the alignment checks
    ocfs2_file_write_iter: stop messing with ppos
    udf_file_write_iter: reorder and simplify
    fuse: ->direct_IO() doesn't need generic_write_checks()
    ext4_file_write_iter: move generic_write_checks() up
    xfs_file_aio_write_checks: switch to iocb/iov_iter
    generic_write_checks(): drop isblk argument
    blkdev_write_iter: expand generic_file_checks() call in there
    ...

    Linus Torvalds
     

16 Apr, 2015

7 commits

  • Merge second patchbomb from Andrew Morton:

    - the rest of MM

    - various misc bits

    - add ability to run /sbin/reboot at reboot time

    - printk/vsprintf changes

    - fiddle with seq_printf() return value

    * akpm: (114 commits)
    parisc: remove use of seq_printf return value
    lru_cache: remove use of seq_printf return value
    tracing: remove use of seq_printf return value
    cgroup: remove use of seq_printf return value
    proc: remove use of seq_printf return value
    s390: remove use of seq_printf return value
    cris fasttimer: remove use of seq_printf return value
    cris: remove use of seq_printf return value
    openrisc: remove use of seq_printf return value
    ARM: plat-pxa: remove use of seq_printf return value
    nios2: cpuinfo: remove use of seq_printf return value
    microblaze: mb: remove use of seq_printf return value
    ipc: remove use of seq_printf return value
    rtc: remove use of seq_printf return value
    power: wakeup: remove use of seq_printf return value
    x86: mtrr: if: remove use of seq_printf return value
    linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
    MAINTAINERS: CREDITS: remove Stefano Brivio from B43
    .mailmap: add Ricardo Ribalda
    CREDITS: add Ricardo Ribalda Delgado
    ...

    Linus Torvalds
     
  • Do not perform cond_resched() before the busy compaction loop in
    __zs_compact(), because this loop does it when needed.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no point in overriding the size class below. It causes fatal
    corruption on the next chunk on the 3264-bytes size class, which is the
    last size class that is not huge.

    For example, if the requested size was exactly 3264 bytes, current
    zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
    not 4096. User access to this chunk may overwrite head of the next
    adjacent chunk.

    Here is the panic log captured when freelist was corrupted due to this:

    Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
    Modules linked in:
    exynos-snapshot: core register saved(CPU:5)
    CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
    exynos-snapshot: context saved(CPU:5)
    exynos-snapshot: item - log_kevents is disabled
    CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
    task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
    PC is at obj_idx_to_offset+0x0/0x1c
    LR is at obj_malloc+0x44/0xe8
    pc : [] lr : [] pstate: a0000045
    sp : ffffffc0b71eb790
    x29: ffffffc0b71eb790 x28: ffffffc00204c000
    x27: 000000000001d96f x26: 0000000000000000
    x25: ffffffc098cc3500 x24: ffffffc0a13f2810
    x23: ffffffc098cc3501 x22: ffffffc0a13f2800
    x21: 000011e1a02006e3 x20: ffffffc0a13f2800
    x19: ffffffbc02a7e000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000feb
    x15: 0000000000000000 x14: 00000000a01003e3
    x13: 0000000000000020 x12: fffffffffffffff0
    x11: ffffffc08b264000 x10: 00000000e3a01004
    x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
    x7 : ffffffc000307d24 x6 : 0000000000000000
    x5 : 0000000000000038 x4 : 000000000000011e
    x3 : ffffffbc00003e90 x2 : 0000000000000cc0
    x1 : 00000000d0100371 x0 : ffffffbc00003e90

    Reported-by: Sooyong Suk
    Signed-off-by: Heesub Shin
    Tested-by: Sooyong Suk
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heesub Shin
     
  • In putback_zspage, we don't need to insert a zspage into list of zspage
    in size_class again to just fix fullness group. We could do directly
    without reinsertion so we could save some instuctions.

    Reported-by: Heesub Shin
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Ganesh Mahendran
    Cc: Luigi Semenzato
    Cc: Gunho Lee
    Cc: Juneho Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • A micro-optimization. Avoid additional branching and reduce (a bit)
    registry pressure (f.e. s_off += size; d_off += size; may be calculated
    twise: first for >= PAGE_SIZE check and later for offset update in "else"
    clause).

    scripts/bloat-o-meter shows some improvement

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
    function old new delta
    zs_object_copy 550 540 -10

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Do not synchronize rcu in zs_compact(). Neither zsmalloc not
    zram use rcu.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Yinghao Xie
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghao Xie