18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

12 Nov, 2016

9 commits

  • Limit the number of kmemleak false positives by including
    .data.ro_after_init in memory scanning. To achieve this we need to add
    symbols for start and end of the section to the linker scripts.

    The problem was been uncovered by commit 56989f6d8568 ("genetlink: mark
    families as __ro_after_init").

    Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com
    Reviewed-by: Catalin Marinas
    Signed-off-by: Jakub Kicinski
    Cc: Arnd Bergmann
    Cc: Cong Wang
    Cc: Johannes Berg
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Kicinski
     
  • While testing OBJFREELIST_SLAB integration with pagealloc, we found a
    bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
    CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
    for loading drivers or creating new caches will fail.

    The original kmem_cache is created early making OFF_SLAB not possible.
    When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
    is enabled it will try to enable it first under certain conditions.
    Given kmem_cache(sys) reuses the original flag, you can have both flags
    at the same time resulting in allocation failures and odd behaviors.

    This fix discards allocator specific flags from memcg before calling
    create_cache.

    The bug exists since 4.6-rc1 and affects testing debug pagealloc
    configurations.

    Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
    Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: Thomas Garnier
    Tested-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Starting from 4.9-rc1 kernel, I started noticing some test failures of
    sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
    testing on sub-page block size filesystems (tested both XFS and ext4),
    these syscalls start to return EIO in the tests. e.g.

    sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
    sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
    sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
    sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

    This is because that in sub-page block size cases, we don't need the
    whole page to be uptodate, only the part we care about is uptodate is OK
    (if fs has ->is_partially_uptodate defined).

    But page_cache_pipe_buf_confirm() doesn't have the ability to check the
    partially-uptodate case, it needs the whole page to be uptodate. So it
    returns EIO in this case.

    This is a regression introduced by commit 82c156f85384 ("switch
    generic_file_splice_read() to use of ->read_iter()"). Prior to the
    change, generic_file_splice_read() doesn't allow partially-uptodate page
    either, so it worked fine.

    Fix it by skipping the partially-uptodate check if we're working on a
    pipe in do_generic_file_read(), so we read the whole page from disk as
    long as the page is not uptodate.

    I think the other way to fix it is to add the ability to check & allow
    partially-uptodate page to page_cache_pipe_buf_confirm(), but that is
    much harder to do and seems gain little.

    Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.com
    Signed-off-by: Eryu Guan
    Reviewed-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eryu Guan
     
  • Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
    allocated huge page.

    If a reservation was associated with the huge page, alloc_huge_page()
    consumed the reservation while allocating. When the newly allocated
    page is freed in free_huge_page(), it will increment the global
    reservation count. However, the reservation entry in the reserve map
    will remain.

    This is not an issue for shared mappings as the entry in the reserve map
    indicates a reservation exists. But, an entry in a private mapping
    reserve map indicates the reservation was consumed and no longer exists.
    This results in an inconsistency between the reserve map and the global
    reservation count. This 'leaks' a reserved huge page.

    Create a new routine restore_reserve_on_error() to restore the reserve
    entry in these specific error paths. This routine makes use of a new
    function vma_add_reservation() which will add a reserve entry for a
    specific address/page.

    In general, these error paths were rarely (if ever) taken on most
    architectures. However, powerpc contained arch specific code that that
    resulted in an extra fault and execution of these error paths on all
    private mappings.

    Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
    Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Tested-by: Jan Stancek
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Kirill A . Shutemov
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When memory_failure() runs on a thp tail page after pmd is split, we
    trigger the following VM_BUG_ON_PAGE():

    page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1
    flags: 0x1fffc000400000(hwpoison)
    page dumped because: VM_BUG_ON_PAGE(!page_count(p))
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/memory-failure.c:1132!

    memory_failure() passed refcount and page lock from tail page to head
    page, which is not needed because we can pass any subpage to
    split_huge_page().

    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When root activates a swap partition whose header has the wrong
    endianness, nr_badpages elements of badpages are swabbed before
    nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

    This normally is not a security issue because it can only be exploited
    by root (more specifically, a process with CAP_SYS_ADMIN or the ability
    to modify a swap file/partition), and such a process can already e.g.
    modify swapped-out memory of any other userspace process on the system.

    Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Acked-by: Jerome Marchand
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • CMA allocation request size is represented by size_t that gets truncated
    when same is passed as int to bitmap_find_next_zero_area_off.

    We observe that during fuzz testing when cma allocation request is too
    high, bitmap_find_next_zero_area_off still returns success due to the
    truncation. This leads to kernel crash, as subsequent code assumes that
    requested memory is available.

    Fail cma allocation in case the request breaches the corresponding cma
    region size.

    Link: http://lkml.kernel.org/r/1478189211-3467-1-git-send-email-shashim@codeaurora.org
    Signed-off-by: Shiraz Hashim
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shiraz Hashim
     
  • If shmem_alloc_page() does not set PageLocked and PageSwapBacked, then
    shmem_replace_page() needs to do so for itself. Without this, it puts
    newpage on the wrong lru, re-unlocks the unlocked newpage, and system
    descends into "Bad page" reports and freeze; or if CONFIG_DEBUG_VM=y, it
    hits an earlier VM_BUG_ON_PAGE(!PageLocked), depending on config.

    But shmem_replace_page() is not a common path: it's only called when
    swapin (or swapoff) finds the page was already read into an unsuitable
    zone: usually all zones are suitable, but gem objects for a few drm
    devices (gma500, omapdrm, crestline, broadwater) require zone DMA32 if
    there's more than 4GB of ram.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1611062003510.11253@eggly.anvils
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: [4.8.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") by error embedded "\n" in the format string, resulting in strange
    output.

    [ 722.876655] kworker/0:1: page alloction stalls for 160001ms, order:0
    [ 722.876656] , mode:0x2400000(GFP_NOIO)
    [ 722.876657] CPU: 0 PID: 6966 Comm: kworker/0:1 Not tainted 4.8.0+ #69

    Link: http://lkml.kernel.org/r/1476026219-7974-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit

  • The stack frame size could grow too large when the plugin used long long
    on 32-bit architectures when the given function had too many basic blocks.

    The gcc warning was:

    drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
    drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]

    This switches latent_entropy from u64 to unsigned long.

    Thanks to PaX Team and Emese Revfy for the patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

28 Oct, 2016

10 commits

  • Merge misc fixes from Andrew Morton:
    "20 fixes"

    * emailed patches from Andrew Morton :
    drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
    cris/arch-v32: cryptocop: print a hex number after a 0x prefix
    ipack: print a hex number after a 0x prefix
    block: DAC960: print a hex number after a 0x prefix
    fs: exofs: print a hex number after a 0x prefix
    lib/genalloc.c: start search from start of chunk
    mm: memcontrol: do not recurse in direct reclaim
    CREDITS: update credit information for Martin Kepplinger
    proc: fix NULL dereference when reading /proc//auxv
    mm: kmemleak: ensure that the task stack is not freed during scanning
    lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
    latent_entropy: raise CONFIG_FRAME_WARN by default
    kconfig.h: remove config_enabled() macro
    ipc: account for kmem usage on mqueue and msg
    mm/slab: improve performance of gathering slabinfo stats
    mm: page_alloc: use KERN_CONT where appropriate
    mm/list_lru.c: avoid error-path NULL pointer deref
    h8300: fix syscall restarting
    kcov: properly check if we are in an interrupt
    mm/slab: fix kmemcg cache creation delayed issue

    Linus Torvalds
     
  • On 4.0, we saw a stack corruption from a page fault entering direct
    memory cgroup reclaim, calling into btrfs_releasepage(), which then
    tried to allocate an extent and recursed back into a kmem charge ad
    nauseam:

    [...]
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    memcg_charge_kmem+0x40/0x80
    new_slab+0x2d9/0x5a0
    __slab_alloc+0x2fd/0x44f
    kmem_cache_alloc+0x193/0x1e0
    alloc_extent_state+0x21/0xc0
    __clear_extent_bit+0x2b5/0x400
    try_release_extent_mapping+0x1a3/0x220
    __btrfs_releasepage+0x31/0x70
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    mem_cgroup_try_charge+0x65/0x1c0
    handle_mm_fault+0x117f/0x1510
    __do_page_fault+0x177/0x420
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30

    On later kernels, kmem charging is opt-in rather than opt-out, and that
    particular kmem allocation in btrfs_releasepage() is no longer being
    charged and won't recurse and overrun the stack anymore.

    But it's not impossible for an accounted allocation to happen from the
    memcg direct reclaim context, and we needed to reproduce this crash many
    times before we even got a useful stack trace out of it.

    Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
    avoid recursing into any other form of direct reclaim. Then let
    recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

    Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 68f24b08ee89 ("sched/core: Free the stack early if
    CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
    during kmemleak_scan() execution, leading to either a NULL pointer fault
    (if task->stack is NULL) or kmemleak accessing already freed memory.

    This patch uses the new try_get_task_stack() API to ensure that the task
    stack is not freed during kmemleak stack scanning.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=173901.

    Fixes: 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")
    Link: http://lkml.kernel.org/r/1476266223-14325-1-git-send-email-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Reported-by: CAI Qian
    Tested-by: CAI Qian
    Acked-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: CAI Qian
    Cc: Hillf Danton
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • On large systems, when some slab caches grow to millions of objects (and
    many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
    seconds. During this time, interrupts are disabled while walking the
    slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
    and this sometimes causes timeouts in other drivers (for instance,
    Infiniband).

    This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
    total number of allocated slabs per node, per cache. This counter is
    updated when a slab is created or destroyed. This enables us to skip
    traversing the slabs_full list while gathering slabinfo statistics, and
    since slabs_full tends to be the biggest list when the cache is large,
    it results in a dramatic performance improvement. Getting slabinfo
    statistics now only requires walking the slabs_free and slabs_partial
    lists, and those lists are usually much smaller than slabs_full.

    We tested this after growing the dentry cache to 70GB, and the
    performance improved from 2s to 5ms.

    Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
    Signed-off-by: Aruna Ramakrishna
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aruna Ramakrishna
     
  • Recent changes to printk require KERN_CONT uses to continue logging
    messages. So add KERN_CONT where necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
    Link: http://lkml.kernel.org/r/c7df37c8665134654a17aaeb8b9f6ace1d6db58b.1476239034.git.joe@perches.com
    Reported-by: Mark Rutland
    Signed-off-by: Joe Perches
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

    After some analysis it seems to be that the problem is in alloc_super().
    In case list_lru_init_memcg() fails it goes into destroy_super(), which
    calls list_lru_destroy().

    And in list_lru_init() we see that in case memcg_init_list_lru() fails,
    lru->node is freed, but not set NULL, which then leads list_lru_destroy()
    to believe it is initialized and call memcg_destroy_list_lru().
    memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
    which is NULL.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Alexander Polakov
    Acked-by: Vladimir Davydov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Polakov
     
  • There is a bug report that SLAB makes extreme load average due to over
    2000 kworker thread.

    https://bugzilla.kernel.org/show_bug.cgi?id=172981

    This issue is caused by kmemcg feature that try to create new set of
    kmem_caches for each memcg. Recently, kmem_cache creation is slowed by
    synchronize_sched() and futher kmem_cache creation is also delayed since
    kmem_cache creation is synchronized by a global slab_mutex lock. So,
    the number of kworker that try to create kmem_cache increases quietly.

    synchronize_sched() is for lockless access to node's shared array but
    it's not needed when a new kmem_cache is created. So, this patch rules
    out that case.

    Fixes: 801faf0db894 ("mm/slab: lockless decision to grow cache")
    Link: http://lkml.kernel.org/r/1475734855-4837-1-git-send-email-iamjoonsoo.kim@lge.com
    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
    but for build testing there is no reason not to allow them together.

    This hopefully means better build coverage and fewer embarrasing silly
    problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
    variable in memory hotplug") in the future.

    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • When I removed the per-zone bitlock hashed waitqueues in commit
    9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
    removed all the magic hotplug memory initialization of said waitqueues
    too.

    But when I actually _tested_ the resulting build, I stupidly assumed
    that "allmodconfig" would enable memory hotplug. And it doesn't,
    because it enables KASAN instead, which then disables hotplug memory
    support.

    As a result, my build test of the per-zone waitqueues was totally
    broken, and I didn't notice that the compiler warns about the now unused
    iterator variable 'i'.

    I guess I should be happy that that seems to be the worst breakage from
    my clearly horribly failed test coverage.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Oct, 2016

1 commit

  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

1 commit

  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

13 commits

  • Merge the gup_flags cleanups from Lorenzo Stoakes:
    "This patch series adjusts functions in the get_user_pages* family such
    that desired FOLL_* flags are passed as an argument rather than
    implied by flags.

    The purpose of this change is to make the use of FOLL_FORCE explicit
    so it is easier to grep for and clearer to callers that this flag is
    being used. The use of FOLL_FORCE is an issue as it overrides missing
    VM_READ/VM_WRITE flags for the VMA whose pages we are reading
    from/writing to, which can result in surprising behaviour.

    The patch series came out of the discussion around commit 38e088546522
    ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
    which addressed a BUG_ON() being triggered when a page was faulted in
    with PROT_NONE set but having been overridden by FOLL_FORCE.
    do_numa_page() was run on the assumption the page _must_ be one marked
    for NUMA node migration as an actual PROT_NONE page would have been
    dealt with prior to this code path, however FOLL_FORCE introduced a
    situation where this assumption did not hold.

    See

    https://marc.info/?l=linux-mm&m=147585445805166

    for the patch proposal"

    Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
    FOLL_WRITE by me.

    [ This branch was rebased recently to add a few more acked-by's and
    reviewed-by's ]

    * gup_flag-cleanups:
    mm: replace access_process_vm() write parameter with gup_flags
    mm: replace access_remote_vm() write parameter with gup_flags
    mm: replace __access_remote_vm() write parameter with gup_flags
    mm: replace get_user_pages_remote() write/force parameters with gup_flags
    mm: replace get_user_pages() write/force parameters with gup_flags
    mm: replace get_vaddr_frames() write/force parameters with gup_flags
    mm: replace get_user_pages_locked() write/force parameters with gup_flags
    mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
    mm: remove write/force parameters from __get_user_pages_unlocked()
    mm: remove write/force parameters from __get_user_pages_locked()
    mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

    Linus Torvalds
     
  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Signed-off-by: Wei Yongjun
    Cc: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/1476719259-6214-1-git-send-email-weiyj.lk@gmail.com
    Signed-off-by: Thomas Gleixner

    Wei Yongjun
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages_remote() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_vaddr_frames() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_locked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_locked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This is an ancient bug that was actually attempted to be fixed once
    (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
    get_user_pages() race for write access") but that was then undone due to
    problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

    In the meantime, the s390 situation has long been fixed, and we can now
    fix it by checking the pte_dirty() bit properly (and do it better). The
    s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
    software dirty bits") which made it into v3.9. Earlier kernels will
    have to look at the page state itself.

    Also, the VM has become more scalable, and what used a purely
    theoretical race back then has become easier to trigger.

    To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
    we already did a COW" rather than play racy games with FOLL_WRITE that
    is very fundamental, and then use the pte dirty flag to validate that
    the FOLL_COW flag is still valid.

    Reported-and-tested-by: Phil "not Paul" Oester
    Acked-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Oleg Nesterov
    Cc: Willy Tarreau
    Cc: Nick Piggin
    Cc: Greg Thelen
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Oct, 2016

2 commits

  • I observed false KSAN positives in the sctp code, when
    sctp uses jprobe_return() in jsctp_sf_eat_sack().

    The stray 0xf4 in shadow memory are stack redzones:

    [ ] ==================================================================
    [ ] BUG: KASAN: stack-out-of-bounds in memcmp+0xe9/0x150 at addr ffff88005e48f480
    [ ] Read of size 1 by task syz-executor/18535
    [ ] page:ffffea00017923c0 count:0 mapcount:0 mapping: (null) index:0x0
    [ ] flags: 0x1fffc0000000000()
    [ ] page dumped because: kasan: bad access detected
    [ ] CPU: 1 PID: 18535 Comm: syz-executor Not tainted 4.8.0+ #28
    [ ] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ ] ffff88005e48f2d0 ffffffff82d2b849 ffffffff0bc91e90 fffffbfff10971e8
    [ ] ffffed000bc91e90 ffffed000bc91e90 0000000000000001 0000000000000000
    [ ] ffff88005e48f480 ffff88005e48f350 ffffffff817d3169 ffff88005e48f370
    [ ] Call Trace:
    [ ] [] dump_stack+0x12e/0x185
    [ ] [] kasan_report+0x489/0x4b0
    [ ] [] __asan_report_load1_noabort+0x19/0x20
    [ ] [] memcmp+0xe9/0x150
    [ ] [] depot_save_stack+0x176/0x5c0
    [ ] [] save_stack+0xb1/0xd0
    [ ] [] kasan_slab_free+0x72/0xc0
    [ ] [] kfree+0xc8/0x2a0
    [ ] [] skb_free_head+0x79/0xb0
    [ ] [] skb_release_data+0x37a/0x420
    [ ] [] skb_release_all+0x4f/0x60
    [ ] [] consume_skb+0x138/0x370
    [ ] [] sctp_chunk_put+0xcb/0x180
    [ ] [] sctp_chunk_free+0x58/0x70
    [ ] [] sctp_inq_pop+0x68f/0xef0
    [ ] [] sctp_assoc_bh_rcv+0xd6/0x4b0
    [ ] [] sctp_inq_push+0x131/0x190
    [ ] [] sctp_backlog_rcv+0xe9/0xa20
    [ ... ]
    [ ] Memory state around the buggy address:
    [ ] ffff88005e48f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] >ffff88005e48f480: f4 f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ^
    [ ] ffff88005e48f500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ==================================================================

    KASAN stack instrumentation poisons stack redzones on function entry
    and unpoisons them on function exit. If a function exits abnormally
    (e.g. with a longjmp like jprobe_return()), stack redzones are left
    poisoned. Later this leads to random KASAN false reports.

    Unpoison stack redzones in the frames we are going to jump over
    before doing actual longjmp in jprobe_return().

    Signed-off-by: Dmitry Vyukov
    Acked-by: Masami Hiramatsu
    Reviewed-by: Mark Rutland
    Cc: Mark Rutland
    Cc: Catalin Marinas
    Cc: Andrey Ryabinin
    Cc: Lorenzo Pieralisi
    Cc: Alexander Potapenko
    Cc: Will Deacon
    Cc: Andrew Morton
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Masami Hiramatsu
    Cc: kasan-dev@googlegroups.com
    Cc: surovegin@google.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1476454043-101898-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     
  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds