28 Oct, 2016

5 commits

  • On large systems, when some slab caches grow to millions of objects (and
    many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
    seconds. During this time, interrupts are disabled while walking the
    slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
    and this sometimes causes timeouts in other drivers (for instance,
    Infiniband).

    This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
    total number of allocated slabs per node, per cache. This counter is
    updated when a slab is created or destroyed. This enables us to skip
    traversing the slabs_full list while gathering slabinfo statistics, and
    since slabs_full tends to be the biggest list when the cache is large,
    it results in a dramatic performance improvement. Getting slabinfo
    statistics now only requires walking the slabs_free and slabs_partial
    lists, and those lists are usually much smaller than slabs_full.

    We tested this after growing the dentry cache to 70GB, and the
    performance improved from 2s to 5ms.

    Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
    Signed-off-by: Aruna Ramakrishna
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aruna Ramakrishna
     
  • Recent changes to printk require KERN_CONT uses to continue logging
    messages. So add KERN_CONT where necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
    Link: http://lkml.kernel.org/r/c7df37c8665134654a17aaeb8b9f6ace1d6db58b.1476239034.git.joe@perches.com
    Reported-by: Mark Rutland
    Signed-off-by: Joe Perches
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

    After some analysis it seems to be that the problem is in alloc_super().
    In case list_lru_init_memcg() fails it goes into destroy_super(), which
    calls list_lru_destroy().

    And in list_lru_init() we see that in case memcg_init_list_lru() fails,
    lru->node is freed, but not set NULL, which then leads list_lru_destroy()
    to believe it is initialized and call memcg_destroy_list_lru().
    memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
    which is NULL.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Alexander Polakov
    Acked-by: Vladimir Davydov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Polakov
     
  • There is a bug report that SLAB makes extreme load average due to over
    2000 kworker thread.

    https://bugzilla.kernel.org/show_bug.cgi?id=172981

    This issue is caused by kmemcg feature that try to create new set of
    kmem_caches for each memcg. Recently, kmem_cache creation is slowed by
    synchronize_sched() and futher kmem_cache creation is also delayed since
    kmem_cache creation is synchronized by a global slab_mutex lock. So,
    the number of kworker that try to create kmem_cache increases quietly.

    synchronize_sched() is for lockless access to node's shared array but
    it's not needed when a new kmem_cache is created. So, this patch rules
    out that case.

    Fixes: 801faf0db894 ("mm/slab: lockless decision to grow cache")
    Link: http://lkml.kernel.org/r/1475734855-4837-1-git-send-email-iamjoonsoo.kim@lge.com
    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Oct, 2016

1 commit

  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

1 commit

  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

13 commits

  • Merge the gup_flags cleanups from Lorenzo Stoakes:
    "This patch series adjusts functions in the get_user_pages* family such
    that desired FOLL_* flags are passed as an argument rather than
    implied by flags.

    The purpose of this change is to make the use of FOLL_FORCE explicit
    so it is easier to grep for and clearer to callers that this flag is
    being used. The use of FOLL_FORCE is an issue as it overrides missing
    VM_READ/VM_WRITE flags for the VMA whose pages we are reading
    from/writing to, which can result in surprising behaviour.

    The patch series came out of the discussion around commit 38e088546522
    ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
    which addressed a BUG_ON() being triggered when a page was faulted in
    with PROT_NONE set but having been overridden by FOLL_FORCE.
    do_numa_page() was run on the assumption the page _must_ be one marked
    for NUMA node migration as an actual PROT_NONE page would have been
    dealt with prior to this code path, however FOLL_FORCE introduced a
    situation where this assumption did not hold.

    See

    https://marc.info/?l=linux-mm&m=147585445805166

    for the patch proposal"

    Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
    FOLL_WRITE by me.

    [ This branch was rebased recently to add a few more acked-by's and
    reviewed-by's ]

    * gup_flag-cleanups:
    mm: replace access_process_vm() write parameter with gup_flags
    mm: replace access_remote_vm() write parameter with gup_flags
    mm: replace __access_remote_vm() write parameter with gup_flags
    mm: replace get_user_pages_remote() write/force parameters with gup_flags
    mm: replace get_user_pages() write/force parameters with gup_flags
    mm: replace get_vaddr_frames() write/force parameters with gup_flags
    mm: replace get_user_pages_locked() write/force parameters with gup_flags
    mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
    mm: remove write/force parameters from __get_user_pages_unlocked()
    mm: remove write/force parameters from __get_user_pages_locked()
    mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

    Linus Torvalds
     
  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Signed-off-by: Wei Yongjun
    Cc: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/1476719259-6214-1-git-send-email-weiyj.lk@gmail.com
    Signed-off-by: Thomas Gleixner

    Wei Yongjun
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages_remote() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_vaddr_frames() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_locked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_locked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This is an ancient bug that was actually attempted to be fixed once
    (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
    get_user_pages() race for write access") but that was then undone due to
    problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

    In the meantime, the s390 situation has long been fixed, and we can now
    fix it by checking the pte_dirty() bit properly (and do it better). The
    s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
    software dirty bits") which made it into v3.9. Earlier kernels will
    have to look at the page state itself.

    Also, the VM has become more scalable, and what used a purely
    theoretical race back then has become easier to trigger.

    To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
    we already did a COW" rather than play racy games with FOLL_WRITE that
    is very fundamental, and then use the pte dirty flag to validate that
    the FOLL_COW flag is still valid.

    Reported-and-tested-by: Phil "not Paul" Oester
    Acked-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Oleg Nesterov
    Cc: Willy Tarreau
    Cc: Nick Piggin
    Cc: Greg Thelen
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Oct, 2016

2 commits

  • I observed false KSAN positives in the sctp code, when
    sctp uses jprobe_return() in jsctp_sf_eat_sack().

    The stray 0xf4 in shadow memory are stack redzones:

    [ ] ==================================================================
    [ ] BUG: KASAN: stack-out-of-bounds in memcmp+0xe9/0x150 at addr ffff88005e48f480
    [ ] Read of size 1 by task syz-executor/18535
    [ ] page:ffffea00017923c0 count:0 mapcount:0 mapping: (null) index:0x0
    [ ] flags: 0x1fffc0000000000()
    [ ] page dumped because: kasan: bad access detected
    [ ] CPU: 1 PID: 18535 Comm: syz-executor Not tainted 4.8.0+ #28
    [ ] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ ] ffff88005e48f2d0 ffffffff82d2b849 ffffffff0bc91e90 fffffbfff10971e8
    [ ] ffffed000bc91e90 ffffed000bc91e90 0000000000000001 0000000000000000
    [ ] ffff88005e48f480 ffff88005e48f350 ffffffff817d3169 ffff88005e48f370
    [ ] Call Trace:
    [ ] [] dump_stack+0x12e/0x185
    [ ] [] kasan_report+0x489/0x4b0
    [ ] [] __asan_report_load1_noabort+0x19/0x20
    [ ] [] memcmp+0xe9/0x150
    [ ] [] depot_save_stack+0x176/0x5c0
    [ ] [] save_stack+0xb1/0xd0
    [ ] [] kasan_slab_free+0x72/0xc0
    [ ] [] kfree+0xc8/0x2a0
    [ ] [] skb_free_head+0x79/0xb0
    [ ] [] skb_release_data+0x37a/0x420
    [ ] [] skb_release_all+0x4f/0x60
    [ ] [] consume_skb+0x138/0x370
    [ ] [] sctp_chunk_put+0xcb/0x180
    [ ] [] sctp_chunk_free+0x58/0x70
    [ ] [] sctp_inq_pop+0x68f/0xef0
    [ ] [] sctp_assoc_bh_rcv+0xd6/0x4b0
    [ ] [] sctp_inq_push+0x131/0x190
    [ ] [] sctp_backlog_rcv+0xe9/0xa20
    [ ... ]
    [ ] Memory state around the buggy address:
    [ ] ffff88005e48f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] >ffff88005e48f480: f4 f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ^
    [ ] ffff88005e48f500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ==================================================================

    KASAN stack instrumentation poisons stack redzones on function entry
    and unpoisons them on function exit. If a function exits abnormally
    (e.g. with a longjmp like jprobe_return()), stack redzones are left
    poisoned. Later this leads to random KASAN false reports.

    Unpoison stack redzones in the frames we are going to jump over
    before doing actual longjmp in jprobe_return().

    Signed-off-by: Dmitry Vyukov
    Acked-by: Masami Hiramatsu
    Reviewed-by: Mark Rutland
    Cc: Mark Rutland
    Cc: Catalin Marinas
    Cc: Andrey Ryabinin
    Cc: Lorenzo Pieralisi
    Cc: Alexander Potapenko
    Cc: Will Deacon
    Cc: Andrew Morton
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Masami Hiramatsu
    Cc: kasan-dev@googlegroups.com
    Cc: surovegin@google.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1476454043-101898-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     
  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

15 Oct, 2016

1 commit

  • Pull percpu updates from Tejun Heo:

    - Nick improved generic implementations of percpu operations which
    modify the variable and return so that they calculate the physical
    address only once.

    - percpu_ref percpu atomic mode switching improvements. The
    patchset was originally posted about a year ago but fell through the
    crack.

    - misc non-critical fixes.

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
    mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
    percpu: eliminate two sparse warnings
    percpu: improve generic percpu modify-return implementation
    percpu-refcount: init ->confirm_switch member properly
    percpu_ref: allow operation mode switching operations to be called concurrently
    percpu_ref: restructure operation mode switching
    percpu_ref: unify staggered atomic switching wait behavior
    percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic()
    percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation

    Linus Torvalds
     

13 Oct, 2016

1 commit

  • This affectively reverts commit 377ccbb48373 ("Makefile: Mute warning
    for __builtin_return_address(>0) for tracing only") because it turns out
    that it really isn't tracing only - it's all over the tree.

    We already also had the warning disabled separately for mm/usercopy.c
    (which this commit also removes), and it turns out that we will also
    want to disable it for get_lock_parent_ip(), that is used for at least
    TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.

    Steven Rostedt had a patch that tried to limit it to just the config
    options that actually triggered this, but quite frankly, the extra
    complexity and abstraction just isn't worth it. We have never actually
    had a case where the warning is actually useful, so let's just disable
    it globally and not worry about it.

    Acked-by: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Oct, 2016

1 commit

  • Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a
    physical address to a virtual one using __va(). However, such physical
    addresses may sometimes be located in highmem and using __va() is
    incorrect, leading to inconsistent object tracking in kmemleak.

    The following functions have been added to the kmemleak API and they take
    a physical address as the object pointer. They only perform the
    corresponding action if the address has a lowmem mapping:

    kmemleak_alloc_phys
    kmemleak_free_part_phys
    kmemleak_not_leak_phys
    kmemleak_ignore_phys

    The affected calling places have been updated to use the new kmemleak
    API.

    Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Reported-by: Vignesh R
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

11 Oct, 2016

9 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     
  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot time as
    possible, hoping to capitalize on any possible variation in CPU operation
    (due to runtime data differences, hardware differences, SMP ordering,
    thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example for
    how to manipulate kernel code using the gcc plugin internals.

    The need for very-early boot entropy tends to be very architecture or
    system design specific, so this plugin is more suited for those sorts
    of special cases. The existing kernel RNG already attempts to extract
    entropy from reliable runtime variation, but this plugin takes the idea to
    a logical extreme by permuting a global variable based on any variation
    in code execution (e.g. a different value (and permutation function)
    is used to permute the global based on loop count, case statement,
    if/then/else branching, etc).

    To do this, the plugin starts by inserting a local variable in every
    marked function. The plugin then adds logic so that the value of this
    variable is modified by randomly chosen operations (add, xor and rol) and
    random values (gcc generates separate static values for each location at
    compile time and also injects the stack pointer at runtime). The resulting
    value depends on the control flow path (e.g., loops and branches taken).

    Before the function returns, the plugin mixes this local variable into
    the latent_entropy global variable. The value of this global variable
    is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
    though it does not credit any bytes of entropy to the pool; the contents
    of the global are just used to mix the pool.

    Additionally, the plugin can pre-initialize arrays with build-time
    random contents, so that two different kernel builds running on identical
    hardware will not have the same starting values.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message and code comments]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • Pull splice fixups from Al Viro:
    "A couple of fixups for interaction of pipe-backed iov_iter with
    O_DIRECT reads + constification of a couple of primitives in uio.h
    missed by previous rounds.

    Kudos to davej - his fuzzing has caught those bugs"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [btrfs] fix check_direct_IO() for non-iovec iterators
    constify iov_iter_count() and iter_is_iovec()
    fix ITER_PIPE interaction with direct_IO

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     
  • Pull protection keys syscall interface from Thomas Gleixner:
    "This is the final step of Protection Keys support which adds the
    syscalls so user space can actually allocate keys and protect memory
    areas with them. Details and usage examples can be found in the
    documentation.

    The mm side of this has been acked by Mel"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/pkeys: Update documentation
    x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
    x86/pkeys: Fix pkeys build breakage for some non-x86 arches
    x86/pkeys: Add self-tests
    x86/pkeys: Allow configuration of init_pkru
    x86/pkeys: Default to a restrictive init PKRU
    pkeys: Add details of system call use to Documentation/
    generic syscalls: Wire up memory protection keys syscalls
    x86: Wire up protection keys system calls
    x86/pkeys: Allocation/free syscalls
    x86/pkeys: Make mprotect_key() mask off additional vm_flags
    mm: Implement new pkey_mprotect() system call
    x86/pkeys: Add fault handling for PF_PK page fault bit

    Linus Torvalds
     
  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

08 Oct, 2016

5 commits

  • Al Viro
     
  • Merge updates from Andrew Morton:

    - fsnotify updates

    - ocfs2 updates

    - all of MM

    * emailed patches from Andrew Morton : (127 commits)
    console: don't prefer first registered if DT specifies stdout-path
    cred: simpler, 1D supplementary groups
    CREDITS: update Pavel's information, add GPG key, remove snail mail address
    mailmap: add Johan Hovold
    .gitattributes: set git diff driver for C source code files
    uprobes: remove function declarations from arch/{mips,s390}
    spelling.txt: "modeled" is spelt correctly
    nmi_backtrace: generate one-line reports for idle cpus
    arch/tile: adopt the new nmi_backtrace framework
    nmi_backtrace: do a local dump_stack() instead of a self-NMI
    nmi_backtrace: add more trigger_*_cpu_backtrace() methods
    min/max: remove sparse warnings when they're nested
    Documentation/filesystems/proc.txt: add more description for maps/smaps
    mm, proc: fix region lost in /proc/self/smaps
    proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
    proc: add LSM hook checks to /proc//timerslack_ns
    proc: relax /proc//timerslack_ns capability requirements
    meminfo: break apart a very long seq_printf with #ifdefs
    seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
    proc: faster /proc/*/status
    ...

    Linus Torvalds
     
  • These inode operations are no longer used; remove them.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Al Viro

    Andreas Gruenbacher
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Every current KDE system has process named ksysguardd polling files
    below once in several seconds:

    $ strace -e trace=open -p $(pidof ksysguardd)
    Process 1812 attached
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
    open("/proc/net/dev", O_RDONLY) = 8
    open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory)
    open("/proc/stat", O_RDONLY) = 8
    open("/proc/vmstat", O_RDONLY) = 8

    Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

    Benchmark is open+read+close 1.000.000 times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% )
    15 context-switches # 0.001 K/sec ( +- 1.41% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    104 page-faults # 0.008 K/sec ( +- 0.57% )
    45,489,799,349 cycles # 3.460 GHz ( +- 0.03% )
    9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% )
    2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% )
    79,241,190,850 instructions # 1.74 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% )
    176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% )

    13.691078109 seconds time elapsed ( +- 0.03% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-vmstat

    Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

    8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% )
    10 context-switches # 0.001 K/sec ( +- 2.13% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.012 K/sec ( +- 0.56% )
    30,384,010,730 cycles # 3.497 GHz ( +- 0.07% )
    12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% )
    3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% )
    28,969,052,879 instructions # 0.95 insn per cycle
    # 0.42 stalled cycles per insn ( +- 0.01% )
    6,308,245,891 branches # 726.058 M/sec ( +- 0.00% )
    214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% )

    9.146081052 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    vsnprintf() is slow because:

    1. format_decode() is busy looking for format specifier: 2 branches
    per character (not in this case, but in others)

    2. approximately million branches while parsing format mini language
    and everywhere

    3. just look at what string() does /proc/vmstat is good case because
    most of its content are strings

    Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan