11 Jul, 2017

1 commit

  • wait4(-2147483648, 0x20, 0, 0xdd0000) triggers:
    UBSAN: Undefined behaviour in kernel/exit.c:1651:9

    The related calltrace is as follows:

    negation of -2147483648 cannot be represented in type 'int':
    CPU: 9 PID: 16482 Comm: zj Tainted: G B ---- ------- 3.10.0-327.53.58.71.x86_64+ #66
    Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA , BIOS CTSAV036 04/27/2011
    Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SyS_wait4+0x1cb/0x1e0
    system_call_fastpath+0x16/0x1b

    Exclude the overflow to avoid the UBSAN warning.

    Link: http://lkml.kernel.org/r/1497264618-20212-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhongjiang
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhongjiang
     

08 Jul, 2017

1 commit

  • We lose the distinction between "found a PID" and "nothing, but that's not
    an error" a bit too early in waitid(). Easily fixed, fortunately...

    Reported-by: Markus Trippelsdorf
    Fixes: 67d7ddded322 ("waitid(2): leave copyout of siginfo to syscall itself")
    Tested-by: Markus Trippelsdorf
    Signed-off-by: Al Viro

    Al Viro
     

07 Jul, 2017

1 commit

  • Commit dd0db88d8094 ("userfaultfd: non-cooperative: rollback
    userfaultfd_exit") removed userfaultfd callback from exit() which makes
    the include of unnecessary.

    Link: http://lkml.kernel.org/r/1494930907-3060-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

06 Jul, 2017

1 commit

  • Pull wait syscall updates from Al Viro:
    "Consolidating sys_wait* and compat counterparts.

    Gets rid of set_fs()/double-copy mess, simplifies the whole thing
    (lifting the copyouts to the syscalls means less headache in the part
    that does actual work - fewer failure exits, to start with), gets rid
    of the overhead of field-by-field __put_user()"

    * 'work.sys_wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    osf_wait4: switch to kernel_wait4()
    waitid(): switch copyout of siginfo to unsafe_put_user()
    wait_task_zombie: consolidate info logics
    kill wait_noreap_copyout()
    lift getrusage() from wait_noreap_copyout()
    waitid(2): leave copyout of siginfo to syscall itself
    kernel_wait4()/kernel_waitid(): delay copying status to userland
    wait4(2)/waitid(2): separate copying rusage to userland
    move compat wait4 and waitid next to native variants

    Linus Torvalds
     

20 Jun, 2017

2 commits

  • This function was introduced by:

    150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")

    ... to allow easier usage of task_rcu_dereference(), however no users
    were ever added. Drop the helper.

    Signed-off-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/20170615023730.22827-1-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

22 May, 2017

9 commits


10 Mar, 2017

1 commit

  • Patch series "userfaultfd non-cooperative further update for 4.11 merge
    window".

    Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
    more testing. I've been doing testing before and this was also tested
    by kbuild bot and exercised by the selftest, but this bug never
    reproduced before.

    I dropped userfaultfd_exit as result. I dropped it because of
    implementation difficulty in receiving signals in __mmput and because I
    think -ENOSPC as result from the background UFFDIO_COPY should be enough
    already.

    Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
    wasn't exercised by the selftest and when I tried to exercise it, after
    moving it to a more correct place in __mmput where it would make more
    sense and where the vma list is stable, it resulted in the
    event_wait_completion in D state. So then I added the second patch to
    be sure even if we call userfaultfd_event_wait_completion too late
    during task exit(), we won't risk to generate tasks in D state. The
    same check exists in handle_userfault() for the same reason, except it
    makes a difference there, while here is just a robustness check and it's
    run under WARN_ON_ONCE.

    While looking at the userfaultfd_event_wait_completion() function I
    looked back at its callers too while at it and I think it's not ok to
    stop executing dup_fctx on the fcs list because we relay on
    userfaultfd_event_wait_completion to execute
    userfaultfd_ctx_put(fctx->orig) which is paired against
    userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
    list_add(fcs). This change only takes care of fctx->orig but this area
    also needs further review looking for similar problems in fctx->new.

    The only patch that is urgent is the first because it's an use after
    free during a SMP race condition that affects all processes if
    CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
    impossible without SLUB poisoning enabled.

    This patch (of 3):

    I once reproduced this oops with the userfaultfd selftest, it's not
    easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[] [] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
    do_exit+0x297/0xd10
    SyS_exit+0x17/0x20
    tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP [] userfaultfd_exit+0x29/0xa0
    RSP
    ---[ end trace 9fecd6dcb442846a ]---

    In the debugger I located the "mm" pointer in the stack and walking
    mm->mmap->vm_next through the end shows the vma->vm_next list is fully
    consistent and it is null terminated list as expected. So this has to
    be an SMP race condition where userfaultfd_exit was running while the
    vma list was being modified by another CPU.

    When userfaultfd_exit() run one of the ->vm_next pointers pointed to
    SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

    The reason is that it's not running in __mmput but while there are still
    other threads running and it's not holding the mmap_sem (it can't as it
    has to wait the even to be received by the manager). So this is an use
    after free that was happening for all processes.

    One more implementation problem aside from the race condition:
    userfaultfd_exit has really to check a flag in mm->flags before walking
    the vma or it's going to slowdown the exit() path for regular tasks.

    One more implementation problem: at that point signals can't be
    delivered so it would also create a task in D state if the manager
    doesn't read the event.

    The major design issue: it overall looks superfluous as the manager can
    check for -ENOSPC in the background transfer:

    if (mmget_not_zero(ctx->mm)) {
    [..]
    } else {
    return -ENOSPC;
    }

    It's safer to roll it back and re-introduce it later if at all.

    [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
    Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mike Rapoport
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Mar, 2017

6 commits

  • …linux/sched/cputime.h>

    Introduce a trivial, mostly empty <linux/sched/cputime.h> header
    to prepare for the moving of cputime functionality out of sched.h.

    Update all code that relies on these facilities.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

25 Feb, 2017

1 commit

  • Allow userfaultfd monitor track termination of the processes that have
    memory backed by the uffd.

    [rppt@linux.vnet.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20170202135448.GB19804@rapoport-lnxLink: http://lkml.kernel.org/r/1485542673-24387-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

22 Feb, 2017

1 commit

  • Pull security layer updates from James Morris:
    "Highlights:

    - major AppArmor update: policy namespaces & lots of fixes

    - add /sys/kernel/security/lsm node for easy detection of loaded LSMs

    - SELinux cgroupfs labeling support

    - SELinux context mounts on tmpfs, ramfs, devpts within user
    namespaces

    - improved TPM 2.0 support"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits)
    tpm: declare tpm2_get_pcr_allocation() as static
    tpm: Fix expected number of response bytes of TPM1.2 PCR Extend
    tpm xen: drop unneeded chip variable
    tpm: fix misspelled "facilitate" in module parameter description
    tpm_tis: fix the error handling of init_tis()
    KEYS: Use memzero_explicit() for secret data
    KEYS: Fix an error code in request_master_key()
    sign-file: fix build error in sign-file.c with libressl
    selinux: allow changing labels for cgroupfs
    selinux: fix off-by-one in setprocattr
    tpm: silence an array overflow warning
    tpm: fix the type of owned field in cap_t
    tpm: add securityfs support for TPM 2.0 firmware event log
    tpm: enhance read_log_of() to support Physical TPM event log
    tpm: enhance TPM 2.0 PCR extend to support multiple banks
    tpm: implement TPM 2.0 capability to get active PCR banks
    tpm: fix RC value check in tpm2_seal_trusted
    tpm_tis: fix iTPM probe via probe_itpm() function
    tpm: Begin the process to deprecate user_read_timer
    tpm: remove tpm_read_index and tpm_write_index from tpm.h
    ...

    Linus Torvalds
     

21 Feb, 2017

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Implement wraparound-safe refcount_t and kref_t types based on
    generic atomic primitives (Peter Zijlstra)

    - Improve and fix the ww_mutex code (Nicolai Hähnle)

    - Add self-tests to the ww_mutex code (Chris Wilson)

    - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
    Bueso)

    - Micro-optimize the current-task logic all around the core kernel
    (Davidlohr Bueso)

    - Tidy up after recent optimizations: remove stale code and APIs,
    clean up the code (Waiman Long)

    - ... plus misc fixes, updates and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
    fork: Fix task_struct alignment
    locking/spinlock/debug: Remove spinlock lockup detection code
    lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
    lkdtm: Convert to refcount_t testing
    kref: Implement 'struct kref' using refcount_t
    refcount_t: Introduce a special purpose refcount type
    sched/wake_q: Clarify queue reinit comment
    sched/wait, rcuwait: Fix typo in comment
    locking/mutex: Fix lockdep_assert_held() fail
    locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
    locking/rwsem: Reinit wake_q after use
    locking/rwsem: Remove unnecessary atomic_long_t casts
    jump_labels: Move header guard #endif down where it belongs
    locking/atomic, kref: Implement kref_put_lock()
    locking/ww_mutex: Turn off __must_check for now
    locking/atomic, kref: Avoid more abuse
    locking/atomic, kref: Use kref_get_unless_zero() more
    locking/atomic, kref: Kill kref_sub()
    locking/atomic, kref: Add kref_read()
    locking/atomic, kref: Add KREF_INIT()
    ...

    Linus Torvalds
     

01 Feb, 2017

2 commits

  • Now that most cputime readers use the transition API which return the
    task cputime in old style cputime_t, we can safely store the cputime in
    nsecs. This will eventually make cputime statistics less opaque and more
    granular. Back and forth convertions between cputime_t and nsecs in order
    to deal with cputime_t random granularity won't be needed anymore.

    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • find_new_reaper() checks same_thread_group(reaper, child_reaper) to
    prevent the cross-namespace reparenting but this is not enough if the
    exiting parent was injected by setns() + fork().

    Suppose we have a process P in the root namespace and some namespace X.
    P does setns() to enter the X namespace, and forks the child C.
    C forks a grandchild G and exits.

    The grandchild G should be re-parented to X->child_reaper, but in this
    case the ->real_parent chain does not lead to ->child_reaper, so it will
    be wrongly reparanted to P's sub-reaper or a global init.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman

    Oleg Nesterov
     

14 Jan, 2017

3 commits

  • rcuwait provides support for (single) RCU-safe task wait/wake functionality,
    with the caveat that it must not be called after exit_notify(), such that
    we avoid racing with rcu delayed_put_task_struct callbacks, task_struct
    being rcu unaware in this context -- for which we similarly have
    task_rcu_dereference() magic, but with different return semantics, which
    can conflict with the wakeup side.

    The interfaces are quite straightforward:

    rcuwait_wait_event()
    rcuwait_wake_up()

    More details are in the comments, but it's perhaps worth mentioning at least,
    that users must provide proper serialization when waiting on a condition, and
    avoid corrupting a concurrent waiter. Also care must be taken between the task
    and the condition for when calling the wakeup -- we cannot miss wakeups. When
    porting users, this is for example, a given when using waitqueues in that
    everything is done under the q->lock. As such, it can remove sources of non
    preemptable unbounded work for realtime.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1484148146-14210-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • This is a nasty interface and setting the state of a foreign task must
    not be done. As of the following commit:

    be628be0956 ("bcache: Make gc wakeup sane, remove set_task_state()")

    ... everyone in the kernel calls set_task_state() with current, allowing
    the helper to be removed.

    However, as the comment indicates, it is still around for those archs
    where computing current is more expensive than using a pointer, at least
    in theory. An important arch that is affected is arm64, however this has
    been addressed now [1] and performance is up to par making no difference
    with either calls.

    Of all the callers, if any, it's the locking bits that would care most
    about this -- ie: we end up passing a tsk pointer to a lot of the lock
    slowpath, and setting ->state on that. The following numbers are based
    on two tests: a custom ad-hoc microbenchmark that just measures
    latencies (for ~65 million calls) between get_task_state() vs
    get_current_state().

    Secondly for a higher overview, an unlink microbenchmark was used,
    which pounds on a single file with open, close,unlink combos with
    increasing thread counts (up to 4x ncpus). While the workload is quite
    unrealistic, it does contend a lot on the inode mutex or now rwsem.

    [1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com

    == 1. x86-64 ==

    Avg runtime set_task_state(): 601 msecs
    Avg runtime set_current_state(): 552 msecs

    vanilla dirty
    Hmean unlink1-processes-2 36089.26 ( 0.00%) 38977.33 ( 8.00%)
    Hmean unlink1-processes-5 28555.01 ( 0.00%) 29832.55 ( 4.28%)
    Hmean unlink1-processes-8 37323.75 ( 0.00%) 44974.57 ( 20.50%)
    Hmean unlink1-processes-12 43571.88 ( 0.00%) 44283.01 ( 1.63%)
    Hmean unlink1-processes-21 34431.52 ( 0.00%) 38284.45 ( 11.19%)
    Hmean unlink1-processes-30 34813.26 ( 0.00%) 37975.17 ( 9.08%)
    Hmean unlink1-processes-48 37048.90 ( 0.00%) 39862.78 ( 7.59%)
    Hmean unlink1-processes-79 35630.01 ( 0.00%) 36855.30 ( 3.44%)
    Hmean unlink1-processes-110 36115.85 ( 0.00%) 39843.91 ( 10.32%)
    Hmean unlink1-processes-141 32546.96 ( 0.00%) 35418.52 ( 8.82%)
    Hmean unlink1-processes-172 34674.79 ( 0.00%) 36899.21 ( 6.42%)
    Hmean unlink1-processes-203 37303.11 ( 0.00%) 36393.04 ( -2.44%)
    Hmean unlink1-processes-224 35712.13 ( 0.00%) 36685.96 ( 2.73%)

    == 2. ppc64le ==

    Avg runtime set_task_state(): 938 msecs
    Avg runtime set_current_state: 940 msecs

    vanilla dirty
    Hmean unlink1-processes-2 19269.19 ( 0.00%) 30704.50 ( 59.35%)
    Hmean unlink1-processes-5 20106.15 ( 0.00%) 21804.15 ( 8.45%)
    Hmean unlink1-processes-8 17496.97 ( 0.00%) 17243.28 ( -1.45%)
    Hmean unlink1-processes-12 14224.15 ( 0.00%) 17240.21 ( 21.20%)
    Hmean unlink1-processes-21 14155.66 ( 0.00%) 15681.23 ( 10.78%)
    Hmean unlink1-processes-30 14450.70 ( 0.00%) 15995.83 ( 10.69%)
    Hmean unlink1-processes-48 16945.57 ( 0.00%) 16370.42 ( -3.39%)
    Hmean unlink1-processes-79 15788.39 ( 0.00%) 14639.27 ( -7.28%)
    Hmean unlink1-processes-110 14268.48 ( 0.00%) 14377.40 ( 0.76%)
    Hmean unlink1-processes-141 14023.65 ( 0.00%) 16271.69 ( 16.03%)
    Hmean unlink1-processes-172 13417.62 ( 0.00%) 16067.55 ( 19.75%)
    Hmean unlink1-processes-203 15293.08 ( 0.00%) 15440.40 ( 0.96%)
    Hmean unlink1-processes-234 13719.32 ( 0.00%) 16190.74 ( 18.01%)
    Hmean unlink1-processes-265 16400.97 ( 0.00%) 16115.22 ( -1.74%)
    Hmean unlink1-processes-296 14388.60 ( 0.00%) 16216.13 ( 12.70%)
    Hmean unlink1-processes-320 15771.85 ( 0.00%) 15905.96 ( 0.85%)

    x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
    and ppc64 (with paca) show similar improvements in the unlink microbenches.
    The small delta for ppc64 (2ms), does not represent the gains on the unlink
    runs. In the case of x86, there was a decent amount of variation in the
    latency runs, but always within a 20 to 50ms increase), ppc was more constant.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: mark.rutland@arm.com
    Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • This patch effectively replaces the tsk pointer dereference (which is
    obviously == current), to directly use get_current() macro. In this
    case, do_exit() always passes current to exit_mm(), hence we can
    simply get rid of the argument. This is also a performance win on some
    archs such as x86-64 and ppc64 -- arm64 is no longer an issue.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: mark.rutland@arm.com
    Link: http://lkml.kernel.org/r/1483479794-14013-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

13 Jan, 2017

1 commit

  • As reported by yangshukui, a permission denial from security_task_wait()
    can lead to a soft lockup in zap_pid_ns_processes() since it only expects
    sys_wait4() to return 0 or -ECHILD. Further, security_task_wait() can
    in general lead to zombies; in the absence of some way to automatically
    reparent a child process upon a denial, the hook is not useful. Remove
    the security hook and its implementations in SELinux and Smack. Smack
    already removed its check from its hook.

    Reported-by: yangshukui
    Signed-off-by: Stephen Smalley
    Acked-by: Casey Schaufler
    Acked-by: Oleg Nesterov
    Signed-off-by: Paul Moore

    Stephen Smalley
     

25 Dec, 2016

1 commit


13 Dec, 2016

1 commit

  • Pull timer updates from Thomas Gleixner:
    "The time/timekeeping/timer folks deliver with this update:

    - Fix a reintroduced signed/unsigned issue and cleanup the whole
    signed/unsigned mess in the timekeeping core so this wont happen
    accidentaly again.

    - Add a new trace clock based on boot time

    - Prevent injection of random sleep times when PM tracing abuses the
    RTC for storage

    - Make posix timers configurable for real tiny systems

    - Add tracepoints for the alarm timer subsystem so timer based
    suspend wakeups can be instrumented

    - The usual pile of fixes and updates to core and drivers"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    timekeeping: Use mul_u64_u32_shr() instead of open coding it
    timekeeping: Get rid of pointless typecasts
    timekeeping: Make the conversion call chain consistently unsigned
    timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
    alarmtimer: Add tracepoints for alarm timers
    trace: Update documentation for mono, mono_raw and boot clock
    trace: Add an option for boot clock as trace clock
    timekeeping: Add a fast and NMI safe boot clock
    timekeeping/clocksource_cyc2ns: Document intended range limitation
    timekeeping: Ignore the bogus sleep time if pm_trace is enabled
    selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
    clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
    clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
    arm64: dts: rockchip: Arch counter doesn't tick in system suspend
    clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
    posix-timers: Make them configurable
    posix_cpu_timers: Move the add_device_randomness() call to a proper place
    timer: Move sys_alarm from timer.c to itimer.c
    ptp_clock: Allow for it to be optional
    Kconfig: Regenerate *.c_shipped files after previous changes
    ...

    Linus Torvalds
     

22 Nov, 2016

1 commit

  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

16 Nov, 2016

2 commits

  • Some embedded systems have no use for them. This removes about
    25KB from the kernel binary size when configured out.

    Corresponding syscalls are routed to a stub logging the attempt to
    use those syscalls which should be enough of a clue if they were
    disabled without proper consideration. They are: timer_create,
    timer_gettime: timer_getoverrun, timer_settime, timer_delete,
    clock_adjtime, setitimer, getitimer, alarm.

    The clock_settime, clock_gettime, clock_getres and clock_nanosleep
    syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
    CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
    majority of use cases with very little code.

    Signed-off-by: Nicolas Pitre
    Acked-by: Richard Cochran
    Acked-by: Thomas Gleixner
    Acked-by: John Stultz
    Reviewed-by: Josh Triplett
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     
  • There is no logical relation between add_device_randomness() and
    posix_cpu_timers_exit(). Let's move the former to where the later
    is called. This way, when posix-cpu-timers.c is compiled out, there
    is no need to worry about not losing a call to add_device_randomness().

    Signed-off-by: Nicolas Pitre
    Acked-by: John Stultz
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Richard Cochran
    Cc: Josh Triplett
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-6-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     

08 Oct, 2016

1 commit

  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Sep, 2016

1 commit

  • Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
    context switch, we can avoid the TASK_DEAD special case currently in
    __schedule() because that avoids the extra preempt_disable() from
    schedule().

    In order to facilitate this, create a do_task_dead() helper which we
    place in the scheduler code, such that it can access __schedule().

    Also add some __noreturn annotations to the functions, there's no
    coming back from do_exit().

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Cheng Chao
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: chris@chris-wilson.co.uk
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra