13 Jul, 2017

3 commits

  • Use the ascii-armor canary to prevent unterminated C string overflows
    from being able to successfully overwrite the canary, even if they
    somehow obtain the canary value.

    Inspired by execshield ascii-armor and Daniel Micay's linux-hardened
    tree.

    Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Acked-by: Kees Cook
    Cc: Daniel Micay
    Cc: "Theodore Ts'o"
    Cc: H. Peter Anvin
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Catalin Marinas
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Add /proc/self/task//fail-nth file that allows failing
    0-th, 1-st, 2-nd and so on calls systematically.
    Excerpt from the added documentation:

    "Write to this file of integer N makes N-th call in the current task
    fail (N is 0-based). Read from this file returns a single char 'Y' or
    'N' that says if the fault setup with a previous write to this file
    was injected or not, and disables the fault if it wasn't yet injected.
    Note that this file enables all types of faults (slab, futex, etc).
    This setting takes precedence over all other generic settings like
    probability, interval, times, etc. But per-capability settings (e.g.
    fail_futex/ignore-private) take precedence over it. This feature is
    intended for systematic testing of faults in a single system call. See
    an example below"

    Why add a new setting:
    1. Existing settings are global rather than per-task.
    So parallel testing is not possible.
    2. attr->interval is close but it depends on attr->count
    which is non reset to 0, so interval does not work as expected.
    3. Trying to model this with existing settings requires manipulations
    of all of probability, interval, times, space, task-filter and
    unexposed count and per-task make-it-fail files.
    4. Existing settings are per-failure-type, and the set of failure
    types is potentially expanding.
    5. make-it-fail can't be changed by unprivileged user and aggressive
    stress testing better be done from an unprivileged user.
    Similarly, this would require opening the debugfs files to the
    unprivileged user, as he would need to reopen at least times file
    (not possible to pre-open before dropping privs).

    The proposed interface solves all of the above (see the example).

    We want to integrate this into syzkaller fuzzer. A prototype has found
    10 bugs in kernel in first day of usage:

    https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

    I've made the current interface work with all types of our sandboxes.
    For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
    make /proc entries non-root owned. So I am fine with the current
    version of the code.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Akinobu Mita
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • The reason to disable interrupts seems to be to avoid switching to a
    different processor while handling per cpu data using individual loads and
    stores. If we use per cpu RMV primitives we will not have to disable
    interrupts.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
    Signed-off-by: Christoph Lameter
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

10 Jul, 2017

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "This scheduler update provides:

    - The (hopefully) final fix for the vtime accounting issues which
    were around for quite some time

    - Use types known to user space in UAPI headers to unbreak user space
    builds

    - Make load balancing respect the current scheduling domain again
    instead of evaluating unrelated CPUs"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/headers/uapi: Fix linux/sched/types.h userspace compilation errors
    sched/fair: Fix load_balance() affinity redo path
    sched/cputime: Accumulate vtime on top of nsec clocksource
    sched/cputime: Move the vtime task fields to their own struct
    sched/cputime: Rename vtime fields
    sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
    vtime, sched/cputime: Remove vtime_account_user()
    Revert "sched/cputime: Refactor the cputime_adjust() code"

    Linus Torvalds
     

07 Jul, 2017

1 commit


05 Jul, 2017

2 commits

  • We are about to add vtime accumulation fields to the task struct. Let's
    avoid more bloatification and gather vtime information to their own
    struct.

    Tested-by: Luiz Capitulino
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The current "snapshot" based naming on vtime fields suggests we record
    some past event but that's a low level picture of their actual purpose
    which comes out blurry. The real point of these fields is to run a basic
    state machine that tracks down cputime entry while switching between
    contexts.

    So lets reflect that with more meaningful names.

    Tested-by: Luiz Capitulino
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

27 May, 2017

1 commit


23 May, 2017

1 commit

  • If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
    fails in copy_process() between calling dup_task_struct() and setting
    p->set_child_tid, then the value of p->set_child_tid will be inherited
    from the parent and get prematurely freed by free_kthread_struct().

    kthread()
    - worker_thread()
    - process_one_work()
    | - call_usermodehelper_exec_work()
    | - kernel_thread()
    | - _do_fork()
    | - copy_process()
    | - dup_task_struct()
    | - arch_dup_task_struct()
    | - tsk->set_child_tid = current->set_child_tid // implied
    | - ...
    | - goto bad_fork_*
    | - ...
    | - free_task(tsk)
    | - free_kthread_struct(tsk)
    | - kfree(tsk->set_child_tid)
    - ...
    - schedule()
    - __schedule()
    - wq_worker_sleeping()
    - kthread_data(task)->flags // UAF

    The problem started showing up with commit 1da5c46fa965 since it reused
    ->set_child_tid for the kthread worker data.

    A better long-term solution might be to get rid of the ->set_child_tid
    abuse. The comment in set_kthread_struct() also looks slightly wrong.

    Debugged-by: Jamie Iles
    Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed")
    Signed-off-by: Vegard Nossum
    Acked-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Greg Kroah-Hartman
    Cc: Andy Lutomirski
    Cc: Frederic Weisbecker
    Cc: Jamie Iles
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com
    Signed-off-by: Thomas Gleixner

    Vegard Nossum
     

14 May, 2017

1 commit

  • Imagine we have a pid namespace and a task from its parent's pid_ns,
    which made setns() to the pid namespace. The task is doing fork(),
    while the pid namespace's child reaper is dying. We have the race
    between them:

    Task from parent pid_ns Child reaper
    copy_process() ..
    alloc_pid() ..
    .. zap_pid_ns_processes()
    .. disable_pid_allocation()
    .. read_lock(&tasklist_lock)
    .. iterate over pids in pid_ns
    .. kill tasks linked to pids
    .. read_unlock(&tasklist_lock)
    write_lock_irq(&tasklist_lock); ..
    attach_pid(p, PIDTYPE_PID); ..
    .. ..

    So, just created task p won't receive SIGKILL signal,
    and the pid namespace will be in contradictory state.
    Only manual kill will help there, but does the userspace
    care about this? I suppose, the most users just inject
    a task into a pid namespace and wait a SIGCHLD from it.

    The patch fixes the problem. It simply checks for
    (pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
    We do it under the tasklist_lock, and can't skip
    PIDNS_HASH_ADDING as noted by Oleg:

    "zap_pid_ns_processes() does disable_pid_allocation()
    and then takes tasklist_lock to kill the whole namespace.
    Given that copy_process() checks PIDNS_HASH_ADDING
    under write_lock(tasklist) they can't race;
    if copy_process() takes this lock first, the new child will
    be killed, otherwise copy_process() can't miss
    the change in ->nr_hashed."

    If allocation is disabled, we just return -ENOMEM
    like it's made for such cases in alloc_pid().

    v2: Do not move disable_pid_allocation(), do not
    introduce a new variable in copy_process() and simplify
    the patch as suggested by Oleg Nesterov.
    Account the problem with double irq enabling
    found by Eric W. Biederman.

    Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies")
    Signed-off-by: Kirill Tkhai
    CC: Andrew Morton
    CC: Ingo Molnar
    CC: Peter Zijlstra
    CC: Oleg Nesterov
    CC: Mike Rapoport
    CC: Michal Hocko
    CC: Andy Lutomirski
    CC: "Eric W. Biederman"
    CC: Andrei Vagin
    CC: Cyrill Gorcunov
    CC: Serge Hallyn
    Cc: stable@vger.kernel.org
    Acked-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman

    Kirill Tkhai
     

13 May, 2017

1 commit


11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

09 May, 2017

2 commits

  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Using virtually mapped stack, kernel stacks are allocated via vmalloc.

    In the current implementation, two stacks per cpu can be cached when
    tasks are freed and the cached stacks are used again in task
    duplications. But the cached stacks may remain unfreed even when cpu
    are offline. By adding a cpu hotplug callback to free the cached stacks
    when a cpu goes offline, the pages of the cached stacks are not wasted.

    Link: http://lkml.kernel.org/r/1487076043-17802-1-git-send-email-hoeun.ryu@gmail.com
    Signed-off-by: Hoeun Ryu
    Reviewed-by: Thomas Gleixner
    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Mateusz Guzik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hoeun Ryu
     

05 May, 2017

1 commit

  • …o 64 bits on 64-bit platforms

    The stack canary is an 'unsigned long' and should be fully initialized to
    random data rather than only 32 bits of random data.

    Signed-off-by: Daniel Micay <danielmicay@gmail.com>
    Acked-by: Arjan van de Ven <arjan@linux.intel.com>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Kees Cook <keescook@chromium.org>
    Cc: Arjan van Ven <arjan@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: kernel-hardening@lists.openwall.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Daniel Micay
     

03 May, 2017

2 commits

  • Pull security subsystem updates from James Morris:
    "Highlights:

    IMA:
    - provide ">" and " of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
    tpm: Fix reference count to main device
    tpm_tis: convert to using locality callbacks
    tpm: fix handling of the TPM 2.0 event logs
    tpm_crb: remove a cruft constant
    keys: select CONFIG_CRYPTO when selecting DH / KDF
    apparmor: Make path_max parameter readonly
    apparmor: fix parameters so that the permission test is bypassed at boot
    apparmor: fix invalid reference to index variable of iterator line 836
    apparmor: use SHASH_DESC_ON_STACK
    security/apparmor/lsm.c: set debug messages
    apparmor: fix boolreturn.cocci warnings
    Smack: Use GFP_KERNEL for smk_netlbl_mls().
    smack: fix double free in smack_parse_opts_str()
    KEYS: add SP800-56A KDF support for DH
    KEYS: Keyring asymmetric key restrict method with chaining
    KEYS: Restrict asymmetric key linkage using a specific keychain
    KEYS: Add a lookup_restriction function for the asymmetric key type
    KEYS: Add KEYCTL_RESTRICT_KEYRING
    KEYS: Consistent ordering for __key_link_begin and restrict check
    KEYS: Add an optional lookup_restriction hook to key_type
    ...

    Linus Torvalds
     
  • Pull livepatch updates from Jiri Kosina:

    - a per-task consistency model is being added for architectures that
    support reliable stack dumping (extending this, currently rather
    trivial set, is currently in the works).

    This extends the nature of the types of patches that can be applied
    by live patching infrastructure. The code stems from the design
    proposal made [1] back in November 2014. It's a hybrid of SUSE's
    kGraft and RH's kpatch, combining advantages of both: it uses
    kGraft's per-task consistency and syscall barrier switching combined
    with kpatch's stack trace switching. There are also a number of
    fallback options which make it quite flexible.

    Most of the heavy lifting done by Josh Poimboeuf with help from
    Miroslav Benes and Petr Mladek

    [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

    - module load time patch optimization from Zhou Chengming

    - a few assorted small fixes

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: add missing printk newlines
    livepatch: Cancel transition a safe way for immediate patches
    livepatch: Reduce the time of finding module symbols
    livepatch: make klp_mutex proper part of API
    livepatch: allow removal of a disabled patch
    livepatch: add /proc//patch_state
    livepatch: change to a per-task consistency model
    livepatch: store function sizes
    livepatch: use kstrtobool() in enabled_store()
    livepatch: move patching functions into patch.c
    livepatch: remove unnecessary object loaded check
    livepatch: separate enabled and patched states
    livepatch/s390: add TIF_PATCH_PENDING thread flag
    livepatch/s390: reorganize TIF thread flag bits
    livepatch/powerpc: add TIF_PATCH_PENDING thread flag
    livepatch/x86: add TIF_PATCH_PENDING thread flag
    livepatch: create temporary klp_update_patch_state() stub
    x86/entry: define _TIF_ALLWORK_MASK flags explicitly
    stacktrace/x86: add function for detecting reliable stack traces

    Linus Torvalds
     

02 May, 2017

1 commit

  • Pull perf updates from Ingo Molnar:
    "The main changes in this cycle were:

    Kernel side changes:

    - Kprobes and uprobes changes:
    - Make their trampolines read-only while they are used
    - Make UPROBES_EVENTS default-y which is the distro practice
    - Apply misc fixes and robustization to probe point insertion.

    - add support for AMD IOMMU events

    - extend hw events on Intel Goldmont CPUs

    - ... plus misc fixes and updates.

    Tooling side changes:

    - support s390 jump instructions in perf annotate (Christian
    Borntraeger)

    - vendor hardware events updates (Andi Kleen)

    - add argument support for SDT events in powerpc (Ravi Bangoria)

    - beautify the statx syscall arguments in 'perf trace' (Arnaldo
    Carvalho de Melo)

    - handle inline functions in callchains (Jin Yao)

    - enable sorting by srcline as key (Milian Wolff)

    - add 'brstackinsn' field in 'perf script' to reuse the x86
    instruction decoder used in the Intel PT code to study hot paths to
    samples (Andi Kleen)

    - add PERF_RECORD_NAMESPACES so that the kernel can record
    information required to associate samples to namespaces, helping in
    container problem characterization. (Hari Bathini)

    - allow sorting by symbol_size in 'perf report' and 'perf top'
    (Charles Baylis)

    - in perf stat, make system wide (-a) the default option if no target
    was specified and one of following conditions is met:
    - no workload specified (current behaviour)
    - a workload is specified but all requested events are system wide
    ones, like uncore ones. (Jiri Olsa)

    - ... plus lots of other updates, enhancements, cleanups and fixes"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (235 commits)
    perf tools: Fix the code to strip command name
    tools arch x86: Sync cpufeatures.h
    tools arch: Sync arch/x86/lib/memcpy_64.S with the kernel
    tools: Update asm-generic/mman-common.h copy from the kernel
    perf tools: Use just forward declarations for struct thread where possible
    perf tools: Add the right header to obtain PERF_ALIGN()
    perf tools: Remove poll.h and wait.h from util.h
    perf tools: Remove string.h, unistd.h and sys/stat.h from util.h
    perf tools: Remove stale prototypes from builtin.h
    perf tools: Remove string.h from util.h
    perf tools: Remove sys/ioctl.h from util.h
    perf tools: Remove a few more needless includes from util.h
    perf tools: Include sys/param.h where needed
    perf callchain: Move callchain specific routines from util.[ch]
    perf tools: Add compress.h for the *_decompress_to_file() headers
    perf mem: Fix display of data source snoop indication
    perf debug: Move dump_stack() and sighandler_dump_stack() to debug.h
    perf kvm: Make function only used by 'perf kvm' static
    perf tools: Move timestamp routines from util.h to time-utils.h
    perf tools: Move units conversion/formatting routines to separate object
    ...

    Linus Torvalds
     

19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

04 Apr, 2017

1 commit

  • A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [] enqueue_task+0x2c/0x80
    [] activate_task+0x23/0x30
    [] pull_dl_task+0x1d5/0x260
    [] pre_schedule_dl+0x16/0x20
    [] __schedule+0xd3/0x900
    [] schedule+0x29/0x70
    [] __rt_mutex_slowlock+0x4b/0xc0
    [] rt_mutex_slowlock+0xd1/0x190
    [] rt_mutex_timed_lock+0x53/0x60
    [] futex_lock_pi.isra.18+0x28c/0x390
    [] do_futex+0x190/0x5b0
    [] SyS_futex+0x80/0x180

    This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
    are only protected by pi_lock when operating pi waiters, while
    rt_mutex_get_top_task(), will access them with rq lock held but
    not holding pi_lock.

    In order to tackle it, we introduce new "pi_top_task" pointer
    cached in task_struct, and add new rt_mutex_update_top_task()
    to update its value, it can be called by rt_mutex_setprio()
    which held both owner's pi_lock and rq lock. Thus "pi_top_task"
    can be safely accessed by enqueue_task_dl() under rq lock.

    Originally-From: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

28 Mar, 2017

1 commit

  • We switched from "struct task_struct"->security to "struct cred"->security
    in Linux 2.6.29. But not all LSM modules were happy with that change.
    TOMOYO LSM module is an example which want to use per "struct task_struct"
    security blob, for TOMOYO's security context is defined based on "struct
    task_struct" rather than "struct cred". AppArmor LSM module is another
    example which want to use it, for AppArmor is currently abusing the cred
    a little bit to store the change_hat and setexeccon info. Although
    security_task_free() hook was revived in Linux 3.4 because Yama LSM module
    wanted to release per "struct task_struct" security blob,
    security_task_alloc() hook and "struct task_struct"->security field were
    not revived. Nowadays, we are getting proposals of lightweight LSM modules
    which want to use per "struct task_struct" security blob.

    We are already allowing multiple concurrent LSM modules (up to one fully
    armored module which uses "struct cred"->security field or exclusive hooks
    like security_xfrm_state_pol_flow_match(), plus unlimited number of
    lightweight modules which do not use "struct cred"->security nor exclusive
    hooks) as long as they are built into the kernel. But this patch does not
    implement variable length "struct task_struct"->security field which will
    become needed when multiple LSM modules want to use "struct task_struct"->
    security field. Although it won't be difficult to implement variable length
    "struct task_struct"->security field, let's think about it after we merged
    this patch.

    Signed-off-by: Tetsuo Handa
    Acked-by: John Johansen
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Tested-by: Djalal Harouni
    Acked-by: José Bollo
    Cc: Paul Moore
    Cc: Stephen Smalley
    Cc: Eric Paris
    Cc: Kees Cook
    Cc: James Morris
    Cc: José Bollo
    Signed-off-by: James Morris

    Tetsuo Handa
     

14 Mar, 2017

1 commit

  • With the advert of container technologies like docker, that depend on
    namespaces for isolation, there is a need for tracing support for
    namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
    recording namespaces related info. By recording info for every
    namespace, it is left to userspace to take a call on the definition of a
    container and trace containers by updating perf tool accordingly.

    Each namespace has a combination of device and inode numbers. Though
    every namespace has the same device number currently, that may change in
    future to avoid the need for a namespace of namespaces. Considering such
    possibility, record both device and inode numbers separately for each
    namespace.

    Signed-off-by: Hari Bathini
    Acked-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Ananth N Mavinakayanahalli
    Cc: Aravinda Prasad
    Cc: Brendan Gregg
    Cc: Daniel Borkmann
    Cc: Eric Biederman
    Cc: Sargun Dhillon
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Hari Bathini
     

08 Mar, 2017

1 commit

  • Change livepatch to use a basic per-task consistency model. This is the
    foundation which will eventually enable us to patch those ~10% of
    security patches which change function or data semantics. This is the
    biggest remaining piece needed to make livepatch more generally useful.

    This code stems from the design proposal made by Vojtech [1] in November
    2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
    consistency and syscall barrier switching combined with kpatch's stack
    trace switching. There are also a number of fallback options which make
    it quite flexible.

    Patches are applied on a per-task basis, when the task is deemed safe to
    switch over. When a patch is enabled, livepatch enters into a
    transition state where tasks are converging to the patched state.
    Usually this transition state can complete in a few seconds. The same
    sequence occurs when a patch is disabled, except the tasks converge from
    the patched state to the unpatched state.

    An interrupt handler inherits the patched state of the task it
    interrupts. The same is true for forked tasks: the child inherits the
    patched state of the parent.

    Livepatch uses several complementary approaches to determine when it's
    safe to patch tasks:

    1. The first and most effective approach is stack checking of sleeping
    tasks. If no affected functions are on the stack of a given task,
    the task is patched. In most cases this will patch most or all of
    the tasks on the first try. Otherwise it'll keep trying
    periodically. This option is only available if the architecture has
    reliable stacks (HAVE_RELIABLE_STACKTRACE).

    2. The second approach, if needed, is kernel exit switching. A
    task is switched when it returns to user space from a system call, a
    user space IRQ, or a signal. It's useful in the following cases:

    a) Patching I/O-bound user tasks which are sleeping on an affected
    function. In this case you have to send SIGSTOP and SIGCONT to
    force it to exit the kernel and be patched.
    b) Patching CPU-bound user tasks. If the task is highly CPU-bound
    then it will get patched the next time it gets interrupted by an
    IRQ.
    c) In the future it could be useful for applying patches for
    architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
    this case you would have to signal most of the tasks on the
    system. However this isn't supported yet because there's
    currently no way to patch kthreads without
    HAVE_RELIABLE_STACKTRACE.

    3. For idle "swapper" tasks, since they don't ever exit the kernel, they
    instead have a klp_update_patch_state() call in the idle loop which
    allows them to be patched before the CPU enters the idle state.

    (Note there's not yet such an approach for kthreads.)

    All the above approaches may be skipped by setting the 'immediate' flag
    in the 'klp_patch' struct, which will disable per-task consistency and
    patch all tasks immediately. This can be useful if the patch doesn't
    change any function or data semantics. Note that, even with this flag
    set, it's possible that some tasks may still be running with an old
    version of the function, until that function returns.

    There's also an 'immediate' flag in the 'klp_func' struct which allows
    you to specify that certain functions in the patch can be applied
    without per-task consistency. This might be useful if you want to patch
    a common function like schedule(), and the function change doesn't need
    consistency but the rest of the patch does.

    For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
    must set patch->immediate which causes all tasks to be patched
    immediately. This option should be used with care, only when the patch
    doesn't change any function or data semantics.

    In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
    may be allowed to use per-task consistency if we can come up with
    another way to patch kthreads.

    The /sys/kernel/livepatch//transition file shows whether a patch
    is in transition. Only a single patch (the topmost patch on the stack)
    can be in transition at a given time. A patch can remain in transition
    indefinitely, if any of the tasks are stuck in the initial patch state.

    A transition can be reversed and effectively canceled by writing the
    opposite value to the /sys/kernel/livepatch//enabled file while
    the transition is in progress. Then all the tasks will attempt to
    converge back to the original patch state.

    [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

    Signed-off-by: Josh Poimboeuf
    Acked-by: Miroslav Benes
    Acked-by: Ingo Molnar # for the scheduler changes
    Signed-off-by: Jiri Kosina

    Josh Poimboeuf
     

03 Mar, 2017

1 commit


02 Mar, 2017

11 commits

  • …linux/sched/cputime.h>

    Introduce a trivial, mostly empty <linux/sched/cputime.h> header
    to prepare for the moving of cputime functionality out of sched.h.

    Update all code that relies on these facilities.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Fix up missing #includes in other places that rely on sched.h doing that for them.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • threadgroup_change_begin()/end() is a pointless wrapper around
    cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
    in the !CONFIG_CGROUPS=y case.

    Remove the wrappery, move the might_sleep() (the down_read()
    already has a might_sleep() check).

    This debloats a bit and simplifies this API.

    Update all call sites.

    No change in functionality.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

23 Feb, 2017

1 commit

  • When the mm with uffd-ed vmas fork()-s the respective vmas notify their
    uffds with the event which contains a descriptor with new uffd. This
    new descriptor can then be used to get events from the child and
    populate its mm with data. Note, that there can be different uffd-s
    controlling different vmas within one mm, so first we should collect all
    those uffds (and ctx-s) in a list and then notify them all one by one
    but only once per fork().

    The context is created at fork() time but the descriptor, file struct
    and anon inode object is created at event read time. So some trickery
    is added to the userfaultfd_ctx_read() to handle the ctx queues' locking
    vs file creation.

    Another thing worth noticing is that the task that fork()-s waits for
    the uffd event to get processed WITHOUT the mmap sem.

    [aarcange@redhat.com: build warning fix]
    Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com
    Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

21 Feb, 2017

2 commits

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Implement wraparound-safe refcount_t and kref_t types based on
    generic atomic primitives (Peter Zijlstra)

    - Improve and fix the ww_mutex code (Nicolai Hähnle)

    - Add self-tests to the ww_mutex code (Chris Wilson)

    - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
    Bueso)

    - Micro-optimize the current-task logic all around the core kernel
    (Davidlohr Bueso)

    - Tidy up after recent optimizations: remove stale code and APIs,
    clean up the code (Waiman Long)

    - ... plus misc fixes, updates and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
    fork: Fix task_struct alignment
    locking/spinlock/debug: Remove spinlock lockup detection code
    lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
    lkdtm: Convert to refcount_t testing
    kref: Implement 'struct kref' using refcount_t
    refcount_t: Introduce a special purpose refcount type
    sched/wake_q: Clarify queue reinit comment
    sched/wait, rcuwait: Fix typo in comment
    locking/mutex: Fix lockdep_assert_held() fail
    locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
    locking/rwsem: Reinit wake_q after use
    locking/rwsem: Remove unnecessary atomic_long_t casts
    jump_labels: Move header guard #endif down where it belongs
    locking/atomic, kref: Implement kref_put_lock()
    locking/ww_mutex: Turn off __must_check for now
    locking/atomic, kref: Avoid more abuse
    locking/atomic, kref: Use kref_get_unless_zero() more
    locking/atomic, kref: Kill kref_sub()
    locking/atomic, kref: Add kref_read()
    locking/atomic, kref: Add KREF_INIT()
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this (fairly busy) cycle were:

    - There was a class of scheduler bugs related to forgetting to update
    the rq-clock timestamp which can cause weird and hard to debug
    problems, so there's a new debug facility for this: which uncovered
    a whole lot of bugs which convinced us that we want to keep the
    debug facility.

    (Peter Zijlstra, Matt Fleming)

    - Various cputime related updates: eliminate cputime and use u64
    nanoseconds directly, simplify and improve the arch interfaces,
    implement delayed accounting more widely, etc. - (Frederic
    Weisbecker)

    - Move code around for better structure plus cleanups (Ingo Molnar)

    - Move IO schedule accounting deeper into the scheduler plus related
    changes to improve the situation (Tejun Heo)

    - ... plus a round of sched/rt and sched/deadline fixes, plus other
    fixes, updats and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
    sched/core: Remove unlikely() annotation from sched_move_task()
    sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
    sched/topology: Split out scheduler topology code from core.c into topology.c
    sched/core: Remove unnecessary #include headers
    sched/rq_clock: Consolidate the ordering of the rq_clock methods
    delayacct: Include
    sched/core: Clean up comments
    sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
    sched/clock: Add dummy clear_sched_clock_stable() stub function
    sched/cputime: Remove generic asm headers
    sched/cputime: Remove unused nsec_to_cputime()
    s390, sched/cputime: Remove unused cputime definitions
    powerpc, sched/cputime: Remove unused cputime definitions
    s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
    ia64, sched/cputime: Remove unused cputime definitions
    ia64: Convert vtime to use nsec units directly
    ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
    sched/cputime: Remove jiffies based cputime
    sched/cputime, vtime: Return nsecs instead of cputime_t to account
    sched/cputime: Complete nsec conversion of tick based accounting
    ...

    Linus Torvalds