02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit


25 Jul, 2017

1 commit

  • Back in the dim distant past, the task_struct structure's RCU-related
    fields optionally included those needed for CONFIG_RCU_BOOST, even in
    CONFIG_PREEMPT_RCU builds. The INIT_TASK_RCU_TREE_PREEMPT() macro was
    used to provide initializers for those optional CONFIG_RCU_BOOST fields.
    However, the CONFIG_RCU_BOOST fields are now included unconditionally
    in CONFIG_PREEMPT_RCU builds, so there is no longer any need fro the
    INIT_TASK_RCU_TREE_PREEMPT() macro. This commit therefore removes it
    in favor of initializing the ->rcu_blocked_node field directly in the
    INIT_TASK_RCU_PREEMPT() macro.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

05 Jul, 2017

2 commits

  • We are about to add vtime accumulation fields to the task struct. Let's
    avoid more bloatification and gather vtime information to their own
    struct.

    Tested-by: Luiz Capitulino
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The current "snapshot" based naming on vtime fields suggests we record
    some past event but that's a low level picture of their actual purpose
    which comes out blurry. The real point of these fields is to run a basic
    state machine that tracks down cputime entry while switching between
    contexts.

    So lets reflect that with more meaningful names.

    Tested-by: Luiz Capitulino
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

03 May, 2017

2 commits

  • Pull security subsystem updates from James Morris:
    "Highlights:

    IMA:
    - provide ">" and " of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
    tpm: Fix reference count to main device
    tpm_tis: convert to using locality callbacks
    tpm: fix handling of the TPM 2.0 event logs
    tpm_crb: remove a cruft constant
    keys: select CONFIG_CRYPTO when selecting DH / KDF
    apparmor: Make path_max parameter readonly
    apparmor: fix parameters so that the permission test is bypassed at boot
    apparmor: fix invalid reference to index variable of iterator line 836
    apparmor: use SHASH_DESC_ON_STACK
    security/apparmor/lsm.c: set debug messages
    apparmor: fix boolreturn.cocci warnings
    Smack: Use GFP_KERNEL for smk_netlbl_mls().
    smack: fix double free in smack_parse_opts_str()
    KEYS: add SP800-56A KDF support for DH
    KEYS: Keyring asymmetric key restrict method with chaining
    KEYS: Restrict asymmetric key linkage using a specific keychain
    KEYS: Add a lookup_restriction function for the asymmetric key type
    KEYS: Add KEYCTL_RESTRICT_KEYRING
    KEYS: Consistent ordering for __key_link_begin and restrict check
    KEYS: Add an optional lookup_restriction hook to key_type
    ...

    Linus Torvalds
     
  • Pull livepatch updates from Jiri Kosina:

    - a per-task consistency model is being added for architectures that
    support reliable stack dumping (extending this, currently rather
    trivial set, is currently in the works).

    This extends the nature of the types of patches that can be applied
    by live patching infrastructure. The code stems from the design
    proposal made [1] back in November 2014. It's a hybrid of SUSE's
    kGraft and RH's kpatch, combining advantages of both: it uses
    kGraft's per-task consistency and syscall barrier switching combined
    with kpatch's stack trace switching. There are also a number of
    fallback options which make it quite flexible.

    Most of the heavy lifting done by Josh Poimboeuf with help from
    Miroslav Benes and Petr Mladek

    [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

    - module load time patch optimization from Zhou Chengming

    - a few assorted small fixes

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: add missing printk newlines
    livepatch: Cancel transition a safe way for immediate patches
    livepatch: Reduce the time of finding module symbols
    livepatch: make klp_mutex proper part of API
    livepatch: allow removal of a disabled patch
    livepatch: add /proc//patch_state
    livepatch: change to a per-task consistency model
    livepatch: store function sizes
    livepatch: use kstrtobool() in enabled_store()
    livepatch: move patching functions into patch.c
    livepatch: remove unnecessary object loaded check
    livepatch: separate enabled and patched states
    livepatch/s390: add TIF_PATCH_PENDING thread flag
    livepatch/s390: reorganize TIF thread flag bits
    livepatch/powerpc: add TIF_PATCH_PENDING thread flag
    livepatch/x86: add TIF_PATCH_PENDING thread flag
    livepatch: create temporary klp_update_patch_state() stub
    x86/entry: define _TIF_ALLWORK_MASK flags explicitly
    stacktrace/x86: add function for detecting reliable stack traces

    Linus Torvalds
     

04 Apr, 2017

1 commit

  • A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [] enqueue_task+0x2c/0x80
    [] activate_task+0x23/0x30
    [] pull_dl_task+0x1d5/0x260
    [] pre_schedule_dl+0x16/0x20
    [] __schedule+0xd3/0x900
    [] schedule+0x29/0x70
    [] __rt_mutex_slowlock+0x4b/0xc0
    [] rt_mutex_slowlock+0xd1/0x190
    [] rt_mutex_timed_lock+0x53/0x60
    [] futex_lock_pi.isra.18+0x28c/0x390
    [] do_futex+0x190/0x5b0
    [] SyS_futex+0x80/0x180

    This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
    are only protected by pi_lock when operating pi waiters, while
    rt_mutex_get_top_task(), will access them with rq lock held but
    not holding pi_lock.

    In order to tackle it, we introduce new "pi_top_task" pointer
    cached in task_struct, and add new rt_mutex_update_top_task()
    to update its value, it can be called by rt_mutex_setprio()
    which held both owner's pi_lock and rq lock. Thus "pi_top_task"
    can be safely accessed by enqueue_task_dl() under rq lock.

    Originally-From: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

28 Mar, 2017

1 commit

  • We switched from "struct task_struct"->security to "struct cred"->security
    in Linux 2.6.29. But not all LSM modules were happy with that change.
    TOMOYO LSM module is an example which want to use per "struct task_struct"
    security blob, for TOMOYO's security context is defined based on "struct
    task_struct" rather than "struct cred". AppArmor LSM module is another
    example which want to use it, for AppArmor is currently abusing the cred
    a little bit to store the change_hat and setexeccon info. Although
    security_task_free() hook was revived in Linux 3.4 because Yama LSM module
    wanted to release per "struct task_struct" security blob,
    security_task_alloc() hook and "struct task_struct"->security field were
    not revived. Nowadays, we are getting proposals of lightweight LSM modules
    which want to use per "struct task_struct" security blob.

    We are already allowing multiple concurrent LSM modules (up to one fully
    armored module which uses "struct cred"->security field or exclusive hooks
    like security_xfrm_state_pol_flow_match(), plus unlimited number of
    lightweight modules which do not use "struct cred"->security nor exclusive
    hooks) as long as they are built into the kernel. But this patch does not
    implement variable length "struct task_struct"->security field which will
    become needed when multiple LSM modules want to use "struct task_struct"->
    security field. Although it won't be difficult to implement variable length
    "struct task_struct"->security field, let's think about it after we merged
    this patch.

    Signed-off-by: Tetsuo Handa
    Acked-by: John Johansen
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Tested-by: Djalal Harouni
    Acked-by: José Bollo
    Cc: Paul Moore
    Cc: Stephen Smalley
    Cc: Eric Paris
    Cc: Kees Cook
    Cc: James Morris
    Cc: José Bollo
    Signed-off-by: James Morris

    Tetsuo Handa
     

08 Mar, 2017

1 commit

  • Change livepatch to use a basic per-task consistency model. This is the
    foundation which will eventually enable us to patch those ~10% of
    security patches which change function or data semantics. This is the
    biggest remaining piece needed to make livepatch more generally useful.

    This code stems from the design proposal made by Vojtech [1] in November
    2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
    consistency and syscall barrier switching combined with kpatch's stack
    trace switching. There are also a number of fallback options which make
    it quite flexible.

    Patches are applied on a per-task basis, when the task is deemed safe to
    switch over. When a patch is enabled, livepatch enters into a
    transition state where tasks are converging to the patched state.
    Usually this transition state can complete in a few seconds. The same
    sequence occurs when a patch is disabled, except the tasks converge from
    the patched state to the unpatched state.

    An interrupt handler inherits the patched state of the task it
    interrupts. The same is true for forked tasks: the child inherits the
    patched state of the parent.

    Livepatch uses several complementary approaches to determine when it's
    safe to patch tasks:

    1. The first and most effective approach is stack checking of sleeping
    tasks. If no affected functions are on the stack of a given task,
    the task is patched. In most cases this will patch most or all of
    the tasks on the first try. Otherwise it'll keep trying
    periodically. This option is only available if the architecture has
    reliable stacks (HAVE_RELIABLE_STACKTRACE).

    2. The second approach, if needed, is kernel exit switching. A
    task is switched when it returns to user space from a system call, a
    user space IRQ, or a signal. It's useful in the following cases:

    a) Patching I/O-bound user tasks which are sleeping on an affected
    function. In this case you have to send SIGSTOP and SIGCONT to
    force it to exit the kernel and be patched.
    b) Patching CPU-bound user tasks. If the task is highly CPU-bound
    then it will get patched the next time it gets interrupted by an
    IRQ.
    c) In the future it could be useful for applying patches for
    architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
    this case you would have to signal most of the tasks on the
    system. However this isn't supported yet because there's
    currently no way to patch kthreads without
    HAVE_RELIABLE_STACKTRACE.

    3. For idle "swapper" tasks, since they don't ever exit the kernel, they
    instead have a klp_update_patch_state() call in the idle loop which
    allows them to be patched before the CPU enters the idle state.

    (Note there's not yet such an approach for kthreads.)

    All the above approaches may be skipped by setting the 'immediate' flag
    in the 'klp_patch' struct, which will disable per-task consistency and
    patch all tasks immediately. This can be useful if the patch doesn't
    change any function or data semantics. Note that, even with this flag
    set, it's possible that some tasks may still be running with an old
    version of the function, until that function returns.

    There's also an 'immediate' flag in the 'klp_func' struct which allows
    you to specify that certain functions in the patch can be applied
    without per-task consistency. This might be useful if you want to patch
    a common function like schedule(), and the function change doesn't need
    consistency but the rest of the patch does.

    For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
    must set patch->immediate which causes all tasks to be patched
    immediately. This option should be used with care, only when the patch
    doesn't change any function or data semantics.

    In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
    may be allowed to use per-task consistency if we can come up with
    another way to patch kthreads.

    The /sys/kernel/livepatch//transition file shows whether a patch
    is in transition. Only a single patch (the topmost patch on the stack)
    can be in transition at a given time. A patch can remain in transition
    indefinitely, if any of the tasks are stuck in the initial patch state.

    A transition can be reversed and effectively canceled by writing the
    opposite value to the /sys/kernel/livepatch//enabled file while
    the transition is in progress. Then all the tasks will attempt to
    converge back to the original patch state.

    [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

    Signed-off-by: Josh Poimboeuf
    Acked-by: Miroslav Benes
    Acked-by: Ingo Molnar # for the scheduler changes
    Signed-off-by: Jiri Kosina

    Josh Poimboeuf
     

03 Mar, 2017

1 commit


02 Mar, 2017

2 commits


28 Jan, 2017

1 commit


16 Sep, 2016

1 commit

  • We currently keep every task's stack around until the task_struct
    itself is freed. This means that we keep the stack allocation alive
    for longer than necessary and that, under load, we free stacks in
    big batches whenever RCU drops the last task reference. Neither of
    these is good for reuse of cache-hot memory, and freeing in batches
    prevents us from usefully caching small numbers of vmalloced stacks.

    On architectures that have thread_info on the stack, we can't easily
    change this, but on architectures that set THREAD_INFO_IN_TASK, we
    can free it as soon as the task is dead.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

15 Sep, 2016

1 commit

  • If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
    then thread_info is defined as a single 'u32 flags' and is the first
    entry of task_struct. thread_info::task is removed (it serves no
    purpose if thread_info is embedded in task_struct), and
    thread_info::cpu gets its own slot in task_struct.

    This is heavily based on a patch written by Linus.

    Originally-from: Linus Torvalds
    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

25 Jun, 2016

1 commit

  • The INIT_TASK() initializer was similarly confused about the stack vs
    thread_info allocation that the allocators had, and that were fixed in
    commit b235beea9e99 ("Clarify naming of thread info/stack allocators").

    The task ->stack pointer only incidentally ends up having the same value
    as the thread_info, and in fact that will change.

    So fix the initial task struct initializer to point to 'init_stack'
    instead of 'init_thread_info', and make sure the ia64 definition for
    that exists.

    This actually makes the ia64 tsk->stack pointer be sensible for the
    initial task, but not for any other task. As mentioned in commit
    b235beea9e99, that whole pointer isn't actually used on ia64, since
    task_stack_page() there just points to the (single) allocation.

    All the other architectures seem to have copied the 'init_stack'
    definition, even if it tended to be generally unusued.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Dec, 2015

1 commit

  • The cputime can only be updated by the current task itself, even in
    vtime case. So we can safely use seqcount instead of seqlock as there
    is no writer concurrency involved.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Hiroshi Shimamoto
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

15 Oct, 2015

2 commits

  • It was found while running a database workload on large systems that
    significant time was spent trying to acquire the sighand lock.

    The issue was that whenever an itimer expired, many threads ended up
    simultaneously trying to send the signal. Most of the time, nothing
    happened after acquiring the sighand lock because another thread
    had just already sent the signal and updated the "next expire" time.
    The fastpath_timer_check() didn't help much since the "next expire"
    time was updated after the threads exit fastpath_timer_check().

    This patch addresses this by having the thread_group_cputimer structure
    maintain a boolean to signify when a thread in the group is already
    checking for process wide timers, and adds extra logic in the fastpath
    to check the boolean.

    Signed-off-by: Jason Low
    Reviewed-by: Oleg Nesterov
    Reviewed-by: George Spelvin
    Cc: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: hideaki.kimura@hpe.com
    Cc: terry.rudd@hpe.com
    Cc: scott.norton@hpe.com
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.com
    Signed-off-by: Thomas Gleixner

    Jason Low
     
  • In the next patch in this series, a new field 'checking_timer' will
    be added to 'struct thread_group_cputimer'. Both this and the
    existing 'running' integer field are just used as boolean values. To
    save space in the structure, we can make both of these fields booleans.

    This is a preparatory patch to convert the existing running integer
    field to a boolean.

    Suggested-by: George Spelvin
    Signed-off-by: Jason Low
    Reviewed: George Spelvin
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: hideaki.kimura@hpe.com
    Cc: terry.rudd@hpe.com
    Cc: scott.norton@hpe.com
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Thomas Gleixner

    Jason Low
     

17 Sep, 2015

1 commit

  • Note: This commit was originally committed as d59cfc09c32a but got
    reverted by 0c986253b939 due to the performance regression from
    the percpu_rwsem write down/up operations added to cgroup task
    migration path. percpu_rwsem changes which alleviate the
    performance issue are pending for v4.4-rc1 merge window.
    Re-apply.

    The cgroup side of threadgroup locking uses signal_struct->group_rwsem
    to synchronize against threadgroup changes. This per-process rwsem
    adds small overhead to thread creation, exit and exec paths, forces
    cgroup code paths to do lock-verify-unlock-retry dance in a couple
    places and makes it impossible to atomically perform operations across
    multiple processes.

    This patch replaces signal_struct->group_rwsem with a global
    percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
    side and contained in cgroups proper. This patch converts one-to-one.

    This does make writer side heavier and lower the granularity; however,
    cgroup process migration is a fairly cold path, we do want to optimize
    thread operations over it and cgroup migration operations don't take
    enough time for the lower granularity to matter.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

16 Sep, 2015

1 commit

  • This reverts commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14.

    d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
    a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
    threadgroup locking") changed how cgroup synchronizes against task
    fork and exits so that it uses global percpu_rwsem instead of
    per-process rwsem; unfortunately, the write [un]lock paths of
    percpu_rwsem always involve synchronize_rcu_expedited() which turned
    out to be too expensive.

    Improvements for percpu_rwsem are scheduled to be merged in the coming
    v4.4-rc1 merge window which alleviates this issue. For now, revert
    the two commits to restore per-process rwsem. They will be re-applied
    for the v4.4-rc1 merge window.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
    Reported-by: Christian Borntraeger
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Paolo Bonzini
    Cc: stable@vger.kernel.org # v4.2+

    Tejun Heo
     

03 Aug, 2015

1 commit

  • While the current code guarantees monotonicity for stime and utime
    independently of one another, it does not guarantee that the sum of
    both is equal to the total time we started out with.

    This confuses things (and peoples) who look at this sum, like top, and
    will report >100% usage followed by a matching period of 0%.

    Rework the code to provide both individual monotonicity and a coherent
    sum.

    Suggested-by: Fredrik Markstrom
    Reported-by: Fredrik Markstrom
    Tested-by: Fredrik Markstrom
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Cc: jason.low2@hp.com
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jun, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     

27 May, 2015

1 commit

  • The cgroup side of threadgroup locking uses signal_struct->group_rwsem
    to synchronize against threadgroup changes. This per-process rwsem
    adds small overhead to thread creation, exit and exec paths, forces
    cgroup code paths to do lock-verify-unlock-retry dance in a couple
    places and makes it impossible to atomically perform operations across
    multiple processes.

    This patch replaces signal_struct->group_rwsem with a global
    percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
    side and contained in cgroups proper. This patch converts one-to-one.

    This does make writer side heavier and lower the granularity; however,
    cgroup process migration is a fairly cold path, we do want to optimize
    thread operations over it and cgroup migration operations don't take
    enough time for the lower granularity to matter.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

08 May, 2015

2 commits

  • Recent optimizations were made to thread_group_cputimer to improve its
    scalability by keeping track of cputime stats without a lock. However,
    the values were open coded to the structure, causing them to be at
    a different abstraction level from the regular task_cputime structure.
    Furthermore, any subsequent similar optimizations would not be able to
    share the new code, since they are specific to thread_group_cputimer.

    This patch adds the new task_cputime_atomic data structure (introduced in
    the previous patch in the series) to thread_group_cputimer for keeping
    track of the cputime atomically, which also helps generalize the code.

    Suggested-by: Ingo Molnar
    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • While running a database workload, we found a scalability issue with itimers.

    Much of the problem was caused by the thread_group_cputimer spinlock.
    Each time we account for group system/user time, we need to obtain a
    thread_group_cputimer's spinlock to update the timers. On larger systems
    (such as a 16 socket machine), this caused more than 30% of total time
    spent trying to obtain this kernel lock to update these group timer stats.

    This patch converts the timers to 64-bit atomic variables and use
    atomic add to update them without a lock. With this patch, the percent
    of total time spent updating thread group cputimer timers was reduced
    from 30% down to less than 1%.

    Note: On 32-bit systems using the generic 64-bit atomics, this causes
    sample_group_cputimer() to take locks 3 times instead of just 1 time.
    However, we tested this patch on a 32-bit system ARM system using the
    generic atomics and did not find the overhead to be much of an issue.
    An explanation for why this isn't an issue is that 32-bit systems usually
    have small numbers of CPUs, and cacheline contention from extra spinlocks
    called periodically is not really apparent on smaller systems.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Scott J Norton
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

14 Feb, 2015

1 commit

  • Stack instrumentation allows to detect out of bounds memory accesses for
    variables allocated on stack. Compiler adds redzones around every
    variable on stack and poisons redzones in function's prologue.

    Such approach significantly increases stack usage, so all in-kernel stacks
    size were doubled.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

13 Feb, 2015

1 commit

  • If an attacker can cause a controlled kernel stack overflow, overwriting
    the restart block is a very juicy exploit target. This is because the
    restart_block is held in the same memory allocation as the kernel stack.

    Moving the restart block to struct task_struct prevents this exploit by
    making the restart_block harder to locate.

    Note that there are other fields in thread_info that are also easy
    targets, at least on some architectures.

    It's also a decent simplification, since the restart code is more or less
    identical on all architectures.

    [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
    Signed-off-by: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: David Miller
    Acked-by: Richard Weinberger
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Steven Miao
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Richard Kuo
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Jonas Bonn
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Michael Ellerman (powerpc)
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Oleg Nesterov
    Cc: Guenter Roeck
    Signed-off-by: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

10 Dec, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

    This instruments might_sleep() checks to catch places that nest
    blocking primitives - such as mutex usage in a wait loop. Such
    bugs can result in hard to debug races/hangs.

    Another category of invalid nesting that this facility will detect
    is the calling of blocking functions from within schedule() ->
    sched_submit_work() -> blk_schedule_flush_plug().

    There's some potential for false positives (if secondary blocking
    primitives themselves are not ready yet for this facility), but the
    kernel will warn once about such bugs per bootup, so the warning
    isn't much of a nuisance.

    This feature comes with a number of fixes, for problems uncovered
    with it, so no messages are expected normally.

    - Another round of sched/numa optimizations and refinements, for
    CONFIG_NUMA_BALANCING=y.

    - Another round of sched/dl fixes and refinements.

    Plus various smaller fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched: Add missing rcu protection to wake_up_all_idle_cpus
    sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
    sched/numa: Init numa balancing fields of init_task
    sched/deadline: Remove unnecessary definitions in cpudeadline.h
    sched/cpupri: Remove unnecessary definitions in cpupri.h
    sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
    sched/fair: Fix stale overloaded status in the busiest group finding logic
    sched: Move p->nr_cpus_allowed check to select_task_rq()
    sched/completion: Document when to use wait_for_completion_io_*()
    sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
    sched/fair: Kill task_struct::numa_entry and numa_group::task_list
    sched: Refactor task_struct to use numa_faults instead of numa_* pointers
    sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
    sched/deadline: Reschedule from switched_from_dl() after a successful pull
    sched/deadline: Push task away if the deadline is equal to curr during wakeup
    sched/deadline: Add deadline rq status print
    sched/deadline: Fix artificial overrun introduced by yield_task_dl()
    sched/rt: Clean up check_preempt_equal_prio()
    sched/core: Use dl_bw_of() under rcu_read_lock_sched()
    sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
    ...

    Linus Torvalds
     

16 Nov, 2014

1 commit

  • We do not initialize init_task.numa_preferred_nid,
    but this value is inherited by userspace "init"
    process:

    rest_init()->kernel_thread(kernel_init)->do_fork(CLONE_VM);

    __sched_fork()
    {
    if (clone_flags & CLONE_VM)
    p->numa_preferred_nid = current->numa_preferred_nid;
    else
    p->numa_preferred_nid = -1;
    }

    kernel_init() becomes userspace "init" process.

    So, we propagate garbage nid to userspace, and it may be used
    during numa balancing.

    Currently, we do not have reports about this brings a problem,
    but it seem we should set it for sure.

    Even if init_task.numa_preferred_nid is zero, we may meet a weird
    configuration without nid#0. On sparc64, where processors are
    numbered physically, I saw a machine without cpu#1, while cpu#2
    existed. Possible, something similar may be with numa nodes.
    So, let's initialize it and be sure we're safe.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Eric Paris
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Sergey Dyasly
    Link: http://lkml.kernel.org/r/1415699189.15631.6.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

30 Oct, 2014

1 commit


08 Sep, 2014

3 commits

  • The rcu_preempt_note_context_switch() function is on a scheduling fast
    path, so it would be good to avoid disabling irqs. The reason that irqs
    are disabled is to synchronize process-level and irq-handler access to
    the task_struct ->rcu_read_unlock_special bitmask. This commit therefore
    makes ->rcu_read_unlock_special instead be a union of bools with a short
    allowing single-access checks in RCU's __rcu_read_unlock(). This results
    in the process-level and irq-handler accesses being simple loads and
    stores, so that irqs need no longer be disabled. This commit therefore
    removes the irq disabling from rcu_preempt_note_context_switch().

    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
    usermode execution. There would be neither a context switch nor a
    scheduling-clock interrupt to tell TASKS_RCU that the task in question
    had passed through a quiescent state. The grace period would therefore
    extend indefinitely. This commit therefore makes RCU's dyntick-idle
    subsystem record the task_struct structure of the task that is running
    in dyntick-idle mode on each CPU. The TASKS_RCU grace period can
    then access this information and record a quiescent state on
    behalf of any CPU running in dyntick-idle usermode.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit adds a new RCU-tasks flavor of RCU, which provides
    call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
    context switch (not preemption!) and userspace execution (not the idle
    loop -- use some sort of schedule_on_each_cpu() if you need to handle the
    idle tasks. Note that unlike other RCU flavors, these quiescent states
    occur in tasks, not necessarily CPUs. Includes fixes from Steven Rostedt.

    This RCU flavor is assumed to have very infrequent latency-tolerant
    updaters. This assumption permits significant simplifications, including
    a single global callback list protected by a single global lock, along
    with a single task-private linked list containing all tasks that have not
    yet passed through a quiescent state. If experience shows this assumption
    to be incorrect, the required additional complexity will be added.

    Suggested-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

10 Jul, 2014

1 commit

  • RCU priority boosting currently checks for boosting via a pointer in
    task_struct. However, this is not needed: As Oleg noted, if the
    rt_mutex is placed in the rcu_node instead of on the booster's stack,
    the boostee can simply check it see if it owns the lock. This commit
    makes this change, shrinking task_struct by one pointer and the kernel
    by thirteen lines.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

24 Jan, 2014

1 commit

  • Pull audit update from Eric Paris:
    "Again we stayed pretty well contained inside the audit system.
    Venturing out was fixing a couple of function prototypes which were
    inconsistent (didn't hurt anything, but we used the same value as an
    int, uint, u32, and I think even a long in a couple of places).

    We also made a couple of minor changes to when a couple of LSMs called
    the audit system. We hoped to add aarch64 audit support this go
    round, but it wasn't ready.

    I'm disappearing on vacation on Thursday. I should have internet
    access, but it'll be spotty. If anything goes wrong please be sure to
    cc rgb@redhat.com. He'll make fixing things his top priority"

    * git://git.infradead.org/users/eparis/audit: (50 commits)
    audit: whitespace fix in kernel-parameters.txt
    audit: fix location of __net_initdata for audit_net_ops
    audit: remove pr_info for every network namespace
    audit: Modify a set of system calls in audit class definitions
    audit: Convert int limit uses to u32
    audit: Use more current logging style
    audit: Use hex_byte_pack_upper
    audit: correct a type mismatch in audit_syscall_exit()
    audit: reorder AUDIT_TTY_SET arguments
    audit: rework AUDIT_TTY_SET to only grab spin_lock once
    audit: remove needless switch in AUDIT_SET
    audit: use define's for audit version
    audit: documentation of audit= kernel parameter
    audit: wait_for_auditd rework for readability
    audit: update MAINTAINERS
    audit: log task info on feature change
    audit: fix incorrect set of audit_sock
    audit: print error message when fail to create audit socket
    audit: fix dangling keywords in audit_log_set_loginuid() output
    audit: log on errors from filter user rules
    ...

    Linus Torvalds
     

22 Jan, 2014

1 commit

  • while_each_thread() and next_thread() should die, almost every lockless
    usage is wrong.

    1. Unless g == current, the lockless while_each_thread() is not safe.

    while_each_thread(g, t) can loop forever if g exits, next_thread()
    can't reach the unhashed thread in this case. Note that this can
    happen even if g is the group leader, it can exec.

    2. Even if while_each_thread() itself was correct, people often use
    it wrongly.

    It was never safe to just take rcu_read_lock() and loop unless
    you verify that pid_alive(g) == T, even the first next_thread()
    can point to the already freed/reused memory.

    This patch adds signal_struct->thread_head and task->thread_node to
    create the normal rcu-safe list with the stable head. The new
    for_each_thread(g, t) helper is always safe under rcu_read_lock() as
    long as this task_struct can't go away.

    Note: of course it is ugly to have both task_struct->thread_node and the
    old task_struct->thread_group, we will kill it later, after we change
    the users of while_each_thread() to use for_each_thread().

    Perhaps we can kill it even before we convert all users, we can
    reimplement next_thread(t) using the new thread_head/thread_node. But
    we can't do this right now because this will lead to subtle behavioural
    changes. For example, do/while_each_thread() always sees at least one
    task, while for_each_thread() can do nothing if the whole thread group
    has died. Or thread_group_empty(), currently its semantics is not clear
    unless thread_group_leader(p) and we need to audit the callers before we
    can change it.

    So this patch adds the new interface which has to coexist with the old
    one for some time, hopefully the next changes will be more or less
    straightforward and the old one will go away soon.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Acked-by: David Rientjes
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Cc: Michal Hocko
    Cc: "Tu, Xiaobing"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

14 Jan, 2014

1 commit