07 Jul, 2017

2 commits

  • When updating task's mems_allowed and rebinding its mempolicy due to
    cpuset's mems being changed, we currently only take the seqlock for
    writing when either the task has a mempolicy, or the new mems has no
    intersection with the old mems.

    This should be enough to prevent a parallel allocation seeing no
    available nodes, but the optimization is IMHO unnecessary (cpuset
    updates should not be frequent), and we still potentially risk issues if
    the intersection of new and old nodes has limited amount of
    free/reclaimable memory.

    Let's just use the seqlock for all tasks.

    Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has introduced a two-step protocol when
    rebinding task's mempolicy due to cpuset update, in order to avoid a
    parallel allocation seeing an empty effective nodemask and failing.

    Later, commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory
    barrier related damage v3") introduced a seqlock protection and removed
    the synchronization point between the two update steps. At that point
    (or perhaps later), the two-step rebinding became unnecessary.

    Currently it only makes sure that the update first adds new nodes in
    step 1 and then removes nodes in step 2. Without memory barriers the
    effects are questionable, and even then this cannot prevent a parallel
    zonelist iteration checking the nodemask at each step to observe all
    nodes as unusable for allocation. We now fully rely on the seqlock to
    prevent premature OOMs and allocation failures.

    We can thus remove the two-step update parts and simplify the code.

    Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Jun, 2017

2 commits

  • Currently, cgroup only supports delegation to !root users and cgroup
    namespaces don't get any special treatments. This limits the
    usefulness of cgroup namespaces as they by themselves can't be safe
    delegation boundaries. A process inside a cgroup can change the
    resource control knobs of the parent in the namespace root and may
    move processes in and out of the namespace if cgroups outside its
    namespace are visible somehow.

    This patch adds a new mount option "nsdelegate" which makes cgroup
    namespaces delegation boundaries. If set, cgroup behaves as if write
    permission based delegation took place at namespace boundaries -
    writes to the resource control knobs from the namespace root are
    denied and migration crossing the namespace boundary aren't allowed
    from inside the namespace.

    This allows cgroup namespace to function as a delegation boundary by
    itself.

    v2: Silently ignore nsdelegate specified on !init mounts.

    Signed-off-by: Tejun Heo
    Cc: Aravind Anbudurai
    Cc: Serge Hallyn
    Cc: Eric Biederman

    Tejun Heo
     
  • Restructure cgroup_procs_write_permission() to make extending
    permission logic easier.

    This patch doesn't cause any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

15 Jun, 2017

7 commits

  • The debug controller grabs cgroup_mutex from interface file show
    functions which can deadlock and triggers lockdep warnings. Fix it by
    using cgroup_kn_lock_live()/cgroup_kn_unlock() instead.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long

    Tejun Heo
     
  • Factor out cgroup_masks_read_one() out of cgroup_masks_read() for
    simplicity.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long

    Tejun Heo
     
  • Make debug an implicit controller on cgroup2 which is enabled by
    "cgroup_debug" boot param.

    Signed-off-by: Tejun Heo
    Cc: Waiman Long

    Tejun Heo
     
  • Besides supporting cgroup v2 and thread mode, the following changes
    are also made:
    1) current_* cgroup files now resides only at the root as we don't
    need duplicated files of the same function all over the cgroup
    hierarchy.
    2) The cgroup_css_links_read() function is modified to report
    the number of tasks that are skipped because of overflow.
    3) The number of extra unaccounted references are displayed.
    4) The current_css_set_read() function now prints out the addresses of
    the css'es associated with the current css_set.
    5) A new cgroup_subsys_states file is added to display the css objects
    associated with a cgroup.
    6) A new cgroup_masks file is added to display the various controller
    bit masks in the cgroup.

    tj: Dropped thread mode related information for now so that debug
    controller changes aren't blocked on the thread mode.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • The Kconfig prompt and description of the debug cgroup controller
    more accurate by saying that it is for debug purpose only and its
    interfaces are unstable.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • The debug cgroup currently resides within cgroup-v1.c and is enabled
    only for v1 cgroup. To enable the debug cgroup also for v2, it makes
    sense to put the code into its own file as it will no longer be v1
    specific. There is no change to the debug cgroup specific code.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     
  • The reference count in the css_set data structure was used as a
    proxy of the number of tasks attached to that css_set. However, that
    count is actually not an accurate measure especially with thread mode
    support. So a new variable nr_tasks is added to the css_set to keep
    track of the actual task count. This new variable is protected by
    the css_set_lock. Functions that require the actual task count are
    updated to use the new variable.

    tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
    Refreshed on top of cgroup/for-v4.13 which dropped on
    css_set_populated() -> nr_tasks conversion.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo

    Waiman Long
     

25 May, 2017

1 commit

  • In most cases, a cgroup controller don't care about the liftimes of
    cgroups. For the controller, a css becomes online when ->css_online()
    is called on it and offline when ->css_offline() is called.

    However, cpuset is special in that the user interface it exposes cares
    whether certain cgroups exist or not. Combined with the RCU delay
    between cgroup removal and css offlining, this can lead to user
    visible behavior oddities where operations which should succeed after
    cgroup removals fail for some time period. The effects of cgroup
    removals are delayed when seen from userland.

    This patch adds css_is_dying() which tests whether offline is pending
    and updates is_cpuset_online() so that the function returns false also
    while offline is pending. This gets rid of the userland visible
    delays.

    Signed-off-by: Tejun Heo
    Reported-by: Daniel Jordan
    Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Tejun Heo

    Tejun Heo
     

18 May, 2017

1 commit

  • The kill_css() function may be called more than once under the condition
    that the css was killed but not physically removed yet followed by the
    removal of the cgroup that is hosting the css. This patch prevents any
    harmm from being done when that happens.

    Signed-off-by: Waiman Long
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v4.5+

    Waiman Long
     

02 May, 2017

2 commits

  • Pull cgroup updates from Tejun Heo:
    "Nothing major. Two notable fixes are Li's second stab at fixing the
    long-standing race condition in the mount path and suppression of
    spurious warning from cgroup_get(). All other changes are trivial"

    * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark cgroup_get() with __maybe_unused
    cgroup: avoid attaching a cgroup root to two different superblocks, take 2
    cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
    cgroup: move cgroup_subsys_state parent field for cache locality
    cpuset: Remove cpuset_update_active_cpus()'s parameter.
    cgroup: switch to BUG_ON()
    cgroup: drop duplicate header nsproxy.h
    kernel: convert css_set.refcount from atomic_t to refcount_t
    kernel: convert cgroup_namespace.count from atomic_t to refcount_t

    Linus Torvalds
     
  • a590b90d472f ("cgroup: fix spurious warnings on cgroup_is_dead() from
    cgroup_sk_alloc()") converted most cgroup_get() usages to
    cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
    cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
    unused warning for cgroup_get().

    Silence the warning by adding __maybe_unused to cgroup_get().

    Reported-by: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
    Signed-off-by: Tejun Heo

    Tejun Heo
     

29 Apr, 2017

2 commits

  • Commit bfb0b80db5f9 ("cgroup: avoid attaching a cgroup root to two
    different superblocks") is broken. Now we try to fix the race by
    delaying the initialization of cgroup root refcnt until a superblock
    has been allocated.

    Reported-by: Dmitry Vyukov
    Reported-by: Andrei Vagin
    Tested-by: Andrei Vagin
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • cgroup_get() expected to be called only on live cgroups and triggers
    warning on a dead cgroup; however, cgroup_sk_alloc() may be called
    while cloning a socket which is left in an empty and removed cgroup
    and thus may legitimately duplicate its reference on a dead cgroup.
    This currently triggers the following warning spuriously.

    WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
    ...
    [] __warn+0xd3/0xf0
    [] warn_slowpath_null+0x1e/0x20
    [] cgroup_get+0x55/0x60
    [] cgroup_sk_alloc+0x51/0xe0
    [] sk_clone_lock+0x2db/0x390
    [] inet_csk_clone_lock+0x16/0xc0
    [] tcp_create_openreq_child+0x23/0x4b0
    [] tcp_v6_syn_recv_sock+0x91/0x670
    [] tcp_check_req+0x3a6/0x4e0
    [] tcp_v6_rcv+0x693/0xa00
    [] ip6_input_finish+0x59/0x3e0
    [] ip6_input+0x32/0xb0
    [] ip6_rcv_finish+0x57/0xa0
    [] ipv6_rcv+0x318/0x4d0
    [] __netif_receive_skb_core+0x2d7/0x9a0
    [] __netif_receive_skb+0x16/0x70
    [] netif_receive_skb_internal+0x23/0x80
    [] napi_gro_frags+0x208/0x270
    [] mlx4_en_process_rx_cq+0x74c/0xf40
    [] mlx4_en_poll_rx_cq+0x30/0x90
    [] net_rx_action+0x210/0x350
    [] __do_softirq+0x106/0x2c7
    [] irq_exit+0x9d/0xa0 [] do_IRQ+0x54/0xd0
    [] common_interrupt+0x7f/0x7f
    [] cpuidle_enter+0x17/0x20
    [] cpu_startup_entry+0x2a9/0x2f0
    [] start_secondary+0xf1/0x100

    This patch renames the existing cgroup_get() with the dead cgroup
    warning to cgroup_get_live() after cgroup_kn_lock_live() and
    introduces the new cgroup_get() which doesn't check whether the cgroup
    is live or dead.

    All existing cgroup_get() users except for cgroup_sk_alloc() are
    converted to use cgroup_get_live().

    Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    Cc: stable@vger.kernel.org # v4.5+
    Cc: Johannes Weiner
    Reported-by: Chris Mason
    Signed-off-by: Tejun Heo

    Tejun Heo
     

17 Apr, 2017

1 commit


16 Apr, 2017

1 commit


12 Apr, 2017

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "This contains fixes for two long standing subtle bugs:

    - kthread_bind() on a new kthread binds it to specific CPUs and
    prevents userland from messing with the affinity or cgroup
    membership. Unfortunately, for cgroup membership, there's a window
    between kthread creation and kthread_bind*() invocation where the
    kthread can be moved into a non-root cgroup by userland.

    Depending on what controllers are in effect, this can assign the
    kthread unexpected attributes. For example, in the reported case,
    workqueue workers ended up in a non-root cpuset cgroups and had
    their CPU affinities overridden. This broke workqueue invariants
    and led to workqueue stalls.

    Fixed by closing the window between kthread creation and
    kthread_bind() as suggested by Oleg.

    - There was a bug in cgroup mount path which could allow two
    competing mount attempts to attach the same cgroup_root to two
    different superblocks.

    This was caused by mishandling return value from kernfs_pin_sb().

    Fixed"

    * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: avoid attaching a cgroup root to two different superblocks
    cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups

    Linus Torvalds
     

11 Apr, 2017

2 commits

  • Run this:

    touch file0
    for ((; ;))
    {
    mount -t cpuset xxx file0
    }

    And this concurrently:

    touch file1
    for ((; ;))
    {
    mount -t cpuset xxx file1
    }

    We'll trigger a warning like this:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
    percpu_ref_kill_and_confirm called more than once on css_release!
    CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
    Call Trace:
    dump_stack+0x63/0x84
    __warn+0xd1/0xf0
    warn_slowpath_fmt+0x5f/0x80
    percpu_ref_kill_and_confirm+0x92/0xb0
    cgroup_kill_sb+0x95/0xb0
    deactivate_locked_super+0x43/0x70
    deactivate_super+0x46/0x60
    ...
    ---[ end trace a79f61c2a2633700 ]---

    Here's a race:

    Thread A Thread B

    cgroup1_mount()
    # alloc a new cgroup root
    cgroup_setup_root()
    cgroup1_mount()
    # no sb yet, returns NULL
    kernfs_pin_sb()

    # but succeeds in getting the refcnt,
    # so re-use cgroup root
    percpu_ref_tryget_live()
    # alloc sb with cgroup root
    cgroup_do_mount()

    cgroup_kill_sb()
    # alloc another sb with same root
    cgroup_do_mount()

    cgroup_kill_sb()

    We end up using the same cgroup root for two different superblocks,
    so percpu_ref_kill() will be called twice on the same root when the
    two superblocks are destroyed.

    We should fix to make sure the superblock pinning is really successful.

    Cc: stable@vger.kernel.org # 3.16+
    Reported-by: Dmitry Vyukov
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
    it.

    Signed-off-by: Rakib Mullick
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Rakib Mullick
     

28 Mar, 2017

1 commit


17 Mar, 2017

1 commit

  • Creation of a kthread goes through a couple interlocked stages between
    the kthread itself and its creator. Once the new kthread starts
    running, it initializes itself and wakes up the creator. The creator
    then can further configure the kthread and then let it start doing its
    job by waking it up.

    In this configuration-by-creator stage, the creator is the only one
    that can wake it up but the kthread is visible to userland. When
    altering the kthread's attributes from userland is allowed, this is
    fine; however, for cases where CPU affinity is critical,
    kthread_bind() is used to first disable affinity changes from userland
    and then set the affinity. This also prevents the kthread from being
    migrated into non-root cgroups as that can affect the CPU affinity and
    many other things.

    Unfortunately, the cgroup side of protection is racy. While the
    PF_NO_SETAFFINITY flag prevents further migrations, userland can win
    the race before the creator sets the flag with kthread_bind() and put
    the kthread in a non-root cgroup, which can lead to all sorts of
    problems including incorrect CPU affinity and starvation.

    This bug got triggered by userland which periodically tries to migrate
    all processes in the root cpuset cgroup to a non-root one. Per-cpu
    workqueue workers got caught while being created and ended up with
    incorrected CPU affinity breaking concurrency management and sometimes
    stalling workqueue execution.

    This patch adds task->no_cgroup_migration which disallows the task to
    be migrated by userland. kthreadd starts with the flag set making
    every child kthread start in the root cgroup with migration
    disallowed. The flag is cleared after the kthread finishes
    initialization by which time PF_NO_SETAFFINITY is set if the kthread
    should stay in the root cgroup.

    It'd be better to wait for the initialization instead of failing but I
    couldn't think of a way of implementing that without adding either a
    new PF flag, or sleeping and retrying from waiting side. Even if
    userland depends on changing cgroup membership of a kthread, it either
    has to be synchronized with kthread_create() or periodically repeat,
    so it's unlikely that this would break anything.

    v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

    Signed-off-by: Tejun Heo
    Suggested-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra (Intel)
    Cc: Thomas Gleixner
    Reported-and-debugged-by: Chris Mason
    Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
    Signed-off-by: Tejun Heo

    Tejun Heo
     

15 Mar, 2017

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Three cgroup fixes. Nothing critical:

    - the pids controller could trigger suspicious RCU warning
    spuriously. Fixed.

    - in the debug controller, %p -> %pK to protect kernel pointer
    from getting exposed.

    - documentation formatting fix"

    * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroups: censor kernel pointer in debug files
    cgroup/pids: remove spurious suspicious RCU usage warning
    cgroup: Fix indenting in PID controller documentation

    Linus Torvalds
     

10 Mar, 2017

1 commit

  • Fix typos and add the following to the scripts/spelling.txt:

    disble||disable
    disbled||disabled

    I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
    untouched. The macro is not referenced at all, but this commit is
    touching only comment blocks just in case.

    Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

09 Mar, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Tejun Heo

    Elena Reshetova
     

07 Mar, 2017

3 commits

  • As found in grsecurity, this avoids exposing a kernel pointer through
    the cgroup debug entries.

    Signed-off-by: Kees Cook
    Signed-off-by: Tejun Heo

    Kees Cook
     
  • pids_can_fork() is special in that the css association is guaranteed
    to be stable throughout the function and thus doesn't need RCU
    protection around task_css access. When determining the css to charge
    the pid, task_css_check() is used to override the RCU sanity check.

    While adding a warning message on fork rejection from pids limit,
    135b8b37bd91 ("cgroup: Add pids controller event when fork fails
    because of pid limit") incorrectly added a task_css access which is
    neither RCU protected or explicitly annotated. This triggers the
    following suspicious RCU usage warning when RCU debugging is enabled.

    cgroup: fork rejected by pids controller in

    ===============================
    [ ERR: suspicious RCU usage. ]
    4.10.0-work+ #1 Not tainted
    -------------------------------
    ./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 0
    1 lock held by bash/1748:
    #0: (&cgroup_threadgroup_rwsem){+++++.}, at: [] _do_fork+0xe6/0x6e0

    stack backtrace:
    CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
    Call Trace:
    dump_stack+0x68/0x93
    lockdep_rcu_suspicious+0xd7/0x110
    pids_can_fork+0x1c7/0x1d0
    cgroup_can_fork+0x67/0xc0
    copy_process.part.58+0x1709/0x1e90
    _do_fork+0xe6/0x6e0
    SyS_clone+0x19/0x20
    do_syscall_64+0x5c/0x140
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f7853fab93a
    RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
    RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
    R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
    R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
    /asdf

    There's no reason to dereference task_css again here when the
    associated css is already available. Fix it by replacing the
    task_cgroup() call with css->cgroup.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Fixes: 135b8b37bd91 ("cgroup: Add pids controller event when fork fails because of pid limit")
    Cc: Kenny Yu
    Cc: stable@vger.kernel.org # v4.8+
    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: Tejun Heo

    Elena Reshetova
     

03 Mar, 2017

3 commits

  • It's not used by any of the scheduler methods, but
    needs it to pick up STACK_END_MAGIC.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The task_lock()/task_unlock() APIs are not realated to core scheduling,
    they are task lifetime APIs, i.e. they belong into .

    Move them.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sors into <linux/sched/signal.h>

    task_struct::signal and task_struct::sighand are pointers, which would normally make it
    straightforward to not define those types in sched.h.

    That is not so, because the types are accompanied by a myriad of APIs (macros and inline
    functions) that dereference them.

    Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.

    With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
    trying to put accessors into sched.h as a test fails the following way:

    ./include/linux/sched.h: In function ‘test_signal_types’:
    ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
    ^

    This reduces the size and complexity of sched.h significantly.

    Update all headers and .c code that relied on getting the signal handling
    functionality from <linux/sched.h> to include <linux/sched/signal.h>.

    The list of affected files in the preparatory patch was partly generated by
    grepping for the APIs, and partly by doing coverage build testing, both
    all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
    cross-architecture builds.

    Nevertheless some (trivial) build breakage is still expected related to rare
    Kconfig combinations and in-flight patches to various kernel code, but most
    of it should be handled by this patch.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

02 Mar, 2017

4 commits

  • But first update the code that uses these facilities with the
    new header.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • threadgroup_change_begin()/end() is a pointless wrapper around
    cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
    in the !CONFIG_CGROUPS=y case.

    Remove the wrappery, move the might_sleep() (the down_read()
    already has a might_sleep() check).

    This debloats a bit and simplifies this API.

    Update all call sites.

    No change in functionality.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Several noteworthy changes.

    - Parav's rdma controller is finally merged. It is very straight
    forward and can limit the abosolute numbers of common rdma
    constructs used by different cgroups.

    - kernel/cgroup.c got too chubby and disorganized. Created
    kernel/cgroup/ subdirectory and moved all cgroup related files
    under kernel/ there and reorganized the core code. This hurts for
    backporting patches but was long overdue.

    - cgroup v2 process listing reimplemented so that it no longer
    depends on allocating a buffer large enough to cache the entire
    result to sort and uniq the output. v2 has always mangled the sort
    order to ensure that users don't depend on the sorted output, so
    this shouldn't surprise anybody. This makes the pid listing
    functions use the same iterators that are used internally, which
    have to have the same iterating capabilities anyway.

    - perf cgroup filtering now works automatically on cgroup v2. This
    patch was posted a long time ago but somehow fell through the
    cracks.

    - misc fixes asnd documentation updates"

    * 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
    kernfs: fix locking around kernfs_ops->release() callback
    cgroup: drop the matching uid requirement on migration for cgroup v2
    cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
    cgroup: misc cleanups
    cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
    cgroup: track migration context in cgroup_mgctx
    cgroup: cosmetic update to cgroup_taskset_add()
    rdmacg: Fixed uninitialized current resource usage
    cgroup: Add missing cgroup-v2 PID controller documentation.
    rdmacg: Added documentation for rdmacg
    IB/core: added support to use rdma cgroup controller
    rdmacg: Added rdma cgroup controller
    cgroup: fix a comment typo
    cgroup: fix RCU related sparse warnings
    cgroup: move namespace code to kernel/cgroup/namespace.c
    cgroup: rename functions for consistency
    cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
    cgroup: separate out cgroup1_kf_syscall_ops
    cgroup: refactor mount path and clearly distinguish v1 and v2 paths
    cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
    ...

    Linus Torvalds
     

03 Feb, 2017

2 commits

  • Merge in to resolve conflicts in Documentation/cgroup-v2.txt. The
    conflicts are from multiple section additions and trivial to resolve.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Along with the write access to the cgroup.procs or tasks file, cgroup
    has required the writer's euid, unless root, to match [s]uid of the
    target process or task. On cgroup v1, this is necessary because
    there's nothing preventing a delegatee from pulling in tasks or
    processes from all over the system.

    If a user has a cgroup subdirectory delegated to it, the user would
    have write access to the cgroup.procs or tasks file. If there are no
    further checks than file write access check, the user would be able to
    pull processes from all over the system into its subhierarchy which is
    clearly not the intended behavior. The matching [s]uid requirement
    partially prevents this problem by allowing a delegatee to pull in the
    processes that belongs to it. This isn't a sufficient protection
    however, because a user would still be able to jump processes across
    two disjoint sub-hierarchies that has been delegated to them.

    cgroup v2 resolves the issue by requiring the writer to have access to
    the common ancestor of the cgroup.procs file of the source and target
    cgroups. This confines each delegatee to their own sub-hierarchy
    proper and bases all permission decisions on the cgroup filesystem
    rather than having to pull in explicit uid matching.

    cgroup v2 has still been applying the matching [s]uid requirement just
    for historical reasons. On cgroup2, the requirement doesn't serve any
    purpose while unnecessarily complicating the permission model. Let's
    drop it.

    Signed-off-by: Tejun Heo

    Tejun Heo