10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

21 Dec, 2011

1 commit

  • Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
    nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
    new set of allowed cpuset nodes where the two nodemasks, as a result of
    the remap, are now disjoint.

    c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
    cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
    nodes from changing for a thread. This causes any update to a set of
    allowed nodes to stall until put_mems_allowed() is called.

    This stall is unncessary, however, if at least one node remains unchanged
    in the update to the set of allowed nodes. This was addressed by
    89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
    node remains set"), but it's still possible that an empty nodemask may be
    read from a mempolicy because the old nodemask may be remapped to the new
    nodemask during rebind. To prevent this, only avoid the stall if there is
    no mempolicy for the thread being changed.

    This is a temporary solution until all reads from mempolicy nodemasks can
    be guaranteed to not be empty without the get_mems_allowed()
    synchronization.

    Also moves the check for nodemask intersection inside task_lock() so that
    tsk->mems_allowed cannot change. This ensures that nothing can set this
    tsk's mems_allowed out from under us and also protects tsk->mempolicy.

    Reported-by: Miao Xie
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Dec, 2011

3 commits

  • ->pre_attach() is supposed to be called before migration, which is
    observed during process migration but task migration does it the other
    way around. The only ->pre_attach() user is cpuset which can do the
    same operaitons in ->can_attach(). Collapse cpuset_pre_attach() into
    cpuset_can_attach().

    -v2: Patch contamination from later patch removed. Spotted by Paul
    Menage.

    Signed-off-by: Tejun Heo
    Reviewed-by: Frederic Weisbecker
    Acked-by: Paul Menage
    Cc: Li Zefan

    Tejun Heo
     
  • Now that subsys->can_attach() and attach() take @tset instead of
    @task, they can handle per-task operations. Convert
    ->can_attach_task() and ->attach_task() users to use ->can_attach()
    and attach() instead. Most converions are straight-forward.
    Noteworthy changes are,

    * In cgroup_freezer, remove unnecessary NULL assignments to unused
    methods. It's useless and very prone to get out of sync, which
    already happened.

    * In cpuset, PF_THREAD_BOUND test is checked for each task. This
    doesn't make any practical difference but is conceptually cleaner.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: James Morris
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     
  • Currently, there's no way to pass multiple tasks to cgroup_subsys
    methods necessitating the need for separate per-process and per-task
    methods. This patch introduces cgroup_taskset which can be used to
    pass multiple tasks and their associated cgroups to cgroup_subsys
    methods.

    Three methods - can_attach(), cancel_attach() and attach() - are
    converted to use cgroup_taskset. This unifies passed parameters so
    that all methods have access to all information. Conversions in this
    patchset are identical and don't introduce any behavior change.

    -v2: documentation updated as per Paul Menage's suggestion.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris

    Tejun Heo
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • {get,put}_mems_allowed() exist so that general kernel code may locklessly
    access a task's set of allowable nodes without having the chance that a
    concurrent write will cause the nodemask to be empty on configurations
    where MAX_NUMNODES > BITS_PER_LONG.

    This could incur a significant delay, however, especially in low memory
    conditions because the page allocator is blocking and reclaim requires
    get_mems_allowed() itself. It is not atypical to see writes to
    cpuset.mems take over 2 seconds to complete, for example. In low memory
    conditions, this is problematic because it's one of the most imporant
    times to change cpuset.mems in the first place!

    The only way a task's set of allowable nodes may change is through cpusets
    by writing to cpuset.mems and when attaching a task to a generic code is
    not reading the nodemask with get_mems_allowed() at the same time, and
    then clearing all the old nodes. This prevents the possibility that a
    reader will see an empty nodemask at the same time the writer is storing a
    new nodemask.

    If at least one node remains unchanged, though, it's possible to simply
    set all new nodes and then clear all the old nodes. Changing a task's
    nodemask is protected by cgroup_mutex so it's guaranteed that two threads
    are not changing the same task's nodemask at the same time, so the
    nodemask is guaranteed to be stored before another thread changes it and
    determines whether a node remains set or not.

    Signed-off-by: David Rientjes
    Cc: Miao Xie
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

27 Jul, 2011

2 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • [ This patch has already been accepted as commit 0ac0c0d0f837 but later
    reverted (commit 35926ff5fba8) because it itroduced arch specific
    __node_random which was defined only for x86 code so it broke other
    archs. This is a followup without any arch specific code. Other than
    that there are no functional changes.]

    Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is
    that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
    at node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number
    of the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    [mhocko@suse.cz: Make it arch independent]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Jack Steiner
    Cc: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Michal Hocko
    Cc: Paul Menage
    Cc: Pekka Enberg
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 May, 2011

1 commit


27 May, 2011

2 commits

  • The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:

    * cgroup creation is out-of-control
    * cgroup name can conflict when pids are looping
    * it is not possible to have a single process handling a lot of
    namespaces without falling in a exponential creation time
    * we may want to create a namespace without creating a cgroup

    The ns_cgroup was replaced by a compatibility flag 'clone_children',
    where a newly created cgroup will copy the parent cgroup values.
    The userspace has to manually create a cgroup and add a task to
    the 'tasks' file.

    This patch removes the ns_cgroup as suggested in the following thread:

    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

    The 'cgroup_clone' function is removed because it is no longer used.

    This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal. Since that
    time we have heard from XXX users who were affected by this.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Serge E. Hallyn
    Cc: Eric W. Biederman
    Cc: Jamal Hadi Salim
    Reviewed-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

11 Apr, 2011

1 commit

  • Remove the SD_LV_ enum and use dynamic level assignments.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110407122942.969433965@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 Mar, 2011

4 commits

  • Chaning cpuset->mems/cpuset->cpus should be protected under
    callback_mutex.

    cpuset_clone() doesn't follow this rule. It's ok because it's
    called when creating and initializing a cgroup, but we'd better
    hold the lock to avoid subtil break in the future.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Those functions that use NODEMASK_ALLOC() can't propagate errno
    to users, but will fail silently.

    Fix it by using a static nodemask_t variable for each function, and
    those variables are protected by cgroup_mutex;

    [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment]
    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
    have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
    to cpuset_migrate_mm().

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
    NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

    As spotted by Paul, a side effect is we fix a bug that the function can
    return -ENOMEM but the caller doesn't expect negative return value.
    Therefore change the return value of cpuset_sprintf_cpulist() and
    cpuset_sprintf_memlist() from int to size_t.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

05 Mar, 2011

1 commit


29 Oct, 2010

1 commit


21 Oct, 2010

1 commit

  • All security modules shouldn't change sched_param parameter of
    security_task_setscheduler(). This is not only meaningless, but also
    make a harmful result if caller pass a static variable.

    This patch remove policy and sched_param parameter from
    security_task_setscheduler() becuase none of security module is
    using it.

    Cc: James Morris
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: James Morris

    KOSAKI Motohiro
     

07 Aug, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
    sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
    sched: No need for bootmem special cases
    sched: Revert nohz_ratelimit() for now
    sched: Reduce update_group_power() calls
    sched: Update rq->clock for nohz balanced cpus
    sched: Fix spelling of sibling
    sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
    sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
    sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
    sched: thread_group_cputime: Simplify, document the "alive" check
    sched: Remove the obsolete exit_state/signal hacks
    sched: task_tick_rt: Remove the obsolete ->signal != NULL check
    sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
    sched: Fix comments to make them DocBook happy
    sched: Fix fix_small_capacity
    powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
    powerpc: Enable asymmetric SMT scheduling on POWER7
    sched: Add asymmetric group packing option for sibling domain
    sched: Fix capacity calculations for SMT4
    sched: Change nohz idle load balancing logic to push model
    ...

    Linus Torvalds
     

22 Jun, 2010

1 commit

  • Commit 3a101d05 (sched: adjust when cpu_active and cpuset
    configurations are updated during cpu on/offlining) added
    hotplug notifiers marked with __cpuexit; however, ia64 drops
    text in __cpuexit during link unlike x86.

    This means that functions which are referenced during init but used
    only for cpu hot unplugging afterwards shouldn't be marked with
    __cpuexit. Drop __cpuexit from those functions.

    Reported-by: Tony Luck
    Signed-off-by: Tejun Heo
    Acked-by: Tony Luck
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

17 Jun, 2010

2 commits


09 Jun, 2010

1 commit

  • Currently, when a cpu goes down, cpu_active is cleared before
    CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
    default priority cpu notifier. When a cpu is coming up, it's set
    before CPU_ONLINE but cpuset configuration again is updated from the
    same cpu notifier.

    For cpu notifiers, this presents an inconsistent state. Threads which
    a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
    migrated to other cpus because the cpu is no more inactive.

    Fix it by updating cpu_active in the highest priority cpu notifier and
    cpuset configuration in the second highest when a cpu is coming up.
    Down path is updated similarly. This guarantees that all other cpu
    notifiers see consistent cpu_active and cpuset configuration.

    cpuset_track_online_cpus() notifier is converted to
    cpuset_update_active_cpus() which just updates the configuration and
    now called from cpuset_cpu_[in]active() notifiers registered from
    sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
    degenerates into partition_sched_domains() making separate notifier
    for !CONFIG_CPUSETS unnecessary.

    This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
    callback creates a kthread and kthread_bind()s it to the target cpu,
    and the thread is expected to run on that cpu.

    * Ingo's test discovered __cpuinit/exit markups were incorrect.
    Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Ingo Molnar
    Cc: Paul Menage

    Tejun Heo
     

28 May, 2010

1 commit

  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

2 commits

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2010

2 commits

  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

25 Mar, 2010

2 commits


07 Dec, 2009

2 commits

  • Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
    sched domain managment) we have cpu_active_mask which is suppose to rule
    scheduler migration and load-balancing, except it never (fully) did.

    The particular problem being solved here is a crash in try_to_wake_up()
    where select_task_rq() ends up selecting an offline cpu because
    select_task_rq_fair() trusts the sched_domain tree to reflect the
    current state of affairs, similarly select_task_rq_rt() trusts the
    root_domain.

    However, the sched_domains are updated from CPU_DEAD, which is after the
    cpu is taken offline and after stop_machine is done. Therefore it can
    race perfectly well with code assuming the domains are right.

    Cure this by building the domains from cpu_active_mask on
    CPU_DOWN_PREPARE.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7 ("cpumask:
    Partition_sched_domains takes array of cpumask_var_t") changed
    the function signature of generate_sched_domains() for the
    CONFIG_SMP=y case, but forgot to update the corresponding
    function for the CONFIG_SMP=n case, causing:

    kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type

    Signed-off-by: Geert Uytterhoeven
    Cc: Rusty Russell
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Geert Uytterhoeven
     

04 Nov, 2009

1 commit

  • Currently partition_sched_domains() takes a 'struct cpumask
    *doms_new' which is a kmalloc'ed array of cpumask_t. You can't
    have such an array if 'struct cpumask' is undefined, as we plan
    for CONFIG_CPUMASK_OFFSTACK=y.

    So, we make this an array of cpumask_var_t instead: this is the
    same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
    multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
    Hence we add alloc_sched_domains() and free_sched_domains()
    functions.

    Signed-off-by: Rusty Russell
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

26 Oct, 2009

1 commit


24 Sep, 2009

1 commit

  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

21 Sep, 2009

1 commit

  • The Cpus_allowed fields in /proc//status is currently only
    shown in case of CONFIG_CPUSETS. However their contents are also
    useful for the !CONFIG_CPUSETS case.

    So change the current behaviour and always show these fields.

    Signed-off-by: Heiko Carstens
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

17 Jun, 2009

1 commit

  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie