07 Nov, 2015

1 commit

  • There is a seqcounter that protects against spurious allocation failures
    when a task is changing the allowed nodes in a cpuset. There is no need
    to check the seqcounter until a cpuset exists.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

1 commit

  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     

25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

19 Sep, 2014

1 commit


05 Jun, 2014

1 commit

  • If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Apr, 2014

1 commit

  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2013

1 commit

  • After adding lockdep support to seqlock/seqcount structures,
    I started seeing the following warning:

    [ 1.070907] ======================================================
    [ 1.072015] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
    [ 1.073181] 3.11.0+ #67 Not tainted
    [ 1.073801] ------------------------------------------------------
    [ 1.074882] kworker/u4:2/708 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 1.076088] (&p->mems_allowed_seq){+.+...}, at: [] new_slab+0x5f/0x280
    [ 1.077572]
    [ 1.077572] and this task is already holding:
    [ 1.078593] (&(&q->__queue_lock)->rlock){..-...}, at: [] blk_execute_rq_nowait+0x53/0xf0
    [ 1.080042] which would create a new lock dependency:
    [ 1.080042] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
    [ 1.080042]
    [ 1.080042] but this new dependency connects a SOFTIRQ-irq-safe lock:
    [ 1.080042] (&(&q->__queue_lock)->rlock){..-...}
    [ 1.080042] ... which became SOFTIRQ-irq-safe at:
    [ 1.080042] [] __lock_acquire+0x5b9/0x1db0
    [ 1.080042] [] lock_acquire+0x95/0x130
    [ 1.080042] [] _raw_spin_lock+0x41/0x80
    [ 1.080042] [] scsi_device_unbusy+0x7e/0xd0
    [ 1.080042] [] scsi_finish_command+0x32/0xf0
    [ 1.080042] [] scsi_softirq_done+0xa1/0x130
    [ 1.080042] [] blk_done_softirq+0x73/0x90
    [ 1.080042] [] __do_softirq+0x110/0x2f0
    [ 1.080042] [] run_ksoftirqd+0x2d/0x60
    [ 1.080042] [] smpboot_thread_fn+0x156/0x1e0
    [ 1.080042] [] kthread+0xd6/0xe0
    [ 1.080042] [] ret_from_fork+0x7c/0xb0
    [ 1.080042]
    [ 1.080042] to a SOFTIRQ-irq-unsafe lock:
    [ 1.080042] (&p->mems_allowed_seq){+.+...}
    [ 1.080042] ... which became SOFTIRQ-irq-unsafe at:
    [ 1.080042] ... [] __lock_acquire+0x613/0x1db0
    [ 1.080042] [] lock_acquire+0x95/0x130
    [ 1.080042] [] kthreadd+0x82/0x180
    [ 1.080042] [] ret_from_fork+0x7c/0xb0
    [ 1.080042]
    [ 1.080042] other info that might help us debug this:
    [ 1.080042]
    [ 1.080042] Possible interrupt unsafe locking scenario:
    [ 1.080042]
    [ 1.080042] CPU0 CPU1
    [ 1.080042] ---- ----
    [ 1.080042] lock(&p->mems_allowed_seq);
    [ 1.080042] local_irq_disable();
    [ 1.080042] lock(&(&q->__queue_lock)->rlock);
    [ 1.080042] lock(&p->mems_allowed_seq);
    [ 1.080042]
    [ 1.080042] lock(&(&q->__queue_lock)->rlock);
    [ 1.080042]
    [ 1.080042] *** DEADLOCK ***

    The issue stems from the kthreadd() function calling set_mems_allowed
    with irqs enabled. While its possibly unlikely for the actual deadlock
    to trigger, a fix is fairly simple: disable irqs before taking the
    mems_allowed_seq lock.

    Signed-off-by: John Stultz
    Signed-off-by: Peter Zijlstra
    Acked-by: Li Zefan
    Cc: Mathieu Desnoyers
    Cc: Steven Rostedt
    Cc: "David S. Miller"
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/1381186321-4906-4-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

06 Mar, 2013

1 commit


13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

24 Jul, 2012

1 commit

  • Separate out the cpuset related handling for CPU/Memory online/offline.
    This also helps us exploit the most obvious and basic level of optimization
    that any notification mechanism (CPU/Mem online/offline) has to offer us:
    "We *know* why we have been invoked. So stop pretending that we are lost,
    and do only the necessary amount of processing!".

    And while at it, rename scan_for_empty_cpusets() to
    scan_cpusets_upon_hotplug(), which is more appropriate considering how
    it is restructured.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     

30 Mar, 2012

1 commit


27 Mar, 2012

1 commit

  • Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
    supposed to finally sort the cpu_active mess, instead uncovered more.

    Since CPU_STARTING is ran before setting the cpu online, there's a
    (small) window where the cpu has active,!online.

    If during this time there's a wakeup of a task that used to reside on
    that cpu select_task_rq() will use select_fallback_rq() to compute an
    alternative cpu to run on since we find !online.

    select_fallback_rq() however will compute the new cpu against
    cpu_active, this means that it can return the same cpu it started out
    with, the !online one, since that cpu is in fact marked active.

    This results in us trying to scheduling a task on an offline cpu and
    triggering a WARN in the IPI code.

    The solution proposed by Chuansheng Liu of setting cpu_active in
    set_cpu_online() is buggy, firstly not all archs actually use
    set_cpu_online(), secondly, not all archs call set_cpu_online() with
    IRQs disabled, this means we would introduce either the same race or
    the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
    wrong CPU") -- albeit much narrower.

    [ By setting online first and active later we have a window of
    online,!active, fresh and bound kthreads have task_cpu() of 0 and
    since cpu0 isn't in tsk_cpus_allowed() we end up in
    select_fallback_rq() which excludes !active, resulting in a reset
    of ->cpus_allowed and the thread running all over the place. ]

    The solution is to re-work select_fallback_rq() to require active
    _and_ online. This makes the active,!online case work as expected,
    OTOH archs running CPU_STARTING after setting online are now
    vulnerable to the issue from fd8a7de17 -- these are alpha and
    blackfin.

    Reported-by: Chuansheng Liu
    Signed-off-by: Peter Zijlstra
    Cc: Mike Frysinger
    Cc: linux-alpha@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Mar, 2012

1 commit

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

28 May, 2011

1 commit


09 Jun, 2010

1 commit

  • Currently, when a cpu goes down, cpu_active is cleared before
    CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
    default priority cpu notifier. When a cpu is coming up, it's set
    before CPU_ONLINE but cpuset configuration again is updated from the
    same cpu notifier.

    For cpu notifiers, this presents an inconsistent state. Threads which
    a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
    migrated to other cpus because the cpu is no more inactive.

    Fix it by updating cpu_active in the highest priority cpu notifier and
    cpuset configuration in the second highest when a cpu is coming up.
    Down path is updated similarly. This guarantees that all other cpu
    notifiers see consistent cpu_active and cpuset configuration.

    cpuset_track_online_cpus() notifier is converted to
    cpuset_update_active_cpus() which just updates the configuration and
    now called from cpuset_cpu_[in]active() notifiers registered from
    sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
    degenerates into partition_sched_domains() making separate notifier
    for !CONFIG_CPUSETS unnecessary.

    This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
    callback creates a kthread and kthread_bind()s it to the target cpu,
    and the thread is expected to run on that cpu.

    * Ingo's test discovered __cpuinit/exit markups were incorrect.
    Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Ingo Molnar
    Cc: Paul Menage

    Tejun Heo
     

28 May, 2010

1 commit

  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2010

2 commits

  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

17 Jun, 2009

1 commit

  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2009

1 commit


30 Mar, 2009

1 commit


09 Jan, 2009

1 commit

  • Impact: cleanups, use new cpumask API

    Final trivial cleanups: mainly s/cpumask_t/struct cpumask

    Note there is a FIXME in generate_sched_domains(). A future patch will
    change struct cpumask *doms to struct cpumask *doms[].
    (I suppose Rusty will do this.)

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

07 Jan, 2009

1 commit

  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Nov, 2008

1 commit

  • After adding a node into the machine, top cpuset's mems isn't updated.

    By reviewing the code, we found that the update function

    cpuset_track_online_nodes()

    was invoked after node_states[N_ONLINE] changes. It is wrong because
    N_ONLINE just means node has pgdat, and if node has/added memory, we use
    N_HIGH_MEMORY. So, We should invoke the update function after
    node_states[N_HIGH_MEMORY] changes, just like its commit says.

    This patch fixes it. And we use notifier of memory hotplug instead of
    direct calling of cpuset_track_online_nodes().

    Signed-off-by: Miao Xie
    Acked-by: Yasunori Goto
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Linus Torvalds

    Miao Xie
     

07 Sep, 2008

1 commit

  • What I realized recently is that calling rebuild_sched_domains() in
    arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
    partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
    and will not actually rebuild anything if new domain masks match the old ones.

    What this means is that doing
    echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
    on a system with cpusets enabled will not take affect untill something changes
    in the cpuset setup (ie new sets created or deleted).

    This patch fixes restore correct behaviour where domains must be rebuilt in
    order to enable MC powersaving flags.

    Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
    Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
    as expected.

    Signed-off-by: Max Krasnyansky
    Tested-by: Vaidyanathan Srinivasan
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

18 Jul, 2008

1 commit

  • This is based on Linus' idea of creating cpu_active_map that prevents
    scheduler load balancer from migrating tasks to the cpu that is going
    down.

    It allows us to simplify domain management code and avoid unecessary
    domain rebuilds during cpu hotplug event handling.

    Please ignore the cpusets part for now. It needs some more work in order
    to avoid crazy lock nesting. Although I did simplfy and unify domain
    reinitialization logic. We now simply call partition_sched_domains() in
    all the cases. This means that we're using exact same code paths as in
    cpusets case and hence the test below cover cpusets too.
    Cpuset changes to make rebuild_sched_domains() callable from various
    contexts are in the separate patch (right next after this one).

    This not only boots but also easily handles
    while true; do make clean; make -j 8; done
    and
    while true; do on-off-cpu 1; done
    at the same time.
    (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

    Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
    this on right now in gnome-terminal and things are moving just fine.

    Also this is running with most of the debug features enabled (lockdep,
    mutex, etc) no BUG_ONs or lockdep complaints so far.

    I believe I addressed all of the Dmitry's comments for original Linus'
    version. I changed both fair and rt balancer to mask out non-active cpus.
    And replaced cpu_is_offline() with !cpu_active() in the main scheduler
    code where it made sense (to me).

    Signed-off-by: Max Krasnyanskiy
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    Cc: dmitry.adamushko@gmail.com
    Cc: pj@sgi.com
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

28 Apr, 2008

1 commit

  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Apr, 2008

1 commit

  • * Modify cpuset_cpus_allowed to return the currently allowed cpuset
    via a pointer argument instead of as the function return value.

    * Use new set_cpus_allowed_ptr function.

    * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     

12 Feb, 2008

1 commit

  • Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
    presence of memoryless nodes. This patch attempts to fix that problem.

    Some background:

    numactl --interleave=all calls set_mempolicy(2) with a fully populated
    [out to MAXNUMNODES] nodemask. set_mempolicy() [in do_set_mempolicy()]
    calls contextualize_policy() which requires that the nodemask be a
    subset of the current task's mems_allowed; else EINVAL will be returned.

    A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
    i.e., nodes with memory. So, a fully populated nodemask will be
    declared invalid if it includes memoryless nodes.

    NOTE: the same thing will occur when running in a cpuset
    with restricted mem_allowed--for the same reason:
    node mask contains dis-allowed nodes.

    mbind(2), on the other hand, just masks off any nodes in the nodemask
    that are not included in the caller's mems_allowed.

    In each case [mbind() and set_mempolicy()], mpol_check_policy() will
    complain [again, resulting in EINVAL] if the nodemask contains any
    memoryless nodes. This is somewhat redundant as mpol_new() will remove
    memoryless nodes for interleave policy, as will bind_zonelist()--called
    by mpol_new() for BIND policy.

    Proposed fix:

    1) modify contextualize_policy logic to:
    a) remember whether the incoming node mask is empty.
    b) if not, restrict the nodemask to allowed nodes, as is
    currently done in-line for mbind(). This guarantees
    that the resulting mask includes only nodes with memory.

    NOTE: this is a [benign, IMO] change in behavior for
    set_mempolicy(). Dis-allowed nodes will be
    silently ignored, rather than returning an error.

    c) fold this code into mpol_check_policy(), replace 2 calls to
    contextualize_policy() to call mpol_check_policy() directly
    and remove contextualize_policy().

    2) In existing mpol_check_policy() logic, after "contextualization":
    a) MPOL_DEFAULT: require that in coming mask "was_empty"
    b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
    contains at least one node.
    c) add a case for MPOL_PREFERRED: if in coming was not empty
    and resulting mask IS empty, user specified invalid nodes.
    Return EINVAL.
    c) remove the now redundant check for memoryless nodes

    3) remove the now redundant masking of policy nodes for interleave
    policy from mpol_new().

    4) Now that mpol_check_policy() contextualizes the nodemask, remove
    the in-line nodes_and() from sys_mbind(). I believe that this
    restores mbind() to the behavior before the memoryless-nodes
    patch series. E.g., we'll no longer treat an invalid nodemask
    with MPOL_PREFERRED as local allocation.

    [ Patch history:

    v1 -> v2:
    - Communicate whether or not incoming node mask was empty to
    mpol_check_policy() for better error checking.
    - As suggested by David Rientjes, remove the now unused
    cpuset_nodes_subset_current_mems_allowed() from cpuset.h

    v2 -> v3:
    - As suggested by Kosaki Motohito, fold the "contextualization"
    of policy nodemask into mpol_check_policy(). Looks a little
    cleaner. ]

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

09 Feb, 2008

1 commit

  • Currently we possibly lookup the pid in the wrong pid namespace. So
    seq_file convert proc_pid_status which ensures the proper pid namespaces is
    passed in.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: s390 build fix]
    [akpm@linux-foundation.org: fix task_name() output]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Eric W. Biederman
    Cc: Andrew Morgan
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

20 Oct, 2007

2 commits

  • When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
    been running on that cpu.

    Currently, such a task is migrated:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any cpu which is both online and among that task's cpus_allowed

    It is typical of a multithreaded application running on a large NUMA system to
    have its tasks confined to a cpuset so as to cluster them near the memory that
    they share. Furthermore, it is typical to explicitly place such a task on a
    specific cpu in that cpuset. And in that case the task's cpus_allowed
    includes only a single cpu.

    This patch would insert a preference to migrate such a task to some cpu within
    its cpuset (and set its cpus_allowed to its entire cpuset).

    With this patch, migrate the task to:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any online cpu within the task's cpuset
    3) to any cpu which is both online and among that task's cpus_allowed

    In order to do this, move_task_off_dead_cpu() must make a call to
    cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
    not block. (name change - per Oleg's suggestion)

    Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
    the cpuset mutex during the whole migrate_live_tasks() and
    migrate_dead_tasks() procedure.

    [akpm@linux-foundation.org: build fix]
    [pj@sgi.com: Fix indentation and spacing]
    Signed-off-by: Cliff Wickman
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ingo Molnar
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

2 commits

  • Instead of testing for overlap in the memory nodes of the the nearest
    exclusive ancestor of both current and the candidate task, it is better to
    simply test for intersection between the task's mems_allowed in their task
    descriptors. This does not require taking callback_mutex since it is only
    used as a hint in the badness scoring.

    Tasks that do not have an intersection in their mems_allowed with the current
    task are not explicitly restricted from being OOM killed because it is quite
    possible that the candidate task has allocated memory there before and has
    since changed its mems_allowed.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • cpusets try to ensure that any node added to a cpuset's mems_allowed is
    on-line and contains memory. The assumption was that online nodes contained
    memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
    tasks to this cpuset. This results in continuous series of oom-kill and
    apparent system hang.

    Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
    place of node_online_map when vetting memories. Return error if admin
    attempts to write a non-empty mems_allowed node mask containing only
    memoryless-nodes.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Bob Picco
    Signed-off-by: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Feb, 2007

1 commit

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven