28 May, 2011

1 commit


09 Jun, 2010

1 commit

  • Currently, when a cpu goes down, cpu_active is cleared before
    CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
    default priority cpu notifier. When a cpu is coming up, it's set
    before CPU_ONLINE but cpuset configuration again is updated from the
    same cpu notifier.

    For cpu notifiers, this presents an inconsistent state. Threads which
    a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
    migrated to other cpus because the cpu is no more inactive.

    Fix it by updating cpu_active in the highest priority cpu notifier and
    cpuset configuration in the second highest when a cpu is coming up.
    Down path is updated similarly. This guarantees that all other cpu
    notifiers see consistent cpu_active and cpuset configuration.

    cpuset_track_online_cpus() notifier is converted to
    cpuset_update_active_cpus() which just updates the configuration and
    now called from cpuset_cpu_[in]active() notifiers registered from
    sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
    degenerates into partition_sched_domains() making separate notifier
    for !CONFIG_CPUSETS unnecessary.

    This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
    callback creates a kthread and kthread_bind()s it to the target cpu,
    and the thread is expected to run on that cpu.

    * Ingo's test discovered __cpuinit/exit markups were incorrect.
    Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Ingo Molnar
    Cc: Paul Menage

    Tejun Heo
     

28 May, 2010

1 commit

  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2010

2 commits

  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

17 Jun, 2009

1 commit

  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

03 Apr, 2009

1 commit


30 Mar, 2009

1 commit


09 Jan, 2009

1 commit

  • Impact: cleanups, use new cpumask API

    Final trivial cleanups: mainly s/cpumask_t/struct cpumask

    Note there is a FIXME in generate_sched_domains(). A future patch will
    change struct cpumask *doms to struct cpumask *doms[].
    (I suppose Rusty will do this.)

    Signed-off-by: Li Zefan
    Cc: Ingo Molnar
    Cc: Rusty Russell
    Acked-by: Mike Travis
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

07 Jan, 2009

1 commit

  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Nov, 2008

1 commit

  • After adding a node into the machine, top cpuset's mems isn't updated.

    By reviewing the code, we found that the update function

    cpuset_track_online_nodes()

    was invoked after node_states[N_ONLINE] changes. It is wrong because
    N_ONLINE just means node has pgdat, and if node has/added memory, we use
    N_HIGH_MEMORY. So, We should invoke the update function after
    node_states[N_HIGH_MEMORY] changes, just like its commit says.

    This patch fixes it. And we use notifier of memory hotplug instead of
    direct calling of cpuset_track_online_nodes().

    Signed-off-by: Miao Xie
    Acked-by: Yasunori Goto
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Linus Torvalds

    Miao Xie
     

07 Sep, 2008

1 commit

  • What I realized recently is that calling rebuild_sched_domains() in
    arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
    partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
    and will not actually rebuild anything if new domain masks match the old ones.

    What this means is that doing
    echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
    on a system with cpusets enabled will not take affect untill something changes
    in the cpuset setup (ie new sets created or deleted).

    This patch fixes restore correct behaviour where domains must be rebuilt in
    order to enable MC powersaving flags.

    Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
    Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
    as expected.

    Signed-off-by: Max Krasnyansky
    Tested-by: Vaidyanathan Srinivasan
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

18 Jul, 2008

1 commit

  • This is based on Linus' idea of creating cpu_active_map that prevents
    scheduler load balancer from migrating tasks to the cpu that is going
    down.

    It allows us to simplify domain management code and avoid unecessary
    domain rebuilds during cpu hotplug event handling.

    Please ignore the cpusets part for now. It needs some more work in order
    to avoid crazy lock nesting. Although I did simplfy and unify domain
    reinitialization logic. We now simply call partition_sched_domains() in
    all the cases. This means that we're using exact same code paths as in
    cpusets case and hence the test below cover cpusets too.
    Cpuset changes to make rebuild_sched_domains() callable from various
    contexts are in the separate patch (right next after this one).

    This not only boots but also easily handles
    while true; do make clean; make -j 8; done
    and
    while true; do on-off-cpu 1; done
    at the same time.
    (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

    Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
    this on right now in gnome-terminal and things are moving just fine.

    Also this is running with most of the debug features enabled (lockdep,
    mutex, etc) no BUG_ONs or lockdep complaints so far.

    I believe I addressed all of the Dmitry's comments for original Linus'
    version. I changed both fair and rt balancer to mask out non-active cpus.
    And replaced cpu_is_offline() with !cpu_active() in the main scheduler
    code where it made sense (to me).

    Signed-off-by: Max Krasnyanskiy
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    Cc: dmitry.adamushko@gmail.com
    Cc: pj@sgi.com
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     

28 Apr, 2008

1 commit

  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Apr, 2008

1 commit

  • * Modify cpuset_cpus_allowed to return the currently allowed cpuset
    via a pointer argument instead of as the function return value.

    * Use new set_cpus_allowed_ptr function.

    * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     

12 Feb, 2008

1 commit

  • Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
    presence of memoryless nodes. This patch attempts to fix that problem.

    Some background:

    numactl --interleave=all calls set_mempolicy(2) with a fully populated
    [out to MAXNUMNODES] nodemask. set_mempolicy() [in do_set_mempolicy()]
    calls contextualize_policy() which requires that the nodemask be a
    subset of the current task's mems_allowed; else EINVAL will be returned.

    A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
    i.e., nodes with memory. So, a fully populated nodemask will be
    declared invalid if it includes memoryless nodes.

    NOTE: the same thing will occur when running in a cpuset
    with restricted mem_allowed--for the same reason:
    node mask contains dis-allowed nodes.

    mbind(2), on the other hand, just masks off any nodes in the nodemask
    that are not included in the caller's mems_allowed.

    In each case [mbind() and set_mempolicy()], mpol_check_policy() will
    complain [again, resulting in EINVAL] if the nodemask contains any
    memoryless nodes. This is somewhat redundant as mpol_new() will remove
    memoryless nodes for interleave policy, as will bind_zonelist()--called
    by mpol_new() for BIND policy.

    Proposed fix:

    1) modify contextualize_policy logic to:
    a) remember whether the incoming node mask is empty.
    b) if not, restrict the nodemask to allowed nodes, as is
    currently done in-line for mbind(). This guarantees
    that the resulting mask includes only nodes with memory.

    NOTE: this is a [benign, IMO] change in behavior for
    set_mempolicy(). Dis-allowed nodes will be
    silently ignored, rather than returning an error.

    c) fold this code into mpol_check_policy(), replace 2 calls to
    contextualize_policy() to call mpol_check_policy() directly
    and remove contextualize_policy().

    2) In existing mpol_check_policy() logic, after "contextualization":
    a) MPOL_DEFAULT: require that in coming mask "was_empty"
    b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
    contains at least one node.
    c) add a case for MPOL_PREFERRED: if in coming was not empty
    and resulting mask IS empty, user specified invalid nodes.
    Return EINVAL.
    c) remove the now redundant check for memoryless nodes

    3) remove the now redundant masking of policy nodes for interleave
    policy from mpol_new().

    4) Now that mpol_check_policy() contextualizes the nodemask, remove
    the in-line nodes_and() from sys_mbind(). I believe that this
    restores mbind() to the behavior before the memoryless-nodes
    patch series. E.g., we'll no longer treat an invalid nodemask
    with MPOL_PREFERRED as local allocation.

    [ Patch history:

    v1 -> v2:
    - Communicate whether or not incoming node mask was empty to
    mpol_check_policy() for better error checking.
    - As suggested by David Rientjes, remove the now unused
    cpuset_nodes_subset_current_mems_allowed() from cpuset.h

    v2 -> v3:
    - As suggested by Kosaki Motohito, fold the "contextualization"
    of policy nodemask into mpol_check_policy(). Looks a little
    cleaner. ]

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

09 Feb, 2008

1 commit

  • Currently we possibly lookup the pid in the wrong pid namespace. So
    seq_file convert proc_pid_status which ensures the proper pid namespaces is
    passed in.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: s390 build fix]
    [akpm@linux-foundation.org: fix task_name() output]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Eric W. Biederman
    Cc: Andrew Morgan
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

20 Oct, 2007

2 commits

  • When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
    been running on that cpu.

    Currently, such a task is migrated:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any cpu which is both online and among that task's cpus_allowed

    It is typical of a multithreaded application running on a large NUMA system to
    have its tasks confined to a cpuset so as to cluster them near the memory that
    they share. Furthermore, it is typical to explicitly place such a task on a
    specific cpu in that cpuset. And in that case the task's cpus_allowed
    includes only a single cpu.

    This patch would insert a preference to migrate such a task to some cpu within
    its cpuset (and set its cpus_allowed to its entire cpuset).

    With this patch, migrate the task to:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any online cpu within the task's cpuset
    3) to any cpu which is both online and among that task's cpus_allowed

    In order to do this, move_task_off_dead_cpu() must make a call to
    cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
    not block. (name change - per Oleg's suggestion)

    Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
    the cpuset mutex during the whole migrate_live_tasks() and
    migrate_dead_tasks() procedure.

    [akpm@linux-foundation.org: build fix]
    [pj@sgi.com: Fix indentation and spacing]
    Signed-off-by: Cliff Wickman
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ingo Molnar
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

2 commits

  • Instead of testing for overlap in the memory nodes of the the nearest
    exclusive ancestor of both current and the candidate task, it is better to
    simply test for intersection between the task's mems_allowed in their task
    descriptors. This does not require taking callback_mutex since it is only
    used as a hint in the badness scoring.

    Tasks that do not have an intersection in their mems_allowed with the current
    task are not explicitly restricted from being OOM killed because it is quite
    possible that the candidate task has allocated memory there before and has
    since changed its mems_allowed.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • cpusets try to ensure that any node added to a cpuset's mems_allowed is
    on-line and contains memory. The assumption was that online nodes contained
    memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
    tasks to this cpuset. This results in continuous series of oom-kill and
    apparent system hang.

    Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
    place of node_online_map when vetting memories. Return error if admin
    attempts to write a non-empty mems_allowed node mask containing only
    memoryless-nodes.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Bob Picco
    Signed-off-by: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Feb, 2007

1 commit

  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

31 Dec, 2006

1 commit

  • fs/proc/base.c:1869: warning: initialization discards qualifiers from pointer target type
    fs/proc/base.c:2150: warning: initialization discards qualifiers from pointer target type

    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

14 Dec, 2006

1 commit

  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

2 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Optimize the critical zonelist scanning for free pages in the kernel memory
    allocator by caching the zones that were found to be full recently, and
    skipping them.

    Remembers the zones in a zonelist that were short of free memory in the
    last second. And it stashes a zone-to-node table in the zonelist struct,
    to optimize that conversion (minimize its cache footprint.)

    Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago. Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results. After all that
    discussion with David on why I didn't need them, I went and got
    some ;). I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably. At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real'). Good. See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct. That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

    There are two reasons I persued this alternative, over some earlier
    proposals that would have focused on optimizing the fake numa
    emulation case by caching the last useful zone:

    1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use. Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past. So this is not guaranteed. Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

    2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

    See comment in the patch below for a description of what it actually does.

    Technical details of note (or controversy):

    - See the use of "zlc_active" and "did_zlc_setup" below, to delay
    adding any work for this new mechanism until we've looked at the
    first zone in zonelist. I figured the odds of the first zone
    having the memory we needed were high enough that we should just
    look there, first, then get fancy only if we need to keep looking.

    - Some odd hackery was needed to add items to struct zonelist, while
    not tripping up the custom zonelists built by the mm/mempolicy.c
    code for MPOL_BIND. My usual wordy comments below explain this.
    Search for "MPOL_BIND".

    - Some per-node data in the struct zonelist is now modified frequently,
    with no locking. Multiple CPU cores on a node could hit and mangle
    this data. The theory is that this is just performance hint data,
    and the memory allocator will work just fine despite any such mangling.
    The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
    (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
    all be self correcting after at most a one second delay.

    - This still does a linear scan of the same lengths as before. All
    I've optimized is making the scan faster, not algorithmically
    shorter. It is now able to scan a compact array of 'unsigned
    short' in the case of many full nodes, so one cache line should
    cover quite a few nodes, rather than each node hitting another
    one or two new and distinct cache lines.

    - If both Andi and Nick don't find this too complicated, I will be
    (pleasantly) flabbergasted.

    - I removed the comment claiming we only use one cachline's worth of
    zonelist. We seem, at least in the fake numa case, to have put the
    lie to that claim.

    - I pay no attention to the various watermarks and such in this performance
    hint. A node could be marked full for one watermark, and then skipped
    over when searching for a page using a different watermark. I think
    that's actually quite ok, as it will tend to slightly increase the
    spreading of memory over other nodes, away from a memory stressed node.

    ===============

    Performance - some benchmark results and analysis:

    This benchmark runs a memory hog program that uses multiple
    threads to touch alot of memory as quickly as it can.

    Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
    the total 96 GBytes on the system, and using 1, 19, 37, or 55
    threads (on a 56 CPU system.) System, user and real (elapsed)
    timings were recorded for each run, shown in units of seconds,
    in the table below.

    Two kernels were tested - 2.6.18-mm3 and the same kernel with
    this zonelist caching patch added. The table also shows the
    percentage improvement the zonelist caching sys time is over
    (lower than) the stock *-mm kernel.

    number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
    GBs N ------------ -------------- ---------------- systime
    mem threads sys user real sys user real sys user real better
    12 1 153 24 177 151 24 176 -2 0 -1 1%
    12 19 99 22 8 99 22 8 0 0 0 0%
    12 37 111 25 6 112 25 6 1 0 0 -0%
    12 55 115 25 5 110 23 5 -5 -2 0 4%
    38 1 502 74 576 497 73 570 -5 -1 -6 0%
    38 19 426 78 48 373 76 39 -53 -2 -9 12%
    38 37 544 83 36 547 82 36 3 -1 0 -0%
    38 55 501 77 23 511 80 24 10 3 1 -1%
    64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
    64 19 1118 138 119 965 141 103 -153 3 -16 13%
    64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
    64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
    90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
    90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
    90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
    90 55 1948 210 104 1843 213 100 -105 3 -4 5%

    Notes:
    1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

    2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass. The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass. For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

    3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

    4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

    5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest. I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away. Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

    6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect. Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

    7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

    Conclusions:
    * This zonelist cache patch probably makes little difference one way or the
    other for most workloads on real numa hardware, if those workloads avoid
    heavy off node allocations.
    * For memory intensive workloads requiring substantial off-node allocations
    on real numa hardware, this patch improves both kernel and elapsed timings
    up to ten per-cent.
    * For fake numa systems, I'm optimistic, but will have to leave that up to
    Rohit Seth to actually test (once I get him a 2.6.18 backport.)

    Signed-off-by: Paul Jackson
    Cc: Rohit Seth
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

30 Sep, 2006

1 commit

  • Change the list of memory nodes allowed to tasks in the top (root) nodeset
    to dynamically track what cpus are online, using a call to a cpuset hook
    from the memory hotplug code. Make this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support memory hotplug, then these tasks cannot make
    use of memory nodes that are added after system boot, because the memory
    nodes are not allowed in the top cpuset. This is a surprising regression
    over earlier kernels that didn't have cpusets enabled.

    One key motivation for this change is to remain consistent with the
    behaviour for the top_cpuset's 'cpus', which is also read-only, and which
    automatically tracks the cpu_online_map.

    This change also has the minor benefit that it fixes a long standing,
    little noticed, minor bug in cpusets. The cpuset performance tweak to
    short circuit the cpuset_zone_allowed() check on systems with just a single
    cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
    changing the 'mems' of the top_cpuset had no affect, even though the change
    (the write system call) appeared to succeed. With the following change,
    that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
    refuses to be changed via user space writes. Thus no one should be mislead
    into thinking they've changed the top_cpusets's 'mems' when in affect they
    haven't.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'mems' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of node_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged memory nodes
    allowed by their cpuset.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Paul Jackson
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

24 Mar, 2006

1 commit

  • This patch provides the implementation and cpuset interface for an alternative
    memory allocation policy that can be applied to certain kinds of memory
    allocations, such as the page cache (file system buffers) and some slab caches
    (such as inode caches).

    The policy is called "memory spreading." If enabled, it spreads out these
    kinds of memory allocations over all the nodes allowed to a task, instead of
    preferring to place them on the node where the task is executing.

    All other kinds of allocations, including anonymous pages for a tasks stack
    and data regions, are not affected by this policy choice, and continue to be
    allocated preferring the node local to execution, as modified by the NUMA
    mempolicy.

    There are two boolean flag files per cpuset that control where the kernel
    allocates pages for the file system buffers and related in kernel data
    structures. They are called 'memory_spread_page' and 'memory_spread_slab'.

    If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
    kernel will spread the file system buffers (page cache) evenly over all the
    nodes that the faulting task is allowed to use, instead of preferring to put
    those pages on the node where the task is running.

    If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
    kernel will spread some file system related slab caches, such as for inodes
    and dentries evenly over all the nodes that the faulting task is allowed to
    use, instead of preferring to put those pages on the node where the task is
    running.

    The implementation is simple. Setting the cpuset flags 'memory_spread_page'
    or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
    PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
    subsequently joins that cpuset. In subsequent patches, the page allocation
    calls for the affected page cache and slab caches are modified to perform an
    inline check for these flags, and if set, a call to a new routine
    cpuset_mem_spread_node() returns the node to prefer for the allocation.

    The cpuset_mem_spread_node() routine is also simple. It uses the value of a
    per-task rotor cpuset_mem_spread_rotor to select the next node in the current
    tasks mems_allowed to prefer for the allocation.

    This policy can provide substantial improvements for jobs that need to place
    thread local data on the corresponding node, but that need to access large
    file system data sets that need to be spread across the several nodes in the
    jobs cpuset in order to fit. Without this patch, especially for jobs that
    might have one thread reading in the data set, the memory allocation across
    the nodes in the jobs cpuset can become very uneven.

    A couple of Copyright year ranges are updated as well. And a couple of email
    addresses that can be found in the MAINTAINERS file are removed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

15 Jan, 2006

1 commit

  • The problem, reported in:

    http://bugzilla.kernel.org/show_bug.cgi?id=5859

    and by various other email messages and lkml posts is that the cpuset hook
    in the oom (out of memory) code can try to take a cpuset semaphore while
    holding the tasklist_lock (a spinlock).

    One must not sleep while holding a spinlock.

    The fix seems easy enough - move the cpuset semaphore region outside the
    tasklist_lock region.

    This required a few lines of mechanism to implement. The oom code where
    the locking needs to be changed does not have access to the cpuset locks,
    which are internal to kernel/cpuset.c only. So I provided a couple more
    cpuset interface routines, available to the rest of the kernel, which
    simple take and drop the lock needed here (cpusets callback_sem).

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

09 Jan, 2006

6 commits

  • Remove a couple of more lines of code from the cpuset hooks in the page
    allocation code path.

    There was a check for a NULL cpuset pointer in the routine
    cpuset_update_task_memory_state() that was only needed during system boot,
    after the memory subsystem was initialized, before the cpuset subsystem was
    initialized, to catch a NULL task->cpuset pointer.

    Add a cpuset_init_early() routine, just before the mem_init() call in
    init/main.c, that sets up just enough of the init tasks cpuset structure to
    render cpuset_update_task_memory_state() calls harmless.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Easy little optimization hack to avoid actually having to call
    cpuset_zone_allowed() and check mems_allowed, in the main page allocation
    routine, __alloc_pages(). This saves several CPU cycles per page allocation
    on systems not using cpusets.

    A counter is updated each time a cpuset is created or removed, and whenever
    there is only one cpuset in the system, it must be the root cpuset, which
    contains all CPUs and all Memory Nodes. In that case, when the counter is
    one, all allocations are allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
    needed, to obtain the mems_allowed vector of a cpuset, and replaced the
    workaround in sys_migrate_pages() to call this new method.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The important code paths through alloc_pages_current() and alloc_page_vma(),
    by which most kernel page allocations go, both called
    cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
    -Both- of these latter two routines did a tasklock, got the tasks cpuset
    pointer, and checked for out of date cpuset->mems_generation.

    That was a silly duplication of code and waste of CPU cycles on an important
    code path.

    Consolidated those two routines into a single routine, called
    cpuset_update_task_memory_state(), since it updates more than just
    mems_allowed.

    Changed all callers of either routine to call the new consolidated routine.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.

    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.

    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.

    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure. It's up to the batch
    manager or other user code to decide what to do about it and take action.

    ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero. So only
    systems that enable this feature will compute the metric.

    Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.

    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
    conversion had left one routine using bitmaps, since it involved a
    corresponding change to kernel/cpuset.c

    Fix that interface by replacing with a simple macro that calls nodes_subset(),
    or if !CONFIG_CPUSET, returns (1).

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

08 Sep, 2005

2 commits

  • Now the real motivation for this cpuset mem_exclusive patch series seems
    trivial.

    This patch keeps a task in or under one mem_exclusive cpuset from provoking an
    oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only
    interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
    containment, there is little to gain from oom killing a task under a
    non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
    allocation must come from disjoint memory nodes.

    This patch enables configuring a system so that a runaway job under one
    mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
    that might be using very high compute and memory resources for a prolonged
    time.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution. With this patch, there are now the following four layers of
    memory placement available:

    1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
    2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
    3) The current tasks cpuset (GFP_USER allocations constrained to here), and
    4) Specific node placement, using mbind and set_mempolicy.

    These nest - each layer is a subset (same or within) of the previous.

    Layer (2) above is new, with this patch. The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed. The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.

    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.

    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such. Swapper and oom_kill activity is also constrained to Layer (2). A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset. Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.

    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.

    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

17 Apr, 2005

1 commit

  • gcc-4 warns with
    include/linux/cpuset.h:21: warning: type qualifiers ignored on function
    return type

    cpuset_cpus_allowed is declared with const
    extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);

    First const should be __attribute__((const)), but the gcc manual
    explains that:

    "Note that a function that has pointer arguments and examines the data
    pointed to must not be declared const. Likewise, a function that calls a
    non-const function usually must not be const. It does not make sense for
    a const function to return void."

    The following patch remove const from the function declaration.

    Signed-off-by: Benoit Boissinot
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benoit Boissinot