20 Oct, 2008

1 commit

  • Describe why we need the freezer subsystem and how to use it in a
    documentation file. Since the cgroups.txt file is focused on the
    subsystem-agnostic portions of cgroups make a directory and move the old
    cgroups.txt file at the same time.

    Signed-off-by: Matt Helsley
    Cc: Paul Menage
    Cc: containers@lists.linux-foundation.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     

14 Sep, 2008

1 commit

  • If all the cpus in a cpuset are offlined, the tasks in it will be moved to
    the nearest ancestor with non-empty cpus.

    Signed-off-by: Li Zefan
    Acked-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

05 Jul, 2008

2 commits


19 Jun, 2008

1 commit


07 Jun, 2008

1 commit


29 Apr, 2008

1 commit

  • This flag provides the hardwalling properties of mem_exclusive, without
    enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
    to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
    nodes.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

20 Apr, 2008

1 commit

  • This patch introduces new feature of cpuset - sched domain customization.

    This version provides a per-cpuset file 'sched_relax_domain_level' that
    enable us to change the searching range of scheduler, which used to limit
    how many cpus the scheduler searches at some schedule events, such as
    wakening task and running out of runqueue.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

24 Feb, 2008

1 commit


08 Feb, 2008

1 commit

  • Update cpuset documentation to match the October 2007 "Fix cpusets
    update_cpumask" changes that now apply changes to a cpusets 'cpus' allowed
    mask immediately to the cpus_allowed of the tasks in that cpuset.

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

20 Oct, 2007

2 commits

  • Add a new per-cpuset flag called 'sched_load_balance'.

    When enabled in a cpuset (the default value) it tells the kernel scheduler
    that the scheduler should provide the normal load balancing on the CPUs in
    that cpuset, sometimes moving tasks from one CPU to a second CPU if the
    second CPU is less loaded and if that task is allowed to run there.

    When disabled (write "0" to the file) then it tells the kernel scheduler
    that load balancing is not required for the CPUs in that cpuset.

    Now even if this flag is disabled for some cpuset, the kernel may still
    have to load balance some or all the CPUs in that cpuset, if some
    overlapping cpuset has its sched_load_balance flag enabled.

    If there are some CPUs that are not in any cpuset whose sched_load_balance
    flag is enabled, the kernel scheduler will not load balance tasks to those
    CPUs.

    Moreover the kernel will partition the 'sched domains' (non-overlapping
    sets of CPUs over which load balancing is attempted) into the finest
    granularity partition that it can find, while still keeping any two CPUs
    that are in the same shed_load_balance enabled cpuset in the same element
    of the partition.

    This serves two purposes:
    1) It provides a mechanism for real time isolation of some CPUs, and
    2) it can be used to improve performance on systems with many CPUs
    by supporting configurations in which load balancing is not done
    across all CPUs at once, but rather only done in several smaller
    disjoint sets of CPUs.

    This mechanism replaces the earlier overloading of the per-cpuset
    flag 'cpu_exclusive', which overloading was removed in an earlier
    patch: cpuset-remove-sched-domain-hooks-from-cpusets

    See further the Documentation and comments in the code itself.

    [akpm@linux-foundation.org: don't be weird]
    Signed-off-by: Paul Jackson
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

2 commits

  • Remove the cpuset hooks that defined sched domains depending on the setting
    of the 'cpu_exclusive' flag.

    The cpu_exclusive flag can only be set on a child if it is set on the
    parent.

    This made that flag painfully unsuitable for use as a flag defining a
    partitioning of a system.

    It was entirely unobvious to a cpuset user what partitioning of sched
    domains they would be causing when they set that one cpu_exclusive bit on
    one cpuset, because it depended on what CPUs were in the remainder of that
    cpusets siblings and child cpusets, after subtracting out other
    cpu_exclusive cpusets.

    Furthermore, there was no way on production systems to query the
    result.

    Using the cpu_exclusive flag for this was simply wrong from the get go.

    Fortunately, it was sufficiently borked that so far as I know, almost no
    successful use has been made of this. One real time group did use it to
    affectively isolate CPUs from any load balancing efforts. They are willing
    to adapt to alternative mechanisms for this, such as someway to manipulate
    the list of isolated CPUs on a running system. They can do without this
    present cpu_exclusive based mechanism while we develop an alternative.

    There is a real risk, to the best of my understanding, of users
    accidentally setting up a partitioned scheduler domains, inhibiting desired
    load balancing across all their CPUs, due to the nonobvious (from the
    cpuset perspective) side affects of the cpu_exclusive flag.

    Furthermore, since there was no way on a running system to see what one was
    doing with sched domains, this change will be invisible to any using code.
    Unless they have real insight to the scheduler load balancing choices, they
    will be unable to detect that this change has been made in the kernel's
    behaviour.

    Initial discussion on lkml of this patch has generated much comment. My
    (probably controversial) take on that discussion is that it has reached a
    rough concensus that the current cpuset cpu_exclusive mechanism for
    defining sched domains is borked. There is no concensus on the
    replacement. But since we can remove this mechanism, and since its
    continued presence risks causing unwanted partitioning of the schedulers
    load balancing, we should remove it while we can, as we proceed to work the
    replacement scheduler domain mechanisms.

    Signed-off-by: Paul Jackson
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Dinakar Guniguntala
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • cpusets try to ensure that any node added to a cpuset's mems_allowed is
    on-line and contains memory. The assumption was that online nodes contained
    memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
    tasks to this cpuset. This results in continuous series of oom-kill and
    apparent system hang.

    Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
    place of node_online_map when vetting memories. Return error if admin
    attempts to write a non-empty mems_allowed node mask containing only
    memoryless-nodes.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Bob Picco
    Signed-off-by: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

03 Apr, 2007

1 commit

  • It seems that there must be at least one node in mems and at least one CPU
    in cpus in order to be able to assign tasks to a cpuset. This makes sense.
    And I think it would also make sense to include a mems setting in the
    basic usage section of the documentation.

    I also wonder if something logged to dmsg, explaining why a write failed,
    would be a good enhancement. I ended up having rummage arround in cpuset.c
    in order to work out why my configuration was failing.

    Signed-off-by: Simon Horman
    Acked-by: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Horman
     

30 Sep, 2006

1 commit

  • Change the list of memory nodes allowed to tasks in the top (root) nodeset
    to dynamically track what cpus are online, using a call to a cpuset hook
    from the memory hotplug code. Make this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support memory hotplug, then these tasks cannot make
    use of memory nodes that are added after system boot, because the memory
    nodes are not allowed in the top cpuset. This is a surprising regression
    over earlier kernels that didn't have cpusets enabled.

    One key motivation for this change is to remain consistent with the
    behaviour for the top_cpuset's 'cpus', which is also read-only, and which
    automatically tracks the cpu_online_map.

    This change also has the minor benefit that it fixes a long standing,
    little noticed, minor bug in cpusets. The cpuset performance tweak to
    short circuit the cpuset_zone_allowed() check on systems with just a single
    cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
    changing the 'mems' of the top_cpuset had no affect, even though the change
    (the write system call) appeared to succeed. With the following change,
    that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
    refuses to be changed via user space writes. Thus no one should be mislead
    into thinking they've changed the top_cpusets's 'mems' when in affect they
    haven't.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'mems' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of node_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged memory nodes
    allowed by their cpuset.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Paul Jackson
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

28 Aug, 2006

1 commit

  • Change the list of cpus allowed to tasks in the top (root) cpuset to
    dynamically track what cpus are online, using a CPU hotplug notifier. Make
    this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support CPU hotplug, then these tasks cannot make use
    of CPUs that are added after system boot, because the CPUs are not allowed
    in the top cpuset. This is a surprising regression over earlier kernels
    that didn't have cpusets enabled.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'cpus' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of cpu_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
    by their cpuset.

    Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
    driving the fix, and earlier versions of this patch.

    Signed-off-by: Paul Jackson
    Cc: Nathan Lynch
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

24 Mar, 2006

1 commit

  • This patch provides the implementation and cpuset interface for an alternative
    memory allocation policy that can be applied to certain kinds of memory
    allocations, such as the page cache (file system buffers) and some slab caches
    (such as inode caches).

    The policy is called "memory spreading." If enabled, it spreads out these
    kinds of memory allocations over all the nodes allowed to a task, instead of
    preferring to place them on the node where the task is executing.

    All other kinds of allocations, including anonymous pages for a tasks stack
    and data regions, are not affected by this policy choice, and continue to be
    allocated preferring the node local to execution, as modified by the NUMA
    mempolicy.

    There are two boolean flag files per cpuset that control where the kernel
    allocates pages for the file system buffers and related in kernel data
    structures. They are called 'memory_spread_page' and 'memory_spread_slab'.

    If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
    kernel will spread the file system buffers (page cache) evenly over all the
    nodes that the faulting task is allowed to use, instead of preferring to put
    those pages on the node where the task is running.

    If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
    kernel will spread some file system related slab caches, such as for inodes
    and dentries evenly over all the nodes that the faulting task is allowed to
    use, instead of preferring to put those pages on the node where the task is
    running.

    The implementation is simple. Setting the cpuset flags 'memory_spread_page'
    or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
    PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
    subsequently joins that cpuset. In subsequent patches, the page allocation
    calls for the affected page cache and slab caches are modified to perform an
    inline check for these flags, and if set, a call to a new routine
    cpuset_mem_spread_node() returns the node to prefer for the allocation.

    The cpuset_mem_spread_node() routine is also simple. It uses the value of a
    per-task rotor cpuset_mem_spread_rotor to select the next node in the current
    tasks mems_allowed to prefer for the allocation.

    This policy can provide substantial improvements for jobs that need to place
    thread local data on the corresponding node, but that need to access large
    file system data sets that need to be spread across the several nodes in the
    jobs cpuset in order to fit. Without this patch, especially for jobs that
    might have one thread reading in the data set, the memory allocation across
    the nodes in the jobs cpuset can become very uneven.

    A couple of Copyright year ranges are updated as well. And a couple of email
    addresses that can be found in the MAINTAINERS file are removed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

15 Mar, 2006

1 commit

  • Update the documentation for page migration.

    - Fix up bits and pieces in cpusets.txt

    - Rework text in vm/page-migration to be clearer and reflect the final
    version of page migration in 2.6.16. Mention Andi Kleen's numactl
    package that contains user space tools for page migration via
    libnuma. Add reference to numa_maps and to the manpage in numactl.

    - Add todo list for outstanding issues

    Signed-off-by: Christoph Lameter
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Jan, 2006

1 commit


09 Jan, 2006

3 commits

  • Remove documentation for the cpuset 'marker_pid' feature, that was in the
    patch "cpuset: change marker for relative numbering" That patch was previously
    pulled from *-mm at my (pj) request.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Document the additional cpuset features:
    notify_on_release
    marker_pid
    memory_pressure
    memory_pressure_enabled

    Rearrange and improve formatting of existing documentation for
    cpu_exclusive and mem_exclusive features.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Add a boolean "memory_migrate" to each cpuset, represented by a file
    containing "0" or "1" in each directory below /dev/cpuset.

    It defaults to false (file contains "0"). It can be set true by writing
    "1" to the file.

    If true, then anytime that a task is attached to the cpuset so marked, the
    pages of that task will be moved to that cpuset, preserving, to the extent
    practical, the cpuset-relative placement of the pages.

    Also anytime that a cpuset so marked has its memory placement changed (by
    writing to its "mems" file), the tasks in that cpuset will have their pages
    moved to the cpusets new nodes, preserving, to the extent practical, the
    cpuset-relative placement of the moved pages.

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

31 Oct, 2005

1 commit


11 Sep, 2005

1 commit

  • The attached patch fixes the following spelling errors in Documentation/
    - double "the"
    - Several misspellings of function/functionality
    - infomation
    - memeory
    - Recieved
    - wether
    and possibly others which I forgot ;-)
    Trailing whitespaces on the same line as the typo are also deleted.

    Signed-off-by: Tobias Klauser
    Signed-off-by: Domen Puncer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     

08 Sep, 2005

1 commit

  • This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution. With this patch, there are now the following four layers of
    memory placement available:

    1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
    2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
    3) The current tasks cpuset (GFP_USER allocations constrained to here), and
    4) Specific node placement, using mbind and set_mempolicy.

    These nest - each layer is a subset (same or within) of the previous.

    Layer (2) above is new, with this patch. The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed. The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.

    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.

    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such. Swapper and oom_kill activity is also constrained to Layer (2). A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset. Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.

    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.

    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

26 Jun, 2005

1 commit


21 May, 2005

1 commit

  • This patch removes the entwining of cpusets and hotplug code in the "No
    more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().

    Since the hotplug code is holding a spinlock at this point, we cannot take
    the cpuset semaphore, cpuset_sem, as would seem to be required either to
    update the tasks cpuset, or to scan up the nested cpuset chain, looking for
    the nearest cpuset ancestor that still has some CPUs that are online. So
    we just punt and blast the tasks cpus_allowed with all bits allowed.

    This reverts these lines of code to what they were before the cpuset patch.
    And it updates the cpuset Doc file, to match.

    The one known alternative to this that seems to work came from Dinakar
    Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
    much earlier in its processing. So far as we know, the increased locking
    entanglement between cpusets and hot plug of this alternative approach is
    not worth doing in this case.

    Signed-off-by: Paul Jackson
    Acked-by: Nathan Lynch
    Acked-by: Dinakar Guniguntala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds