07 Jun, 2008

1 commit

  • Adding a nonexistent cpu to a cpuset will be omitted quietly. It should
    return -EINVAL.

    Example: (real_nr_cpus < NR_CPUS or cpu#4 was just offline)

    # cat cpus
    0-1
    # /bin/echo 4 > cpus
    # /bin/echo $?
    0
    # cat cpus

    #

    The same occurs when add a nonexistent mem.
    This patch will fix this bug.
    And when *buf == "", the check is unneeded.

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

09 May, 2008

1 commit

  • Due to a merge conflict, the sched_relax_domain_level control file was marked
    as being handled by cpuset_read/write_u64, but the code to handle it was
    actually in cpuset_common_file_read/write.

    Since the value being written/read is in fact a signed integer, it should be
    treated as such; this patch adds cpuset_read/write_s64 functions, and uses
    them to handle the sched_relax_domain_level file.

    With this patch, the sched_relax_domain_level can be read and written, and the
    correct contents seen/updated.

    Signed-off-by: Paul Menage
    Cc: Hidetoshi Seto
    Cc: Paul Jackson
    Cc: Ingo Molnar
    Reviewed-by: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

29 Apr, 2008

5 commits

  • This flag provides the hardwalling properties of mem_exclusive, without
    enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
    to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
    nodes.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Currently the cpusets mem_exclusive flag is overloaded to mean both
    "no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".

    These patches add a new mem_hardwall flag with just the allocation restriction
    part of the mem_exclusive semantics, without breaking backwards-compatibility
    for those who continue to use just mem_exclusive. Additionally, the cgroup
    control file registration for cpusets is cleaned up to reduce boilerplate.

    This patch:

    This change tidies up the cpusets control file definitions, and reduces the
    amount of boilerplate required to add/change control files in the future.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Make the following needlessly global functions static:

    - cpuset_test_cpumask()
    - cpuset_change_cpumask()
    - cpuset_do_move_task()

    Signed-off-by: Adrian Bunk
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Many of the cpusets control files are simple integer values, which don't
    require the overhead of memory allocations for reads and writes.

    Move the handlers for these control files into cpuset_read_u64() and
    cpuset_write_u64().

    [akpm@linux-foundation.org: ad dmissing `break']
    Signed-off-by: Paul Menage
    Cc: "Li Zefan"
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: "YAMAMOTO Takashi"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer
    kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer

    Signed-off-by: Harvey Harrison
    Reviewed-by: Paul Jackson
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

28 Apr, 2008

3 commits

  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Apr, 2008

3 commits

  • [rebased for sched-devel/latest]

    - Add a new cpuset file, having levels:
    sched_relax_domain_level

    - Modify partition_sched_domains() and build_sched_domains()
    to take attributes parameter passed from cpuset.

    - Fill newidle_idx for node domains which currently unused but
    might be required if sched_relax_domain_level become higher.

    - We can change the default level by boot option 'relax_domain_level='.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • * Cleaned up references to cpumask_scnprintf() and added new
    cpulist_scnprintf() interfaces where appropriate.

    * Fix some small bugs (or code efficiency improvments) for various uses
    of cpumask_scnprintf.

    * Clean up some checkpatch errors.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • * Modify cpuset_cpus_allowed to return the currently allowed cpuset
    via a pointer argument instead of as the function return value.

    * Use new set_cpus_allowed_ptr function.

    * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.

    Depends on:
    [sched-devel]: sched: add new set_cpus_allowed_ptr function

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     

06 Mar, 2008

1 commit

  • mm migration is no longer done in cpuset_update_task_memory_state() so it
    can no longer take current->mm->mmap_sem, so fix the obsolete comment.

    [ This changed in commit 04c19fa6f16047abff2288ddbc1f0798ede5a849
    ("cpuset: migrate all tasks in cpuset at once") when the mm migration
    was moved from cpuset_update_task_memory_state() to update_nodemask() ]

    Signed-off-by: David Rientjes
    Cc: Paul Jackson
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Feb, 2008

1 commit

  • Currently we possibly lookup the pid in the wrong pid namespace. So
    seq_file convert proc_pid_status which ensures the proper pid namespaces is
    passed in.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: s390 build fix]
    [akpm@linux-foundation.org: fix task_name() output]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Eric W. Biederman
    Cc: Andrew Morgan
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Pavel Emelyanov
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

08 Feb, 2008

5 commits

  • - Narrow the scope of callback_mutex in scan_for_empty_cpusets().

    - Avoid rewriting the cpus, mems of cpusets except when it is likely that
    we'll be changing them.

    - Have remove_tasks_in_empty_cpuset() also check for empty mems.

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Various minor formatting and comment tweaks to Cliff Wickman's
    [PATCH_3_of_3]_cpusets__update_cpumask_revision.patch

    I had had "iff", meaning "if and only if" in a comment. However, except for
    ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
    changed it to "if", presumably figuring that the "iff" was a typo. However,
    it was the "only if" half of the conjunction that was most interesting.
    Reword to emphasis the "only if" aspect.

    The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
    callback_mutex had to be held on entry. The opposite is true.

    Several mentions of attach_task() in comments needed to be
    changed to cgroup_attach_task().

    A comment about notify_on_release was no longer relevant,
    as the line of code it had commented, namely:
    set_bit(CS_RELEASED_RESOURCE, &parent->flags);
    is no longer present in that place in the cpuset.c code.

    Similarly a comment about notify_on_release before the
    scan_for_empty_cpusets() routine was no longer relevant.

    Removed extra parentheses and unnecessary return statement.

    Renamed attach_task() to cpuset_attach() in various comments.

    Removed comment about not needing memory migration, as it seems the migration
    is done anyway, via the cpuset_attach() callback from cgroup_attach_task().

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Some of the comments in kernel/cpuset.c were stale following the
    transition to control groups; this patch updates them to more closely
    match reality.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Use the new function cgroup_scan_tasks() to step through all tasks in a
    cpuset.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • This patch corrects a situation that occurs when one disables all the cpus in
    a cpuset.

    Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
    which is incorrect because it may then overlap its cpu-exclusive sibling.

    Tasks of an empty cpuset should be moved to the cpuset which is the parent of
    their current cpuset. Or if the parent cpuset has no cpus, to its parent,
    etc.

    And the empty cpuset should be released (if it is flagged notify_on_release).

    Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
    iterate through all tasks in the cpu-less cpuset. We are deliberately
    avoiding a walk of the tasklist.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     

26 Jan, 2008

1 commit

  • Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
    get_online_cpus and put_online_cpus instead as it highlights the
    refcount semantics in these operations.

    The new API guarantees protection against the cpu-hotplug operation, but
    it doesn't guarantee serialized access to any of the local data
    structures. Hence the changes needs to be reviewed.

    In case of pseries_add_processor/pseries_remove_processor, use
    cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
    cpu_present_map there.

    Signed-off-by: Gautham R Shenoy
    Signed-off-by: Ingo Molnar

    Gautham R Shenoy
     

20 Oct, 2007

6 commits

  • When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
    been running on that cpu.

    Currently, such a task is migrated:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any cpu which is both online and among that task's cpus_allowed

    It is typical of a multithreaded application running on a large NUMA system to
    have its tasks confined to a cpuset so as to cluster them near the memory that
    they share. Furthermore, it is typical to explicitly place such a task on a
    specific cpu in that cpuset. And in that case the task's cpus_allowed
    includes only a single cpu.

    This patch would insert a preference to migrate such a task to some cpu within
    its cpuset (and set its cpus_allowed to its entire cpuset).

    With this patch, migrate the task to:
    1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
    2) to any online cpu within the task's cpuset
    3) to any cpu which is both online and among that task's cpus_allowed

    In order to do this, move_task_off_dead_cpu() must make a call to
    cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
    not block. (name change - per Oleg's suggestion)

    Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
    the cpuset mutex during the whole migrate_live_tasks() and
    migrate_dead_tasks() procedure.

    [akpm@linux-foundation.org: build fix]
    [pj@sgi.com: Fix indentation and spacing]
    Signed-off-by: Cliff Wickman
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: Ingo Molnar
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:

    - collect batches of tasks under tasklist_lock and then call
    set_cpus_allowed() on them outside the lock (since this can sleep).

    - add a simple generic priority heap type to allow efficient collection
    of batches of tasks to be processed without duplicating or missing any
    tasks in subsequent batches.

    - make "cpus" file update a no-op if the mask hasn't changed

    - fix race between update_cpumask() and sched_setaffinity() by making
    sched_setaffinity() post-check that it's not running on any cpus outside
    cpuset_cpus_allowed().

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Balbir Singh
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Decrustify the kernel/cpuset.c 'cpus' and 'mems' updating code.

    Other than subtle improvements in the consistency of identifying
    white space at the beginning and end of passed in masks, this
    doesn't make any visible difference in behaviour. But it's
    one or two hundred kernel text bytes smaller, and easier to
    understand.

    [akpm@linux-foundation.org: coding-style fix]
    Signed-off-by: Paul Jackson
    Reviewed-by: Paul Menage
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Add a new per-cpuset flag called 'sched_load_balance'.

    When enabled in a cpuset (the default value) it tells the kernel scheduler
    that the scheduler should provide the normal load balancing on the CPUs in
    that cpuset, sometimes moving tasks from one CPU to a second CPU if the
    second CPU is less loaded and if that task is allowed to run there.

    When disabled (write "0" to the file) then it tells the kernel scheduler
    that load balancing is not required for the CPUs in that cpuset.

    Now even if this flag is disabled for some cpuset, the kernel may still
    have to load balance some or all the CPUs in that cpuset, if some
    overlapping cpuset has its sched_load_balance flag enabled.

    If there are some CPUs that are not in any cpuset whose sched_load_balance
    flag is enabled, the kernel scheduler will not load balance tasks to those
    CPUs.

    Moreover the kernel will partition the 'sched domains' (non-overlapping
    sets of CPUs over which load balancing is attempted) into the finest
    granularity partition that it can find, while still keeping any two CPUs
    that are in the same shed_load_balance enabled cpuset in the same element
    of the partition.

    This serves two purposes:
    1) It provides a mechanism for real time isolation of some CPUs, and
    2) it can be used to improve performance on systems with many CPUs
    by supporting configurations in which load balancing is not done
    across all CPUs at once, but rather only done in several smaller
    disjoint sets of CPUs.

    This mechanism replaces the earlier overloading of the per-cpuset
    flag 'cpu_exclusive', which overloading was removed in an earlier
    patch: cpuset-remove-sched-domain-hooks-from-cpusets

    See further the Documentation and comments in the code itself.

    [akpm@linux-foundation.org: don't be weird]
    Signed-off-by: Paul Jackson
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • The cpuset code to present a list of tasks using a cpuset to user space could
    write to an array that it had kmalloc'd, after a kmalloc request of zero size.

    The problem was that the code didn't check for writes past the allocated end
    of the array until -after- the first write.

    This is a race condition that is likely rare -- it would only show up if a
    cpuset went from being empty to having a task in it, during the brief time
    between the allocation and the first write.

    Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
    zero kmalloc returned a few usable bytes anyway, and no harm was done with the
    bogus write.

    With the 2.6.22 kernel changes to make issue a warning if code tries to write
    to the location returned from a zero size allocation, this problem is no
    longer benign. This cpuset code would occassionally trigger that warning.

    The fix is trivial -- check before storing into the array, not after, whether
    the array is big enough to hold the store.

    Cc: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Cc: Balbir Singh
    Cc: Dave Hansen
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Paul Menage
    Cc: Srivatsa Vaddagiri
    Cc: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

19 Oct, 2007

1 commit


17 Oct, 2007

4 commits

  • Instead of testing for overlap in the memory nodes of the the nearest
    exclusive ancestor of both current and the candidate task, it is better to
    simply test for intersection between the task's mems_allowed in their task
    descriptors. This does not require taking callback_mutex since it is only
    used as a hint in the badness scoring.

    Tasks that do not have an intersection in their mems_allowed with the current
    task are not explicitly restricted from being OOM killed because it is quite
    possible that the candidate task has allocated memory there before and has
    since changed its mems_allowed.

    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Remove the cpuset hooks that defined sched domains depending on the setting
    of the 'cpu_exclusive' flag.

    The cpu_exclusive flag can only be set on a child if it is set on the
    parent.

    This made that flag painfully unsuitable for use as a flag defining a
    partitioning of a system.

    It was entirely unobvious to a cpuset user what partitioning of sched
    domains they would be causing when they set that one cpu_exclusive bit on
    one cpuset, because it depended on what CPUs were in the remainder of that
    cpusets siblings and child cpusets, after subtracting out other
    cpu_exclusive cpusets.

    Furthermore, there was no way on production systems to query the
    result.

    Using the cpu_exclusive flag for this was simply wrong from the get go.

    Fortunately, it was sufficiently borked that so far as I know, almost no
    successful use has been made of this. One real time group did use it to
    affectively isolate CPUs from any load balancing efforts. They are willing
    to adapt to alternative mechanisms for this, such as someway to manipulate
    the list of isolated CPUs on a running system. They can do without this
    present cpu_exclusive based mechanism while we develop an alternative.

    There is a real risk, to the best of my understanding, of users
    accidentally setting up a partitioned scheduler domains, inhibiting desired
    load balancing across all their CPUs, due to the nonobvious (from the
    cpuset perspective) side affects of the cpu_exclusive flag.

    Furthermore, since there was no way on a running system to see what one was
    doing with sched domains, this change will be invisible to any using code.
    Unless they have real insight to the scheduler load balancing choices, they
    will be unable to detect that this change has been made in the kernel's
    behaviour.

    Initial discussion on lkml of this patch has generated much comment. My
    (probably controversial) take on that discussion is that it has reached a
    rough concensus that the current cpuset cpu_exclusive mechanism for
    defining sched domains is borked. There is no concensus on the
    replacement. But since we can remove this mechanism, and since its
    continued presence risks causing unwanted partitioning of the schedulers
    load balancing, we should remove it while we can, as we proceed to work the
    replacement scheduler domain mechanisms.

    Signed-off-by: Paul Jackson
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Dinakar Guniguntala
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch marks a number of allocations that are either short-lived such as
    network buffers or are reclaimable such as inode allocations. When something
    like updatedb is called, long-lived and unmovable kernel allocations tend to
    be spread throughout the address space which increases fragmentation.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
    reclaimed on demand, but not moved. i.e. they can be migrated by deleting
    them and re-reading the information from elsewhere.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cpusets try to ensure that any node added to a cpuset's mems_allowed is
    on-line and contains memory. The assumption was that online nodes contained
    memory. Thus, it is possible to add memoryless nodes to a cpuset and then add
    tasks to this cpuset. This results in continuous series of oom-kill and
    apparent system hang.

    Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in
    place of node_online_map when vetting memories. Return error if admin
    attempts to write a non-empty mems_allowed node mask containing only
    memoryless-nodes.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Bob Picco
    Signed-off-by: Nishanth Aravamudan
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

18 Jul, 2007

2 commits

  • Rather than using a tri-state integer for the wait flag in
    call_usermodehelper_exec, define a proper enum, and use that. I've
    preserved the integer values so that any callers I've missed should
    still work OK.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: James Bottomley
    Cc: Randy Dunlap
    Cc: Christoph Hellwig
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Johannes Berg
    Cc: Ralf Baechle
    Cc: Bjorn Helgaas
    Cc: Joel Becker
    Cc: Tony Luck
    Cc: Kay Sievers
    Cc: Srivatsa Vaddagiri
    Cc: Oleg Nesterov
    Cc: David Howells

    Jeremy Fitzhardinge
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits)
    KVM: Use CPU_DYING for disabling virtualization
    KVM: Tune hotplug/suspend IPIs
    KVM: Keep track of which cpus have virtualization enabled
    SMP: Allow smp_call_function_single() to current cpu
    i386: Allow smp_call_function_single() to current cpu
    x86_64: Allow smp_call_function_single() to current cpu
    HOTPLUG: Adapt thermal throttle to CPU_DYING
    HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
    HOTPLUG: Add CPU_DYING notifier
    KVM: Clean up #includes
    KVM: Remove kvmfs in favor of the anonymous inodes source
    KVM: SVM: Reliably detect if SVM was disabled by BIOS
    KVM: VMX: Remove unnecessary code in vmx_tlb_flush()
    KVM: MMU: Fix Wrong tlb flush order
    KVM: VMX: Reinitialize the real-mode tss when entering real mode
    KVM: Avoid useless memory write when possible
    KVM: Fix x86 emulator writeback
    KVM: Add support for in-kernel pio handlers
    KVM: VMX: Fix interrupt checking on lightweight exit
    KVM: Adds support for in-kernel mmio handlers
    ...

    Linus Torvalds
     

17 Jul, 2007

1 commit


16 Jul, 2007

1 commit


17 Jun, 2007

1 commit

  • The cpuset code to present a list of tasks using a cpuset to user space could
    write to an array that it had kmalloc'd, after a kmalloc request of zero size.

    The problem was that the code didn't check for writes past the allocated end
    of the array until -after- the first write.

    This is a race condition that is likely rare -- it would only show up if a
    cpuset went from being empty to having a task in it, during the brief time
    between the allocation and the first write.

    Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
    zero kmalloc returned a few usable bytes anyway, and no harm was done with the
    bogus write.

    With the 2.6.22 kernel changes to make issue a warning if code tries to write
    to the location returned from a zero size allocation, this problem is no
    longer benign. This cpuset code would occassionally trigger that warning.

    The fix is trivial -- check before storing into the array, not after, whether
    the array is big enough to hold the store.

    Cc: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Cc: Balbir Singh
    Cc: Dave Hansen
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Paul Menage
    Cc: Srivatsa Vaddagiri
    Cc: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

10 May, 2007

1 commit


09 May, 2007

2 commits