18 Jul, 2007

2 commits

  • Rather than using a tri-state integer for the wait flag in
    call_usermodehelper_exec, define a proper enum, and use that. I've
    preserved the integer values so that any callers I've missed should
    still work OK.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: James Bottomley
    Cc: Randy Dunlap
    Cc: Christoph Hellwig
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Cc: Johannes Berg
    Cc: Ralf Baechle
    Cc: Bjorn Helgaas
    Cc: Joel Becker
    Cc: Tony Luck
    Cc: Kay Sievers
    Cc: Srivatsa Vaddagiri
    Cc: Oleg Nesterov
    Cc: David Howells

    Jeremy Fitzhardinge
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (80 commits)
    KVM: Use CPU_DYING for disabling virtualization
    KVM: Tune hotplug/suspend IPIs
    KVM: Keep track of which cpus have virtualization enabled
    SMP: Allow smp_call_function_single() to current cpu
    i386: Allow smp_call_function_single() to current cpu
    x86_64: Allow smp_call_function_single() to current cpu
    HOTPLUG: Adapt thermal throttle to CPU_DYING
    HOTPLUG: Adapt cpuset hotplug callback to CPU_DYING
    HOTPLUG: Add CPU_DYING notifier
    KVM: Clean up #includes
    KVM: Remove kvmfs in favor of the anonymous inodes source
    KVM: SVM: Reliably detect if SVM was disabled by BIOS
    KVM: VMX: Remove unnecessary code in vmx_tlb_flush()
    KVM: MMU: Fix Wrong tlb flush order
    KVM: VMX: Reinitialize the real-mode tss when entering real mode
    KVM: Avoid useless memory write when possible
    KVM: Fix x86 emulator writeback
    KVM: Add support for in-kernel pio handlers
    KVM: VMX: Fix interrupt checking on lightweight exit
    KVM: Adds support for in-kernel mmio handlers
    ...

    Linus Torvalds
     

17 Jul, 2007

1 commit


16 Jul, 2007

1 commit


17 Jun, 2007

1 commit

  • The cpuset code to present a list of tasks using a cpuset to user space could
    write to an array that it had kmalloc'd, after a kmalloc request of zero size.

    The problem was that the code didn't check for writes past the allocated end
    of the array until -after- the first write.

    This is a race condition that is likely rare -- it would only show up if a
    cpuset went from being empty to having a task in it, during the brief time
    between the allocation and the first write.

    Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
    zero kmalloc returned a few usable bytes anyway, and no harm was done with the
    bogus write.

    With the 2.6.22 kernel changes to make issue a warning if code tries to write
    to the location returned from a zero size allocation, this problem is no
    longer benign. This cpuset code would occassionally trigger that warning.

    The fix is trivial -- check before storing into the array, not after, whether
    the array is big enough to hold the store.

    Cc: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Cc: Balbir Singh
    Cc: Dave Hansen
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Paul Menage
    Cc: Srivatsa Vaddagiri
    Cc: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

10 May, 2007

1 commit


09 May, 2007

3 commits


08 May, 2007

1 commit

  • OOM killed tasks have access to memory reserves as specified by the
    TIF_MEMDIE flag in the hopes that it will quickly exit. If such a task has
    memory allocations constrained by cpusets, we may encounter a deadlock if a
    blocking task cannot exit because it cannot allocate the necessary memory.

    We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
    including outside its cpuset restriction, so that it can quickly die
    regardless of whether it is __GFP_HARDWALL.

    Cc: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Feb, 2007

2 commits

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Many struct file_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

31 Dec, 2006

1 commit

  • fs/proc/base.c:1869: warning: initialization discards qualifiers from pointer target type
    fs/proc/base.c:2150: warning: initialization discards qualifiers from pointer target type

    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

14 Dec, 2006

1 commit

  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

09 Dec, 2006

1 commit


08 Dec, 2006

4 commits

  • When using fake NUMA setup, the number of memory nodes can greatly exceed
    the number of CPUs. So the current limit in cpuset_common_file_write() is
    insufficient.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
    prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
    generating compiler warnings of unused symbols, hence forcing people to add
    #ifdefs.

    the compiler can skip truly unused functions just fine:

    text data bss dec hex filename
    1624412 728710 3674856 6027978 5bfaca vmlinux.before
    1624412 728710 3674856 6027978 5bfaca vmlinux.after

    [akpm@osdl.org: topology.c fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • A couple of minor code simplifications to the kernel/cpuset.c code. No
    functional change. Just a little less code and a little more readable.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

11 Oct, 2006

1 commit


01 Oct, 2006

1 commit


30 Sep, 2006

4 commits

  • Fix obscure race condition in kernel/cpuset.c attach_task() code.

    There is basically zero chance of anyone accidentally being harmed by this
    race.

    It requires a special 'micro-stress' load and a special timing loop hacks
    in the kernel to hit in less than an hour, and even then you'd have to hit
    it hundreds or thousands of times, followed by some unusual and senseless
    cpuset configuration requests, including removing the top cpuset, to cause
    any visibly harm affects.

    One could, with perhaps a few days or weeks of such effort, get the
    reference count on the top cpuset below zero, and manage to crash the
    kernel by asking to remove the top cpuset.

    I found it by code inspection.

    The race was introduced when 'the_top_cpuset_hack' was introduced, and one
    piece of code was not updated. An old check for a possibly null task
    cpuset pointer needed to be changed to a check for a task marked
    PF_EXITING. The pointer can't be null anymore, thanks to
    the_top_cpuset_hack (documented in kernel/cpuset.c). But the task could
    have gone into PF_EXITING state after it was found in the task_list scan.

    If a task is PF_EXITING in this code, it is possible that its task->cpuset
    pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
    than because the top_cpuset was that tasks last valid cpuset. In that
    case, the wrong cpuset reference counter would be decremented.

    The fix is trivial. Instead of failing the system call if the tasks cpuset
    pointer is null here, fail it if the task is in PF_EXITING state.

    The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
    the top_cpuset is done without locking, so could happen at anytime. But it
    is done during the exit handling, after the PF_EXITING flag is set. So if
    we verify that a task is still not PF_EXITING after we copy out its cpuset
    pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
    hack references to the top_cpuset.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
    it could remove a CPU or Node from the top cpuset, while leaving it still
    in some child cpusets.

    One basic rule of cpusets is that each cpusets cpus and mems are subsets of
    its parents. The cpuset hot unplug code violated this rule.

    So the cpuset hotunplug handler must walk down the tree, removing any
    removed CPU or Node from all cpusets.

    However, it is not allowed to make a cpusets cpus or mems become empty.
    They can only transition from empty to non-empty, not back.

    So if the last CPU or Node would be removed from a cpuset by the above
    walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
    that still has something online, and copy its CPU or Memory placement.

    Signed-off-by: Paul Jackson
    Cc: Nathan Lynch
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Change the list of memory nodes allowed to tasks in the top (root) nodeset
    to dynamically track what cpus are online, using a call to a cpuset hook
    from the memory hotplug code. Make this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support memory hotplug, then these tasks cannot make
    use of memory nodes that are added after system boot, because the memory
    nodes are not allowed in the top cpuset. This is a surprising regression
    over earlier kernels that didn't have cpusets enabled.

    One key motivation for this change is to remain consistent with the
    behaviour for the top_cpuset's 'cpus', which is also read-only, and which
    automatically tracks the cpu_online_map.

    This change also has the minor benefit that it fixes a long standing,
    little noticed, minor bug in cpusets. The cpuset performance tweak to
    short circuit the cpuset_zone_allowed() check on systems with just a single
    cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
    changing the 'mems' of the top_cpuset had no affect, even though the change
    (the write system call) appeared to succeed. With the following change,
    that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
    refuses to be changed via user space writes. Thus no one should be mislead
    into thinking they've changed the top_cpusets's 'mems' when in affect they
    haven't.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'mems' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of node_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged memory nodes
    allowed by their cpuset.

    [akpm@osdl.org: build fix]
    [bunk@stusta.de: build fix]
    Signed-off-by: Paul Jackson
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This is an updated version of Eric Biederman's is_init() patch.
    (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
    replaces a few more instances of ->pid == 1 with is_init().

    Further, is_init() checks pid and thus removes dependency on Eric's other
    patches for now.

    Eric's original description:

    There are a lot of places in the kernel where we test for init
    because we give it special properties. Most significantly init
    must not die. This results in code all over the kernel test
    ->pid == 1.

    Introduce is_init to capture this case.

    With multiple pid spaces for all of the cases affected we are
    looking for only the first process on the system, not some other
    process that has pid == 1.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc:
    Acked-by: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

27 Sep, 2006

1 commit

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

26 Sep, 2006

2 commits

  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     

28 Aug, 2006

2 commits

  • cpuset_excl_nodes_overlap always returns 0 if current is exiting. This caused
    customer's systems to panic in the OOM killer when processes were having
    trouble getting memory for the final put_user in mm_release. Even though
    there were lots of processes to kill.

    Change to returning 1 in this case. This achieves parity with !CONFIG_CPUSETS
    case, and was observed to fix the problem.

    Signed-off-by: Nick Piggin
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change the list of cpus allowed to tasks in the top (root) cpuset to
    dynamically track what cpus are online, using a CPU hotplug notifier. Make
    this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support CPU hotplug, then these tasks cannot make use
    of CPUs that are added after system boot, because the CPUs are not allowed
    in the top cpuset. This is a surprising regression over earlier kernels
    that didn't have cpusets enabled.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'cpus' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of cpu_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
    by their cpuset.

    Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
    driving the fix, and earlier versions of this patch.

    Signed-off-by: Paul Jackson
    Cc: Nathan Lynch
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

24 Jul, 2006

1 commit

  • Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
    callback_mutex lock.

    It only happens on cpu_exclusive cpusets, due to the dynamic
    sched domain code trying to take the cpu hotplug lock inside
    the cpuset callback_mutex lock.

    This bug has apparently been here for several months, but didn't
    get hit until the right customer load on a large system.

    This fix appears right from inspection, but it will take a few
    more days running it on that customers workload to be confident
    we nailed it. We don't have any other reproducible test case.

    The cpu_hotplug_lock() tends to cover large runs of code.
    The other places that hold both that lock and the cpuset callback
    mutex lock always nest the cpuset lock inside the hotplug lock.
    This place tries to do the reverse, risking an ABBA deadlock.

    This is in the cpuset_rmdir() code, where we:
    * take the callback_mutex lock
    * mark the cpuset CS_REMOVED
    * call update_cpu_domains for cpu_exclusive cpusets
    * in that call, take the cpu_hotplug lock if the
    cpuset is marked for removal.

    Thanks to Jack Steiner for identifying this deadlock.

    The fix is to tear down the dynamic sched domain before we grab
    the cpuset callback_mutex lock. This way, the two locks are
    serialized, with the hotplug lock taken and released before
    trying for the cpuset lock.

    I suspect that this bug was introduced when I changed the
    cpuset locking from one lock to two. The dynamic sched domain
    dependency on cpu_exclusive cpusets and its hotplug hooks were
    added to this code earlier, when cpusets had only a single lock.
    It may well have been fine then.

    Signed-off-by: Paul Jackson
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

01 Jul, 2006

2 commits


27 Jun, 2006

2 commits

  • Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
    that they work with struct pid instead of struct task_ref.

    Mostly this is a straight 1-1 substitution.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

23 Jun, 2006

2 commits

  • Add a security hook call to enable security modules to control the ability
    to attach a task to a cpuset. While limited control over this operation is
    possible via permission checks on the pseudo fs interface, those checks are
    not sufficient to control access to the target task, which is looked up in
    this function. The existing task_setscheduler hook is re-used for this
    operation since this falls under the same class of operations.

    Signed-off-by: David Quigley
    Acked-by: Stephen Smalley
    Signed-off-by: James Morris
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Quigley
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

22 May, 2006

2 commits

  • It's too easy to incorrectly call cpuset_zone_allowed() in an atomic
    context without __GFP_HARDWALL set, and when done, it is not noticed until
    a tight memory situation forces allocations to be tried outside the current
    cpuset.

    Add a 'might_sleep_if()' check, to catch this earlier on, instead of
    waiting for a similar check in the mutex_lock() code, which is only rarely
    invoked.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Update the kernel/cpuset.c:cpuset_zone_allowed() comment.

    The rule for when mm/page_alloc.c should call cpuset_zone_allowed()
    was intended to be:

    Don't call cpuset_zone_allowed() if you can't sleep, unless you
    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
    the code that might scan up ancestor cpusets and sleep.

    The explanation of this rule in the comment above cpuset_zone_allowed() was
    stale, as a result of a restructuring of some __alloc_pages() code in
    November 2005.

    Rewrite that comment ...

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

01 Apr, 2006

1 commit

  • Fix memory migration so that it works regardless of what cpuset the invoking
    task is in.

    If a task invoked a memory migration, by doing one of:

    1) writing a different nodemask to a cpuset 'mems' file, or

    2) writing a tasks pid to a different cpuset's 'tasks' file,
    where the cpuset had its 'memory_migrate' option turned on, then the
    allocation of the new pages for the migrated task(s) was constrained
    by the invoking tasks cpuset.

    If this task wasn't in a cpuset that allowed the requested memory nodes, the
    memory migration would happen to some other nodes that were in that invoking
    tasks cpuset. This was usually surprising and puzzling behaviour: Why didn't
    the pages move? Why did the pages move -there-?

    To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct
    field to the nodes the migrating tasks is moving to, so that new pages can be
    allocated there.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson