07 Jan, 2009

40 commits

  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A good place to free up old swap is where do_wp_page(), or do_swap_page(),
    is about to redirty the page: the data on disk is then stale and won't be
    read again; and if we do decide to write the page out later, using the
    previous swap location makes an unnecessary disk seek very likely.

    So give can_share_swap_page() the side-effect of delete_from_swap_cache()
    when it safely can. And can_share_swap_page() was always a misleading
    name, the more so if it has a side-effect: rename it reuse_swap_page().

    Irrelevant cleanup nearby: remove swap_token_default_timeout definition
    from swap.h: it's used nowhere.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • An application may rely on get_user_pages() to give it pages writable from
    userspace and shared with a driver, GUP breaking COW if necessary. It may
    mprotect() the pages' writability, off and on, from time to time.

    Normally this works fine (so long as the app does not fork); but just
    occasionally, under memory pressure, a readonly pte in a newly writable
    area is COWed unnecessarily, breaking the link with the driver: because
    do_wp_page() does trylock_page, and falls back to COW whenever that fails.

    For reliable behaviour in the unshared case, when the trylock_page fails,
    now unlock pagetable, lock page and relock pagetable, before deciding
    whether Copy-On-Write is really necessary.

    Reported-by: Zhou Yingchao
    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that
    COW has been done if necessary, though it may be leaving the pte without
    write permission - for the odd case of forced writing to a readonly vma
    for ptrace. At present GUP then retries the follow_page() without asking
    for write permission, to escape an endless loop when forced.

    But an application may be relying on GUP to guarantee a writable page
    which won't be COWed again when written from userspace, whereas a race
    here might leave a readonly pte in place? Change the VM_FAULT_WRITE
    handling to ask follow_page() for write permission again, except in that
    odd case of forced writing to a readonly vma.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The background dirty and dirty limits are better defined with type
    specifiers of unsigned long since negative writeback thresholds are not
    possible.

    These values, as returned by get_dirty_limits(), are normally compared
    with ZVC values to determine whether writeback shall commence or be
    throttled. Such page counts cannot be negative, so declaring the page
    limits as signed is unnecessary.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • As noted by Akinobu Mita in patch b1fceac2b9e04d278316b2faddf276015fc06e3b,
    alloc_bootmem and related functions never return NULL and always return a
    zeroed region of memory. Thus a NULL test or memset after calls to these
    functions is unnecessary.

    This was fixed using the following semantic patch.
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @@
    expression E;
    statement S;
    @@

    E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
    ... when != E
    (
    - BUG_ON (E == NULL);
    |
    - if (E == NULL) S
    )

    @@
    expression E,E1;
    @@

    E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...)
    ... when != E
    - memset(E,0,E1);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap()
    was good but stupid: we can and should SetPageSwapBacked() there too; and
    we know for sure that this anonymous, swap-backed page is not file cache.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_lock_anon_vma() and page_unlock_anon_vma() were made available to
    show_page_path() in vmscan.c; but now that has been removed, make them
    static in rmap.c again, they're better kept private if possible.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code is over-provisioned with BUG_ONs on assorted page flags,
    mostly dating back to 2.3. They're good documentation, and guard against
    developer error, but a waste of space on most systems: change them to
    VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
    ones: they're later, from 2.5.69, but even less interesting now.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
    that harder to track down: remove it, and its out-of-work brothers
    GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.

    Since we're making that improvement to hotremove_migrate_alloc(), I think
    we can now also remove one of the "o"s from its comment.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • cgroup_mm_owner_callbacks() was brought in to support the memrlimit
    controller, but sneaked into mainline ahead of it. That controller has
    now been shelved, and the mm_owner_changed() args were inadequate for it
    anyway (they needed an mm pointer instead of a task pointer).

    Remove the dead code, and restore mm_update_next_owner() locking to how it
    was before: taking mmap_sem there does nothing for memcontrol.c, now the
    only user of mm->owner.

    Signed-off-by: Hugh Dickins
    Cc: Paul Menage
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It is known that buffer_mapped() is false in this code path.

    Signed-off-by: Franck Bui-Huu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Franck Bui-Huu
     
  • Make the pte-level function in apply_to_range be called in lazy mmu mode,
    so that any pagetable modifications can be batched.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Venkatesh Pallipadi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Lazy unmapping in the vmalloc code has now opened the possibility for use
    after free bugs to go undetected. We can catch those by forcing an unmap
    and flush (which is going to be slow, but that's what happens).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The vmalloc purge lock can be a mutex so we can sleep while a purge is
    going on (purge involves a global kernel TLB invalidate, so it can take
    quite a while).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we do that, output of files like /proc/vmallocinfo will show things
    like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
    caller. This info is not as useful as the real caller of the allocation.

    So, proposal is to call __vmalloc_node node directly, with matching
    parameters to save the caller information

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • If we can't service a vmalloc allocation, show size of the allocation that
    actually failed. Useful for debugging.

    Signed-off-by: Glauber Costa
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • File pages mapped only in sequentially read mappings are perfect reclaim
    canditates.

    This patch makes these mappings behave like weak references, their pages
    will be reclaimed unless they have a strong reference from a normal
    mapping as well.

    It changes the reclaim and the unmap path where they check if the page has
    been referenced. In both cases, accesses through sequentially read
    mappings will be ignored.

    Benchmark results from KOSAKI Motohiro:

    http://marc.info/?l=linux-mm&m=122485301925098&w=2

    Signed-off-by: Johannes Weiner
    Signed-off-by: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • speculative page references patch (commit:
    e286781d5f2e9c846e012a39653a166e9d31777d) removed last
    pagevec_release_nonlru() caller.

    So this function can be removed now.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Don't print the size of the zone's memmap array if it does not have one.

    Impact: cleanup

    Signed-off-by: Yinghai Lu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade
     
  • Chris Mason notices do_sync_mapping_range didn't actually ask for data
    integrity writeout. Unfortunately, it is advertised as being usable for
    data integrity operations.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Now that we have the early-termination logic in place, it makes sense to
    bail out early in all other cases where done is set to 1.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Terminate the write_cache_pages loop upon encountering the first page past
    end, without locking the page. Pages cannot have their index change when
    we have a reference on them (truncate, eg truncate_inode_pages_range
    performs the same check without the page lock).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if we get stuck behind another process that is
    cleaning pages, we will be forced to wait for them to finish, then perform
    our own writeout (if it was redirtied during the long wait), then wait for
    that.

    If a page under writeout is still clean, we can skip waiting for it (if
    we're part of a data integrity sync, we'll be waiting for all writeout
    pages afterwards, so we'll still be waiting for the other guy's write
    that's cleaned the page).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Get rid of some complex expressions from flow control statements, add a
    comment, remove some duplicate code.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
    so the function will return success after writing out nr_to_write pages,
    even if that was not sufficient to guarantee data integrity.

    The callers tend to set it to values that could break data interity
    semantics easily in practice. For example, nr_to_write can be set to
    mapping->nr_pages * 2, however if a file has a single, dirty page, then
    fsync is called, subsequent pages might be concurrently added and dirtied,
    then write_cache_pages might writeout two of these newly dirty pages,
    while not writing out the old page that should have been written out.

    Fix this by ignoring nr_to_write if it is a data integrity sync.

    This is a data integrity bug.

    The reason this has been done in the past is to avoid stalling sync
    operations behind page dirtiers.

    "If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow."

    What we do today is return success after an arbitrary amount of pages are
    written, whether or not we have provided the data-integrity semantics that
    the caller has asked for. Even this doesn't actually fix all stall cases
    completely: in the above situation, if the file has a huge number of pages
    in pagecache (but not dirty), then mapping->nrpages is going to be huge,
    even if pages are being dirtied.

    This change does indeed make the possibility of long stalls lager, and
    that's not a good thing, but lying about data integrity is even worse. We
    have to either perform the sync, or return -ELINUXISLAME so at least the
    caller knows what has happened.

    There are subsequent competing approaches in the works to solve the stall
    problems properly, without compromising data integrity.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, if ret signals a real error, but we still have some
    pages left in the pagevec, done would be set to 1, but the remaining pages
    would continue to be processed and ret will be overwritten in the process.

    It could easily be overwritten with success, and thus success will be
    returned even if there is an error. Thus the caller is told all writes
    succeeded, wheras in reality some did not.

    Fix this by bailing immediately if there is an error, and retaining the
    first error code.

    This is a data integrity bug.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We'd like to break out of the loop early in many situations, however the
    existing code has been setting mapping->writeback_index past the final
    page in the pagevec lookup for cyclic writeback. This is a problem if we
    don't process all pages up to the final page.

    Currently the code mostly keeps writeback_index reasonable and hacked
    around this by not breaking out of the loop or writing pages outside the
    range in these cases. Keep track of a real "done index" that enables us
    to terminate the loop in a much more flexible manner.

    Needed by the subsequent patch to preserve writepage errors, and then
    further patches to break out of the loop early for other reasons. However
    there are no functional changes with this patch alone.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In write_cache_pages, scanned == 1 is supposed to mean that cyclic
    writeback has circled through zero, thus we should not circle again.
    However it gets set to 1 after the first successful pagevec lookup. This
    leads to cases where not enough data gets written.

    Counterexample: file with first 10 pages dirty, writeback_index == 5,
    nr_to_write == 10. Then the 5 last pages will be found, and scanned will
    be set to 1, after writing those out, we will not cycle back to get the
    first 5.

    Rework this logic, now we'll always cycle unless we started off from index
    0. When cycling, only write out as far as 1 page before the start page
    from the first cycle (so we don't write parts of the file twice).

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
    identified a minor issue in fs/mpage.c

    As the comment above mpage_readpages() says, a fs's get_block function
    will set BH_Boundary when it maps a block just before a block for which
    extra I/O is required.

    Since get_block() can map a range of pages, for all these pages the
    BH_Boundary flag will be set. But we only need to push what I/O we have
    accumulated at the last block of this range.

    This makes do_mpage_readpage() send out the largest possible bio instead
    of a bunch of page-sized ones in the BH_Boundary case.

    Signed-off-by: Miquel van Smoorenburg
    Cc: Nick Piggin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miquel van Smoorenburg
     
  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • zone_scan_mutex is actually a spinlock, so name it appropriately.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • pp->page is never used when not set to the right page, so there is no need
    to set it to ZERO_PAGE(0) by default.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • Rework do_pages_move() to work by page-sized chunks of struct page_to_node
    that are passed to do_move_page_to_node_array(). We now only have to
    allocate a single page instead a possibly very large vmalloc area to store
    all page_to_node entries.

    As a result, new_page_node() will now have a very small lookup, hidding
    much of the overall sys_move_pages() overhead.

    Signed-off-by: Brice Goglin
    Signed-off-by: Nathalie Furmento
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin