07 Jan, 2009

33 commits

  • Signed-off-by: Alexey Dobriyan
    Cc: Gabor Gombas
    Cc: Jan Beulich
    Cc: Andi Kleen
    Cc: Ingo Molnar ,
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The atomic_t type cannot currently be used in some header files because it
    would create an include loop with asm/atomic.h. Move the type definition
    to linux/types.h to break the loop.

    Signed-off-by: Matthew Wilcox
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses
    mm->hiwater_xxx directly, this leads to 2 problems:

    - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any
    moment before the task exits, so we should check the current values of
    rss/vm anyway.

    - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can
    be preempted right before mm->hiwater_xxx = new_val, and another thread
    can use A_LOT of memory and exit in between. When the first thread
    resumes it can be the last thread in the thread group, in that case we
    report the wrong hiwater_xxx values which do not take A_LOT into
    account.

    Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change
    xacct_add_tsk() to use them. The first helper will also be used by
    rusage->ru_maxrss accounting.

    Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to
    decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody
    can look at this mm_struct when exit_mmap() actually unmaps the memory.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • s_syncing livelock avoidance was breaking data integrity guarantee of
    sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
    if there is a concurrent sys_sync happening.

    This livelock avoidance is much less important now that we don't have the
    get_super_to_sync() call after every sb that we sync. This was replaced
    by __put_super_and_need_restart.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove WB_SYNC_HOLD. The primary motiviation is the design of my
    anti-starvation code for fsync. It requires taking an inode lock over the
    sync operation, so we could run into lock ordering problems with multiple
    inodes. It is possible to take a single global lock to solve the ordering
    problem, but then that would prevent a future nice implementation of "sync
    multiple inodes" based on lock order via inode address.

    Seems like a backward step to remove this, but actually it is busted
    anyway: we can't use the inode lists for data integrity wait: an inode can
    be taken off the dirty lists but still be under writeback. In order to
    satisfy data integrity semantics, we should wait for it to finish
    writeback, but if we only search the dirty lists, we'll miss it.

    It would be possible to have a "writeback" list, for sys_sync, I suppose.
    But why complicate things by prematurely optimise? For unmounting, we
    could avoid the "livelock avoidance" code, which would be easier, but
    again premature IMO.

    Fixing the existing data integrity problem will come next.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Complete zap_pte_range()'s coverage of bad pagetable entries by calling
    print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
    That needs free_swap_and_cache() to tell it, which will also have shown
    one of those "swap_free" errors (but with much less information).

    Similar checks in fork's copy_one_pte()? No, that would be more noisy
    than helpful: we'll see them when parent and child exec or exit.

    Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
    case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
    VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
    with what happens if do_swap_page() operates a bad swap entry; but don't
    we have patches to be more careful about killing when VM_FAULT_OOM?

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
    a page: check the same flags as before when freeing, clear ALL the flags
    (unless PageReserved) when freeing, check ALL flags off when allocating.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap allocation has always started from the beginning of the swap area;
    but if we're dealing with a solidstate swap device which can only remap
    blocks within limited zones, that would sooner wear out the first zone.

    Therefore sys_swapon() test whether blk_queue is non-rotational, and if so
    randomize the cluster_next starting position for allocation.

    If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it
    with an "SS" at the right end of the kernel's "Adding ... swap" message
    (so that if it's both nonrot and discardable, "SSD" will be shown there).
    Perhaps something should be shown in /proc/swaps (swapon -s), but we have
    to be more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When scan_swap_map() finds a free cluster of swap pages to allocate,
    discard the old contents of the cluster if the device supports discard.
    But don't bother when swap is so fragmented that we allocate single pages.

    Be careful about racing allocations made while we're scanning for a
    cluster; and hold up allocations made while we're discarding.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When adding swap, all the old data on swap can be forgotten: sys_swapon()
    discard all but the header page of the swap partition (or every extent but
    the header of the swap file), to give a solidstate swap device the
    opportunity to optimize its wear-levelling.

    If that succeeds, note SWP_DISCARDABLE for later use, and report it with a
    "D" at the right end of the kernel's "Adding ... swap" message. Perhaps
    something should be shown in /proc/swaps (swapon -s), but we have to be
    more cautious before making any addition to that format.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Joern Engel
    Cc: James Bottomley
    Cc: Donjun Shin
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before making functional changes, rearrange scan_swap_map() to simplify
    subsequent diffs. Actually, there is one functional change in there:
    leave cluster_nr negative while scanning for a new cluster - resetting it
    early increased the likelihood that when we have difficulty finding a free
    cluster, another task may come in and try doing exactly the same - just a
    waste of cpu.

    Before making functional changes, rearrange struct swap_info_struct
    slightly: flags will be needed as an unsigned long (for wait_on_bit), next
    is a good int to pair with prio, old_block_size is uninteresting so shift
    it to the end.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Sparse output following warnings.

    mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
    mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?

    However, it is used by /dev/kmem. fixed here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc
    optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual
    declaration to swapfile.c: it never belonged in page_alloc.c.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If we add a failing stub for add_to_swap(), then we can remove the #ifdef
    CONFIG_SWAP from mm/vmscan.c.

    This was intended as a source cleanup, but looking more closely, it turns
    out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
    page, whereas now it goes to the more suitable activate_locked, like the
    CONFIG_SWAP nr_swap_pages 0 case.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove gfp_mask argument from add_to_swap(): it's misleading because its
    only caller, shrink_page_list(), is not atomic at that point; and in due
    course (implementing discard) we'll sometimes want to allocate some memory
    with GFP_NOIO (as is used in swap_writepage) when allocating swap.

    No change to the gfp_mask passed down to add_to_swap_cache(): still use
    __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
    though it's not obvious if that's the best combination to ask for here.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A good place to free up old swap is where do_wp_page(), or do_swap_page(),
    is about to redirty the page: the data on disk is then stale and won't be
    read again; and if we do decide to write the page out later, using the
    previous swap location makes an unnecessary disk seek very likely.

    So give can_share_swap_page() the side-effect of delete_from_swap_cache()
    when it safely can. And can_share_swap_page() was always a misleading
    name, the more so if it has a side-effect: rename it reuse_swap_page().

    Irrelevant cleanup nearby: remove swap_token_default_timeout definition
    from swap.h: it's used nowhere.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Acked-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This change introduces two new sysctls to /proc/sys/vm:
    dirty_background_bytes and dirty_bytes.

    dirty_background_bytes is the counterpart to dirty_background_ratio and
    dirty_bytes is the counterpart to dirty_ratio.

    With growing memory capacities of individual machines, it's no longer
    sufficient to specify dirty thresholds as a percentage of the amount of
    dirtyable memory over the entire system.

    dirty_background_bytes and dirty_bytes specify quantities of memory, in
    bytes, that represent the dirty limits for the entire system. If either
    of these values is set, its value represents the amount of dirty memory
    that is needed to commence either background or direct writeback.

    When a `bytes' or `ratio' file is written, its counterpart becomes a
    function of the written value. For example, if dirty_bytes is written to
    be 8096, 8K of memory is required to commence direct writeback.
    dirty_ratio is then functionally equivalent to 8K / the amount of
    dirtyable memory:

    dirtyable_memory = free pages + mapped pages + file cache

    dirty_background_bytes = dirty_background_ratio * dirtyable_memory
    -or-
    dirty_background_ratio = dirty_background_bytes / dirtyable_memory

    AND

    dirty_bytes = dirty_ratio * dirtyable_memory
    -or-
    dirty_ratio = dirty_bytes / dirtyable_memory

    Only one of dirty_background_bytes and dirty_background_ratio may be
    specified at a time, and only one of dirty_bytes and dirty_ratio may be
    specified. When one sysctl is written, the other appears as 0 when read.

    The `bytes' files operate on a page size granularity since dirty limits
    are compared with ZVC values, which are in page units.

    Prior to this change, the minimum dirty_ratio was 5 as implemented by
    get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
    written value between 0 and 100. This restriction is maintained, but
    dirty_bytes has a lower limit of only one page.

    Also prior to this change, the dirty_background_ratio could not equal or
    exceed dirty_ratio. This restriction is maintained in addition to
    restricting dirty_background_bytes. If either background threshold equals
    or exceeds that of the dirty threshold, it is implicitly set to half the
    dirty threshold.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The background dirty and dirty limits are better defined with type
    specifiers of unsigned long since negative writeback thresholds are not
    possible.

    These values, as returned by get_dirty_limits(), are normally compared
    with ZVC values to determine whether writeback shall commence or be
    throttled. Such page counts cannot be negative, so declaring the page
    limits as signed is unnecessary.

    Acked-by: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Cc: Andrea Righi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • page_lock_anon_vma() and page_unlock_anon_vma() were made available to
    show_page_path() in vmscan.c; but now that has been removed, make them
    static in rmap.c again, they're better kept private if possible.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always
    appear together. Save some symbol table space and some jumping around by
    removing lru_cache_add_active_or_unevictable(), folding its code into
    page_add_new_anon_rmap(): like how we add file pages to lru just after
    adding them to page cache.

    Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and
    change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as
    originally intended.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then
    we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
    that harder to track down: remove it, and its out-of-work brothers
    GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.

    Since we're making that improvement to hotremove_migrate_alloc(), I think
    we can now also remove one of the "o"s from its comment.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • cgroup_mm_owner_callbacks() was brought in to support the memrlimit
    controller, but sneaked into mainline ahead of it. That controller has
    now been shelved, and the mm_owner_changed() args were inadequate for it
    anyway (they needed an mm pointer instead of a task pointer).

    Remove the dead code, and restore mm_update_next_owner() locking to how it
    was before: taking mmap_sem there does nothing for memcontrol.c, now the
    only user of mm->owner.

    Signed-off-by: Hugh Dickins
    Cc: Paul Menage
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • #ifdef in *.c file decrease source readability a bit. removing is better.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • speculative page references patch (commit:
    e286781d5f2e9c846e012a39653a166e9d31777d) removed last
    pagevec_release_nonlru() caller.

    So this function can be removed now.

    This patch doesn't have any functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Show node to memory section relationship with symlinks in sysfs

    Add /sys/devices/system/node/nodeX/memoryY symlinks for all
    the memory sections located on nodeX. For example:
    /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
    indicates that memory section 135 resides on node1.

    Also revises documentation to cover this change as well as updating
    Documentation/ABI/testing/sysfs-devices-memory to include descriptions
    of memory hotremove files 'phys_device', 'phys_index', and 'state'
    that were previously not described there.

    In addition to it always being a good policy to provide users with
    the maximum possible amount of physical location information for
    resources that can be hot-added and/or hot-removed, the following
    are some (but likely not all) of the user benefits provided by
    this change.
    Immediate:
    - Provides information needed to determine the specific node
    on which a defective DIMM is located. This will reduce system
    downtime when the node or defective DIMM is swapped out.
    - Prevents unintended onlining of a memory section that was
    previously offlined due to a defective DIMM. This could happen
    during node hot-add when the user or node hot-add assist script
    onlines _all_ offlined sections due to user or script inability
    to identify the specific memory sections located on the hot-added
    node. The consequences of reintroducing the defective memory
    could be ugly.
    - Provides information needed to vary the amount and distribution
    of memory on specific nodes for testing or debugging purposes.
    Future:
    - Will provide information needed to identify the memory
    sections that need to be offlined prior to physical removal
    of a specific node.

    Symlink creation during boot was tested on 2-node x86_64, 2-node
    ppc64, and 2-node ia64 systems. Symlink creation during physical
    memory hot-add tested on a 2-node x86_64 system.

    Signed-off-by: Gary Hade
    Signed-off-by: Badari Pulavarty
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gary Hade
     
  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
    kernel to back a VMA. This matches the size used by the MMU in the
    majority of cases. However, one counter-example occurs on PPC64 kernels
    whereby a kernel using 64K as a base pagesize may still use 4K pages for
    the MMU on older processor. To distinguish, this patch reports
    MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

    Signed-off-by: Mel Gorman
    Cc: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is useful to verify a hugepage-aware application is using the expected
    pagesizes for its memory regions. This patch creates an entry called
    KernelPageSize in /proc/pid/smaps that is the size of page used by the
    kernel to back a VMA. The entry is not called PageSize as it is possible
    the MMU uses a different size. This extension should not break any sensible
    parser that skips lines containing unrecognised information.

    Signed-off-by: Mel Gorman
    Acked-by: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Jan, 2009

7 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
    dm snapshot: extend exception store functions
    dm snapshot: split out exception store implementations
    dm snapshot: rename struct exception_store
    dm snapshot: separate out exception store interface
    dm mpath: move trigger_event to system workqueue
    dm: add name and uuid to sysfs
    dm table: rework reference counting
    dm: support barriers on simple devices
    dm request: extend target interface
    dm request: add caches
    dm ioctl: allow dm_copy_name_and_uuid to return only one field
    dm log: ensure log bitmap fits on log device
    dm log: move region_size validation
    dm log: avoid reinitialising io_req on every operation
    dm: consolidate target deregistration error handling
    dm raid1: fix error count
    dm log: fix dm_io_client leak on error paths
    dm snapshot: change yield to msleep
    dm table: drop reference at unbind

    Linus Torvalds
     
  • Implement barrier support for single device DM devices

    This patch implements barrier support in DM for the common case of dm linear
    just remapping a single underlying device. In this case we can safely
    pass the barrier through because there can be no reordering between
    devices.

    NB. Any DM device might cease to support barriers if it gets
    reconfigured so code must continue to allow for a possible
    -EOPNOTSUPP on every barrier bio submitted. - agk

    Signed-off-by: Andi Kleen
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Andi Kleen
     
  • This patch adds the following target interfaces for request-based dm.

    map_rq : for mapping a request

    rq_end_io : for finishing a request

    busy : for avoiding performance regression from bio-based dm.
    Target can tell dm core not to map requests now, and
    that may help requests in the block layer queue to be
    bigger by I/O merging.
    In bio-based dm, this behavior is done by device
    drivers managing the block layer queue.
    But in request-based dm, dm core has to do that
    since dm core manages the block layer queue.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • Change dm_unregister_target to return void and use BUG() for error
    reporting.

    dm_unregister_target can only fail because of programming bug in the
    target driver. It can't fail because of user's behavior or disk errors.

    This patch changes unregister_target to return void and use BUG if
    someone tries to unregister non-registered target or unregister target
    that is in use.

    This patch removes code duplication (testing of error codes in all dm
    targets) and reports bugs in just one place, in dm_unregister_target. In
    some target drivers, these return codes were ignored, which could lead
    to a situation where bugs could be missed.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • * 'for-next' of git://git.o-hand.com/linux-mfd: (30 commits)
    mfd: Fix section mismatch in da903x
    mfd: move drivers/i2c/chips/menelaus.c to drivers/mfd
    mfd: move drivers/i2c/chips/tps65010.c to drivers/mfd
    mfd: dm355evm msp430 driver
    mfd: Add missing break from wm3850-core
    mfd: Add WM8351 support
    mfd: Support configurable numbers of DCDCs and ISINKs on WM8350
    mfd: Handle missing WM8350 platform data
    mfd: Add WM8352 support
    mfd: Use irq_to_desc in twl4030 code
    power_supply: Add Dialog DA9030 battery charger driver
    mfd: Dialog DA9030 battery charger MFD driver
    mfd: Register WM8400 codec device
    mfd: Pass driver_data onto child devices
    mfd: Fix twl4030-core.c build error
    mfd: twl4030 regulator bug fixes
    mfd: twl4030: create some regulator devices
    mfd: twl4030: cleanup symbols and OMAP dependency
    mfd: twl4030: simplified child creation code
    power_supply: Add battery health reporting for WM8350
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    module: convert to stop_machine_create/destroy.
    stop_machine: introduce stop_machine_create/destroy.
    parisc: fix module loading failure of large kernel modules
    module: fix module loading failure of large kernel modules for parisc
    module: fix warning of unused function when !CONFIG_PROC_FS
    kernel/module.c: compare symbol values when marking symbols as exported in /proc/kallsyms.
    remove CONFIG_KMOD

    Linus Torvalds
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    swiotlb: Don't include linux/swiotlb.h twice in lib/swiotlb.c
    intel-iommu: fix build error with INTR_REMAP=y and DMAR=n
    swiotlb: add missing __init annotations

    Linus Torvalds