29 Jun, 2006

1 commit


28 Jun, 2006

15 commits

  • Runtime debugging functionality for rt-mutexes.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add debug_check_no_locks_freed(), as a central inline to add
    bad-lock-free-debugging functionality to.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Make notifier_calls associated with cpu_notifier as __cpuinit.

    __cpuinit makes sure that the function is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    [akpm@osdl.org: section fix]
    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Make notifier_blocks associated with cpu_notifier as __cpuinitdata.

    __cpuinitdata makes sure that the data is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
    band-aid solution to solve that problem. In the process, i undid all the
    changes you both were making to ensure that these notifiers were available
    only at init time (unless CONFIG_HOTPLUG_CPU is defined).

    We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
    XFS problem cleanly and makes the cpu notifiers available only at init time
    (unless CONFIG_HOTPLUG_CPU is defined).

    If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
    time.

    This patch reverts the notifier_call changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • generic_file_buffered_write() prefaults in user pages in order to avoid
    deadlock on copying from the same page as write goes to.

    However, it looks like there is a problem when write is vectored:
    fault_in_pages_readable brings in current segment or its part (maxlen).
    OTOH, filemap_copy_from_user_iovec is called to copy number of bytes
    (bytes) which may exceed current segment, so filemap_copy_from_user_iovec
    switches to the next segment which is not brought in yet. Pagefault is
    generated. That causes the deadlock if pagefault is for the same page
    write goes to: page being written is locked and not uptodate, pagefault
    will deadlock trying to lock locked page.

    [akpm@osdl.org: somewhat rewritten]
    Cc: Neil Brown
    Cc: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir V. Saveliev
     
  • locking init cleanups:

    - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
    - convert rwlocks in a similar manner

    this patch was generated automatically.

    Motivation:

    - cleanliness
    - lockdep needs control of lock initialization, which the open-coded
    variants do not give
    - it's also useful for -rt and for lock debugging in general

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Localize poison values into one header file for better documentation and
    easier/quicker debugging and so that the same values won't be used for
    multiple purposes.

    Use these constants in core arch., mm, driver, and fs code.

    Signed-off-by: Randy Dunlap
    Acked-by: Matt Mackall
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • When new node becomes enable by hot-add, new sysfs file must be created for
    new node. So, if new node is enabled by add_memory(), register_one_node() is
    called to create it. In addition, I386's arch_register_node() and a part of
    register_nodes() of powerpc are consolidated to register_one_node() as a
    generic_code().

    This is tested by Tiger4(IPF) with node hot-plug emulation.

    Signed-off-by: Keiichiro Tokunaga
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Fix "undefined reference to `arch_add_memory'" on sparc64 allmodconfig.

    sparc64 doesn't support memory hotplug. But we want it to support
    sparsemem.

    Signed-off-by: Yasunori Goto
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • This patch allows hot-add memory which is not aligned to section.

    Now, hot-added memory has to be aligned to section size. Considering big
    section sized archs, this is not useful.

    When hot-added memory is registerd as iomem resoruce by iomem resource
    patch, we can make use of that information to detect valid memory range.

    Note: With this, not-aligned memory can be registerd. To allow hot-add
    memory with holes, we have to do more work around add_memory().
    (It doesn't allows add memory to already existing mem section.)

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Register hot-added memory to iomem_resource. With this, /proc/iomem can
    show hot-added memory.

    Note: kdump uses /proc/iomem to catch memory range when it is installed.
    So, kdump should be re-installed after /proc/iomem change.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Vivek Goyal
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add node-hot-add support to add_memory().

    node hotadd uses this sequence.
    1. allocate pgdat.
    2. refresh NODE_DATA()
    3. call free_area_init_node() to initialize
    4. create sysfs entry
    5. add memory (old add_memory())
    6. set node online
    7. run kswapd for new node.
    (8). update zonelist after pages are onlined. (This is already merged in -mm
    due to update phase is difference.)

    Note:
    To make common function as much as possible,
    there is 2 changes from v2.
    - The old add_memory(), which is defiend by each archs,
    is renamed to arch_add_memory(). New add_memory becomes
    caller of arch dependent function as a common code.

    - This patch changes add_memory()'s interface
    From: add_memory(start, end)
    TO : add_memory(nid, start, end).
    It was cause of similar code that finding node id from
    physical address is inside of old add_memory() on each arch.

    In addition, acpi memory hotplug driver can find node id easier.
    In v2, it must walk DSDT'S _CRS by matching physical address to
    get the handle of its memory device, then get _PXM and node id.
    Because input is just physical address.
    However, in v3, the acpi driver can use handle to get _PXM and node id
    for the new memory device. It can pass just node id to add_memory().

    Fix interface of arch_add_memory() is in next patche.

    Signed-off-by: Yasunori Goto
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • When node is hot-added, kswapd for the node should start. This export kswapd
    start function as kswapd_run() to use at add_memory().

    [akpm@osdl.org: daemonize() isn't needed when using the kthread API]
    Signed-off-by: Yasunori Goto
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Dave Hansen
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • Change the name of old add_memory() to arch_add_memory. And use node id to
    get pgdat for the node at NODE_DATA().

    Note: Powerpc's old add_memory() is defined as __devinit. However,
    add_memory() is usually called only after bootup.
    I suppose it may be redundant. But, I'm not well known about powerpc.
    So, I keep it. (But, __meminit is better at least.)

    Signed-off-by: Yasunori Goto
    Cc: Dave Hansen
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

27 Jun, 2006

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
    typo fixes
    Clean up 'inline is not at beginning' warnings for usb storage
    Storage class should be first
    i386: Trivial typo fixes
    ixj: make ixj_set_tone_off() static
    spelling fixes
    fix paniced->panicked typos
    Spelling fixes for Documentation/atomic_ops.txt
    move acknowledgment for Mark Adler to CREDITS
    remove the bouncing email address of David Campbell

    Linus Torvalds
     
  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This patch converts the combination of list_del(A) and list_add(A, B) to
    list_move(A, B).

    Cc: Greg Kroah-Hartman
    Cc: Ram Pai
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • acquired (aquired)
    contiguous (contigious)
    successful (succesful, succesfull)
    surprise (suprise)
    whether (weather)
    some other misspellings

    Signed-off-by: Andreas Mohr
    Signed-off-by: Adrian Bunk

    Andreas Mohr
     

26 Jun, 2006

9 commits

  • * git://git.linux-nfs.org/pub/linux/nfs-2.6: (51 commits)
    nfs: remove nfs_put_link()
    nfs-build-fix-99
    git-nfs-build-fixes
    Merge branch 'odirect'
    NFS: alloc nfs_read/write_data as direct I/O is scheduled
    NFS: Eliminate nfs_get_user_pages()
    NFS: refactor nfs_direct_free_user_pages
    NFS: remove user_addr, user_count, and pos from nfs_direct_req
    NFS: "open code" the NFS direct write rescheduler
    NFS: Separate functions for counting outstanding NFS direct I/Os
    NLM: Fix reclaim races
    NLM: sem to mutex conversion
    locks.c: add the fl_owner to nlm_compare_locks
    NFS: Display the chosen RPCSEC_GSS security flavour in /proc/mounts
    NFS: Split fs/nfs/inode.c
    NFS: Fix typo in nfs_do_clone_mount()
    NFS: Fix compile errors introduced by referrals patches
    NFSv4: Ensure that referral mounts bind to a reserved port
    NFSv4: A root pathname is sent as a zero component4
    NFSv4: Follow a referral
    ...

    Linus Torvalds
     
  • Backoff readahead size exponentially on I/O error.

    Michael Tokarev described the problem as:

    [QUOTE]
    Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
    In order to "fix" it, one have to read it and write to another CD-rom,
    or something.. or just ignore the error (if it's just a skip in a video
    stream). Let's assume the unreadable block is number U.

    But current behavior is just insane. An application requests block
    number N, which is before U. Kernel tries to read-ahead blocks N..U.
    Cdrom drive tries to read it, re-read it.. for some time. Finally,
    when all the N..U-1 blocks are read, kernel returns block number N
    (as requested) to an application, successefully.

    Now an app requests block number N+1, and kernel tries to read
    blocks N+1..U+1. Retrying again as in previous step.

    And so on, up to when an app requests block number U-1. And when,
    finally, it requests block U, it receives read error.

    So, kernel currentry tries to re-read the same failing block as
    many times as the current readahead value (256 (times?) by default).

    This whole process already killed my cdrom drive (I posted about it
    to LKML several months ago) - literally, the drive has fried, and
    does not work anymore. Ofcourse that problem was a bug in firmware
    (or whatever) of the drive *too*, but.. main problem with that is
    current readahead logic as described above.
    [/QUOTE]

    Which was confirmed by Jens Axboe :

    [QUOTE]
    For ide-cd, it tends do only end the first part of the request on a
    medium error. So you may see a lot of repeats :/
    [/QUOTE]

    With this patch, retries are expected to be reduced from, say, 256, to 5.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Put short function description for read_cache_pages() on one line as needed
    by kernel-doc.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The problem is that when we write to a file, the copy from userspace to
    pagecache is first done with preemption disabled, so if the source address is
    not immediately available the copy fails *and* *zeros* *the* *destination*.

    This is a problem because a concurrent read (which admittedly is an odd thing
    to do) might see zeros rather that was there before the write, or what was
    there after, or some mixture of the two (any of these being a reasonable thing
    to see).

    If the copy did fail, it will immediately be retried with preemption
    re-enabled so any transient problem with accessing the source won't cause an
    error.

    The first copying does not need to zero any uncopied bytes, and doing so
    causes the problem. It uses copy_from_user_atomic rather than copy_from_user
    so the simple expedient is to change copy_from_user_atomic to *not* zero out
    bytes on failure.

    The first of these two patches prepares for the change by fixing two places
    which assume copy_from_user_atomic does zero the tail. The two usages are
    very similar pieces of code which copy from a userspace iovec into one or more
    page-cache pages. These are changed to remove the assumption.

    The second patch changes __copy_from_user_inatomic* to not zero the tail.
    Once these are accepted, I will look at similar patches of other architectures
    where this is important (ppc, mips and sparc being the ones I can find).

    This patch:

    There is a problem with __copy_from_user_inatomic zeroing the tail of the
    buffer in the case of an error. As it is called in atomic context, the error
    may be transient, so it results in zeros being written where maybe they
    shouldn't be.

    In the usage in filemap, this opens a window for a well timed read to see data
    (zeros) which is not consistent with any ordering of reads and writes.

    Most cases where __copy_from_user_inatomic is called, a failure results in
    __copy_from_user being called immediately. As long as the latter zeros the
    tail, the former doesn't need to. However in *copy_from_user_iovec
    implementations (in both filemap and ntfs/file), it is assumed that
    copy_from_user_inatomic will zero the tail.

    This patch removes that assumption, so that after this patch it will
    be safe for copy_from_user_inatomic to not zero the tail.

    This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.

    After this patch, all architectures that might disable preempt when
    kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
    This includes
    - powerpc
    - i386
    - mips
    - sparc

    Signed-off-by: Neil Brown
    Cc: David Howells
    Cc: Anton Altaparmakov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This is redundant with check in wakeup_kswapd.

    Signed-off-by: Chris Wright
    Acked-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • pdflush is carefully designed to ensure that all wakeups have some
    corresponding work to do - if a woken-up pdflush thread discovers that it
    hasn't been given any work to do then this is considered an error.

    That all broke when swsusp came along - because a timer-delivered wakeup to a
    frozen pdflush thread will just get lost. This causes the pdflush thread to
    get lost as well: the writeback timer is supposed to be re-armed by pdflush in
    process context, but pdflush doesn't execute the callout which does this.

    Fix that up by ignoring the return value from try_to_freeze(): jsut proceed,
    see if we have any work pending and only go back to sleep if that is not the
    case.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Hugh clarified the role of VM_LOCKED. So we can now implement page
    migration for mlocked pages.

    Allow the migration of mlocked pages. This means that try_to_unmap must
    unmap mlocked pages in the migration case.

    Signed-off-by: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Hooks for calling vma specific migration functions

    With this patch a vma may define a vma->vm_ops->migrate function. That
    function may perform page migration on its own (some vmas may not contain page
    structs and therefore cannot be handled by regular page migration. Pages in a
    vma may require special preparatory treatment before migration is possible
    etc) . Only mmap_sem is held when the migration function is called. The
    migrate() function gets passed two sets of nodemasks describing the source and
    the target of the migration. The flags parameter either contains

    MPOL_MF_MOVE which means that only pages used exclusively by
    the specified mm should be moved

    or

    MPOL_MF_MOVE_ALL which means that pages shared with other processes
    should also be moved.

    The migration function returns 0 on success or an error condition. An error
    condition will prevent regular page migration from occurring.

    On its own this patch cannot be included since there are no users for this
    functionality. But it seems that the uncached allocator will need this
    functionality at some point.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • AOP_TRUNCATED_PAGE victims in read_pages() belong in the LRU

    Nick Piggin rightly pointed out that the introduction of AOP_TRUNCATED_PAGE
    to read_pages() was wrong to leave A_T_P victim pages in the page cache but
    not put them in the LRU. Failing to do so hid them from the VM.

    A_T_P just means that the aop method unlocked the page rather than
    performing IO. It would be very rare that the page was truncated between
    the unlock and testing A_T_P. So we leave the pages in the LRU for likely
    reuse soon rather than backing them back out of the page cache. We do this
    by matching the behaviour before the A_T_P introduction which added pages
    to the LRU regardless of what ->readpage() did.

    This doesn't include the unrelated cleanup in Nick's initial fix which
    changed read_pages() to return void to match its only caller's behaviour of
    ignoring errors.

    Signed-off-by: Nick Piggin
    Signed-off-by: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

25 Jun, 2006

1 commit


23 Jun, 2006

10 commits

  • A process flag to indicate whether we are doing sync io is incredibly
    ugly. It also causes performance problems when one does a lot of async
    io and then proceeds to sync it. Part of the io will go out as async,
    and the other part as sync. This causes a disconnect between the
    previously submitted io and the synced io. For io schedulers such as CFQ,
    this will cause us lost merges and suboptimal behaviour in scheduling.

    Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
    the O_DIRECT path just directly indicate that the writes are sync
    by using WRITE_SYNC instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Eric Sesterhenn
    Signed-off-by: Alexey Dobriyan
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Alan Cox
    Cc: James Bottomley
    Acked-by: "Salyzyn, Mark"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sesterhenn
     
  • If invalidate_mapping_pages is called to invalidate a very large mapping
    (e.g. a very large block device) and if the only active page in that
    device is near the end (or at least, at a very large index), such as, say,
    the superblock of an md array, and if that page happens to be locked when
    invalidate_mapping_pages is called, then

    pagevec_lookup will return this page and
    as it is locked, 'next' will be incremented and pagevec_lookup
    will be called again. and again. and again.
    while we count from 0 upto a very large number.

    We should really always set 'next' to 'page->index+1' before going around
    the loop again, not just if the page isn't locked.

    Cc: "Steinar H. Gunderson"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • - Move percpu_counter routines from mm/swap.c to lib/percpu_counter.c

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Add read_mapping_page() which is used for callers that pass
    mapping->a_ops->readpage as the filler for read_cache_page. This removes
    some duplication from filesystem code.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Use the x86 cache-bypassing copy instructions for copy_from_user().

    Some performance data are

    Total of GLOBAL_POWER_EVENTS (CPU cycle samples)

    2.6.12.4.orig 1921587
    2.6.12.4.nt 1599424
    1599424/1921587=83.23% (16.77% reduction)

    BSQ_CACHE_REFERENCE (L3 cache miss)
    2.6.12.4.orig 57427
    2.6.12.4.nt 20858
    20858/57427=36.32% (63.7% reduction)

    L3 cache miss reduction of __copy_from_user_ll
    samples %
    37408 65.1412 vmlinux __copy_from_user_ll
    23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
    23/37408=0.061% (99.94% reduction)

    Top 5 of 2.6.12.4.nt
    Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
    samples % app name symbol name
    128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
    64206 4.0143 vmlinux journal_add_journal_head
    59746 3.7355 vmlinux do_get_write_access
    47674 2.9807 vmlinux journal_put_journal_head
    46021 2.8774 vmlinux journal_dirty_metadata
    pattern9-0-cpu4-0-09011728/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
    samples % app name symbol name
    69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
    55685 3.4215 vmlinux journal_add_journal_head
    52371 3.2179 vmlinux __find_get_block
    45504 2.7960 vmlinux journal_put_journal_head
    36005 2.2123 vmlinux journal_stop
    pattern9-0-cpu4-0-09011744/summary.out

    Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
    samples % app name symbol name
    1147 5.4994 vmlinux journal_add_journal_head
    881 4.2240 vmlinux journal_dirty_data
    872 4.1809 vmlinux blk_rq_map_sg
    734 3.5192 vmlinux journal_commit_transaction
    617 2.9582 vmlinux radix_tree_delete
    pattern9-0-cpu4-0-09011731/summary.out

    iozone results are

    original 2.6.12.4 CPU time = 207.768 sec
    cache aware CPU time = 184.783 sec
    (three times run)
    184.783/207.768=88.94% (11.06% reduction)

    original:
    pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
    pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
    pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %

    cache awre:
    pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
    pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
    pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %

    Signed-off-by: Hiro Yoshioka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiro Yoshioka
     
  • This patch inserts security_task_movememory hook calls into memory management
    code to enable security modules to mediate this operation between tasks.

    Since the last posting, the hook has been renamed following feedback from
    Christoph Lameter.

    Signed-off-by: David Quigley
    Acked-by: Stephen Smalley
    Signed-off-by: James Morris
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Acked-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Quigley
     
  • move_pages() is used to move individual pages of a process. The function can
    be used to determine the location of pages and to move them onto the desired
    node. move_pages() returns status information for each page.

    long move_pages(pid, number_of_pages_to_move,
    addresses_of_pages[],
    nodes[] or NULL,
    status[],
    flags);

    The addresses of pages is an array of void * pointing to the
    pages to be moved.

    The nodes array contains the node numbers that the pages should be moved
    to. If a NULL is passed instead of an array then no pages are moved but
    the status array is updated. The status request may be used to determine
    the page state before issuing another move_pages() to move pages.

    The status array will contain the state of all individual page migration
    attempts when the function terminates. The status array is only valid if
    move_pages() completed successfullly.

    Possible page states in status[]:

    0..MAX_NUMNODES The page is now on the indicated node.

    -ENOENT Page is not present

    -EACCES Page is mapped by multiple processes and can only
    be moved if MPOL_MF_MOVE_ALL is specified.

    -EPERM The page has been mlocked by a process/driver and
    cannot be moved.

    -EBUSY Page is busy and cannot be moved. Try again later.

    -EFAULT Invalid address (no VMA or zero page).

    -ENOMEM Unable to allocate memory on target node.

    -EIO Unable to write back page. The page must be written
    back in order to move it since the page is dirty and the
    filesystem does not provide a migration function that
    would allow the moving of dirty pages.

    -EINVAL A dirty page cannot be moved. The filesystem does not provide
    a migration function and has no ability to write back pages.

    The flags parameter indicates what types of pages to move:

    MPOL_MF_MOVE Move pages that are only mapped by the process.

    MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
    Requires sufficient capabilities.

    Possible return codes from move_pages()

    -ENOENT No pages found that would require moving. All pages
    are either already on the target node, not present, had an
    invalid address or could not be moved because they were
    mapped by multiple processes.

    -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
    to migrate pages in a kernel thread.

    -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
    or an attempt to move a process belonging to another user.

    -EACCES One of the target nodes is not allowed by the current cpuset.

    -ENODEV One of the target nodes is not online.

    -ESRCH Process does not exist.

    -E2BIG Too many pages to move.

    -ENOMEM Not enough memory to allocate control array.

    -EFAULT Parameters could not be accessed.

    A test program for move_pages() may be found with the patches
    on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

    From: Christoph Lameter

    Detailed results for sys_move_pages()

    Pass a pointer to an integer to get_new_page() that may be used to
    indicate where the completion status of a migration operation should be
    placed. This allows sys_move_pags() to report back exactly what happened to
    each page.

    Wish there would be a better way to do this. Looks a bit hacky.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Instead of passing a list of new pages, pass a function to allocate a new
    page. This allows the correct placement of MPOL_INTERLEAVE pages during page
    migration. It also further simplifies the callers of migrate pages.
    migrate_pages() becomes similar to migrate_pages_to() so drop
    migrate_pages_to(). The batching of new page allocations becomes unnecessary.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter