14 Jan, 2011

40 commits

  • IS_ERR() already implies unlikely(), so it can be omitted here.

    Signed-off-by: Tobias Klauser
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • Today, tasklist_lock in migrate_pages doesn't protect anything.
    rcu_read_lock() provide enough protection from pid hash walk.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • __get_user_pages gets a new 'nonblocking' parameter to signal that the
    caller is prepared to re-acquire mmap_sem and retry the operation if
    needed. This is used to split off long operations if they are going to
    block on a disk transfer, or when we detect contention on the mmap_sem.

    [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use a single code path for faulting in pages during mlock.

    The reason to have it in this patch series is that I did not want to
    update both code paths in a later change that releases mmap_sem when
    blocking on disk.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Move the code to mlock pages from __mlock_vma_pages_range() to
    follow_page().

    This allows __mlock_vma_pages_range() to not have to break down work into
    16-page batches.

    An additional motivation for doing this within the present patch series is
    that it'll make it easier for a later chagne to drop mmap_sem when
    blocking on disk (we'd like to be able to resume at the page that was read
    from disk instead of at the start of a 16-page batch).

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Currently mlock() holds mmap_sem in exclusive mode while the pages get
    faulted in. In the case of a large mlock, this can potentially take a
    very long time, during which various commands such as 'ps auxw' will
    block. This makes sysadmins unhappy:

    real 14m36.232s
    user 0m0.003s
    sys 0m0.015s
    (output from 'time ps auxw' while a 20GB file was being mlocked without
    being previously preloaded into page cache)

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages, in small batches, releasing mmap_sem when we
    block on disk access or when we detect some contention.

    This patch:

    Before this change, mlock() holds mmap_sem in exclusive mode while the
    pages get faulted in. In the case of a large mlock, this can potentially
    take a very long time. Various things will block while mmap_sem is held,
    including 'ps auxw'. This can make sysadmins angry.

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages with mmap_sem held for reads only. We need to
    recheck the vma flags after we re-acquire mmap_sem, but this is easy.

    In the case where a vma has been munlocked before mlock completes, pages
    that were already marked as PageMlocked() are handled by the munlock()
    call, and mlock() is careful to not mark new page batches as PageMlocked()
    after the munlock() call has cleared the VM_LOCKED vma flags. So, the end
    result will be identical to what'd happen if munlock() had executed after
    the mlock() call.

    In a later change, I will allow the second pass to release mmap_sem when
    blocking on disk accesses or when it is otherwise contended, so that it
    won't be held for long periods of time even in shared mode.

    Signed-off-by: Michel Lespinasse
    Tested-by: Valdis Kletnieks
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When faulting in pages for mlock(), we want to break COW for anonymous or
    file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
    need to write-fault into VM_SHARED vmas since shared file pages can be
    mlocked first and dirtied later, when/if they actually get written to.
    Skipping the write fault is desirable, as we don't want to unnecessarily
    cause these pages to be dirtied and queued for writeback.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Reorganize the code so that dirty pages are handled closer to the place
    that makes them dirty (handling write fault into shared, writable VMAs).
    No behavior changes.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • mlocking a shared, writable vma currently causes the corresponding pages
    to be marked as dirty and queued for writeback. This seems rather
    unnecessary given that the pages are not being actually modified during
    mlock. It is understood that for non-shared mappings (file or anon) we
    want to use a write fault in order to break COW, but there is just no such
    need for shared mappings.

    The first two patches in this series do not introduce any behavior change.
    The intent there is to make it obvious that dirtying file pages is only
    done in the (writable, shared) case. I think this clarifies the code, but
    I wouldn't mind dropping these two patches if there is no consensus about
    them.

    The last patch is where we actually avoid dirtying shared mappings during
    mlock. Note that as a side effect of this, we won't call page_mkwrite()
    for the mappings that define it, and won't be pre-allocating data blocks
    at the FS level if the mapped file was sparsely allocated. My
    understanding is that mlock does not need to provide such guarantee, as
    evidenced by the fact that it never did for the filesystems that don't
    define page_mkwrite() - including some common ones like ext3. However, I
    would like to gather feedback on this from filesystem people as a
    precaution. If this turns out to be a showstopper, maybe block
    preallocation can be added back on using a different interface.

    Large shared mlocks are getting significantly (>2x) faster in my tests, as
    the disk can be fully used for reading the file instead of having to share
    between this and writeback.

    This patch:

    Reorganize the code to remove the 'reuse' flag. No behavior changes.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
    module_init(). Much of the code is duplicated and can be generalized in a
    globally accessible function, __vmalloc_node_range().

    __vmalloc_node() now calls into __vmalloc_node_range() with a range of
    [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.

    Each architecture may then use __vmalloc_node_range() directly to remove
    the duplication of code.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
    formal and use the mask internally.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • get_vm_area_node() is unused in the kernel and can thus be removed.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With compaction being used instead of lumpy reclaim, the name lumpy_mode
    and associated variables is a bit misleading. Rename lumpy_mode to
    reclaim_mode which is a better fit. There is no functional change.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • try_to_compact_pages() is initially called to only migrate pages
    asychronously and kswapd always compacts asynchronously. Both are being
    optimistic so it is important to complete the work as quickly as possible
    to minimise stalls.

    This patch alters the scanner when asynchronous to only consider
    MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls
    when allocating huge pages while not impairing allocation success rates as
    a full scan will be performed if necessary after direct reclaim.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
    syncronous or asyncronous. In preparation for using compaction instead of
    lumpy reclaim, this patch converts the flags into a bitmap.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In preparation for a patches promoting the use of memory compaction over
    lumpy reclaim, this patch adds trace points for memory compaction
    activity. Using them, we can monitor the scanning activity of the
    migration and free page scanners as well as the number and success rates
    of pages passed to page migration.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently there is no way to find whether a process has locked its pages
    in memory or not. And which of the memory regions are locked in memory.

    Add a new field "Locked" to export this information via the smaps file.

    Signed-off-by: Nikanth Karthikesan
    Acked-by: Balbir Singh
    Acked-by: Wu Fengguang
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • Signed-off-by: Joe Perches
    Acked-by: Pekka Enberg
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to
    eliminate code duplication.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hai Shan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hai Shan
     
  • Testing ->mapping and ->index without a ref is not stable as the page
    may have been reused at this point.

    Signed-off-by: Nick Piggin
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Currently, kswapd() has deep nesting and is slightly hard to read. Clean
    this up.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • __set_page_dirty_no_writeback() should return true if it actually
    transitioned the page from a clean to dirty state although it seems nobody
    uses its return value at present.

    Signed-off-by: Bob Liu
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Use correct function name, remove incorrect apostrophe

    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
    usually set to LONG_MAX. The logic in wb_writeback() then calls
    __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
    easily end up with non-positive nr_to_write after the function returns, if
    the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

    When nr_to_write is
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Background writeback is easily livelockable in a loop in wb_writeback() by
    a process continuously re-dirtying pages (or continuously appending to a
    file). This is in fact intended as the target of background writeback is
    to write dirty pages it can find as long as we are over
    dirty_background_threshold.

    But the above behavior gets inconvenient at times because no other work
    queued in the flusher thread's queue gets processed. In particular, since
    e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
    can hang forever waiting for flusher thread to do the work.

    Generally, when a flusher thread has some work queued, someone submitted
    the work to achieve a goal more specific than what background writeback
    does. Moreover by working on the specific work, we also reduce amount of
    dirty pages which is exactly the target of background writeout. So it
    makes sense to give specific work a priority over a generic page cleaning.

    Thus we interrupt background writeback if there is some other work to do.
    We return to the background writeback after completing all the queued
    work.

    This may delay the writeback of expired inodes for a while, however the
    expired inodes will eventually be flushed to disk as long as the other
    works won't livelock.

    [fengguang.wu@intel.com: update comment]
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • This tracks when balance_dirty_pages() tries to wakeup the flusher thread
    for background writeback (if it was not started already).

    Suggested-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Check whether background writeback is needed after finishing each work.

    When bdi flusher thread finishes doing some work check whether any kind of
    background writeback needs to be done (either because
    dirty_background_ratio is exceeded or because we need to start flushing
    old inodes). If so, just do background write back.

    This way, bdi_start_background_writeback() just needs to wake up the
    flusher thread. It will do background writeback as soon as there is no
    other work.

    This is a preparatory patch for the next patch which stops background
    writeback as soon as there is other work to do.

    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
    to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
    errors due to counter drift. The functions duplicate some code so this
    patch replaces them with a single set_pgdat_percpu_threshold() that takes
    a callback function to calculate the desired threshold as a parameter.

    [akpm@linux-foundation.org: readability tweak]
    [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This warning was added in commit bdff746a3915 ("clone: prepare to recycle
    CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
    no-one is actually using CLONE_STOPPED.

    Signed-off-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • When working in RS485 mode, the atmel_serial driver keeps RTS high after
    the initialization of the serial port. It goes low only after the first
    character has been sent.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Claudio Scordino
    Signed-off-by: Arkadiusz Bubala
    Tested-by: Arkadiusz Bubala
    Cc: Nicolas Ferre
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • Use modern per_cpu API to increment {soft|hard}irq counters, and use
    per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.

    This gives better SMP/NUMA locality and saves few instructions per irq.

    With small nr_cpuids values (8 for example), kstats_irq was a small array
    (less than L1_CACHE_BYTES), potentially source of false sharing.

    In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
    kstat_irqs_all[NR_IRQS][NR_CPUS] array.

    Note: we still populate kstats_irq for all possible irqs in
    early_irq_init(). We probably could use on-demand allocations. (Code
    included in alloc_descs()). Problem is not all IRQS are used with a prior
    alloc_descs() call.

    kstat_irqs_this_cpu() is not used anymore, remove it.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Andi Kleen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't
    handle the Linux driver for VIA now, I would like to request to update the
    maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA
    UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one.

    Signed-off-by: Bruce Chang
    Signed-off-by: Florian Tobias Schandinat
    Cc: Joseph Chan
    Cc: Geert Uytterhoeven
    Cc: Harald Welte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bruce Chang
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
    dm: raid456 basic support
    dm: per target unplug callback support
    dm: introduce target callbacks and congestion callback
    dm mpath: delay activate_path retry on SCSI_DH_RETRY
    dm: remove superfluous irq disablement in dm_request_fn
    dm log: use PTR_ERR value instead of ENOMEM
    dm snapshot: avoid storing private suspended state
    dm snapshot: persistent make metadata_wq multithreaded
    dm: use non reentrant workqueues if equivalent
    dm: convert workqueues to alloc_ordered
    dm stripe: switch from local workqueue to system_wq
    dm: dont use flush_scheduled_work
    dm snapshot: remove unused dm_snapshot queued_bios_work
    dm ioctl: suppress needless warning messages
    dm crypt: add loop aes iv generator
    dm crypt: add multi key capability
    dm crypt: add post iv call to iv generator
    dm crypt: use io thread for reads only if mempool exhausted
    dm crypt: scale to multiple cpus
    dm crypt: simplify compatible table output
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix removal of extra drives when converting RAID6 to RAID5
    md: range check slot number when manually adding a spare.
    md/raid5: handle manually-added spares in start_reshape.
    md: fix sync_completed reporting for very large drives (>2TB)
    md: allow suspend_lo and suspend_hi to decrease as well as increase.
    md: Don't let implementation detail of curr_resync leak out through sysfs.
    md: separate meta and data devs
    md-new-param-to_sync_page_io
    md-new-param-to-calc_dev_sboffset
    md: Be more careful about clearing flags bit in ->recovery
    md: md_stop_writes requires mddev_lock.
    md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
    md: Ensure no IO request to get md device before it is properly initialised.
    md: Fix single printks with multiple KERN_s
    md: fix regression resulting in delays in clearing bits in a bitmap
    md: fix regression with re-adding devices to arrays with no metadata

    Linus Torvalds