25 May, 2010

31 commits

  • The unusable free space index measures how much of the available free
    memory cannot be used to satisfy an allocation of a given size and is a
    value between 0 and 1. The higher the value, the more of free memory is
    unusable and by implication, the worse the external fragmentation is. For
    the most part, the huge page size will be the size of interest but not
    necessarily so it is exported on a per-order and per-zone basis via
    /sys/kernel/debug/extfrag/unusable_index.

    > cat /sys/kernel/debug/extfrag/unusable_index
    Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
    Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

    [akpm@linux-foundation.org: Fix allnoconfig]
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …tion by not migrating temporary stacks

    Page migration requires rmap to be able to find all ptes mapping a page
    at all times, otherwise the migration entry can be instantiated, but it
    is possible to leave one behind if the second rmap_walk fails to find
    the page. If this page is later faulted, migration_entry_to_page() will
    call BUG because the page is locked indicating the page was migrated by
    the migration PTE not cleaned up. For example

    kernel BUG at include/linux/swapops.h:105!
    invalid opcode: 0000 [#1] PREEMPT SMP
    ...
    Call Trace:
    [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
    [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
    [<ffffffff813099b5>] page_fault+0x25/0x30
    [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
    [<ffffffff8111329b>] search_binary_handler+0x173/0x313
    [<ffffffff81114896>] do_execve+0x219/0x30a
    [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
    [<ffffffff8100320a>] stub_execve+0x6a/0xc0
    RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

    There is a race between shift_arg_pages and migration that triggers this
    bug. A temporary stack is setup during exec and later moved. If
    migration moves a page in the temporary stack and the VMA is then removed
    before migration completes, the migration PTE may not be found leading to
    a BUG when the stack is faulted.

    This patch causes pages within the temporary stack during exec to be
    skipped by migration. It does this by marking the VMA covering the
    temporary stack with an otherwise impossible combination of VMA flags.
    These flags are cleared when the temporary stack is moved to its final
    location.

    [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
    being able to hot-remove memory. The main users of page migration such as
    sys_move_pages(), sys_migrate_pages() and cpuset process migration are
    only beneficial on NUMA so it makes sense.

    As memory compaction will operate within a zone and is useful on both NUMA
    and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
    user selects CONFIG_COMPACTION as an option.

    [akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • PageAnon pages that are unmapped may or may not have an anon_vma so are
    not currently migrated. However, a swap cache page can be migrated and
    fits this description. This patch identifies page swap caches and allows
    them to be migrated but ensures that no attempt to made to remap the pages
    would would potentially try to access an already freed anon_vma.

    Signed-off-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • rmap_walk_anon() was triggering errors in memory compaction that look like
    use-after-free errors. The problem is that between the page being
    isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
    page dropped to 0 and the anon_vma gets freed. This can happen during
    memory compaction if pages being migrated belong to a process that exits
    before migration completes. Hence, the use-after-free race looks like

    1. Page isolated for migration
    2. Process exits
    3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
    4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
    4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

    This patch checks the mapcount after the rcu lock is taken. If the
    mapcount is zero, the anon_vma is assumed to be freed and no further
    action is taken.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks. The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory. For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim). Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.

    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner. The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list. The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list. As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area. A compaction
    run completes within a zone when the two scanners meet.

    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis. This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.

    It also does not try relocate virtually contiguous pages to be physically
    contiguous. However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.

    Memory compaction can be triggered in one of three ways. It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory. It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted. When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim. Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.

    The series is in 14 patches. The first three are not "core" to the series
    but are important pre-requisites.

    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    patch, it's possible to use anon_vma after free if the caller is
    not holding a VMA or mmap_sem for the pages in question. While
    there should be no existing user that causes this problem,
    it's a requirement for memory compaction to be stable. The patch
    is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    guarantees about the anon_vma existing. There is a window between
    when a page was isolated and migration started during which anon_vma
    could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    a measure of external fragmentation that takes the size of the
    allocation request into account. It can also be calculated from
    userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    allocation request fails. It determines if an allocation failure
    would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    after compaction.

    Testing of compaction was in three stages. For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out. min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations. It was tested on X86,
    X86-64 and PPC64.

    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.

    1. Machine freshly booted and configured for hugepage usage with
    a) hugeadm --create-global-mounts
    b) hugeadm --pool-pages-max DEFAULT:8G
    c) hugeadm --set-recommended-min_free_kbytes
    d) hugeadm --set-recommended-shmmax

    The min_free_kbytes here is important. Anti-fragmentation works best
    when pageblocks don't mix. hugeadm knows how to calculate a value that
    will significantly reduce the worst of external-fragmentation-related
    events as reported by the mm_page_alloc_extfrag tracepoint.

    2. Load up memory
    a) Start updatedb
    b) Create in parallel a X files of pagesize*128 in size. Wait
    until files are created. By parallel, I mean that 4096 instances
    of dd were launched, one after the other using &. The crude
    objective being to mix filesystem metadata allocations with
    the buffer cache.
    c) Delete every second file so that pageblocks are likely to
    have holes
    d) kill updatedb if it's still running

    At this point, the system is quiet, memory is full but it's full with
    clean filesystem metadata and clean buffer cache that is unmapped.
    This is readily migrated or discarded so you'd expect lumpy reclaim
    to have no significant advantage over compaction but this is at
    the POC stage.

    3. In increments, attempt to allocate 5% of memory as hugepages.
    Measure how long it took, how successful it was, how many
    direct reclaims took place and how how many compactions. Note
    the compaction figures might not fully add up as compactions
    can take place for orders other than the hugepage size

    X86 vanilla compaction
    Final page count 913 916 (attempted 1002)
    pages reclaimed 68296 9791

    X86-64 vanilla compaction
    Final page count: 901 902 (attempted 1002)
    Total pages reclaimed: 112599 53234

    PPC64 vanilla compaction
    Final page count: 93 94 (attempted 110)
    Total pages reclaimed: 103216 61838

    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either. What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.

    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench. None showed anything too remarkable.

    The last test was a high-order allocation stress test. Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations. During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.

    vanilla compaction
    Percentage of request allocated X86 98 99
    Percentage of request allocated X86-64 95 98
    Percentage of request allocated PPC64 55 70

    This patch:

    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.

    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated. This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are two types of zonelist ordering methodologies:

    - node order, preferring allocations on a node to stay local to and

    - zone order, preferring allocations come from a higher zone to avoid
    allocating in lowmem zones even though they may not be local.

    The ordering technique used by the kernel is configurable on the command
    line, but also has some logic to determine what the default should be.

    This logic currently lacks knowledge of systems where a node may only have
    lowmem. For such systems, it is necessary to use node order so that
    GFP_KERNEL allocations may be satisfied by nodes consisting of only
    lowmem.

    If zone order is used, GFP_KERNEL allocations to such nodes are actually
    allocated on a node with local affinity that includes ZONE_NORMAL.

    This change defaults to node zonelist ordering if any node lacks
    ZONE_NORMAL.

    To force zone order, append 'numa_zonelist_order=zone' to the kernel
    command line.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called. So put
    it (and its calling function) into #ifdef block.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Do page table walks with the well-known nested loops we use in several
    other places already.

    This avoids doing full page table walks after every pte range and also
    allows to handle unmapped areas bigger than one pte range in one go.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of passing a start address and a number of pages into the helper
    functions, convert them to use a start and an end address.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Split out functions to handle hugetlb ranges, pte ranges and unmapped
    ranges, to improve readability but also to prepare the file structure for
    nested page table walks.

    No semantic changes intended.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This fixes some minor issues that bugged me while going over the code:

    o adjust argument order of do_mincore() to match the syscall
    o simplify range length calculation
    o drop superfluous shift in huge tlb calculation, address is page aligned
    o drop dead nr_huge calculation
    o check pte_none() before pte_present()
    o comment and whitespace fixes

    No semantic changes intended.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Update Documentation/filesystems/tmpfs.txt to describe the interaction of
    tmpfs mount option memory policy with tasks' cpuset mems_allowed.

    Note: the mount(8) man page [in the util-linux-ng package] requires
    similiar updates.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Factor out duplicate put/frees in mpol_shared_policy_init() to a common
    return path.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rename 'policy_types[]' to 'policy_modes[]' to better match the array
    contents.

    Use designated intializer syntax for policy_modes[].

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • We don't really need the extra variable 'i' in mpol_parse_str(). The only
    use is as the the loop variable. Then, it's assigned to 'mode'. Just use
    mode, and loose the 'uninitialized_var()' macro.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • No need to call mpol_set_nodemask() when we have no context for the
    mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option.
    Just save the raw nodemask in the mempolicy's w.user_nodemask member for
    use when a tmpfs/shmem file is created. mpol_shared_policy_init() will
    "contextualize" the policy for the new file based on the creating task's
    context.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy"
    has made the MPOL_DEFAULT only used in the memory policy APIs. So, no
    need to check in __mpol_equal also. Also get rid of mpol_match_intent()
    and move its logic directly into __mpol_equal().

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: Andi Kleen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • In policy_zonelist() mode MPOL_INTERLEAVE shouldn't happen, so fall
    through to BUG() instead of break to return. I also fixed the comment.

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: Andi Kleen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • 1. In funtion is_valid_nodemask(), varibable k will be inited to 0 in
    the following loop, needn't init to policy_zone anymore.

    2. (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) has already defined
    to MPOL_MODE_FLAGS in mempolicy.h.

    Signed-off-by: Bob Liu
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • putback_lru_page() never can fail. So it doesn't matter count of "the
    number of pages put back".

    In addition, users of this functions don't use return value.

    Let's remove unnecessary code.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • prep_new_page() will call set_page_private(page, 0) to initialise the
    page, so the code is redundant.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • We need to put mem_map high when virtual memmap is not used.

    before this patch
    free mem pfn range on first node:
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3b80 3b8b
    [ 0.000000] 73b8b 73bc0 - 73c00 73c00
    [ 0.000000] 73ddd - 73e00
    [ 0.000000] 73fdd - 74000
    [ 0.000000] 741dd - 74200
    [ 0.000000] 743dd - 74400
    [ 0.000000] 745dd - 74600
    [ 0.000000] 747dd - 74800
    [ 0.000000] 749dd - 74a00
    [ 0.000000] 74bdd - 74c00
    [ 0.000000] 74ddd - 74e00
    [ 0.000000] 74fdd - 75000
    [ 0.000000] 751dd - 75200
    [ 0.000000] 753dd - 75400
    [ 0.000000] 755dd - 75600
    [ 0.000000] 757dd - 75800
    [ 0.000000] 759dd - 75a00
    [ 0.000000] 79bdd 79c00 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 10000b 100040 - 2080000 2080000
    so only 79c00 - 7d540 are major free block under 4g...

    after this patch, we will get
    [ 0.000000] 19 - 1f
    [ 0.000000] 28 40 - 80 95
    [ 0.000000] 702 740 - 1000 1000
    [ 0.000000] 347c - 347e
    [ 0.000000] 34e7 3500 - 3600 3600
    [ 0.000000] 37dd - 3800
    [ 0.000000] 39dd - 3a00
    [ 0.000000] 3bdd - 3c00
    [ 0.000000] 3ddd - 3e00
    [ 0.000000] 3fdd - 4000
    [ 0.000000] 41dd - 4200
    [ 0.000000] 43dd - 4400
    [ 0.000000] 45dd - 4600
    [ 0.000000] 47dd - 4800
    [ 0.000000] 49dd - 4a00
    [ 0.000000] 4bdd - 4c00
    [ 0.000000] 4ddd - 4e00
    [ 0.000000] 4fdd - 5000
    [ 0.000000] 51dd - 5200
    [ 0.000000] 53dd - 5400
    [ 0.000000] 95dd 9600 - 7d540 7d550
    [ 0.000000] 7f745 - 7f750
    [ 0.000000] 17000b 170040 - 2080000 2080000
    we will have 9600 - 7d540 for major free block...

    sparse-vmemmap path already used __alloc_bootmem_node_high()

    Signed-off-by: Yinghai Lu
    Cc: Jiri Slaby
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Christoph Lameter
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • …re merging to the tail of the free lists

    In order to reduce fragmentation, this patch classifies freed pages in two
    groups according to their probability of being part of a high order merge.
    Pages belonging to a compound whose next-highest buddy is free are more
    likely to be part of a high order merge in the near future, so they will
    be added at the tail of the freelist. The remaining pages are put at the
    front of the freelist.

    In this way, the pages that are more likely to cause a big merge are kept
    free longer. Consequently there is a tendency to aggregate the
    long-living allocations on a subset of the compounds, reducing the
    fragmentation.

    This heuristic was tested on three machines, x86, x86-64 and ppc64 with
    3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
    and STREAM for performance and a high-order stress test for huge page
    allocations.

    KernBench X86
    Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
    User mean 649.53 ( 0.00%) 650.44 (-0.14%)
    System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
    CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)

    KernBench X86-64
    Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
    User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
    System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
    CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)

    KernBench PPC64
    Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
    User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
    System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
    CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)

    Nothing notable for kernbench.

    NetPerf UDP X86
    64 42.68 ( 0.00%) 42.77 ( 0.21%)
    128 85.62 ( 0.00%) 85.32 (-0.35%)
    256 170.01 ( 0.00%) 168.76 (-0.74%)
    1024 655.68 ( 0.00%) 652.33 (-0.51%)
    2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
    3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
    4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
    8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
    16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*

    NetPerf UDP X86-64
    64 148.82 ( 0.00%) 154.92 ( 3.94%)
    128 298.96 ( 0.00%) 312.95 ( 4.47%)
    256 583.67 ( 0.00%) 626.39 ( 6.82%)
    1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
    2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
    3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
    4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
    8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
    16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
    1.64% 2.73%

    NetPerf UDP PPC64
    64 49.98 ( 0.00%) 50.25 ( 0.54%)
    128 98.66 ( 0.00%) 100.95 ( 2.27%)
    256 197.33 ( 0.00%) 191.03 (-3.30%)
    1024 761.98 ( 0.00%) 785.07 ( 2.94%)
    2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
    3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
    4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
    8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
    16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)

    The tests are run to have confidence limits within 1%. Results marked
    with a * were not confident although in this case, it's only outside by
    small amounts. Even with some results that were not confident, the
    netperf UDP results were generally positive.

    NetPerf TCP X86
    64 652.25 ( 0.00%)* 648.12 (-0.64%)*
    23.80% 22.82%
    128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
    21.03% 18.90%
    256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
    1.00% 16.46%
    1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
    13.37% 11.39%
    2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
    9.76% 12.48%
    3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
    6.49% 8.75%
    4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
    9.85% 8.50%
    8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
    9.13% 13.04%
    16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
    7.73% 8.68%

    NETPERF TCP X86-64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
    5.79% 4.78%
    128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
    3.68% 6.06%
    256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
    3.02% 2.10%
    1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
    5.89% 6.55%
    2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
    7.08% 7.44%
    3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
    6.87% 7.33%
    4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
    6.86% 8.18%
    8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
    7.49% 5.55%
    16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
    7.36% 6.49%

    NETPERF TCP PPC64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 594.17 ( 0.00%) 596.04 ( 0.31%)*
    1.00% 2.29%
    128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
    1.30% 1.40%
    256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
    1.25% 1.00%
    1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
    1.02% 1.00%
    2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
    1.15% 1.04%
    3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
    4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
    8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
    16384 9618.65 ( 0.00%) 9490.10 (-1.35%)

    There was a distinct lack of confidence in the X86* figures so I included
    what the devation was where the results were not confident. Many of the
    results, whether gains or losses were within the standard deviation so no
    solid conclusion can be reached on performance impact. Looking at the
    figures, only the X86-64 ones look suspicious with a few losses that were
    outside the noise. However, the results were so unstable that without
    knowing why they vary so much, a solid conclusion cannot be reached.

    SYSBENCH X86
    sysbench-vanilla pgalloc-delay
    1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
    2 14901.11 ( 0.00%) 13683.44 (-8.90%)
    3 15171.71 ( 0.00%) 14888.25 (-1.90%)
    4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
    5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
    6 14870.33 ( 0.00%) 14845.57 (-0.17%)
    7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
    8 14354.35 ( 0.00%) 14362.31 ( 0.06%)

    SYSBENCH X86-64
    1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
    2 34276.39 ( 0.00%) 34251.00 (-0.07%)
    3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
    4 66667.10 ( 0.00%) 66174.69 (-0.74%)
    5 66003.91 ( 0.00%) 65685.25 (-0.49%)
    6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
    7 64933.16 ( 0.00%) 64379.23 (-0.86%)
    8 63353.30 ( 0.00%) 63281.22 (-0.11%)
    9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
    10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
    11 62092.81 ( 0.00%) 61787.75 (-0.49%)
    12 61330.11 ( 0.00%) 61036.34 (-0.48%)
    13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
    14 62304.48 ( 0.00%) 62064.90 (-0.39%)
    15 63296.48 ( 0.00%) 62875.16 (-0.67%)
    16 63951.76 ( 0.00%) 63769.09 (-0.29%)

    SYSBENCH PPC64
    -sysbench-pgalloc-delay-sysbench
    sysbench-vanilla pgalloc-delay
    1 7645.08 ( 0.00%) 7467.43 (-2.38%)
    2 14856.67 ( 0.00%) 14558.73 (-2.05%)
    3 21952.31 ( 0.00%) 21683.64 (-1.24%)
    4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
    5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
    6 27477.10 ( 0.00%) 27337.45 (-0.51%)
    7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
    8 26642.91 ( 0.00%) 25274.33 (-5.41%)
    9 25137.27 ( 0.00%) 24810.06 (-1.32%)
    10 24451.99 ( 0.00%) 24275.85 (-0.73%)
    11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
    12 24234.81 ( 0.00%) 23640.89 (-2.51%)
    13 24577.75 ( 0.00%) 24433.50 (-0.59%)
    14 25640.19 ( 0.00%) 25116.52 (-2.08%)
    15 26188.84 ( 0.00%) 26181.36 (-0.03%)
    16 26782.37 ( 0.00%) 26255.99 (-2.00%)

    Again, there is little to conclude here. While there are a few losses,
    the results vary by +/- 8% in some cases. They are the results of most
    concern as there are some large losses but it's also within the variance
    typically seen between kernel releases.

    The STREAM results varied so little and are so verbose that I didn't
    include them here.

    The final test stressed how many huge pages can be allocated. The
    absolute number of huge pages allocated are the same with or without the
    page. However, the "unusability free space index" which is a measure of
    external fragmentation was slightly lower (lower is better) throughout the
    lifetime of the system. I also measured the latency of how long it took
    to successfully allocate a huge page. The latency was slightly lower and
    on X86 and PPC64, more huge pages were allocated almost immediately from
    the free lists. The improvement is slight but there.

    [mel@csn.ul.ie: Tested, reworked for less branches]
    [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
    Cc: Corrado Zoccolo <czoccolo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Corrado Zoccolo
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • fix the following 'make includecheck' warnings:

    arch/xtensa/kernel/vectors.S: asm/processor.h is included more than once.
    arch/xtensa/kernel/vectors.S: asm/ptrace.h is included more than once.

    Signed-off-by: Jaswinder Singh Rajput
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaswinder Singh Rajput
     
  • Also remove lots of unused irq_cpustat fields.

    Signed-off-by: Christoph Hellwig
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Architectures that handle DMA-non-coherent memory need to set
    ARCH_KMALLOC_MINALIGN to make sure that kmalloc'ed buffer is DMA-safe: the
    buffer doesn't share a cache with the others.

    Signed-off-by: FUJITA Tomonori
    Cc: Chris Zankel
    Acked-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    FUJITA Tomonori
     

24 May, 2010

9 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-2.6:
    cmd640: fix kernel oops in test_irq() method
    pdc202xx_old: ignore "FIFO empty" bit in test_irq() method
    pdc202xx_old: wire test_irq() method for PDC2026x
    IDE: pass IRQ flags to the IDE core
    ide: fix comment typo in ide.h

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vapier/blackfin: (30 commits)
    Blackfin: SMP: fix continuation lines
    Blackfin: acvilon: fix timeout usage for I2C
    Blackfin: fix typo in BF537 IRQ comment
    Blackfin: unify duplicate MEM_MT48LC32M8A2_75 kconfig options
    Blackfin: set ARCH_KMALLOC_MINALIGN
    Blackfin: use atomic kmalloc in L1 alloc so it too can be atomic
    Blackfin: another year of changes (update copyright in boot log)
    Blackfin: optimize strncpy a bit
    Blackfin: isram: clean up ITEST_COMMAND macro and improve the selftests
    Blackfin: move string functions to normal lib/ assembly
    Blackfin: SIC: cut down on IAR MMR reads a bit
    Blackfin: bf537-minotaur: fix build errors due to header changes
    Blackfin: kgdb: pass up the CC register instead of a 0 stub
    Blackfin: handle HW errors in the new "FAULT" printing code
    Blackfin: show the whole accumulator in the pseudo DBG insn
    Blackfin: support all possible registers in the pseudo instructions
    Blackfin: add support for the DBG (debug output) pseudo insn
    Blackfin: change the BUG opcode to an unused 16-bit opcode
    Blackfin: allow NMI watchdog to be used w/RETN as a scratch reg
    Blackfin: add support for the DBGA (debug assert) pseudo insn
    ...

    Linus Torvalds
     
  • * 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
    uml: Pushdown the bkl from harddog_kern ioctl
    sunrpc: Pushdown the bkl from sunrpc cache ioctl
    sunrpc: Pushdown the bkl from ioctl
    autofs4: Pushdown the bkl from ioctl
    uml: Convert to unlocked_ioctls to remove implicit BKL
    ncpfs: BKL ioctl pushdown
    coda: Clean-up whitespace problems in pioctl.c
    coda: BKL ioctl pushdown
    drivers: Push down BKL into various drivers
    isdn: Push down BKL into ioctl functions
    scsi: Push down BKL into ioctl functions
    dvb: Push down BKL into ioctl functions
    smbfs: Push down BKL into ioctl function
    coda/psdev: Remove BKL from ioctl function
    um/mmapper: Remove BKL usage
    sn_hwperf: Kill BKL usage
    hfsplus: Push down BKL into ioctl function

    Linus Torvalds
     
  • * git://git.infradead.org/battery-2.6:
    ds2760_battery: Document ABI change
    ds2760_battery: Make charge_now and charge_full writeable
    power_supply: Add support for writeable properties
    power_supply: Use attribute groups
    power_supply: Add test_power driver
    tosa_battery: Fix build error due to direct driver_data usage
    wm97xx_battery: Quieten sparse warning (bat_set_pdata not declared)
    ds2782_battery: Get rid of magic numbers in driver_data
    ds2782_battery: Add support for ds2786 battery gas gauge
    pda_power: Add function callbacks for suspend and resume
    wm831x_power: Use genirq
    Driver for Zipit Z2 battery chip
    ds2782_battery: Fix clientdata on removal

    Linus Torvalds
     
  • …nel/git/tip/linux-2.6-tip

    * 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    timers: Fix slack calculation for expired timers
    timekeeping: Fix timezone update

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (25 commits)
    sh: fix up sh7785lcr_32bit_defconfig.
    arch/sh/lib/strlen.S: Checkpatch cleanup
    sh: fix up sh7786 dmaengine build.
    sh: guard cookie consistency across termination in the DMA driver
    sh: prevent the DMA driver from unloading, while in use
    sh: fix Oops in the serial SCI driver
    sh: allow platforms to specify SD-card supported voltages
    mmc: let MFD's provide supported Vdd card voltages to tmio_mmc
    sh: disable SD-card write-protection detection on kfr2r09
    mfd: pass platform flags down to the tmio_mmc driver
    tmio: add a platform flag to disable card write-protection detection
    sh: Add SDHI DMA support to migor
    sh: Add SDHI DMA support to kfr2r09
    sh: Add SDHI DMA support to ms7724se
    sh: Add SDHI DMA support to ecovec
    mmc: add DMA support to tmio_mmc driver, when used on SuperH
    sh: prepare the SDHI MFD driver to pass DMA configuration to tmio_mmc.c
    mmc: prepare tmio_mmc for passing of DMA configuration from the MFD cell
    sh: add DMA slave definitions to sh7724
    sh: add DMA slaves for two SDHI controllers to sh7722
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.open-osd.org/linux-open-osd:
    exofs: confusion between kmap() and kmap_atomic() api
    exofs: Add default address_space_operations

    Linus Torvalds
     
  • This reverts commit 03ceedea972a82d343fa5c2528b3952fa9e615d5, since it
    breaks resume from suspend-to-ram on Rafael's Acer Ferrari One.
    NetworkManager thinks everything is ok, but it can't connect to the AP
    to get an IP address after the resume.

    In fact, it even breaks resume for non-ath9k chipsets: reverting it also
    fixes Rafael's Toshiba Protege R500 with the iwlagn driver. As Johannes
    says:

    "Indeed, this patch needs to be reverted. That mac80211 change is wrong
    and completely unnecessary."

    Reported-and-requested-by: Rafael J. Wysocki
    Acked-by: Johannes Berg
    Cc: Daniel Yingqiang Ma
    Cc: John W. Linville
    Cc: David Miller
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
    fat: convert to unlocked_ioctl
    fat: Cleanup nls_unload() usage
    fat: use pack_hex_byte() instead of custom one

    Linus Torvalds