08 Mar, 2011

1 commit


05 Mar, 2011

3 commits

  • Pass down the correct node for a transparent hugepage allocation. Most
    callers continue to use the current node, however the hugepaged daemon
    now uses the previous node of the first to be collapsed page instead.
    This ensures that khugepaged does not mess up local memory for an
    existing process which uses local policy.

    The choice of node is somewhat primitive currently: it just uses the
    node of the first page in the pmd range. An alternative would be to
    look at multiple pages and use the most popular node. I used the
    simplest variant for now which should work well enough for the case of
    all pages being on the same node.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This makes a difference for LOCAL policy, where the node cannot be
    determined from the policy itself, but has to be gotten from the original
    page.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Currently alloc_pages_vma() always uses the local node as policy node for
    the LOCAL policy. Pass this node down as an argument instead.

    No behaviour change from this patch, but will be needed for followons.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

26 Feb, 2011

5 commits

  • It seems odd that truncate_inode_pages_range(), called not only when
    truncating but also when evicting inodes, has mem_cgroup_uncharge_start
    and _end() batching in its second loop to clear up a few leftovers, but
    not in its first loop that does almost all the work: add them there too.

    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The THP code didn't pass the correct interleaving shift to the memory
    policy code. Fix this here by adjusting for the order.

    Signed-off-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • When pfn_valid_within() failed 'iter' was incremented twice.

    Signed-off-by: Namhyung Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • should_continue_reclaim() for reclaim/compaction allows scanning to
    continue even if pages are not being reclaimed until the full list is
    scanned. In terms of allocation success, this makes sense but potentially
    it introduces unwanted latency for high-order allocations such as
    transparent hugepages and network jumbo frames that would prefer to fail
    the allocation attempt and fallback to order-0 pages. Worse, there is a
    potential that the full LRU scan will clear all the young bits, distort
    page aging information and potentially push pages into swap that would
    have otherwise remained resident.

    This patch will stop reclaim/compaction if no pages were reclaimed in the
    last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
    hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
    LRU list may still be scanned.

    Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
    is not set so the following avoids the gfp_mask being examined:

    if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
    return false;

    A tool was developed based on ftrace that tracked the latency of
    high-order allocations while transparent hugepage support was enabled and
    three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
    Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
    applied.

    STREAM Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 10298 10229
    1 :: Min 0.4560 0.4640
    1 :: Mean 1.0589 1.0183
    1 :: Max 14.5990 11.7510
    1 :: Stddev 0.5208 0.4719
    2 :: Count 2 1
    2 :: Min 1.8610 3.7240
    2 :: Mean 3.4325 3.7240
    2 :: Max 5.0040 3.7240
    2 :: Stddev 1.5715 0.0000
    9 :: Count 111696 111694
    9 :: Min 0.5230 0.4110
    9 :: Mean 10.5831 10.5718
    9 :: Max 38.4480 43.2900
    9 :: Stddev 1.1147 1.1325

    Mean time for order-1 allocations is reduced. order-2 looks increased but
    with so few allocations, it's not particularly significant. THP mean
    allocation latency is also reduced. That said, allocation time varies so
    significantly that the reductions are within noise.

    Max allocation time is reduced by a significant amount for low-order
    allocations but reduced for THP allocations which presumably are now
    breaking before reclaim has done enough work.

    SysBench Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 15745 15677
    1 :: Min 0.4250 0.4550
    1 :: Mean 1.1023 1.0810
    1 :: Max 14.4590 10.8220
    1 :: Stddev 0.5117 0.5100
    2 :: Count 1 1
    2 :: Min 3.0040 2.1530
    2 :: Mean 3.0040 2.1530
    2 :: Max 3.0040 2.1530
    2 :: Stddev 0.0000 0.0000
    9 :: Count 2017 1931
    9 :: Min 0.4980 0.7480
    9 :: Mean 10.4717 10.3840
    9 :: Max 24.9460 26.2500
    9 :: Stddev 1.1726 1.1966

    Again, mean time for order-1 allocations is reduced while order-2
    allocations are too few to draw conclusions from. The mean time for THP
    allocations is also slightly reduced albeit the reductions are within
    varianes.

    Once again, our maximum allocation time is significantly reduced for
    low-order allocations and slightly increased for THP allocations.

    Anon stream mmap reference Highorder Allocation Latency Statistics
    1 :: Count 1376 1790
    1 :: Min 0.4940 0.5010
    1 :: Mean 1.0289 0.9732
    1 :: Max 6.2670 4.2540
    1 :: Stddev 0.4142 0.2785
    2 :: Count 1 -
    2 :: Min 1.9060 -
    2 :: Mean 1.9060 -
    2 :: Max 1.9060 -
    2 :: Stddev 0.0000 -
    9 :: Count 11266 11257
    9 :: Min 0.4990 0.4940
    9 :: Mean 27250.4669 24256.1919
    9 :: Max 11439211.0000 6008885.0000
    9 :: Stddev 226427.4624 186298.1430

    This benchmark creates one thread per CPU which references an amount of
    anonymous memory 1.5 times the size of physical RAM. This pounds swap
    quite heavily and is intended to exercise THP a bit.

    Mean allocation time for order-1 is reduced as before. It's also reduced
    for THP allocations but the variations here are pretty massive due to
    swap. As before, maximum allocation times are significantly reduced.

    Overall, the patch reduces the mean and maximum allocation latencies for
    the smaller high-order allocations. This was with Slab configured so it
    would be expected to be more significant with Slub which uses these size
    allocations more aggressively.

    The mean allocation times for THP allocations are also slightly reduced.
    The maximum latency was slightly increased as predicted by the comments
    due to reclaim/compaction breaking early. However, workloads care more
    about the latency of lower-order allocations than THP so it's an
    acceptable trade-off.

    Signed-off-by: Mel Gorman
    Acked-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Cc: Kent Overstreet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
    prevent free_pid() from reclaiming the pid.

    Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
    with:

    CONFIG_LOCKUP_DETECTOR=y
    CONFIG_PREEMPT=y
    CONFIG_LOCKDEP=y
    CONFIG_PROVE_LOCKING=y
    CONFIG_PROVE_RCU=y

    Previously, migrate_pages() went through a similar transformation
    replacing usage of tasklist_lock with rcu read lock:

    commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
    Author: Zeng Zhaoming
    Date: Thu Dec 2 14:31:13 2010 -0800

    mm/mempolicy.c: add rcu read lock to protect pid structure

    commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
    Author: KOSAKI Motohiro
    Date: Thu Jan 13 15:46:14 2011 -0800

    mempolicy: remove tasklist_lock from migrate_pages

    Signed-off-by: Greg Thelen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Oleg Nesterov
    Cc: Zeng Zhaoming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

25 Feb, 2011

1 commit

  • Grab a reference to bdev before calling blkdev_get(), which expects
    the refcount to be already incremented and either returns success or
    decrements the refcount and returns an error.

    The bug was introduced by e525fd89 (block: make blkdev_get/put()
    handle exclusive access), which didn't take into account this behavior
    of blkdev_get().

    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

24 Feb, 2011

2 commits

  • Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
    a hole with madvise(,, MADV_REMOVE). That path is under mutex, and
    cannot be explained by lack of serialization in unmap_mapping_range().

    Reviewing the code, I found one place where vm_truncate_count handling
    should have been updated, when I switched at the last minute from one
    way of managing the restart_addr to another: mremap move changes the
    virtual addresses, so it ought to adjust the restart_addr.

    But rather than exporting the notion of restart_addr from memory.c, or
    converting to restart_pgoff throughout, simply reset vm_truncate_count
    to 0 to force a rescan if mremap move races with preempted truncation.

    We have no confirmation that this fixes Robert's BUG,
    but it is a fix that's worth making anyway.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Michael Leun reported that running parallel opens on a fuse filesystem
    can trigger a "kernel BUG at mm/truncate.c:475"

    Gurudas Pai reported the same bug on NFS.

    The reason is, unmap_mapping_range() is not prepared for more than
    one concurrent invocation per inode. For example:

    thread1: going through a big range, stops in the middle of a vma and
    stores the restart address in vm_truncate_count.

    thread2: comes in with a small (e.g. single page) unmap request on
    the same vma, somewhere before restart_address, finds that the
    vma was already unmapped up to the restart address and happily
    returns without doing anything.

    Another scenario would be two big unmap requests, both having to
    restart the unmapping and each one setting vm_truncate_count to its
    own value. This could go on forever without any of them being able to
    finish.

    Truncate and hole punching already serialize with i_mutex. Other
    callers of unmap_mapping_range() do not, and it's difficult to get
    i_mutex protection for all callers. In particular ->d_revalidate(),
    which calls invalidate_inode_pages2_range() in fuse, may be called
    with or without i_mutex.

    This patch adds a new mutex to 'struct address_space' to prevent
    running multiple concurrent unmap_mapping_range() on the same mapping.

    [ We'll hopefully get rid of all this with the upcoming mm
    preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
    lockbreak" patch in particular. But that is for 2.6.39 ]

    Signed-off-by: Miklos Szeredi
    Reported-by: Michael Leun
    Reported-by: Gurudas Pai
    Tested-by: Gurudas Pai
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

16 Feb, 2011

1 commit

  • Transparent hugepages can only be created if rmap is fully
    functional. So we must prevent hugepages to be created while
    is_vma_temporary_stack() is true.

    This also optmizes away some harmless but unnecessary setting of
    khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

12 Feb, 2011

5 commits

  • mem_cgroup_uncharge_page() should be called in all failure cases after
    mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

    [ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
    [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
    [ 4209.078674] page flags: 0x40000000004000(head)
    [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
    [ 4209.082177] (/A)
    [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
    [ 4209.083412] Call Trace:
    [ 4209.083678] [] ? bad_page+0xe4/0x140
    [ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
    [ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
    [ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
    [ 4209.086110] [] ? free_compound_page+0x1b/0x20
    [ 4209.086699] [] ? __put_compound_page+0x1c/0x30
    [ 4209.087333] [] ? put_compound_page+0x4d/0x200
    [ 4209.087935] [] ? put_page+0x45/0x50
    [ 4209.097361] [] ? khugepaged+0x9e9/0x1430
    [ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
    [ 4209.099121] [] ? khugepaged+0x0/0x1430
    [ 4209.099780] [] ? kthread+0x96/0xa0
    [ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
    [ 4209.101214] [] ? kthread+0x0/0xa0
    [ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 3e7d34497067 ("mm: vmscan: reclaim order-0 and use compaction
    instead of lumpy reclaim") introduced an indefinite loop in
    shrink_zone().

    It meant to break out of this loop when no pages had been reclaimed and
    not a single page was even scanned. The way it would detect the latter
    is by taking a snapshot of sc->nr_scanned at the beginning of the
    function and comparing it against the new sc->nr_scanned after the scan
    loop. But it would re-iterate without updating that snapshot, looping
    forever if sc->nr_scanned changed at least once since shrink_zone() was
    invoked.

    This is not the sole condition that would exit that loop, but it
    requires other processes to change the zone state, as the reclaimer that
    is stuck obviously can not anymore.

    This is only happening for higher-order allocations, where reclaim is
    run back to back with compaction.

    Signed-off-by: Johannes Weiner
    Reported-by: Michal Hocko
    Tested-by: Kent Overstreet
    Reported-by: Kent Overstreet
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If the page is going to be written to, __do_page needs to break COW.

    However, the old page (before breaking COW) was never mapped mapped into
    the current pte (__do_fault is only called when the pte is not present),
    so vmscan can't have marked the old page as PageMlocked due to being
    mapped in __do_fault's VMA. Therefore, __do_fault() does not need to
    worry about clearing PageMlocked() on the old page.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
    set the PageMlocked bit on these pages, transfering them onto the
    unevictable list. When do_wp_page() breaks COW within a VM_LOCKED vma,
    it may need to clear PageMlocked on the old page and set it on the new
    page instead.

    This change fixes an issue where do_wp_page() was clearing PageMlocked
    on the old page while the pte was still pointing to it (as well as
    rmap). Therefore, we were not protected against vmscan immediately
    transfering the old page back onto the unevictable list. This could
    cause pages to get stranded there forever.

    I propose to move the corresponding code to the end of do_wp_page(),
    after the pte (and rmap) have been pointed to the new page.
    Additionally, we can use munlock_vma_page() instead of
    clear_page_mlock(), so that the old page stays mlocked if there are
    still other VM_LOCKED vmas mapping it.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • While applying patch to use memblock to find aperture for 64bit x86.
    Ingo found system with 1g + force_iommu

    > No AGP bridge found
    > Node 0: aperture @ 38000000 size 32 MB
    > Aperture pointing to e820 RAM. Ignoring.
    > Your BIOS doesn't leave a aperture memory hole
    > Please enable the IOMMU option in the BIOS setup
    > This costs you 64 MB of RAM
    > Cannot allocate aperture memory hole (0,65536K)

    the corresponding code:

    addr = memblock_find_in_range(0, 1ULL<< 0xffffffff) {
    printk(KERN_ERR
    "Cannot allocate aperture memory hole (%lx,%uK)\n",
    addr, aper_size>>10);
    return 0;
    }
    memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")

    fails because memblock core code align the size with 512M. That could
    make size way too big.

    So don't align the size in that case.

    actually __memblock_alloc_base, the another caller already align that
    before calling that function.

    BTW. x86 does not use __memblock_alloc_base...

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: "H. Peter Anvin"
    Cc: Benjamin Herrenschmidt
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     

03 Feb, 2011

11 commits

  • Changes in e401f1761 ("memcg: modify accounting function for supporting
    THP better") adds nr_pages to support multiple page size in
    memory_cgroup_charge_statistics.

    But counting the number of event nees abs(nr_pages) for increasing
    counters. This patch fixes event counting.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Huge page coverage should obviously have less priority than the continued
    execution of a process.

    Never kill a process when charging it a huge page fails. Instead, give up
    after the first failed reclaim attempt and fall back to regular pages.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If reclaim after a failed charging was unsuccessful, the limits are
    checked again, just in case they settled by means of other tasks.

    This is all fine as long as every charge is of size PAGE_SIZE, because in
    that case, being below the limit means having at least PAGE_SIZE bytes
    available.

    But with transparent huge pages, we may end up in an endless loop where
    charging and reclaim fail, but we keep going because the limits are not
    yet exceeded, although not allowing for a huge page.

    Fix this up by explicitely checking for enough room, not just whether we
    are within limits.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The charging code can encounter a charge size that is bigger than a
    regular page in two situations: one is a batched charge to fill the
    per-cpu stocks, the other is a huge page charge.

    This code is distributed over two functions, however, and only the outer
    one is aware of huge pages. In case the charging fails, the inner
    function will tell the outer function to retry if the charge size is
    bigger than regular pages--assuming batched charging is the only case.
    And the outer function will retry forever charging a huge page.

    This patch makes sure the inner function can distinguish between batch
    charging and a single huge page charge. It will only signal another
    attempt if batch charging failed, and go into regular reclaim when it is
    called on behalf of a huge page.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When a tail page of THP is poisoned, memory-failure will do nothing except
    setting PG_hwpoison, while the expected behavior is that the process, who
    is using the poisoned tail page, should be killed.

    The above problem is caused by lru check of the poisoned tail page of THP.
    Because PG_lru flag is only set on the head page of THP, the check always
    consider the poisoned tail page as NON lru page.

    So the lru check for the tail page of THP should be avoided, as like as
    hugetlb.

    This patch adds !PageTransCompound() before lru check for THP, because of
    the check (!PageHuge() && !PageTransCompound()) the whole branch could be
    optimized away at build time when both hugetlbfs and THP are set with "N"
    (or in archs not supporting either of those).

    [akpm@linux-foundation.org: fix unrelated typo in shake_page() comment]
    Signed-off-by: Jin Dongming
    Reviewed-by: Hidetoshi Seto
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jin Dongming
     
  • When the tail page of THP is poisoned, the head page will be poisoned too.
    And the wrong address, address of head page, will be sent with sigbus
    always.

    So when the poisoned page is used by Guest OS which is running on KVM,
    after the address changing(hva->gpa) by qemu, the unexpected process on
    Guest OS will be killed by sigbus.

    What we expected is that the process using the poisoned tail page could be
    killed on Guest OS, but not that the process using the healthy head page
    is killed.

    Since it is not good to poison the healthy page, avoid poisoning other
    than the page which is really poisoned.
    (While we poison all pages in a huge page in case of hugetlb,
    we can do this for THP thanks to split_huge_page().)

    Here we fix two parts:
    1. Isolate the poisoned page only to make sure
    the reported address is the address of poisoned page.
    2. make the poisoned page work as the poisoned regular page.

    [akpm@linux-foundation.org: fix spello in comment]
    Signed-off-by: Jin Dongming
    Reviewed-by: Hidetoshi Seto
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jin Dongming
     
  • The poisoned THP is now split with split_huge_page() in
    collect_procs_anon(). If kmalloc() is failed in collect_procs(),
    split_huge_page() could not be called. And the work after
    split_huge_page() for collecting the processes using poisoned page will
    not be done, too. So the processes using the poisoned page could not be
    killed.

    The condition becomes worse when CONFIG_DEBUG_VM == "Y". Because the
    poisoned THP could not be split, system panic will be caused by
    VM_BUG_ON(PageTransHuge(page)) in try_to_unmap().

    This patch does:
    1. move split_huge_page() to the place before collect_procs().
    This can be sure the failure of splitting THP is caused by itself.
    2. when splitting THP is failed, stop the operations after it.
    This can avoid unexpected system panic or non sense works.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jin Dongming
    Reviewed-by: Hidetoshi Seto
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jin Dongming
     
  • If migrate_huge_page by memory-failure fails , it calls put_page in itself
    to decrease page reference and caller of migrate_huge_page also calls
    putback_lru_pages. It can do double free of page so it can make page
    corruption on page holder.

    In addtion, clean of pages on caller is consistent behavior with
    migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
    counting").

    Signed-off-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In some cases migrate_pages could return zero while still leaving a few
    pages in the pagelist (and some caller wouldn't notice it has to call
    putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
    COMPACTPAGEFAILED counting")).

    Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
    compaction: fix COMPACTPAGEFAILED counting").

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Minchan Kim
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • noswapaccount couldn't be used to control memsw for both on/off cases so
    we have added swapaccount[=0|1] parameter. This way we can turn the
    feature in two ways noswapaccount resp. swapaccount=0. We have kept the
    original noswapaccount but I think we should remove it after some time as
    it just makes more command line parameters without any advantages and also
    the code to handle parameters is uglier if we want both parameters.

    Signed-off-by: Michal Hocko
    Requested-by: KAMEZAWA Hiroyuki
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __setup based kernel command line parameters handlers which are handled in
    obsolete_checksetup are provided with the parameter value including =
    (more precisely everything right after the parameter name).

    This means that the current implementation of swapaccount[=1|0] doesn't
    work at all because if there is a value for the parameter then we are
    testing for "0" resp. "1" but we are getting "=0" resp. "=1" and if
    there is no parameter value we are getting an empty string rather than
    NULL.

    The original noswapccount parameter, which doesn't care about the value,
    works correctly.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Feb, 2011

2 commits

  • As Tao Ma noticed, change 5ecfda0 breaks blktrace. This is because
    blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ,
    so my attempt to not unnecessarity break COW during mlock ended up
    causing mlock to fail with a permission problem.

    I am proposing to let mlock ignore vma protection in all cases except
    PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions
    (as in the blktrace case, which broke at 5ecfda0) or for PROT_EXEC
    regions (which seem to me like they were always broken).

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • SELinux would like to implement a new labeling behavior of newly created
    inodes. We currently label new inodes based on the parent and the creating
    process. This new behavior would also take into account the name of the
    new object when deciding the new label. This is not the (supposed) full path,
    just the last component of the path.

    This is very useful because creating /etc/shadow is different than creating
    /etc/passwd but the kernel hooks are unable to differentiate these
    operations. We currently require that userspace realize it is doing some
    difficult operation like that and than userspace jumps through SELinux hoops
    to get things set up correctly. This patch does not implement new
    behavior, that is obviously contained in a seperate SELinux patch, but it
    does pass the needed name down to the correct LSM hook. If no such name
    exists it is fine to pass NULL.

    Signed-off-by: Eric Paris

    Eric Paris
     

31 Jan, 2011

1 commit


28 Jan, 2011

2 commits

  • This patch adds __GFP_NORETRY and __GFP_NOMEMALLOC flags to the kmemleak
    metadata allocations so that it has a smaller effect on the users of the
    kernel slab allocator. Since kmemleak allocations can now fail more
    often, this patch also reduces the verbosity by passing __GFP_NOWARN and
    not dumping the stack trace when a kmemleak allocation fails.

    Signed-off-by: Catalin Marinas
    Reported-by: Toralf Förster
    Acked-by: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Ted Ts'o

    Catalin Marinas
     
  • We don't need to memset if we just use kzalloc() rather than kmalloc() in
    kmemleak_test_init().

    Signed-off-by: Jesper Juhl
    Reviewed-by: Minchan Kim
    Signed-off-by: Catalin Marinas

    Jesper Juhl
     

26 Jan, 2011

6 commits

  • A fix up mem_cgroup_move_parent() which use compound_order() in
    asynchronous manner. This compound_order() may return unknown value
    because we don't take lock. Use PageTransHuge() and HPAGE_SIZE instead
    of it.

    Also clean up for mem_cgroup_move_parent().
    - remove unnecessary initialization of local variable.
    - rename charge_size -> page_size
    - remove unnecessary (wrong) comment.
    - added a comment about THP.

    Note:
    Current design take compound_page_lock() in caller of move_account().
    This should be revisited when we implement direct move_task of hugepage
    without splitting.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_disabled() should be checked at splitting. If disabled, no
    heavy work is necesary.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Johannes Weiner
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a
    cancel of charge at case: memory charge-> success. mem+swap charge->
    failure.

    This leaks usage of memory. Fix it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Callers of migrate_pages should putback_lru_pages to return pages
    isolated to LRU or free list. Now comment is rather confusing. It says
    caller always have to call it.

    It is more clear to point out that the caller has to call it if
    migrate_pages's return value isn't zero.

    Signed-off-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit 5d6892407 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
    enabled") causes this warning during the configuration process:

    warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
    direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)

    COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
    either, it is also useful for regular alloc_pages(order > 0) including
    the very kernel stack during fork (THREAD_ORDER = 1). It's always
    better to enable COMPACTION.

    The warning should be an error because we would end up with MIGRATION
    not selected, and COMPACTION wouldn't work without migration (despite it
    seems to build with an inline migrate_pages returning -ENOSYS).

    I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
    for some releases (for full safety the default remains disabled which I
    think is enough).

    Signed-off-by: Andrea Arcangeli
    Reported-by: Luca Tettamanti
    Tested-by: Luca Tettamanti
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
    to the 'put_back' label

    ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
    if (ret || !parent)
    goto put_back;

    where we'll

    if (charge > PAGE_SIZE)
    compound_unlock_irqrestore(page, flags);

    but, we have not assigned anything to 'flags' at this point, nor have we
    called 'compound_lock_irqsave()' (which is what sets 'flags'). The
    'put_back' label should be moved below the call to
    compound_unlock_irqrestore() as per this patch.

    Signed-off-by: Jesper Juhl
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Pavel Emelianov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl