21 Sep, 2010

1 commit

  • Commit 4969c1192d15 ("mm: fix swapin race condition") is now agreed to
    be incomplete. There's a race, not very much less likely than the
    original race envisaged, in which it is further necessary to check that
    the swapcache page's swap has not changed.

    Here's the reasoning: cast in terms of reuse_swap_page(), but probably
    could be reformulated to rely on try_to_free_swap() instead, or on
    swapoff+swapon.

    A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
    comes through the lock_page(page1).

    B, a racing thread of the same process, faults on the same address: does
    page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
    for whatever reason is unlucky not to get the lock any time soon.

    A carries on through do_swap_page(), a write fault, but cannot reuse the
    swap page1 (another reference to swap1). Unlocks the page1 (but B
    doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

    C, perhaps the parent of A+B, comes in and write faults the same swap
    page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

    kswapd comes in after some time (B still unlucky) and swaps out some
    pages from A+B and C: it allocates the original swap1 to page2 in A+B,
    and some other swap2 to the original page1 now in C. But does not
    immediately free page1 (actually it couldn't: B holds a reference),
    leaving it in swap cache for now.

    B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
    Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
    given the swap1 which page1 used to have. So B proceeds to insert page1
    into A+B's page_table, though its content now belongs to C, quite
    different from what A wrote there.

    B ought to have checked that page1's swap was still swap1.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Sep, 2010

1 commit

  • The pte_same check is reliable only if the swap entry remains pinned (by
    the page lock on swapcache). We've also to ensure the swapcache isn't
    removed before we take the lock as try_to_free_swap won't care about the
    page pin.

    One of the possible impacts of this patch is that a KSM-shared page can
    point to the anon_vma of another process, which could exit before the page
    is freed.

    This can leave a page with a pointer to a recycled anon_vma object, or
    worse, a pointer to something that is no longer an anon_vma.

    [riel@redhat.com: changelog help]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Aug, 2010

1 commit

  • pa-risc and ia64 have stacks that grow upwards. Check that
    they do not run into other mappings. By making VM_GROWSUP
    0x0 on architectures that do not ever use it, we can avoid
    some unpleasant #ifdefs in check_stack_guard_page().

    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

21 Aug, 2010

1 commit

  • Like the mlock() change previously, this makes the stack guard check
    code use vma->vm_prev to see what the mapping below the current stack
    is, rather than have to look it up with find_vma().

    Also, accept an abutting stack segment, since that happens naturally if
    you split the stack with mlock or mprotect.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Aug, 2010

1 commit

  • We do in fact need to unmap the page table _before_ doing the whole
    stack guard page logic, because if it is needed (mainly 32-bit x86 with
    PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
    will do a kmap_atomic/kunmap_atomic.

    And those kmaps will create an atomic region that we cannot do
    allocations in. However, the whole stack expand code will need to do
    anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
    atomic region.

    Now, a better model might actually be to do the anon_vma_prepare() when
    _creating_ a VM_GROWSDOWN segment, and not have to worry about any of
    this at page fault time. But in the meantime, this is the
    straightforward fix for the issue.

    See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.

    Reported-by: Wylda
    Reported-by: Sedat Dilek
    Reported-by: Mike Pagano
    Reported-by: François Valenduc
    Tested-by: Ed Tomlinson
    Cc: Pekka Enberg
    Cc: Greg KH
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Aug, 2010

1 commit


13 Aug, 2010

1 commit

  • This is a rather minimally invasive patch to solve the problem of the
    user stack growing into a memory mapped area below it. Whenever we fill
    the first page of the stack segment, expand the segment down by one
    page.

    Now, admittedly some odd application might _want_ the stack to grow down
    into the preceding memory mapping, and so we may at some point need to
    make this a process tunable (some people might also want to have more
    than a single page of guarding), but let's try the minimal approach
    first.

    Tested with trivial application that maps a single page just below the
    stack, and then starts recursing. Without this, we will get a SIGSEGV
    _after_ the stack has smashed the mapping. With this patch, we'll get a
    nice SIGBUS just as the stack touches the page just above the mapping.

    Requested-by: Keith Packard
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Aug, 2010

4 commits

  • It is not appropriate for apply_to_page_range() to directly call any mmu
    notifiers, because it is a general purpose function whose effect depends
    on what context it is called in and what the callback function does.

    In particular, if it is being used as part of an mmu notifier
    implementation, the recursive calls can be particularly problematic.

    It is up to apply_to_page_range's caller to do any notifier calls if
    necessary. It does not affect any in-tree users because they all operate
    on init_mm, and mmu notifiers only pertain to usermode mappings.

    [stefano.stabellini@eu.citrix.com: remove unused local `start']
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Stefano Stabellini
    Cc: Andrea Arcangeli
    Cc: Stefano Stabellini
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Set the flag if do_swap_page is decowing the page the same way do_wp_page
    would too.

    Signed-off-by: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • On swapin it is fairly common for a page to be owned exclusively by one
    process. In that case we want to add the page to the anon_vma of that
    process's VMA, instead of to the root anon_vma.

    This will reduce the amount of rmap searching that the swapout code needs
    to do.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • No real bugs, just some dead code and some fixups.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

31 Jul, 2010

1 commit

  • Debian's ia64 autobuilders have been seeing kernel freeze or reboot
    when running the gdb testsuite (Debian bug 588574): dannf bisected to
    2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
    PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.

    I'd missed updating the gate_vma handling in __get_user_pages(): that
    happens to use vm_normal_page() (nowadays failing on the zero page),
    yet reported success even when it failed to get a page - boom when
    access_process_vm() tried to copy that to its intermediate buffer.

    Fix this, resisting cleanups: in particular, leave it for now reporting
    success when not asked to get any pages - very probably safe to change,
    but let's not risk it without testing exposure.

    Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
    Because setup_gate() pads each 64kB of its gate area with zero pages.

    Reported-by: Andreas Barth
    Bisected-by: dann frazier
    Signed-off-by: Hugh Dickins
    Tested-by: dann frazier
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 May, 2010

1 commit


07 Apr, 2010

1 commit

  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit

  • In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss
    (called from do_exit) when workqueue is destroyed. This does not happen
    on net-next, or with vhost on top of to 2.6.33.

    The issue seems to be introduced by
    34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing of
    mm_counter) which added sync_mm_rss() that is passed task->mm, and
    dereferences it without checking. If task is a kernel thread, mm might be
    NULL. I think this might also happen e.g. with aio.

    This patch fixes the oops by calling sync_mm_rss when task->mm is set to
    NULL. I also added BUG_ON to detect any other cases where counters get
    incremented while mm is NULL.

    The oops I observed looks like this:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
    IP: [] sync_mm_rss+0x33/0x6f
    PGD 0
    Oops: 0002 [#1] SMP
    last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
    CPU 2
    Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]

    Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]-
    RIP: 0010:[] [] sync_mm_rss+0x33/0x6f
    RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202
    RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000
    RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0
    RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000
    R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000
    R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0)
    Stack:
    ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d
    0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98
    ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78
    Call Trace:
    [] do_exit+0x147/0x6c4
    [] ? handle_rx_net+0x0/0x17 [vhost_net]
    [] ? autoremove_wake_function+0x0/0x39
    [] ? worker_thread+0x0/0x229
    [] kthreadd+0x0/0xf2
    [] kernel_thread_helper+0x4/0x10
    [] ? kthread+0x0/0x87
    [] ? kernel_thread_helper+0x0/0x10
    Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74
    RIP [] sync_mm_rss+0x33/0x6f
    RSP
    CR2: 00000000000002a8
    ---[ end trace 41603ba922beddd2 ]---
    Fixing recursive fault but reboot is needed!

    (note: handle_rx_net is a work item using workqueue in question).
    sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting
    34e55232e59f7b19050267a05ff1226e5cd122a5 and the oops goes away.

    The module in question calls use_mm and later unuse_mm from a kernel
    thread. It is when this kernel thread is destroyed that the crash
    happens.

    Signed-off-by: Michael S. Tsirkin
    Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

13 Mar, 2010

2 commits

  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Commit 34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing
    of mm_counter") added sync_mm_rss() for syncing loosely accounted rss
    counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU
    enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it
    well.

    This patch changes
    SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU

    And for avoid unnecessary function calls, sync_mm_rss changed to be inlined
    noop function in header file.

    Reported-by: David Howells
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Mike Frysinger
    Signed-off-by: Michal Simek
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Mar, 2010

5 commits

  • When the parent process breaks the COW on a page, both the original which
    is mapped at child and the new page which is mapped parent end up in that
    same anon_vma. Generally this won't be a problem, but for some workloads
    it could preserve the O(N) rmap scanning complexity.

    A simple fix is to ensure that, when a page which is mapped child gets
    reused in do_wp_page, because we already are the exclusive owner, the page
    gets moved to our own exclusive child's anon_vma.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

21 Feb, 2010

1 commit

  • On VIVT ARM, when we have multiple shared mappings of the same file
    in the same MM, we need to ensure that we have coherency across all
    copies. We do this via make_coherent() by making the pages
    uncacheable.

    This used to work fine, until we allowed highmem with highpte - we
    now have a page table which is mapped as required, and is not available
    for modification via update_mmu_cache().

    Ralf Beache suggested getting rid of the PTE value passed to
    update_mmu_cache():

    On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
    to construct a pointer to the pte again. Passing a pte_t * is much
    more elegant. Maybe we might even replace the pte argument with the
    pte_t?

    Ben Herrenschmidt would also like the pte pointer for PowerPC:

    Passing the ptep in there is exactly what I want. I want that
    -instead- of the PTE value, because I have issue on some ppc cases,
    for I$/D$ coherency, where set_pte_at() may decide to mask out the
    _PAGE_EXEC.

    So, pass in the mapped page table pointer into update_mmu_cache(), and
    remove the PTE value, updating all implementations and call sites to
    suit.

    Includes a fix from Stephen Rothwell:

    sparc: fix fallout from update_mmu_cache API change

    Signed-off-by: Stephen Rothwell

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Russell King

    Russell King
     

17 Dec, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     

16 Dec, 2009

5 commits

  • In massive parallel enviroment, res_counter can be a performance
    bottleneck. One strong techinque to reduce lock contention is reducing
    calls by coalescing some amount of calls into one.

    Considering charge/uncharge chatacteristic,
    - charge is done one by one via demand-paging.
    - uncharge is done by
    - in chunk at munmap, truncate, exit, execve...
    - one by one via vmscan/paging.

    It seems we have a chance to coalesce uncharges for improving scalability
    at unmap/truncation.

    This patch is a for coalescing uncharge. For avoiding scattering memcg's
    structure to functions under /mm, this patch adds memcg batch uncharge
    information to the task. A reason for per-task batching is for making use
    of caller's context information. We do batched uncharge (deleyed
    uncharge) when truncation/unmap occurs but do direct uncharge when
    uncharge is called by memory reclaim (vmscan.c).

    The degree of coalescing depends on callers
    - at invalidate/trucate... pagevec size
    - at unmap ....ZAP_BLOCK_SIZE
    (memory itself will be freed in this degree.)
    Then, we'll not coalescing too much.

    On x86-64 8cpu server, I tested overheads of memcg at page fault by
    running a program which does map/fault/unmap in a loop. Running
    a task per a cpu by taskset and see sum of the number of page faults
    in 60secs.

    [without memcg config]
    40156968 page-faults # 0.085 M/sec ( +- 0.046% )
    27.67 cache-miss/faults
    [root cgroup]
    36659599 page-faults # 0.077 M/sec ( +- 0.247% )
    31.58 miss/faults
    [in a child cgroup]
    18444157 page-faults # 0.039 M/sec ( +- 0.133% )
    69.96 miss/faults
    [child with this patch]
    27133719 page-faults # 0.057 M/sec ( +- 0.155% )
    47.16 miss/faults

    We can see some amounts of improvement.
    (root cgroup doesn't affected by this patch)
    Another patch for "charge" will follow this and above will be improved more.

    Changelog(since 2009/10/02):
    - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
    - some clean up and commentary/description updates.
    - added initialize code to copy_process(). (possible bug fix)

    Changelog(old):
    - fixed !CONFIG_MEM_CGROUP case.
    - rebased onto the latest mmotm + softlimit fix patches.
    - unified patch for callers
    - added commetns.
    - make ->do_batch as bool.
    - removed css_get() at el. We don't need it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • AK: Improve comment

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Initial implementation for swapping out KSM's shared pages: add
    page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
    faced with a PageKsm page.

    Most of what's needed can be got from the rmap_items listed from the
    stable_node of the ksm page, without discovering the actual vma: so in
    this patch just fake up a struct vma for page_referenced_one() or
    try_to_unmap_one(), then refine that in the next patch.

    Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
    implicit there (being only set with VM_SHARED, already excluded), but
    let's make it explicit, to help justify the lack of nonlinear unmap.

    Rely on the page lock to protect against concurrent modifications to that
    page's node of the stable tree.

    The awkward part is not swapout but swapin: do_swap_page() and
    page_add_anon_rmap() now have to allow for new possibilities - perhaps a
    ksm page still in swapcache, perhaps a swapcache page associated with one
    location in one anon_vma now needed for another location or anon_vma.
    (And the vma might even be no longer VM_MERGEABLE when that happens.)

    ksm_might_need_to_copy() checks for that case, and supplies a duplicate
    page when necessary, simply leaving it to a subsequent pass of ksmd to
    rediscover the identity and merge them back into one ksm page.
    Disappointingly primitive: but the alternative would have to accumulate
    unswappable info about the swapped out ksm pages, limiting swappability.

    Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
    particular case it was handling, so just use it instead.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When do_nonlinear_fault() realizes that the page table must have been
    corrupted for it to have been called, it does print_bad_pte() and returns
    ... VM_FAULT_OOM, which is hard to understand.

    It made some sense when I did it for 2.6.15, when do_page_fault() just
    killed the current process; but nowadays it lets the OOM killer decide who
    to kill - so page table corruption in one process would be liable to kill
    another.

    Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
    the process will be killed, but is good enough for such a rare
    abnormality, accompanied as it is by the "BUG: Bad page map" message.

    And recent HWPOISON work has copied that code into do_swap_page(), when it
    finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Oct, 2009

2 commits

  • * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    HWPOISON: fix invalid page count in printk output
    HWPOISON: Allow schedule_on_each_cpu() from keventd
    HWPOISON: fix/proc/meminfo alignment
    HWPOISON: fix oops on ksm pages
    HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page
    HWPOISON: return early on non-LRU pages
    HWPOISON: Add brief hwpoison description to Documentation
    HWPOISON: Clean up PR_MCE_KILL interface

    Linus Torvalds
     
  • There are some places where we do like:

    pte = pte_map();
    do {
    (do break in some conditions)
    } while (pte++, ...);
    pte_unmap(pte - 1);

    But if the loop breaks at the first loop, pte_unmap() unmaps invalid pte.

    This patch is a fix for this problem.

    Signed-off-by: Daisuke Nishimura
    Reviewd-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

19 Oct, 2009

1 commit


24 Sep, 2009

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    truncate: use new helpers
    truncate: new helpers
    fs: fix overflow in sys_mount() for in-kernel calls
    fs: Make unload_nls() NULL pointer safe
    freeze_bdev: grab active reference to frozen superblocks
    freeze_bdev: kill bd_mount_sem
    exofs: remove BKL from super operations
    fs/romfs: correct error-handling code
    vfs: seq_file: add helpers for data filling
    vfs: remove redundant position check in do_sendfile
    vfs: change sb->s_maxbytes to a loff_t
    vfs: explicitly cast s_maxbytes in fiemap_check_ranges
    libfs: return error code on failed attr set
    seq_file: return a negative error code when seq_path_root() fails.
    vfs: optimize touch_time() too
    vfs: optimization for touch_atime()
    vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
    fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
    libfs: make simple_read_from_buffer conventional

    Linus Torvalds
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 Sep, 2009

4 commits

  • Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
    __read_mostly in memory.c: to help them share a cacheline, since they're
    very often tested together in vm_normal_page().

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
    those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

    Contrary to how I'd imagined it, there's nothing ugly about this, just a
    zero_pfn test built into one or another block of vm_normal_page().

    But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
    my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
    ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
    that: it would have to take vm_flags to weed out some cases.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __get_user_pages() has been taking its own GUP flags, then processing
    them into FOLL flags for follow_page(). Though oddly named, the FOLL
    flags are more widely used, so pass them to __get_user_pages() now.
    Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

    (The patch to __get_user_pages() looks peculiar, with both gup_flags
    and foll_flags: the gup_flags remain constant; but as before there's
    an exceptional case, out of scope of the patch, in which foll_flags
    per page have FOLL_WRITE masked off.)

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
    advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
    using in 2.6.24. And there were a couple of regression reports on LKML.

    Following suggestions from Linus, reinstate do_anonymous_page() use of
    the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
    with (map)count updates - let vm_normal_page() regard it as abnormal.

    Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
    most powerpc): that's not essential, but minimizes additional branches
    (keeping them in the unlikely pte_special case); and incidentally
    excludes mips (some models of which needed eight colours of ZERO_PAGE
    to avoid costly exceptions).

    Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
    callers won't want to make exceptions for it, so increment its count
    there. Changes to mlock and migration? happily seems not needed.

    In most places it's quicker to check pfn than struct page address:
    prepare a __read_mostly zero_pfn for that. Does get_dump_page()
    still need its ZERO_PAGE check? probably not, but keep it anyway.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins