02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

3 commits

  • If a process creates a large hugetlbfs mapping that is eligible for page
    table sharing and forks heavily with children some of whom fault and
    others which destroy the mapping then it is possible for page tables to
    get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
    output a message to the kernel log. The final teardown will trigger a
    BUG_ON in mm/filemap.c.

    This was reproduced in 3.4 but is known to have existed for a long time
    and goes back at least as far as 2.6.37. It was probably was introduced
    in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
    look like this;

    [ ..........] Lots of bad pmd messages followed by this
    [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
    [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
    [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
    [ 127.186778] ------------[ cut here ]------------
    [ 127.186781] kernel BUG at mm/filemap.c:134!
    [ 127.186782] invalid opcode: 0000 [#1] SMP
    [ 127.186783] CPU 7
    [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
    [ 127.186801]
    [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
    [ 127.186804] RIP: 0010:[] [] __delete_from_page_cache+0x15e/0x160
    [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
    [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
    [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
    [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
    [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
    [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
    [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
    [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
    [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
    [ 127.186821] Stack:
    [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
    [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
    [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
    [ 127.186827] Call Trace:
    [ 127.186829] [] delete_from_page_cache+0x3b/0x80
    [ 127.186832] [] truncate_hugepages+0x115/0x220
    [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30
    [ 127.186837] [] evict+0xa7/0x1b0
    [ 127.186839] [] iput_final+0xd3/0x1f0
    [ 127.186840] [] iput+0x39/0x50
    [ 127.186842] [] d_kill+0xf8/0x130
    [ 127.186843] [] dput+0xd2/0x1a0
    [ 127.186845] [] __fput+0x170/0x230
    [ 127.186848] [] ? rb_erase+0xce/0x150
    [ 127.186849] [] fput+0x1d/0x30
    [ 127.186851] [] remove_vma+0x37/0x80
    [ 127.186853] [] do_munmap+0x2d2/0x360
    [ 127.186855] [] sys_shmdt+0xc9/0x170
    [ 127.186857] [] system_call_fastpath+0x16/0x1b
    [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
    [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160
    [ 127.186870] RSP
    [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---

    The bug is a race and not always easy to reproduce. To reproduce it I was
    doing the following on a single socket I7-based machine with 16G of RAM.

    $ hugeadm --pool-pages-max DEFAULT:13G
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
    $ for i in `seq 1 9000`; do ./hugetlbfs-test; done

    On my particular machine, it usually triggers within 10 minutes but
    enabling debug options can change the timing such that it never hits.
    Once the bug is triggered, the machine is in trouble and needs to be
    rebooted. The machine will respond but processes accessing proc like "ps
    aux" will hang due to the BUG_ON. shutdown will also hang and needs a
    hard reset or a sysrq-b.

    The basic problem is a race between page table sharing and teardown. For
    the most part page table sharing depends on i_mmap_mutex. In some cases,
    it is also taking the mm->page_table_lock for the PTE updates but with
    shared page tables, it is the i_mmap_mutex that is more important.

    Unfortunately it appears to be also insufficient. Consider the following
    situation

    Process A Process B
    --------- ---------
    hugetlb_fault shmdt
    LockWrite(mmap_sem)
    do_munmap
    unmap_region
    unmap_vmas
    unmap_single_vma
    unmap_hugepage_range
    Lock(i_mmap_mutex)
    Lock(mm->page_table_lock)
    huge_pmd_unshare/unmap tables page_table_lock)
    Unlock(i_mmap_mutex)
    huge_pte_alloc ...
    Lock(i_mmap_mutex) ...
    vma_prio_walk, find svma, spte ...
    Lock(mm->page_table_lock) ...
    share spte ...
    Unlock(mm->page_table_lock) ...
    Unlock(i_mmap_mutex) ...
    hugetlb_no_page < end; a += 4096)
    *a = 0;
    }

    int main(int argc, char **argv)
    {
    key_t key = IPC_PRIVATE;
    size_t sizeA = nr_huge_page_A * huge_page_size;
    size_t sizeB = nr_huge_page_B * huge_page_size;
    int shmidA, shmidB;
    void *addrA = NULL, *addrB = NULL;
    int nr_children = 300, n = 0;

    if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }
    if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }

    fork_child:
    switch(fork()) {
    case 0:
    switch (n%3) {
    case 0:
    play(addrA, sizeA);
    break;
    case 1:
    play(addrB, sizeB);
    break;
    case 2:
    break;
    }
    break;
    case -1:
    perror("fork:");
    break;
    default:
    if (++n < nr_children)
    goto fork_child;
    play(addrA, sizeA);
    break;
    }
    shmdt(addrA);
    shmdt(addrB);
    do {
    wait(NULL);
    } while (--n > 0);
    shmctl(shmidA, IPC_RMID, NULL);
    shmctl(shmidB, IPC_RMID, NULL);
    return 0;
    }

    [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Call up_read(&mm->mmap_sem) directly since we have already got mm via
    current->mm at the beginning of print_vma_addr().

    Signed-off-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • Use a mmu_gather instead of a temporary linked list for accumulating pages
    when we unmap a hugepage range

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

31 Jul, 2012

1 commit

  • Filesystems wanting to properly support freezing need to have control
    when file_update_time() is called. After pushing file_update_time()
    to all relevant .page_mkwrite implementations we can just stop calling
    file_update_time() when filesystem implements .page_mkwrite.

    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

27 Jul, 2012

1 commit

  • Pull x86/mm changes from Peter Anvin:
    "The big change here is the patchset by Alex Shi to use INVLPG to flush
    only the affected pages when we only need to flush a small page range.

    It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
    vectors!) and replace it with an ordinary IPI function call."

    Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
    to changed line)

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tlb: Fix build warning and crash when building for !SMP
    x86/tlb: do flush_tlb_kernel_range by 'invlpg'
    x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
    x86/tlb: enable tlb flush range support for x86
    mm/mmu_gather: enable tlb flush range in generic mmu_gather
    x86/tlb: add tlb_flushall_shift knob into debugfs
    x86/tlb: add tlb_flushall_shift for specific CPU
    x86/tlb: fall back to flush all when meet a THP large page
    x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
    x86/tlb_info: get last level TLB entry number of CPU
    x86: Add read_mostly declaration/definition to variables from smp.h
    x86: Define early read-mostly per-cpu macros

    Linus Torvalds
     

28 Jun, 2012

1 commit

  • This patch enabled the tlb flush range support in generic mmu layer.

    Most of arch has self tlb flush range support, like ARM/IA64 etc.
    X86 arch has no this support in hardware yet. But another instruction
    'invlpg' can implement this function in some degree. So, enable this
    feather in generic layer for x86 now. and maybe useful for other archs
    in further.

    Generic mmu_gather struct is protected by micro
    HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported
    own self mmu_gather struct. So, now this change is safe for them.

    In future we may unify this struct and related functions on multiple
    archs.

    Thanks for Peter Zijlstra time and time reminder for multiple
    architecture code safe!

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     

21 Jun, 2012

2 commits


30 May, 2012

2 commits

  • On COW, a new hugepage is allocated and charged to the memcg. If the
    system is oom or the charge to the memcg fails, however, the fault
    handler will return VM_FAULT_OOM which results in an oom kill.

    Instead, it's possible to fallback to splitting the hugepage so that the
    COW results only in an order-0 page being allocated and charged to the
    memcg which has a higher liklihood to succeed. This is expensive
    because the hugepage must be split in the page fault handler, but it is
    much better than unnecessarily oom killing a process.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The swap token code no longer fits in with the current VM model. It
    does not play well with cgroups or the better NUMA placement code in
    development, since we have only one swap token globally.

    It also has the potential to mess with scalability of the system, by
    increasing the number of non-reclaimable pages on the active and
    inactive anon LRU lists.

    Last but not least, the swap token code has been broken for a year
    without complaints, as reported by Konstantin Khlebnikov. This suggests
    we no longer have much use for it.

    The days of sub-1G memory systems with heavy use of swap are over. If
    we ever need thrashing reducing code in the future, we will have to
    implement something that does scale.

    Signed-off-by: Rik van Riel
    Cc: Konstantin Khlebnikov
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Acked-by: Bob Picco
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

07 May, 2012

2 commits

  • The VM accounting makes no sense at this level, and half of the callers
    didn't ever actually use the end result. The only time we want to
    unaccount the memory is when we actually remove the vma, so do the
    accounting at that point instead.

    This simplifies the interfaces (no need to pass down that silly page
    counter to functions that really don't care), and also makes it much
    more obvious what is actually going on: we do vm_[un]acct_memory() when
    adding or removing the vma, not on random page walking.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • None of the callers want to pass in 'zap_details', and it doesn't even
    make sense for the case of actually unmapping vma's. So remove the
    argument, and clean up the interface.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Apr, 2012

1 commit

  • Uprobes has a callback (uprobe_munmap()) in the unmap path to
    maintain the uprobes count.

    In the exit path this callback gets called in unlink_file_vma().
    However by the time unlink_file_vma() is called, the pages would
    have been unmapped (in unmap_vmas()) and the task->rss_stat counts
    accounted (in zap_pte_range()).

    If the exiting process has probepoints, uprobe_munmap() checks if
    the breakpoint instruction was around before decrementing the probe
    count.

    This results in a file backed page being reread by uprobe_munmap()
    and hence it does not find the breakpoint.

    This patch fixes this problem by moving the callback to
    unmap_single_vma(). Since unmap_single_vma() may not unmap the
    complete vma, add start and end parameters to uprobe_munmap().

    This bug became apparent courtesy of commit c3f0327f8e9d
    ("mm: add rss counters consistency check").

    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Anton Arapov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

24 Mar, 2012

1 commit

  • The motivation for this patchset was that I was looking at a way for a
    qemu-kvm process, to exclude the guest memory from its core dump, which
    can be quite large. There are already a number of filter flags in
    /proc//coredump_filter, however, these allow one to specify 'types'
    of kernel memory, not specific address ranges (which is needed in this
    case).

    Since there are no more vma flags available, the first patch eliminates
    the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
    the kernel to mark vdso and vsyscall pages. However, it is simple
    enough to check if a vma covers a vdso or vsyscall page without the need
    for this flag.

    The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
    'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
    'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
    continue to work the same as before unless 'MADV_DONTDUMP' is set on the
    region.

    The qemu code which implements this features is at:

    http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

    In my testing the qemu core dump shrunk from 383MB -> 13MB with this
    patch.

    I also believe that the 'MADV_DONTDUMP' flag might be useful for
    security sensitive apps, which might want to select which areas are
    dumped.

    This patch:

    The VM_ALWAYSDUMP flag is currently used by the coredump code to
    indicate that a vma is part of a vsyscall or vdso section. However, we
    can determine if a vma is in one these sections by checking it against
    the gate_vma and checking for a non-NULL return value from
    arch_vma_name(). Thus, freeing a valuable vma bit.

    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

5 commits

  • There's no difference between sync_mm_rss() and __sync_task_rss_stat(),
    so fold the latter into the former.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • sync_mm_rss() can only be used for current to avoid race conditions in
    iterating and clearing its per-task counters. Remove the task argument
    for it and its helper function, __sync_task_rss_stat(), to avoid thinking
    it can be used safely for anything other than current.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make get_mm_counter() always static inline, it is simple enough for that.
    And remove unused set_mm_counter()

    bloat-o-meter:

    add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
    function old new delta
    try_to_unmap_one 886 952 +66
    sys_remap_file_pages 1214 1230 +16
    dup_mm 1684 1700 +16
    do_exit 2277 2278 +1
    zap_page_range 208 205 -3
    unmap_region 304 296 -8
    static.oom_kill_process 554 546 -8
    try_to_unmap_file 1716 1700 -16
    getrusage 925 909 -16
    flush_old_exec 1704 1688 -16
    static.dump_header 416 390 -26
    acct_update_integrals 218 187 -31
    do_task_stat 2986 2954 -32
    get_mm_counter 34 - -34
    xacct_add_tsk 371 334 -37
    task_statm 172 118 -54
    task_mem 383 323 -60

    try_to_unmap_one() grows because update_hiwater_rss() now completely inline.

    Signed-off-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Pull munmap/truncate race fixes from Al Viro:
    "Fixes for racy use of unmap_vmas() on truncate-related codepaths"

    * 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VM: make zap_page_range() callers that act on a single VMA use separate helper
    VM: make unmap_vmas() return void
    VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap()
    VM: make zap_page_range() return void
    VM: can't go through the inner loop in unmap_vmas() more than once...
    VM: unmap_page_range() can return void

    Linus Torvalds
     

21 Mar, 2012

5 commits


20 Mar, 2012

1 commit


24 Jan, 2012

1 commit

  • Memory migration fills a pte with a migration entry and it doesn't
    update the rss counters. Then it replaces the migration entry with the
    new page (or the old one if migration failed). But between these two
    passes this pte can be unmaped, or a task can fork a child and it will
    get a copy of this migration entry. Nobody accounts for this in the rss
    counters.

    This patch properly adjust rss counters for migration entries in
    zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic
    operations on the migration fast-path.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

1 commit

  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Michel while working on the working set estimation code, noticed that
    calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
    wasn't safe, if the pfn ended up being a tail page of a transparent
    hugepage under splitting by __split_huge_page_refcount().

    He then found the problem could also theoretically materialize with
    page_cache_get_speculative() during the speculative radix tree lookups
    that uses get_page_unless_zero() in SMP if the radix tree page is freed
    and reallocated and get_user_pages is called on it before
    page_cache_get_speculative has a chance to call get_page_unless_zero().

    So the best way to fix the problem is to keep page_tail->_count zero at
    all times. This will guarantee that get_page_unless_zero() can never
    succeed on any tail page. page_tail->_mapcount is guaranteed zero and
    is unused for all tail pages of a compound page, so we can simply
    account the tail page references there and transfer them to
    tail_page->_count in __split_huge_page_refcount() (in addition to the
    head_page->_mapcount).

    While debugging this s/_count/_mapcount/ change I also noticed get_page is
    called by direct-io.c on pages returned by get_user_pages. That wasn't
    entirely safe because the two atomic_inc in get_page weren't atomic. As
    opposed to other get_user_page users like secondary-MMU page fault to
    establish the shadow pagetables would never call any superflous get_page
    after get_user_page returns. It's safer to make get_page universally safe
    for tail pages and to use get_page_foll() within follow_page (inside
    get_user_pages()). get_page_foll() is safe to do the refcounting for tail
    pages without taking any locks because it is run within PT lock protected
    critical sections (PT lock for pte and page_table_lock for
    pmd_trans_huge).

    The standard get_page() as invoked by direct-io instead will now take
    the compound_lock but still only for tail pages. The direct-io paths
    are usually I/O bound and the compound_lock is per THP so very
    finegrined, so there's no risk of scalability issues with it. A simple
    direct-io benchmarks with all lockdep prove locking and spinlock
    debugging infrastructure enabled shows identical performance and no
    overhead. So it's worth it. Ideally direct-io should stop calling
    get_page() on pages returned by get_user_pages(). The spinlock in
    get_page() is already optimized away for no-THP builds but doing
    get_page() on tail pages returned by GUP is generally a rare operation
    and usually only run in I/O paths.

    This new refcounting on page_tail->_mapcount in addition to avoiding new
    RCU critical sections will also allow the working set estimation code to
    work without any further complexity associated to the tail page
    refcounting with THP.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Michel Lespinasse
    Reviewed-by: Michel Lespinasse
    Reviewed-by: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

31 Oct, 2011

1 commit


26 Jul, 2011

3 commits

  • I haven't reproduced it myself but the fail scenario is that on such
    machines (notably ARM and some embedded powerpc), if you manage to hit
    that futex path on a writable page whose dirty bit has gone from the PTE,
    you'll livelock inside the kernel from what I can tell.

    It will go in a loop of trying the atomic access, failing, trying gup to
    "fix it up", getting succcess from gup, go back to the atomic access,
    failing again because dirty wasn't fixed etc...

    So I think you essentially hang in the kernel.

    The scenario is probably rare'ish because affected architecture are
    embedded and tend to not swap much (if at all) so we probably rarely hit
    the case where dirty is missing or young is missing, but I think Shan has
    a piece of SW that can reliably reproduce it using a shared writable
    mapping & fork or something like that.

    On archs who use SW tracking of dirty & young, a page without dirty is
    effectively mapped read-only and a page without young unaccessible in the
    PTE.

    Additionally, some architectures might lazily flush the TLB when relaxing
    write protection (by doing only a local flush), and expect a fault to
    invalidate the stale entry if it's still present on another processor.

    The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
    "fix it up" by causing get_user_pages() which would then be equivalent to
    taking the fault.

    However that isn't the case. get_user_pages() will not call
    handle_mm_fault() in the case where the PTE seems to have the right
    permissions, regardless of the dirty and young state. It will eventually
    update those bits ... in the struct page, but not in the PTE.

    Additionally, it will not handle the lazy TLB flushing that can be
    required by some architectures in the fault case.

    Basically, gup is the wrong interface for the job. The patch provides a
    more appropriate one which boils down to just calling handle_mm_fault()
    since what we are trying to do is simulate a real page fault.

    The futex code currently attempts to write to user memory within a
    pagefault disabled section, and if that fails, tries to fix it up using
    get_user_pages().

    This doesn't work on archs where the dirty and young bits are maintained
    by software, since they will gate access permission in the TLB, and will
    not be updated by gup().

    In addition, there's an expectation on some archs that a spurious write
    fault triggers a local TLB flush, and that is missing from the picture as
    well.

    I decided that adding those "features" to gup() would be too much for this
    already too complex function, and instead added a new simpler
    fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
    which the futex code can call.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
    Signed-off-by: Benjamin Herrenschmidt
    Reported-by: Shan Hai
    Tested-by: Shan Hai
    Cc: David Laight
    Acked-by: Peter Zijlstra
    Cc: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Currently we are keeping faulted page locked throughout whole __do_fault
    call (except for page_mkwrite code path) after calling file system's fault
    code. If we do early COW, we allocate a new page which has to be charged
    for a memcg (mem_cgroup_newpage_charge).

    This function, however, might block for unbounded amount of time if memcg
    oom killer is disabled or fork-bomb is running because the only way out of
    the OOM situation is either an external event or OOM-situation fix.

    In the end we are keeping the faulted page locked and blocking other
    processes from faulting it in which is not good at all because we are
    basically punishing potentially an unrelated process for OOM condition in
    a different group (I have seen stuck system because of ld-2.11.1.so being
    locked).

    We can do test easily.

    % cgcreate -g memory:A
    % cgset -r memory.limit_in_bytes=64M A
    % cgset -r memory.memsw.limit_in_bytes=64M A
    % cd kernel_dir; cgexec -g memory:A make -j

    Then, the whole system will live-locked until you kill 'make -j'
    by hands (or push reboot...) This is because some important page in a
    a shared library are locked.

    Considering again, the new page is not necessary to be allocated
    with lock_page() held. And usual page allocation may dive into
    long memory reclaim loop with holding lock_page() and can cause
    very long latency.

    There are 3 ways.
    1. do allocation/charge before lock_page()
    Pros. - simple and can handle page allocation in the same manner.
    This will reduce holding time of lock_page() in general.
    Cons. - we do page allocation even if ->fault() returns error.

    2. do charge after unlock_page(). Even if charge fails, it's just OOM.
    Pros. - no impact to non-memcg path.
    Cons. - implemenation requires special cares of LRU and we need to modify
    page_add_new_anon_rmap()...

    3. do unlock->charge->lock again method.
    Pros. - no impact to non-memcg path.
    Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
    lock and get it again...

    This patch moves "charge" and memory allocation for COW page
    before lock_page(). Then, we can avoid scanning LRU with holding
    a lock on a page and latency under lock_page() will be reduced.

    Then, above livelock disappears.

    [akpm@linux-foundation.org: fix code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Lutz Vieweg
    Original-idea-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • ZAP_BLOCK_SIZE became unused in the preemptible-mmu_gather work ("mm:
    Remove i_mmap_lock lockbreak"). So zap it.

    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Jul, 2011

1 commit


28 Jun, 2011

1 commit


16 Jun, 2011

2 commits

  • Running a ktest.pl test, I hit the following bug on x86_32:

    ------------[ cut here ]------------
    WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
    Hardware name:
    Modules linked in:
    Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
    Call Trace:
    [] warn_slowpath_common+0x7c/0x91
    [] ? __kunmap_atomic+0x64/0xc1
    [] ? __kunmap_atomic+0x64/0xc1^M
    [] warn_slowpath_null+0x22/0x24
    [] __kunmap_atomic+0x64/0xc1
    [] unmap_vmas+0x43a/0x4e0
    [] exit_mmap+0x91/0xd2
    [] mmput+0x43/0xad
    [] exit_mm+0x111/0x119
    [] do_exit+0x1ff/0x5fa
    [] ? set_current_blocked+0x3c/0x40
    [] ? sigprocmask+0x7e/0x8e
    [] do_group_exit+0x65/0x88
    [] sys_exit_group+0x18/0x1c
    [] sysenter_do_call+0x12/0x38
    ---[ end trace 8055f74ea3c0eb62 ]---

    Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
    ("mm: extended batches for generic mmu_gather")

    But although this was the commit triggering the bug, it was not the one
    originally responsible for the bug. That was commit d16dfc550f53 ("mm:
    mmu_gather rework").

    The code in zap_pte_range() has something that looks like the following:

    pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
    do {
    [...]
    } while (pte++, addr += PAGE_SIZE, addr != end);
    pte_unmap_unlock(pte - 1, ptl);

    The pte starts off pointing at the first element in the page table
    directory that was returned by the pte_offset_map_lock(). When it's done
    with the page, pte will be pointing to anything between the next entry and
    the first entry of the next page inclusive. By doing a pte - 1, this puts
    the pte back onto the original page, which is all that pte_unmap_unlock()
    needs.

    In most archs (64 bit), this is not an issue as the pte is ignored in the
    pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it
    is essential that the pte passed to pte_unmap_unlock() resides on the same
    page that was given by pte_offest_map_lock().

    The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
    a "break;" from the while loop. This alone did not seem to easily trigger
    the bug. But the modifications made by e303297e6 caused that "break;" to
    be hit on the first iteration, before the pte++.

    The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
    be pointing to the previous page. This will cause the wrong page to be
    unmapped, and also trigger the warning above.

    The simple solution is to just save the pointer given by
    pte_offset_map_lock() and use it in the unlock.

    Signed-off-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Fix new kernel-doc warnings in mm/memory.c:

    Warning(mm/memory.c:1327): No description found for parameter 'tlb'
    Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap