27 Apr, 2012

1 commit

  • Merge fixes from Andrew Morton:
    "13 fixes. The acerhdf patches aren't (really) fixes. But they've
    been stuck in my tree for up to two years, sent to Matthew multiple
    times and the developers are unhappy."

    * emailed from Andrew Morton : (13 patches)
    mm: fix NULL ptr dereference in move_pages
    mm: fix NULL ptr dereference in migrate_pages
    revert "proc: clear_refs: do not clear reserved pages"
    drivers/rtc/rtc-ds1307.c: fix BUG shown with lock debugging enabled
    arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file
    hugetlbfs: lockdep annotate root inode properly
    acerhdf: lowered default temp fanon/fanoff values
    acerhdf: add support for new hardware
    acerhdf: add support for Aspire 1410 BIOS v1.3314
    fs/buffer.c: remove BUG() in possible but rare condition
    mm: fix up the vmscan stat in vmstat
    epoll: clear the tfile_check_list on -ELOOP
    mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

    Linus Torvalds
     

26 Apr, 2012

6 commits

  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    This would lead to the following NULL ptr deref and BUG() when calling
    migrate_pages() with a pid that has no mm struct:

    [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
    [25904.194235] IP: [] mmput+0x27/0xf0
    [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
    [25904.194235] Oops: 0002 [#1] PREEMPT SMP
    [25904.194235] CPU 2
    [25904.194235] Pid: 31608, comm: trinity Tainted: G W 3.4.0-rc2-next-20120412-sasha #69
    [25904.194235] RIP: 0010:[] [] mmput+0x27/0xf0
    [25904.194235] RSP: 0018:ffff880077d49e08 EFLAGS: 00010202
    [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
    [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
    [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
    [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
    [25904.194235] FS: 00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
    [25904.194235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
    [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
    [25904.194235] Stack:
    [25904.194235] ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
    [25904.194235] ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
    [25904.194235] 00000000000003ff 0000000000000000 0000000000000000 0000000000000000
    [25904.194235] Call Trace:
    [25904.194235] [] sys_migrate_pages+0x340/0x3a0
    [25904.194235] [] ? sys_migrate_pages+0xb1/0x3a0
    [25904.194235] [] system_call_fastpath+0x16/0x1b
    [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
    [25904.194235] RIP [] mmput+0x27/0xf0
    [25904.194235] RSP
    [25904.194235] CR2: 0000000000000050
    [25904.348999] ---[ end trace a307b3ed40206b4b ]---

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • The "pgsteal" stat is confusing because it counts both direct reclaim as
    well as background reclaim. However, we have "kswapd_steal" which also
    counts background reclaim value.

    This patch fixes it and also makes it match the existng "pgscan_" stats.

    Test:
    pgsteal_kswapd_dma32 447623
    pgsteal_kswapd_normal 42272677
    pgsteal_kswapd_movable 0
    pgsteal_direct_dma32 2801
    pgsteal_direct_normal 44353270
    pgsteal_direct_movable 0

    Signed-off-by: Ying Han
    Reviewed-by: Rik van Riel
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Dan Magenheimer
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
    large amounts of memory barrier related damage v3")

    Local variable "page" can be uninitialized if the nodemask from vma policy
    does not intersects with nodemask from cpuset. Even if it doesn't happens
    it is better to initialize this variable explicitly than to introduce
    a kernel oops in a weird corner case.

    mm/hugetlb.c: In function `alloc_huge_page':
    mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • None of the callsites actually need the page_cgroup descriptor
    themselves, so just pass the page and do the look up in there.

    We already had two bugs (6568d4a 'mm: memcg: update the correct soft
    limit tree during migration' and 'memcg: fix Bad page state after
    replace_page_cache') where the passed page and pc were not referring
    to the same page frame.

    Signed-off-by: Johannes Weiner
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The comments above __alloc_bootmem_node() claim that the code will
    first try the allocation using 'goal' and if that fails it will
    try again but with the 'goal' requirement dropped.

    Unfortunately, this is not what the code does, so fix it to do so.

    This is important for nobootmem conversions to architectures such
    as sparc where MAX_DMA_ADDRESS is infinity.

    On such architectures all of the allocations done by generic spots,
    such as the sparse-vmemmap implementation, will pass in:

    __pa(MAX_DMA_ADDRESS)

    as the goal, and with the limit given as "-1" this will always fail
    unless we add the appropriate fallback logic here.

    Signed-off-by: David S. Miller
    Acked-by: Yinghai Lu
    Signed-off-by: Linus Torvalds

    David Miller
     

24 Apr, 2012

1 commit

  • Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
    3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
    tries to transfer dirty flag from s390 storage key to struct page and
    radix_tree.

    That would be because of reclaim's shrink_page_list() calling
    add_to_swap() on this page at the same time: first PageSwapCache is set
    (causing page_mapping(page) to appear as &swapper_space), then
    page->private set, then tree_lock taken, then page inserted into
    radix_tree - so there's an interval before taking the lock when the
    radix_tree slot is empty.

    We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
    before the SetPageSwapCache. But a better fix is simply to do what's
    five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
    (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
    overhead, and swap is just the same - it ignores the radix_tree tag, and
    does not participate in dirty page accounting, so should be using
    __set_page_dirty_no_writeback() too.

    s390 testing now confirms that this does indeed fix the problem.

    Reported-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Andrew Morton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rik van Riel
    Cc: Ken Chen
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Apr, 2012

5 commits

  • it's always current->mm

    Signed-off-by: Al Viro

    Al Viro
     
  • This continues the theme started with vm_brk() and vm_munmap():
    vm_mmap() does the same thing as do_mmap(), but additionally does the
    required VM locking.

    This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
    duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
    to export our internal do_mmap_pgoff() function.

    Some day we hopefully don't have to export do_mmap() either, if all
    modular users can become the simpler vm_mmap() instead. We're actually
    very close to that already, with the notable exception of the (broken)
    use in i810, and a couple of stragglers in binfmt_elf.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Like the vm_brk() function, this is the same as "do_munmap()", except it
    does the VM locking for the caller.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It does the same thing as "do_brk()", except it handles the VM locking
    too.

    It turns out that all external callers want that anyway, so we can make
    do_brk() static to just mm/mmap.c while at it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit 24aa07882b ("memblock, x86: Replace memblock_x86_reserve/
    free_range() with generic ones") replaced x86 specific memblock
    operations with the generic ones; unfortunately, it lost zero length
    operation handling in the process making the kernel panic if somebody
    tries to reserve zero length area.

    There isn't much to be gained by being cranky to zero length operations
    and panicking is almost the worst response. Drop the BUG_ON() in
    memblock_reserve() and update memblock_add_region/isolate_range() so
    that all zero length operations are handled as noops.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Reported-by: Valere Monseur
    Bisected-by: Joseph Freeman
    Tested-by: Joseph Freeman
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

19 Apr, 2012

1 commit

  • My 9ce70c0240d0 "memcg: fix deadlock by inverting lrucare nesting" put a
    nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(),
    sometimes used for FUSE. Replacing __mem_cgroup_commit_charge_lrucare()
    by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier:
    but it's for oldpage, and needs now to be for newpage. Once oldpage was
    freed, its PageCgroupUsed bit (cleared above but set again here) caused
    "Bad page state" messages - and perhaps worse, being missed from newpage.
    (I didn't find this by using FUSE, but in reusing the function for tmpfs.)

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org [v3.3 only]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Apr, 2012

4 commits

  • This reverts commit c38446cc65e1f2b3eb8630c53943b94c4f65f670.

    Before the commit, the code makes senses to me but not after the commit.
    The "nr_reclaimed" is the number of pages reclaimed by scanning through
    the memcg's lru lists. The "nr_to_reclaim" is the target value for the
    whole function. For example, we like to early break the reclaim if
    reclaimed 32 pages under direct reclaim (not DEF_PRIORITY).

    After the reverted commit, the target "nr_to_reclaim" is decremented each
    time by "nr_reclaimed" but we still use it to compare the "nr_reclaimed".
    It just doesn't make sense to me...

    Signed-off-by: Ying Han
    Acked-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The race is as follows:

    Suppose a multi-threaded task forks a new process (on cpu A), thus
    bumping up the ref count on all the pages. While the fork is occurring
    (and thus we have marked all the PTEs as read-only), another thread in
    the original process (on cpu B) tries to write to a huge page, taking an
    access violation from the write-protect and calling hugetlb_cow(). Now,
    suppose the fork() fails. It will undo the COW and decrement the ref
    count on the pages, so the ref count on the huge page drops back to 1.
    Meanwhile hugetlb_cow() also decrements the ref count by one on the
    original page, since the original address space doesn't need it any
    more, having copied a new page to replace the original page. This
    leaves the ref count at zero, and when we call unlock_page(), we panic.

    fork on CPU A fault on CPU B
    ============= ==============
    ...
    down_write(&parent->mmap_sem);
    down_write_nested(&child->mmap_sem);
    ...
    while duplicating vmas
    if error
    break;
    ...
    up_write(&child->mmap_sem);
    up_write(&parent->mmap_sem); ...
    down_read(&parent->mmap_sem);
    ...
    lock_page(page);
    handle COW
    page_mapcount(old_page) == 2
    alloc and prepare new_page
    ...
    handle error
    page_remove_rmap(page);
    put_page(page);
    ...
    fold new_page into pte
    page_remove_rmap(page);
    put_page(page);
    ...
    oops ==> unlock_page(page);
    up_read(&parent->mmap_sem);

    The solution is to take an extra reference to the page while we are
    holding the lock on it.

    Signed-off-by: Chris Metcalf
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • We should use the accessor res_counter_read_u64 for that.

    Although a purely cosmetic change is sometimes better delayed, to avoid
    conflicting with other people's work, we are starting to have people
    touching this code as well, and reproducing the open code behavior
    because that's the standard =)

    Time to fix it, then.

    Signed-off-by: Glauber Costa
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • action != CPU_DEAD || action != CPU_DEAD_FROZEN is always true.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

29 Mar, 2012

10 commits

  • Merge third batch of patches from Andrew Morton:
    - Some MM stragglers
    - core SMP library cleanups (on_each_cpu_mask)
    - Some IPI optimisations
    - kexec
    - kdump
    - IPMI
    - the radix-tree iterator work
    - various other misc bits.

    "That'll do for -rc1. I still have ~10 patches for 3.4, will send
    those along when they've baked a little more."

    * emailed from Andrew Morton : (35 commits)
    backlight: fix typo in tosa_lcd.c
    crc32: add help text for the algorithm select option
    mm: move hugepage test examples to tools/testing/selftests/vm
    mm: move slabinfo.c to tools/vm
    mm: move page-types.c from Documentation to tools/vm
    selftests/Makefile: make `run_tests' depend on `all'
    selftests: launch individual selftests from the main Makefile
    radix-tree: use iterators in find_get_pages* functions
    radix-tree: rewrite gang lookup using iterator
    radix-tree: introduce bit-optimized iterator
    fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
    nbd: rename the nbd_device variable from lo to nbd
    pidns: add reboot_pid_ns() to handle the reboot syscall
    sysctl: use bitmap library functions
    ipmi: use locks on watchdog timeout set on reboot
    ipmi: simplify locking
    ipmi: fix message handling during panics
    ipmi: use a tasklet for handling received messages
    ipmi: increase KCS timeouts
    ipmi: decrease the IPMI message transaction time in interrupt mode
    ...

    Linus Torvalds
     
  • Replace radix_tree_gang_lookup_slot() and
    radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
    brand-new radix-tree direct iterating. This avoids the double-scanning
    and pointer copying.

    Iterator don't stop after nr_pages page-get fails in a row, it continue
    lookup till the radix-tree end. Thus we can safely remove these restart
    conditions.

    Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
    case does not fit into new code, so the patch adds an extra check at the
    beginning.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
    an IPI requesting CPUs to drain these pages to the buddy allocator if they
    actually have pages when asked to flush.

    This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
    severe memory pressure that leads to OOM since in these cases multiple,
    possibly concurrent, allocation requests end up in the direct reclaim code
    path so when the per-cpu pages end up reclaimed on first allocation
    failure for most of the proceeding allocation attempts until the memory
    pressure is off (possibly via the OOM killer) there are no per-cpu pages
    on most CPUs (and there can easily be hundreds of them).

    This also has the side effect of shortening the average latency of direct
    reclaim by 1 or more order of magnitude since waiting for all the CPUs to
    ACK the IPI takes a long time.

    Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
    difference between the number of direct reclaim attempts that end up in
    drain_all_pages() and those were more then 1/2 of the online CPU had any
    per-cpu page in them, using the vmstat counters introduced in the next
    patch in the series and using proc/interrupts.

    In the test sceanrio, this was seen to save around 3600 global
    IPIs after trigerring an OOM on a concurrent workload:

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 0
    pcp_global_ipi_saved 0

    $ cat /proc/interrupts | grep CAL
    CAL: 1 2 1 2
    2 2 2 2 Function call interrupts

    $ hackbench 400
    [OOM messages snipped]

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 3647
    pcp_global_ipi_saved 3642

    $ cat /proc/interrupts | grep CAL
    CAL: 6 13 6 3
    3 3 1 2 7 Function call interrupts

    Please note that if the global drain is removed from the direct reclaim
    path as a patch from Mel Gorman currently suggests this should be replaced
    with an on_each_cpu_cond invocation.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Acked-by: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Andi Kleen
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • flush_all() is called for each kmem_cache_destroy(). So every cache being
    destroyed dynamically ends up sending an IPI to each CPU in the system,
    regardless if the cache has ever been used there.

    For example, if you close the Infinband ipath driver char device file, the
    close file ops calls kmem_cache_destroy(). So running some infiniband
    config tool on one a single CPU dedicated to system tasks might interrupt
    the rest of the 127 CPUs dedicated to some CPU intensive or latency
    sensitive task.

    I suspect there is a good chance that every line in the output of "git
    grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

    This patch attempts to rectify this issue by sending an IPI to flush the
    per cpu objects back to the free lists only to CPUs that seem to have such
    objects.

    The check which CPU to IPI is racy but we don't care since asking a CPU
    without per cpu objects to flush does no damage and as far as I can tell
    the flush_all by itself is racy against allocs on remote CPUs anyway, so
    if you required the flush_all to be determinstic, you had to arrange for
    locking regardless.

    Without this patch the following artificial test case:

    $ cd /sys/kernel/slab
    $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

    produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

    The code path of memory allocation failure for CPUMASK_OFFSTACK=y
    config was tested using fault injection framework.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Christoph Lameter
    Cc: Chris Metcalf
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Sasha Levin
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: Avi Kivity
    Cc: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • Most system calls taking flags first check that the flags passed in are
    valid, and that helps userspace to detect when new flags are supported.

    But swapon never did so: start checking now, to help if we ever want to
    support more swap_flags in future.

    It's difficult to get stray bits set in an int, and swapon is not widely
    used, so this is most unlikely to break any userspace; but we can just
    revert if it turns out to do so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The size of coredump files is limited by RLIMIT_CORE, however, allocating
    large amounts of memory results in three negative consequences:

    - the coredumping process may be chosen for oom kill and quickly deplete
    all memory reserves in oom conditions preventing further progress from
    being made or tasks from exiting,

    - the coredumping process may cause other processes to be oom killed
    without fault of their own as the result of a SIGSEGV, for example, in
    the coredumping process, or

    - the coredumping process may result in a livelock while writing to the
    dump file if it needs memory to allocate while other threads are in
    the exit path waiting on the coredumper to complete.

    This is fixed by implying __GFP_NORETRY in the page allocator for
    coredumping processes when reclaim has failed so the allocations fail and
    the process continues to exit.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pmd_trans_unstable() should be called before pmd_offset_map() in the
    locations where the mmap_sem is held for reading.

    Signed-off-by: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Cc: Ulrich Obergfell
    Cc: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
    but forgetting to unmap pages first (ocfs2 remembers). This is not really
    a bug, since races already require truncate_inode_page() to handle that
    case once the page is locked; but it can be very inefficient if the file
    being punched happens to be mapped into many vmas.

    Provide a drop-in replacement truncate_pagecache_range() which does the
    unmapping pass first, handling the awkward mismatch between arguments to
    truncate_inode_pages_range() and arguments to unmap_mapping_range().

    Note that holepunching does not unmap privately COWed pages in the range:
    POSIX requires that we do so when truncating, but it's hard to justify,
    difficult to implement without an i_size cutoff, and no filesystem is
    attempting to implement it.

    Signed-off-by: Hugh Dickins
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull SLAB changes from Pekka Enberg:
    "There's the new kmalloc_array() API, minor fixes and performance
    improvements, but quite honestly, nothing terribly exciting."

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: SLAB Out-of-memory diagnostics
    slab: introduce kmalloc_array()
    slub: per cpu partial statistics change
    slub: include include for prefetch
    slub: Do not hold slub_lock when calling sysfs_slab_add()
    slub: prefetch next freelist pointer in slab_alloc()
    slab, cleanup: remove unneeded return

    Linus Torvalds
     
  • Pull ext4 updates for 3.4 from Ted Ts'o:
    "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

    The changes to export dirty_writeback_interval are from Artem's s_dirt
    cleanup patch series. The same is true of the change to remove the
    s_dirt helper functions which never got used by anyone in-tree. I've
    run these changes by Al Viro, and am carrying them so that Artem can
    more easily fix up the rest of the file systems during the next merge
    window. (Originally we had hopped to remove the use of s_dirt from
    ext4 during this merge window, but his patches had some bugs, so I
    ultimately ended dropping them from the ext4 tree.)"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
    vfs: remove unused superblock helpers
    mm: export dirty_writeback_interval
    ext4: remove useless s_dirt assignment
    ext4: write superblock only once on unmount
    ext4: do not mark superblock as dirty unnecessarily
    ext4: correct ext4_punch_hole return codes
    ext4: remove restrictive checks for EOFBLOCKS_FL
    ext4: always set then trimmed blocks count into len
    ext4: fix trimmed block count accunting
    ext4: fix start and len arguments handling in ext4_trim_fs()
    ext4: update s_free_{inodes,blocks}_count during online resize
    ext4: change some printk() calls to use ext4_msg() instead
    ext4: avoid output message interleaving in ext4_error_()
    ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
    ext4: add no_printk argument validation, fix fallout
    ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
    ext4: give more helpful error message in ext4_ext_rm_leaf()
    ext4: remove unused code from ext4_ext_map_blocks()
    ext4: rewrite punch hole to use ext4_ext_remove_space()
    jbd2: cleanup journal tail after transaction commit
    ...

    Linus Torvalds
     

25 Mar, 2012

1 commit


24 Mar, 2012

4 commits

  • Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
    for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
    MADV_DONTDUMP, which can be set by applications to specifically request
    memory regions which should not dump core.

    The specific application I have in mind is qemu: we can add a flag there
    that wouldn't dump all of guest memory when qemu dumps core. This flag
    might also be useful for security sensitive apps that want to absolutely
    make sure that parts of memory are not dumped. To clear the flag use:
    MADV_DODUMP.

    [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
    [akpm@linux-foundation.org: fix up the architectures which broke]
    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • The motivation for this patchset was that I was looking at a way for a
    qemu-kvm process, to exclude the guest memory from its core dump, which
    can be quite large. There are already a number of filter flags in
    /proc//coredump_filter, however, these allow one to specify 'types'
    of kernel memory, not specific address ranges (which is needed in this
    case).

    Since there are no more vma flags available, the first patch eliminates
    the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
    the kernel to mark vdso and vsyscall pages. However, it is simple
    enough to check if a vma covers a vdso or vsyscall page without the need
    for this flag.

    The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
    'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
    'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
    continue to work the same as before unless 'MADV_DONTDUMP' is set on the
    region.

    The qemu code which implements this features is at:

    http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

    In my testing the qemu core dump shrunk from 383MB -> 13MB with this
    patch.

    I also believe that the 'MADV_DONTDUMP' flag might be useful for
    security sensitive apps, which might want to select which areas are
    dumped.

    This patch:

    The VM_ALWAYSDUMP flag is currently used by the coredump code to
    indicate that a vma is part of a vsyscall or vdso section. However, we
    can determine if a vma is in one these sections by checking it against
    the gate_vma and checking for a non-NULL return value from
    arch_vma_name(). Thus, freeing a valuable vma bit.

    Signed-off-by: Jason Baron
    Acked-by: Roland McGrath
    Cc: Chris Metcalf
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
    of force_sig(SIGKILL). With the recent changes we do not need force_ to
    kill the CLONE_NEWPID tasks.

    And this is more correct. force_sig() can race with the exiting thread
    even if oom_kill_task() checks p->mm != NULL, while
    do_send_sig_info(group => true) kille the whole process.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Fix code duplication in __unmap_hugepage_range(), such as pte_page() and
    huge_pte_none().

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

23 Mar, 2012

4 commits

  • Adjusting cc715d99e529 "mm: vmscan: forcibly scan highmem if there are
    too many buffer_heads pinning highmem" for -stable reveals that it was
    slightly wrong once on top of fe2c2a106663 "vmscan: reclaim at order 0
    when compaction is enabled", which specifically adds testorder for the
    zone_watermark_ok_safe() test.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     
  • Pull MCE changes from Ingo Molnar.

    * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
    x86/mce: Convert static array of pointers to per-cpu variables
    x86/mce: Replace hard coded hex constants with symbolic defines
    x86/mce: Recognise machine check bank signature for data path error
    x86/mce: Handle "action required" errors
    x86/mce: Add mechanism to safely save information in MCE handler
    x86/mce: Create helper function to save addr/misc when needed
    HWPOISON: Add code to handle "action required" errors.
    HWPOISON: Clean up memory_failure() vs. __memory_failure()

    Linus Torvalds
     
  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

3 commits

  • Export 'dirty_writeback_interval' to make it visible to
    file-systems. We are going to push superblock management down to
    file-systems and get rid of the 'sync_supers' kernel thread completly.

    Signed-off-by: Artem Bityutskiy
    Cc: Al Viro
    Signed-off-by: "Theodore Ts'o"

    Artem Bityutskiy
     
  • Currently we can't do task migration among memory cgroups without THP
    split, which means processes heavily using THP experience large overhead
    in task migration. This patch introduces the code for moving charge of
    THP and makes THP more valuable.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Acked-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • These macros will be used in a later patch, where all usages are expected
    to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
    detect unexpected usages, we convert the existing BUG() to BUILD_BUG().

    [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi