24 May, 2014

5 commits

  • When a memory error happens on an in-use page or (free and in-use)
    hugepage, the victim page is isolated with its refcount set to one.

    When you try to unpoison it later, unpoison_memory() calls put_page()
    for it twice in order to bring the page back to free page pool (buddy or
    free hugepage list). However, if another memory error occurs on the
    page which we are unpoisoning, memory_failure() returns without
    releasing the refcount which was incremented in the same call at first,
    which results in memory leak and unconsistent num_poisoned_pages
    statistics. This patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: [2.6.32+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Commit 284f39afeaa4 ("mm: memcg: push !mm handling out to page cache
    charge function") explicitly checks for page cache charges without any
    mm context (from kernel thread context[1]).

    This seemed to be the only possible case where memory could be charged
    without mm context so commit 03583f1a631c ("memcg: remove unnecessary
    !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
    get_mem_cgroup_from_mm(). This however caused another NULL ptr
    dereference during early boot when loopback kernel thread splices to
    tmpfs as reported by Stephan Kulow:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
    IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
    Oops: 0000 [#1] SMP
    Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
    CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    __mem_cgroup_try_charge_swapin+0x40/0xe0
    mem_cgroup_charge_file+0x8b/0xd0
    shmem_getpage_gfp+0x66b/0x7b0
    shmem_file_splice_read+0x18f/0x430
    splice_direct_to_actor+0xa2/0x1c0
    do_lo_receive+0x5a/0x60 [loop]
    loop_thread+0x298/0x720 [loop]
    kthread+0xc6/0xe0
    ret_from_fork+0x7c/0xb0

    Also Branimir Maksimovic reported the following oops which is tiggered
    for the swapcache charge path from the accounting code for kernel threads:

    CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P OE 3.15.0-rc5-core2-custom #159
    Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
    task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
    RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
    Call Trace:
    __mem_cgroup_try_charge_swapin+0x45/0xf0
    mem_cgroup_charge_file+0x9c/0xe0
    shmem_getpage_gfp+0x62c/0x770
    shmem_write_begin+0x38/0x40
    generic_perform_write+0xc5/0x1c0
    __generic_file_aio_write+0x1d1/0x3f0
    generic_file_aio_write+0x4f/0xc0
    do_sync_write+0x5a/0x90
    do_acct_process+0x4b1/0x550
    acct_process+0x6d/0xa0
    do_exit+0x827/0xa70
    kthread+0xc3/0xf0

    This patch fixes the issue by reintroducing mm check into
    get_mem_cgroup_from_mm. We could do the same trick in
    __mem_cgroup_try_charge_swapin as we do for the regular page cache path
    but it is not worth troubles. The check is not that expensive and it is
    better to have get_mem_cgroup_from_mm more robust.

    [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2

    Fixes: 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
    Reported-and-tested-by: Stephan Kulow
    Reported-by: Branimir Maksimovic
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • MADV_WILLNEED currently does not read swapped out shmem pages back in.

    Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
    cache radix trees") made find_get_page() filter exceptional radix tree
    entries but failed to convert all find_get_page() callers that WANT
    exceptional entries over to find_get_entry(). One of them is shmem swap
    readahead in madvise, which now skips over any swap-out records.

    Convert it to find_get_entry().

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In some testing I ran today (some fio jobs that spread over two nodes),
    we end up spending 40% of the time in filemap_check_errors(). That
    smells fishy. Looking further, this is basically what happens:

    blkdev_aio_read()
    generic_file_aio_read()
    filemap_write_and_wait_range()
    if (!mapping->nr_pages)
    filemap_check_errors()

    and filemap_check_errors() always attempts two test_and_clear_bit() on
    the mapping flags, thus dirtying it for every single invocation. The
    patch below tests each of these bits before clearing them, avoiding this
    issue. In my test case (4-socket box), performance went from 1.7M IOPS
    to 4.0M IOPS.

    Signed-off-by: Jens Axboe
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • For handling a free hugepage in memory failure, the race will happen if
    another thread hwpoisoned this hugepage concurrently. So we need to
    check PageHWPoison instead of !PageHWPoison.

    If hwpoison_filter(p) returns true or a race happens, then we need to
    unlock_page(hpage).

    Signed-off-by: Chen Yucong
    Reviewed-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     

20 May, 2014

1 commit

  • Pull Metag architecture and related fixes from James Hogan:
    "Mostly fixes for metag and parisc relating to upgrowing stacks.

    - Fix missing compiler barriers in metag memory barriers.
    - Fix BUG_ON on metag when RLIMIT_STACK hard limit is increased
    beyond safe value.
    - Make maximum stack size configurable. This reduces the default
    user stack size back to 80MB (especially on parisc after their
    removal of _STK_LIM_MAX override). This only affects metag and
    parisc.
    - Remove metag _STK_LIM_MAX override to match other arches and follow
    parisc, now that it is safe to do so (due to the BUG_ON fix
    mentioned above).
    - Finally now that both metag and parisc _STK_LIM_MAX overrides have
    been removed, it makes sense to remove _STK_LIM_MAX altogether"

    * tag 'metag-for-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
    asm-generic: remove _STK_LIM_MAX
    metag: Remove _STK_LIM_MAX override
    parisc,metag: Do not hardcode maximum userspace stack size
    metag: Reduce maximum stack size to 256MB
    metag: fix memory barriers

    Linus Torvalds
     

15 May, 2014

1 commit

  • This patch affects only architectures where the stack grows upwards
    (currently parisc and metag only). On those do not hardcode the maximum
    initial stack size to 1GB for 32-bit processes, but make it configurable
    via a config option.

    The main problem with the hardcoded stack size is, that we have two
    memory regions which grow upwards: stack and heap. To keep most of the
    memory available for heap in a flexmap memory layout, it makes no sense
    to hard allocate up to 1GB of the memory for stack which can't be used
    as heap then.

    This patch makes the stack size for 32-bit processes configurable and
    uses 80MB as default value which has been in use during the last few
    years on parisc and which hasn't showed any problems yet.

    Signed-off-by: Helge Deller
    Signed-off-by: James Hogan
    Cc: "James E.J. Bottomley"
    Cc: linux-parisc@vger.kernel.org
    Cc: linux-metag@vger.kernel.org
    Cc: John David Anglin

    Helge Deller
     

13 May, 2014

1 commit

  • Pull a percpu fix from Tejun Heo:
    "Fix for a percpu allocator bug where it could try to kfree() a memory
    region allocated using vmalloc(). The bug has been there for years
    now and is unlikely to have ever triggered given the size of struct
    pcpu_chunk. It's still theoretically possible and the fix is simple
    and safe enough, so the patch is marked with -stable"

    * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()

    Linus Torvalds
     

11 May, 2014

2 commits

  • It's critical for split_huge_page() (and migration) to catch and freeze
    all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
    mremap() since usually we copy/move page table entries on dup_mm() or
    move_page_tables() without rmap lock taken. To get it work we rely on
    rmap walk order to not miss any entry. We expect to see destination VMA
    after source one to work correctly.

    But after switching rmap implementation to interval tree it's not always
    possible to preserve expected walk order.

    It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
    / vma_last_pgoff() and explicitly insert dst VMA after src one with
    vma_interval_tree_insert_after().

    But on move_vma() destination VMA can be merged into adjacent one and as
    result shifted left in interval tree. Fortunately, we can detect the
    situation and prevent race with rmap walk by moving page table entries
    under rmap lock. See commit 38a76013ad80.

    Problem is that we miss the lock when we move transhuge PMD. Most
    likely this bug caused the crash[1].

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

    Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Cc: Rik van Riel
    Acked-by: Michel Lespinasse
    Cc: Dave Jones
    Cc: David Miller
    Acked-by: Johannes Weiner
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 8910ae896c8c ("kmemleak: change some global variables to int"),
    in addition to the atomic -> int conversion, moved the disabling of
    kmemleak_early_log to the beginning of the kmemleak_init() function,
    before the full kmemleak tracing is actually enabled. In this small
    window, kmem_cache_create() is called by kmemleak which triggers
    additional memory allocation that are not traced. This patch restores
    the original logic with kmemleak_early_log disabling when kmemleak is
    fully functional.

    Fixes: 8910ae896c8c (kmemleak: change some global variables to int)

    Signed-off-by: Catalin Marinas
    Cc: Sasha Levin
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

07 May, 2014

10 commits

  • Merge misc fixes from Andrew Morton:
    "13 fixes"

    * emailed patches from Andrew Morton :
    agp: info leak in agpioc_info_wrap()
    fs/affs/super.c: bugfix / double free
    fanotify: fix -EOVERFLOW with large files on 64-bit
    slub: use sysfs'es release mechanism for kmem_cache
    revert "mm: vmscan: do not swap anon pages just because free+file is low"
    autofs: fix lockref lookup
    mm: filemap: update find_get_pages_tag() to deal with shadow entries
    mm/compaction: make isolate_freepages start at pageblock boundary
    MAINTAINERS: zswap/zbud: change maintainer email address
    mm/page-writeback.c: fix divide by zero in pos_ratio_polynom
    hugetlb: ensure hugepage access is denied if hugepages are not supported
    slub: fix memcg_propagate_slab_attrs
    drivers/rtc/rtc-pcf8523.c: fix month definition

    Linus Torvalds
     
  • debugobjects warning during netfilter exit:

    ------------[ cut here ]------------
    WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
    Workqueue: netns cleanup_net
    Call Trace:
    dump_stack+0x52/0x87
    warn_slowpath_common+0x8c/0xc0
    warn_slowpath_fmt+0x46/0x50
    debug_print_object+0x8d/0xb0
    __debug_check_no_obj_freed+0xa5/0x220
    debug_check_no_obj_freed+0x15/0x20
    kmem_cache_free+0x197/0x340
    kmem_cache_destroy+0x86/0xe0
    nf_conntrack_cleanup_net_list+0x131/0x170
    nf_conntrack_pernet_exit+0x5d/0x70
    ops_exit_list+0x5e/0x70
    cleanup_net+0xfb/0x1c0
    process_one_work+0x338/0x550
    worker_thread+0x215/0x350
    kthread+0xe7/0xf0
    ret_from_fork+0x7c/0xb0

    Also during dcookie cleanup:

    WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
    ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
    Modules linked in:
    CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    warn_slowpath_common (kernel/panic.c:430)
    warn_slowpath_fmt (kernel/panic.c:445)
    debug_print_object (lib/debugobjects.c:262)
    __debug_check_no_obj_freed (lib/debugobjects.c:697)
    debug_check_no_obj_freed (lib/debugobjects.c:726)
    kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
    kmem_cache_destroy (mm/slab_common.c:363)
    dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
    event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
    __fput (fs/file_table.c:217)
    ____fput (fs/file_table.c:253)
    task_work_run (kernel/task_work.c:125 (discriminator 1))
    do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
    int_signal (arch/x86/kernel/entry_64.S:807)

    Sysfs has a release mechanism. Use that to release the kmem_cache
    structure if CONFIG_SYSFS is enabled.

    Only slub is changed - slab currently only supports /proc/slabinfo and
    not /sys/kernel/slab/*. We talked about adding that and someone was
    working on it.

    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
    [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
    Signed-off-by: Christoph Lameter
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Acked-by: Greg KH
    Cc: Thomas Gleixner
    Cc: Pekka Enberg
    Cc: Russell King
    Cc: Bart Van Assche
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
    just because free+file is low") because it introduced a regression in
    mostly-anonymous workloads, where reclaim would become ineffective and
    trap every allocating task in direct reclaim.

    The problem is that there is a runaway feedback loop in the scan balance
    between file and anon, where the balance tips heavily towards a tiny
    thrashing file LRU and anonymous pages are no longer being looked at.
    The commit in question removed the safe guard that would detect such
    situations and respond with forced anonymous reclaim.

    This commit was part of a series to fix premature swapping in loads with
    relatively little cache, and while it made a small difference, the cure
    is obviously worse than the disease. Revert it.

    Signed-off-by: Johannes Weiner
    Reported-by: Christian Borntraeger
    Acked-by: Christian Borntraeger
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Dave Jones reports the following crash when find_get_pages_tag() runs
    into an exceptional entry:

    kernel BUG at mm/filemap.c:1347!
    RIP: find_get_pages_tag+0x1cb/0x220
    Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

    1343 /*
    1344 * This function is never used on a shmem/tmpfs
    1345 * mapping, so a swap entry won't be found here.
    1346 */
    1347 BUG();

    After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
    page cache radix trees") this comment and BUG() are out of date because
    exceptional entries can now appear in all mappings - as shadows of
    recently evicted pages.

    However, as Hugh Dickins notes,

    "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
    any other PAGECACHE_TAG_*) to appear on an exceptional entry.

    I expect it comes down to an occasional race in RCU lookup of the
    radix_tree: lacking absolute synchronization, we might sometimes
    catch an exceptional entry, with the tag which really belongs with
    the unexceptional entry which was there an instant before."

    And indeed, not only is the tree walk lockless, the tags are also read
    in chunks, one radix tree node at a time. There is plenty of time for
    page reclaim to swoop in and replace a page that was already looked up
    as tagged with a shadow entry.

    Remove the BUG() and update the comment. While reviewing all other
    lookup sites for whether they properly deal with shadow entries of
    evicted pages, update all the comments and fix memcg file charge moving
    to not miss shmem/tmpfs swapcache pages.

    Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
    Signed-off-by: Johannes Weiner
    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The compaction freepage scanner implementation in isolate_freepages()
    starts by taking the current cc->free_pfn value as the first pfn. In a
    for loop, it scans from this first pfn to the end of the pageblock, and
    then subtracts pageblock_nr_pages from the first pfn to obtain the first
    pfn for the next for loop iteration.

    This means that when cc->free_pfn starts at offset X rather than being
    aligned on pageblock boundary, the scanner will start at offset X in all
    scanned pageblock, ignoring potentially many free pages. Currently this
    can happen when

    a) zone's end pfn is not pageblock aligned, or

    b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
    enabled and a hole spanning the beginning of a pageblock

    This patch fixes the problem by aligning the initial pfn in
    isolate_freepages() to pageblock boundary. This also permits replacing
    the end-of-pageblock alignment within the for loop with a simple
    pageblock_nr_pages increment.

    Signed-off-by: Vlastimil Babka
    Reported-by: Heesub Shin
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: Dongjun Shin
    Cc: Sunghwan Yun
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • It is possible for "limit - setpoint + 1" to equal zero, after getting
    truncated to a 32 bit variable, and resulting in a divide by zero error.

    Using the fully 64 bit divide functions avoids this problem. It also
    will cause pos_ratio_polynom() to return the correct value when
    (setpoint - limit) exceeds 2^32.

    Also uninline pos_ratio_polynom, at Andrew's request.

    Signed-off-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Nishanth Aravamudan
    Cc: Luiz Capitulino
    Cc: Masayoshi Mizuma
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • After creating a cache for a memcg we should initialize its sysfs attrs
    with the values from its parent. That's what memcg_propagate_slab_attrs
    is for. Currently it's broken - we clearly muddled root-vs-memcg caches
    there. Let's fix it up.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pull vfs fixes from Al Viro:
    "dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl
    bugfix from hch"

    The dcache fixes are for a subtle LRU list corruption bug reported by
    Miklos Szeredi, where people inside IBM saw list corruptions with the
    LTP/host01 test.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nick kvfree() from apparmor
    posix_acl: handle NULL ACL in posix_acl_equiv_mode
    dcache: don't need rcu in shrink_dentry_list()
    more graceful recovery in umount_collect()
    don't remove from shrink list in select_collect()
    dentry_kill(): don't try to remove from shrink list
    expand the call of dentry_lru_del() in dentry_kill()
    new helper: dentry_free()
    fold try_prune_one_dentry()
    fold d_kill() and d_free()
    fix races between __d_instantiate() and checks of dentry flags

    Linus Torvalds
     
  • too many places open-code it

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2014

2 commits

  • If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
    likewise if freelist_idx_t is a short, then it should be 65535 not
    65536.

    This was leading to all kinds of random crashes on sparc64 where
    PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was
    enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
    same cpu already holding a lock it shouldn't hold, or the lock belonging
    to a completely unrelated process.

    Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab")
    Signed-off-by: David S. Miller
    Signed-off-by: Linus Torvalds

    David Miller
     
  • Commit a41adfaa23df ("slab: introduce byte sized index for the freelist
    of a slab") changes the size of freelist index and also changes
    prototype of accessor function to freelist index. And there was a
    mistake.

    The mistake is that although it changes the size of freelist index
    correctly, it changes the size of the index of freelist index
    incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that
    means that num of object on on a slab can be more than 255. So we need
    more than 1 byte for the index to find the index of free object on
    freelist. But, above patch makes this index type 1 byte, so slab which
    have more than 255 objects cannot work properly and in consequence of
    it, the system cannot boot.

    This issue was reported by Steven King on m68knommu which would use
    2 bytes freelist index:

    https://lkml.org/lkml/2014/4/16/433

    To fix is easy. To change the type of the index of freelist index on
    accessor functions is enough to fix this bug. Although 2 bytes is
    enough, I use 4 bytes since it have no bad effect and make things more
    easier. This fix was suggested and tested by Steven in his original
    report.

    Signed-off-by: Joonsoo Kim
    Reported-and-acked-by: Steven King
    Acked-by: Christoph Lameter
    Tested-by: James Hogan
    Tested-by: David Miller
    Cc: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

29 Apr, 2014

1 commit

  • BUG_ON() is a big hammer, and should be used _only_ if there is some
    major corruption that you cannot possibly recover from, making it
    imperative that the current process (and possibly the whole machine) be
    terminated with extreme prejudice.

    The trivial sanity check in the vmacache code is *not* such a fatal
    error. Recovering from it is absolutely trivial, and using BUG_ON()
    just makes it harder to debug for no actual advantage.

    To make matters worse, the placement of the BUG_ON() (only if the range
    check matched) actually makes it harder to hit the sanity check to begin
    with, so _if_ there is a bug (and we just got a report from Srivatsa
    Bhat that this can indeed trigger), it is harder to debug not just
    because the machine is possibly dead, but because we don't have better
    coverage.

    BUG_ON() must *die*. Maybe we should add a checkpatch warning for it,
    because it is simply just about the worst thing you can ever do if you
    hit some "this cannot happen" situation.

    Reported-by: Srivatsa S. Bhat
    Cc: Davidlohr Bueso
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Apr, 2014

1 commit

  • The mmu-gather operation 'tlb_flush_mmu()' has done two things: the
    actual tlb flush operation, and the batched freeing of the pages that
    the TLB entries pointed at.

    This splits the operation into separate phases, so that the forced
    batched flushing done by zap_pte_range() can now do the actual TLB flush
    while still holding the page table lock, but delay the batched freeing
    of all the pages to after the lock has been dropped.

    This in turn allows us to avoid a race condition between
    set_page_dirty() (as called by zap_pte_range() when it finds a dirty
    shared memory pte) and page_mkclean(): because we now flush all the
    dirty page data from the TLB's while holding the pte lock,
    page_mkclean() will be held up walking the (recently cleaned) page
    tables until after the TLB entries have been flushed from all CPU's.

    Reported-by: Benjamin Herrenschmidt
    Tested-by: Dave Hansen
    Acked-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux
    Cc: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Apr, 2014

1 commit

  • fixup_user_fault() is used by the futex code when the direct user access
    fails, and the futex code wants it to either map in the page in a usable
    form or return an error. It relied on handle_mm_fault() to map the
    page, and correctly checked the error return from that, but while that
    does map the page, it doesn't actually guarantee that the page will be
    mapped with sufficient permissions to be then accessed.

    So do the appropriate tests of the vma access rights by hand.

    [ Side note: arguably handle_mm_fault() could just do that itself, but
    we have traditionally done it in the caller, because some callers -
    notably get_user_pages() - have been able to access pages even when
    they are mapped with PROT_NONE. Maybe we should re-visit that design
    decision, but in the meantime this is the minimal patch. ]

    Found by Dave Jones running his trinity tool.

    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Apr, 2014

4 commits

  • Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
    have the same root cause. Let's look to them one by one.

    The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
    BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
    testing I see that page_mapcount() is higher than mapcount here.

    I think it happens due to race between zap_huge_pmd() and
    page_check_address_pmd(). page_check_address_pmd() misses PMD which is
    under zap:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    /*
    * We check if PMD present without taking ptl: no
    * serialization against zap_huge_pmd(). We miss this PMD,
    * it's not accounted to 'mapcount' in __split_huge_page().
    */
    pmd_present(pmd) == 0

    BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)

    The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
    It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

    This happens in similar way:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    pmd_present(pmd) == 0 /* The same comment as above */
    /*
    * No crash this time since we already decremented page->_mapcount in
    * zap_huge_pmd().
    */
    BUG_ON(mapcount != page_mapcount(page))

    /*
    * We split the compound page here into small pages without
    * serialization against zap_huge_pmd()
    */
    __split_huge_page_refcount()
    VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

    So my understanding the problem is pmd_present() check in mm_find_pmd()
    without taking page table lock.

    The bug was introduced by me commit with commit 117b0791ac42. Sorry for
    that. :(

    Let's open code mm_find_pmd() in page_check_address_pmd() and do the
    check under page table lock.

    Note that __page_check_address() does the same for PTE entires
    if sync != 0.

    I've stress tested split and zap code paths for 36+ hours by now and
    don't see crashes with the patch applied. Before it took
    [2] https://lkml.kernel.org/g/

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Dave Jones
    Cc: Vlastimil Babka
    Cc: [3.13+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fix new kernel-doc warning in mm/filemap.c:

    Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
    hugetlb: fix softlockup when a large number of hugepages are freed." can
    happen in return_unused_surplus_pages(), so let's fix it.

    Signed-off-by: Masayoshi Mizuma
    Signed-off-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Seems to be called with preemption enabled. Therefore it must use
    mod_zone_page_state instead.

    Signed-off-by: Christoph Lameter
    Reported-by: Grygorii Strashko
    Tested-by: Grygorii Strashko
    Cc: Tejun Heo
    Cc: Santosh Shilimkar
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Apr, 2014

1 commit

  • pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
    BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)

    It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
    but for consistency with its couterpart pcpu_mem_zalloc(),
    use pcpu_mem_free() instead.

    Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
    pcpu_mem_free() instead of kfree()") addressed this problem, but
    missed this one.

    tj: commit message updated

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo
    Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
    Cc: stable@vger.kernel.org

    Jianyu Zhan
     

14 Apr, 2014

2 commits

  • Some versions of gcc even warn about it:

    mm/shmem.c: In function ‘shmem_file_aio_read’:
    mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function

    If the loop is aborted during the first iteration by one of the two
    first break statements, error will be uninitialized.

    Introduced by commit 6e58e79db8a1 ("introduce copy_page_to_iter, kill
    loop over iovec in generic_file_aio_read()").

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

13 Apr, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit


11 Apr, 2014

1 commit

  • 'struct page' has two list_head fields: 'lru' and 'list'. Conveniently,
    they are unioned together. This means that code can use them
    interchangably, which gets horribly confusing like with this nugget from
    slab.c:

    > list_del(&page->lru);
    > if (page->active == cachep->num)
    > list_add(&page->list, &n->slabs_full);

    This patch makes the slab and slub code use page->lru universally instead
    of mixing ->list and ->lru.

    So, the new rule is: page->lru is what the you use if you want to keep
    your page on a list. Don't like the fact that it's not called ->list?
    Too bad.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Pekka Enberg

    Dave Hansen
     

09 Apr, 2014

1 commit

  • Page reclaim force-scans / swaps anonymous pages when file cache drops
    below the high watermark of a zone in order to prevent what little cache
    remains from thrashing.

    However, on bigger machines the high watermark value can be quite large
    and when the workload is dominated by a static anonymous/shmem set, the
    file set might just be a small window of used-once cache. In such
    situations, the VM starts swapping heavily when instead it should be
    recycling the no longer used cache.

    This is a longer-standing problem, but it's more likely to trigger after
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy")
    because file pages can no longer accumulate in a single zone and are
    dispersed into smaller fractions among the available zones.

    To resolve this, do not force scan anon when file pages are low but
    instead rely on the scan/rotation ratios to make the right prediction.

    Signed-off-by: Johannes Weiner
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2014

3 commits

  • Merge second patch-bomb from Andrew Morton:
    - the rest of MM
    - zram updates
    - zswap updates
    - exit
    - procfs
    - exec
    - wait
    - crash dump
    - lib/idr
    - rapidio
    - adfs, affs, bfs, ufs
    - cris
    - Kconfig things
    - initramfs
    - small amount of IPC material
    - percpu enhancements
    - early ioremap support
    - various other misc things

    * emailed patches from Andrew Morton : (156 commits)
    MAINTAINERS: update Intel C600 SAS driver maintainers
    fs/ufs: remove unused ufs_super_block_third pointer
    fs/ufs: remove unused ufs_super_block_second pointer
    fs/ufs: remove unused ufs_super_block_first pointer
    fs/ufs/super.c: add __init to init_inodecache()
    doc/kernel-parameters.txt: add early_ioremap_debug
    arm64: add early_ioremap support
    arm64: initialize pgprot info earlier in boot
    x86: use generic early_ioremap
    mm: create generic early_ioremap() support
    x86/mm: sparse warning fix for early_memremap
    lglock: map to spinlock when !CONFIG_SMP
    percpu: add preemption checks to __this_cpu ops
    vmstat: use raw_cpu_ops to avoid false positives on preemption checks
    slub: use raw_cpu_inc for incrementing statistics
    net: replace __this_cpu_inc in route.c with raw_cpu_inc
    modules: use raw_cpu_write for initialization of per cpu refcount.
    mm: use raw_cpu ops for determining current NUMA node
    percpu: add raw_cpu_ops
    slub: fix leak of 'name' in sysfs_slab_add
    ...

    Linus Torvalds
     
  • This patch creates a generic implementation of early_ioremap() support
    based on the existing x86 implementation. early_ioremp() is useful for
    early boot code which needs to temporarily map I/O or memory regions
    before normal mapping functions such as ioremap() are available.

    Some architectures have optional MMU. In the no-MMU case, the remap
    functions simply return the passed in physical address and the unmap
    functions do nothing.

    Signed-off-by: Mark Salter
    Acked-by: Catalin Marinas
    Acked-by: H. Peter Anvin
    Cc: Borislav Petkov
    Cc: Dave Young
    Cc: Will Deacon
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Salter
     
  • Statistics are not critical to the operation of the allocation but
    should also not cause too much overhead.

    When __this_cpu_inc is altered to check if preemption is disabled this
    triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
    cause interrupt disable/enable sequences on various arches which may
    significantly impact allocator performance.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Christoph Lameter
    Cc: Fengguang Wu
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter