05 Aug, 2014

3 commits

  • Pull scheduler updates from Ingo Molnar:

    - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
    from Frederic Weisbecker.

    This necessiated quite some background infrastructure rework,
    including:

    * Clean up some irq-work internals
    * Implement remote irq-work
    * Implement nohz kick on top of remote irq-work
    * Move full dynticks timer enqueue notification to new kick
    * Move multi-task notification to new kick
    * Remove unecessary barriers on multi-task notification

    - Remove proliferation of wait_on_bit() action functions and allow
    wait_on_bit_action() functions to support a timeout. (Neil Brown)

    - Another round of sched/numa improvements, cleanups and fixes. (Rik
    van Riel)

    - Implement fast idling of CPUs when the system is partially loaded,
    for better scalability. (Tim Chen)

    - Restructure and fix the CPU hotplug handling code that may leave
    cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
    cpu. (Kirill Tkhai)

    - Robustify the sched topology setup code. (Peterz Zijlstra)

    - Improve sched_feat() handling wrt. static_keys (Jason Baron)

    - Misc fixes.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    sched/fair: Fix 'make xmldocs' warning caused by missing description
    sched: Use macro for magic number of -1 for setparam
    sched: Robustify topology setup
    sched: Fix sched_setparam() policy == -1 logic
    sched: Allow wait_on_bit_action() functions to support a timeout
    sched: Remove proliferation of wait_on_bit() action functions
    sched/numa: Revert "Use effective_load() to balance NUMA loads"
    sched: Fix static_key race with sched_feat()
    sched: Remove extra static_key*() function indirection
    sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
    sched: Transform resched_task() into resched_curr()
    sched/deadline: Kill task_struct->pi_top_task
    sched: Rework check_for_tasks()
    sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
    sched/fair: Disable runtime_enabled on dying rq
    sched/numa: Change scan period code to match intent
    sched/numa: Rework best node setting in task_numa_migrate()
    sched/numa: Examine a task move when examining a task swap
    sched/numa: Simplify task_numa_compare()
    sched/numa: Use effective_load() to balance NUMA loads
    ...

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     
  • Pull percpu updates from Tejun Heo:

    - Major reorganization of percpu header files which I think makes
    things a lot more readable and logical than before.

    - percpu-refcount is updated so that it requires explicit destruction
    and can be reinitialized if necessary. This was pulled into the
    block tree to replace the custom percpu refcnting implemented in
    blk-mq.

    - In the process, percpu and percpu-refcount got cleaned up a bit

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
    percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
    percpu-refcount: require percpu_ref to be exited explicitly
    percpu-refcount: use unsigned long for pcpu_count pointer
    percpu-refcount: add helpers for ->percpu_count accesses
    percpu-refcount: one bit is enough for REF_STATUS
    percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
    workqueue: stronger test in process_one_work()
    workqueue: clear POOL_DISASSOCIATED in rebind_workers()
    percpu: Use ALIGN macro instead of hand coding alignment calculation
    percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
    percpu: preffity percpu header files
    percpu: use raw_cpu_*() to define __this_cpu_*()
    percpu: reorder macros in percpu header files
    percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
    percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
    percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
    percpu: reorganize include/linux/percpu-defs.h
    percpu: move accessors from include/linux/percpu.h to percpu-defs.h
    percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
    percpu: introduce arch_raw_cpu_ptr()
    ...

    Linus Torvalds
     

31 Jul, 2014

8 commits

  • PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
    ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
    another symbol to filter *hugetlbfs* pages.

    If a user hope to filter user pages, makedumpfile tries to exclude them by
    checking the condition whether the page is anonymous, but hugetlbfs pages
    aren't anonymous while they also be user pages.

    We know it's possible to detect them in the same way as PageHuge(),
    so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
    if (!PageCompound(page))
    return 0;

    page = compound_head(page);
    return get_compound_page_dtor(page) == free_huge_page;
    }

    For that reason, this patch changes free_huge_page() into public
    to export it to VMCOREINFO.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Baoquan He
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     
  • Fix kernel-doc warnings in mm/filemap.c: pagecache_get_page():

    Warning(..//mm/filemap.c:1054): No description found for parameter 'cache_gfp_mask'
    Warning(..//mm/filemap.c:1054): No description found for parameter 'radix_gfp_mask'
    Warning(..//mm/filemap.c:1054): Excess function parameter 'gfp_mask' description in 'pagecache_get_page'

    Fixes: 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible")

    [mgorman@suse.de: change everything]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Randy Dunlap
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • do_fault_around() expects fault_around_bytes rounded down to nearest page
    order. Instead of calling rounddown_pow_of_two every time in
    fault_around_pages()/fault_around_mask() we could do round down when user
    changes fault_around_bytes via debugfs interface.

    This also fixes bug when user set fault_around_bytes to 0. Result of
    rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0
    doesn't work without this patch.

    Let's set fault_around_bytes to PAGE_SIZE if user sets to something less
    than PAGE_SIZE

    [akpm@linux-foundation.org: tweak code layout]
    Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Paul Furtado has reported the following GPF:

    general protection fault: 0000 [#1] SMP
    Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
    CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
    task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
    RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
    RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283
    RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
    RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
    RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
    R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
    FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
    CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
    Call Trace:
    pagefault_out_of_memory+0x18/0x90
    mm_fault_error+0xa9/0x1a0
    __do_page_fault+0x478/0x4c0
    do_page_fault+0x2c/0x40
    page_fault+0x28/0x30
    Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
    RIP mem_cgroup_oom_synchronize+0x140/0x240

    Commit fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and
    wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
    assuming it is protected by the hierarchical OOM-lock.

    Although this is true for the notification part the protection doesn't
    cover unregistration of event which can happen in parallel now so
    mem_cgroup_oom_notify can see already unlinked and/or freed
    mem_cgroup_eventfd_list.

    Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881

    Fixes: fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup)
    Signed-off-by: Michal Hocko
    Reported-by: Paul Furtado
    Tested-by: Paul Furtado
    Acked-by: Johannes Weiner
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • hwpoison_user_mappings() could fail for various reasons, so printk()s to
    print out the reasons should be done in each failure check inside
    hwpoison_user_mappings().

    And currently we don't call action_result() when hwpoison_user_mappings()
    fails, which is not consistent with other exit points of memory error
    handler. So this patch fixes these messaging problems.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
    handling path of the victimized page frame that belong to non-LRU")
    rejects going into unmapping operation for hugetlbfs/thp pages, which
    results in failing error containing on such pages. This patch fixes it.

    With this patch, hwpoison functional tests in mce-test testsuite pass.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
    should be set in allocflags. ALLOC_CPUSET controls if a page allocation
    should be restricted only to the set of allowed cpuset mems.

    Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
    the fault path from using memory compaction or direct reclaim. Thus, it
    is unfairly able to allocate outside of its cpuset mems restriction as a
    side-effect.

    This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
    truly GFP_ATOMIC by verifying it is also not a thp allocation.

    Signed-off-by: David Rientjes
    Reported-by: Alex Thorlton
    Tested-by: Alex Thorlton
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: Hedi Berriche
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Srivatsa S. Bhat
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Under memory pressure, it is possible for dirty_thresh, calculated by
    global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
    strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:

    bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh

    by dividing by zero.

    Signed-off-by: Maxim Patlasov
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     

30 Jul, 2014

1 commit

  • Fix kernel-doc warnings and function name in mm/page_alloc.c:

    Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
    Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
    Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
    Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
    Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
    Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'

    Signed-off-by: Randy Dunlap
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

28 Jul, 2014

1 commit


27 Jul, 2014

1 commit

  • Shortly before 3.16-rc1, Dave Jones reported:

    WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
    xfs_vm_writepage+0x5ce/0x630 [xfs]()
    CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
    Call Trace:
    xfs_vm_writepage+0x5ce/0x630 [xfs]
    shrink_page_list+0x8f9/0xb90
    shrink_inactive_list+0x253/0x510
    shrink_lruvec+0x563/0x6c0
    shrink_zone+0x3b/0x100
    shrink_zones+0x1f1/0x3c0
    try_to_free_pages+0x164/0x380
    __alloc_pages_nodemask+0x822/0xc90
    alloc_pages_vma+0xaf/0x1c0
    handle_mm_fault+0xa31/0xc50
    etc.

    970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
    971 PF_MEMALLOC))

    I did not respond at the time, because a glance at the PageDirty block
    in shrink_page_list() quickly shows that this is impossible: we don't do
    writeback on file pages (other than tmpfs) from direct reclaim nowadays.
    Dave was hallucinating, but it would have been disrespectful to say so.

    However, my own /var/log/messages now shows similar complaints

    WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
    WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()

    from stressing some mmotm trees during July.

    Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
    so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
    to mapping->a_ops->writepage()?

    Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
    page freeing callback") has provided such a way to compaction: if
    migrating a SwapBacked page fails, its newpage may be put back on the
    list for later use with PageSwapBacked still set, and nothing will clear
    it.

    Whether that can do anything worse than issue WARN_ON_ONCEs, and get
    some statistics wrong, is unclear: easier to fix than to think through
    the consequences.

    Fixing it here, before the put_new_page(), addresses the bug directly,
    but is probably the worst place to fix it. Page migration is doing too
    many parts of the job on too many levels: fixing it in
    move_to_new_page() to complement its SetPageSwapBacked would be
    preferable, except why is it (and newpage->mapping and newpage->index)
    done there, rather than down in migrate_page_move_mapping(), once we are
    sure of success? Not a cleanup to get into right now, especially not
    with memcg cleanups coming in 3.17.

    Reported-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jul, 2014

7 commits

  • Pull slab fix from Mike Snitzer:
    "This fixes the broken duplicate slab name check in
    kmem_cache_sanity_check() that has been repeatedly reported (as
    recently as today against Fedora rawhide).

    Pekka seemed to have it staged for a late 3.15-rc in his 'slab/urgent'
    branch but never sent a pull request, see:
    https://lkml.org/lkml/2014/5/23/648"

    * tag 'urgent-slab-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    slab_common: fix the check for duplicate slab names

    Linus Torvalds
     
  • Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry") changed the order of
    huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
    in some workloads like hugepage-backed heap allocation via libhugetlbfs.
    This patch fixes it.

    The test program for the problem is shown below:

    $ cat heap.c
    #include
    #include
    #include

    #define HPS 0x200000

    int main() {
    int i;
    char *p = malloc(HPS);
    memset(p, '1', HPS);
    for (i = 0; i < 5; i++) {
    if (!fork()) {
    memset(p, '2', HPS);
    p = malloc(HPS);
    memset(p, '3', HPS);
    free(p);
    return 0;
    }
    }
    sleep(1);
    free(p);
    return 0;
    }

    $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap

    Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry"), so is applicable to -stable kernels which
    include it.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Guillaume Morin
    Suggested-by: Guillaume Morin
    Acked-by: Hugh Dickins
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
    truncate_inode_pages_range"), to keep truncate_inode_pages_range() in
    synch with shmem_undo_range(); but have stepped back - a change to
    hole-punching in truncate_inode_pages_range() is a change to
    hole-punching in every filesystem (except tmpfs) that supports it.

    If there's a logical proof why no filesystem can depend for its own
    correctness on the pincer guarantee in truncate_inode_pages_range() - an
    instant when the entire hole is removed from pagecache - then let's
    revisit later. But the evidence is that only tmpfs suffered from the
    livelock, and we have no intention of extending hole-punch to ramfs. So
    for now just add a few comments (to match or differ from those in
    shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...

    Its "index == start" addition to the hole-punch termination test was
    incomplete: it opened a way for the end condition to be missed, and the
    loop go on looking through the radix_tree, all the way to end of file.
    Fix that pessimization by resetting index when detected in inner loop.

    Note that it's actually hard to hit this case, without the obsessive
    concurrent faulting that trinity does: normally all pages are removed in
    the initial trylock_page() pass, and this loop finds nothing to do. I
    had to "#if 0" out the initial pass to reproduce bug and test fix.

    Signed-off-by: Hugh Dickins
    Cc: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Lukas Czerner
    Cc: Dave Jones
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_fault() is the actual culprit in trinity's hole-punch starvation,
    and the most significant cause of such problems: since a page faulted is
    one that then appears page_mapped(), needing unmap_mapping_range() and
    i_mmap_mutex to be unmapped again.

    But it is not the only way in which a page can be brought into a hole in
    the radix_tree while that hole is being punched; and Vlastimil's testing
    implies that if enough other processors are busy filling in the hole,
    then shmem_undo_range() can be kept from completing indefinitely.

    shmem_file_splice_read() is the main other user of SGP_CACHE, which can
    instantiate shmem pagecache pages in the read-only case (without holding
    i_mutex, so perhaps concurrently with a hole-punch). Probably it's
    silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
    ought to be safe, but might bring surprises - not a change to be rushed.

    shmem_read_mapping_page_gfp() is an internal interface used by
    drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
    shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
    called internally by the kernel (perhaps for a stacking filesystem,
    which might rely on holes to be reserved): it's unclear whether it could
    be provoked to keep hole-punch busy or not.

    We could apply the same umbrella as now used in shmem_fault() to
    shmem_file_splice_read() and the others; but it looks ugly, and use over
    a range raises questions - should it actually be per page? can these get
    starved themselves?

    The origin of this part of the problem is my v3.1 commit d0823576bf4b
    ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
    into shmem.c. It seemed like a nice idea at the time, to ensure
    (barring RCU lookup fuzziness) that there's an instant when the entire
    hole is empty; but the indefinitely repeated scans to ensure that make
    it vulnerable.

    Revert that "enhancement" to hole-punch from shmem_undo_range(), but
    retain the unproblematic rescanning when it's truncating; add a couple
    of comments there.

    Remove the "indices[0] >= end" test: that is now handled satisfactorily
    by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
    to be worth avoiding here.

    But if we do not always loop indefinitely, we do need to handle the case
    of swap swizzled back to page before shmem_free_swap() gets it: add a
    retry for that case, as suggested by Konstantin Khlebnikov; and for the
    case of page swizzled back to swap, as suggested by Johannes Weiner.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Suggested-by: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
    punched") was buggy: Sasha sent a lockdep report to remind us that
    grabbing i_mutex in the fault path is a no-no (write syscall may already
    hold i_mutex while faulting user buffer).

    We tried a completely different approach (see following patch) but that
    proved inadequate: good enough for a rational workload, but not good
    enough against trinity - which forks off so many mappings of the object
    that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
    into serious starvation when concurrent faults force the puncher to fall
    back to single-page unmap_mapping_range() searches of the i_mmap tree.

    So return to the original umbrella approach, but keep away from i_mutex
    this time. We really don't want to bloat every shmem inode with a new
    mutex or completion, just to protect this unlikely case from trinity.
    So extend the original with wait_queue_head on stack at the hole-punch
    end, and wait_queue item on the stack at the fault end.

    This involves further use of i_lock to guard against the races: lockdep
    has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
    i_lock around wake_up_bit(), which is comparable to what we do here.
    i_lock is more convenient, but we could switch to shmem's info->lock.

    This issue has been tagged with CVE-2014-4171, which will require commit
    f00cdc6df7d7 and this and the following patch to be backported: we
    suggest to 3.1+, though in fact the trinity forkbomb effect might go
    back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
    not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
    Anyone running trinity on 3.0 and earlier? I don't think we need care.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Ingo Korb reported that "repeated mapping of the same file on tmpfs
    using remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when
    the process exits".

    He bisected the bug to d7c1755179b8 ("mm: implement ->map_pages for
    shmem/tmpfs"), although the bug was actually added by commit
    8c6e50b0290c ("mm: introduce vm_ops->map_pages()").

    The problem is caused by calling do_fault_around for a _non-linear_
    fault. In this case pgoff is shifted and might become negative during
    calculation.

    Faulting around non-linear page-fault makes no sense and breaks the
    logic in do_fault_around because pgoff is shifted.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Ingo Korb
    Tested-by: Ingo Korb
    Cc: Hugh Dickins
    Cc: Sasha Levin
    Cc: Dave Jones
    Cc: Ning Qu
    Cc: "Kirill A. Shutemov"
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
    anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
    because pgoff's calculation in rmap_walk_anon() fails to consider
    compound_order() only to have an incorrect value.

    This patch introduces page_to_pgoff(), which gets the page's offset in
    PAGE_CACHE_SIZE.

    Kirill pointed out that page cache tree should natively handle
    hugepages, and in order to make hugetlbfs fit it, page->index of
    hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
    but page_to_pgoff() contains the point to be fixed in a single function.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

23 Jul, 2014

1 commit


16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

15 Jul, 2014

3 commits

  • Until now, cftype arrays carried files for both the default and legacy
    hierarchies and the files which needed to be used on only one of them
    were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
    gets confusing very quickly and we may end up exposing interface files
    to the default hierarchy without thinking it through.

    This patch makes cgroup core provide separate sets of interfaces for
    cftype handling so that the cftypes for the default and legacy
    hierarchies are clearly distinguished. The previous two patches
    renamed the existing ones so that they clearly indicate that they're
    for the legacy hierarchies. This patch adds the interface for the
    default hierarchy and apply them selectively depending on the
    hierarchy type.

    * cftypes added through cgroup_subsys->dfl_cftypes and
    cgroup_add_dfl_cftypes() only show up on the default hierarchy.

    * cftypes added through cgroup_subsys->legacy_cftypes and
    cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

    * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
    same array for the cases where the interface files are identical on
    both types of hierarchies.

    * This makes all the existing subsystem interface files legacy-only by
    default and all subsystems will have no interface file created when
    enabled on the default hierarchy. Each subsystem should explicitly
    review and compose the interface for the default hierarchy.

    * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
    makes subsystems which haven't decided the interface files for the
    default hierarchy to present the legacy files on the default
    hierarchy so that its behavior on the default hierarchy can be
    tested. As the awkward name suggests, this is for development only.

    * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
    array isn't used on the default hierarchy. The flag is removed.

    v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

    v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
    as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

11 Jul, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Mostly fixes for the fallouts from the recent cgroup core changes.

    The decoupled nature of cgroup dynamic hierarchy management
    (hierarchies are created dynamically on mount but may or may not be
    reused once unmounted depending on remaining usages) led to more
    ugliness being added to kernfs.

    Hopefully, this is the last of it"

    * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: break kernfs active protection in cpuset_write_resmask()
    cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
    kernfs: introduce kernfs_pin_sb()
    cgroup: fix mount failure in a corner case
    cpuset,mempolicy: fix sleeping function called from invalid context
    cgroup: fix broken css_has_online_children()

    Linus Torvalds
     

09 Jul, 2014

2 commits

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    cgroup now has a mechanism to express such dependency -
    cgroup_subsys->depends_on. This patch declares that blkcg depends on
    memcg so that memcg is enabled automatically on the default hierarchy
    when available. Future changes will make blkcg map the memcg tag to
    find out the cgroup to blame for writeback IOs.

    As this means that a memcg may be made invisible, this patch also
    implements css_reset() for memcg which resets its basic
    configurations. This implementation will probably need to be expanded
    to cover other states which are used in the default hierarchy.

    v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure. Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     

04 Jul, 2014

5 commits

  • Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
    in isolate_lru_pages() at mm/vmscan.c:1281!

    Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") looks like interrupted work-in-progress.

    mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
    - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
    when the page is always visible in radix_tree, and often already on LRU.

    Revert change to shmem_write_begin(), and use init_page_accessed() or
    mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().

    SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
    before; but since many other filesystems use [__]page_symlink(), which did
    and does mark the page accessed, consider this as rectifying an oversight.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Until now, the kernel has the same policy to handle victimized page
    frames that belong to kernel-space(reserved/slab-subsystem) or
    non-LRU(unknown page state). In other word, the result of handling
    either of these victimized page frames is (IGNORED | FAILED), and the
    return value of memory_failure() is -EBUSY.

    This patch is to avoid that memory_failure() returns very soon due to
    the "true" value of (!PageLRU(p)), and it also ensures that
    action_result() can report more precise information("reserved kernel",
    "kernel slab", and "unknown page state") instead of "non LRU",
    especially for memory errors which are detected by memory-scrubbing.

    Andi said:

    : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
    :
    : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
    : page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
    : page flags: 0x3ffff800004081(locked|slab|head)
    : ------------[ cut here ]------------
    : kernel BUG at mm/rmap.c:1495!
    :
    : I think what happened is that a LRU page turned into a slab page in
    : parallel with offlining. memory_failure initially tests for this case,
    : but doesn't retest later after the page has been locked.
    :
    : ...
    :
    : I ran this patch in a loop over night with some stress plus
    : the mcelog test suite running in a loop. I cannot guarantee it hit it,
    : but it should have given it a good beating.
    :
    : The kernel survived with no messages, although the mcelog test suite
    : got killed at some point because it couldn't fork anymore. Probably
    : some unrelated problem.
    :
    : So the patch is ok for me for .16.

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Fix a regression caused by 7fc34a62ca44 ("mm/msync.c: sync only the
    requested range in msync()").

    xfstests generic/075 fail occured on ext4 data=journal mode because the
    intended range was not syncing due to wrong fstart calculation.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Reported-by: Eric Whitney
    Tested-by: Eric Whitney
    Acked-by: Matthew Wilcox
    Reviewed-by: Lukas Czerner
    Tested-by: Lukas Czerner
    Reviewed-by: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namjae Jeon
     
  • min_partial means minimum number of slab cached in node partial list.
    So, if nr_partial is less than it, we keep newly empty slab on node
    partial list rather than freeing it. But if nr_partial is equal or
    greater than it, it means that we have enough partial slabs so should
    free newly empty slab. Current implementation missed the equal case so
    if we set min_partial is 0, then, at least one slab could be cached.
    This is critical problem to kmemcg destroying logic because it doesn't
    works properly if some slabs is cached. This patch fixes this problem.

    Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
    immediately").

    Signed-off-by: Joonsoo Kim
    Acked-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
    the following is triggered at early boot:

    SMP: Total of 8 processors activated.
    devtmpfs: initialized
    Unable to handle kernel NULL pointer dereference at virtual address 00000008
    pgd = fffffe0000050000
    [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
    Internal error: Oops: 96000006 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
    task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
    PC is at __list_add+0x10/0xd4
    LR is at free_one_page+0x270/0x638
    ...
    Call trace:
    __list_add+0x10/0xd4
    free_one_page+0x26c/0x638
    __free_pages_ok.part.52+0x84/0xbc
    __free_pages+0x74/0xbc
    init_cma_reserved_pageblock+0xe8/0x104
    cma_init_reserved_areas+0x190/0x1e4
    do_one_initcall+0xc4/0x154
    kernel_init_freeable+0x204/0x2a8
    kernel_init+0xc/0xd4

    This happens because init_cma_reserved_pageblock() calls
    __free_one_page() with pageblock_order as page order but it is bigger
    than MAX_ORDER. This in turn causes accesses past zone->free_list[].

    Fix the problem by changing init_cma_reserved_pageblock() such that it
    splits pageblock into individual MAX_ORDER pages if pageblock is bigger
    than a MAX_ORDER page.

    In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
    architectures expect for ia64, powerpc and tile at the moment, the
    “pageblock_order > MAX_ORDER” condition will be optimised out since both
    sides of the operator are constants. In cases where pageblock size is
    variable, the performance degradation should not be significant anyway
    since init_cma_reserved_pageblock() is called only at boot time at most
    MAX_CMA_AREAS times which by default is eight.

    Signed-off-by: Michal Nazarewicz
    Reported-by: Mark Salter
    Tested-by: Mark Salter
    Tested-by: Christopher Covington
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Marek Szyprowski
    Cc: Catalin Marinas
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     

25 Jun, 2014

1 commit

  • When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable

    Gu Zheng
     

24 Jun, 2014

5 commits

  • In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    introduced vma merging to mbind(), but it should have also changed the
    convention of passing start vma from queue_pages_range() (formerly
    check_range()) to new_vma_page(): vma merging may have already freed
    that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
    worse crashes.

    Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    Reported-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: [2.6.34+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit b1cb0982bdd6 ("change the management method of free objects of
    the slab") introduced a bug on slab leak detector
    ('/proc/slab_allocators'). This detector works like as following
    decription.

    1. traverse all objects on all the slabs.
    2. determine whether it is active or not.
    3. if active, print who allocate this object.

    but that commit changed the way how to manage free objects, so the logic
    determining whether it is active or not is also changed. In before, we
    regard object in cpu caches as inactive one, but, with this commit, we
    mistakenly regard object in cpu caches as active one.

    This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
    DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
    corrupt free memory in the slab. It unmaps page table mapping if object
    is free and map it if object is active. When slab leak detector check
    object in cpu caches, it mistakenly think this object active so try to
    access object memory to retrieve caller of allocation. At this point,
    page table mapping to this object doesn't exist, so oops occurs.

    Following is oops message reported from Dave.

    It blew up when something tried to read /proc/slab_allocators
    (Just cat it, and you should see the oops below)

    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    [snip...]
    CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
    task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
    RIP: 0010:[] [] handle_slab+0x8a/0x180
    RSP: 0018:ffff880076925de0 EFLAGS: 00010002
    RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
    RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
    RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
    R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
    FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
    DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
    Call Trace:
    leaks_show+0xce/0x240
    seq_read+0x28e/0x490
    proc_reg_read+0x3d/0x80
    vfs_read+0x9b/0x160
    SyS_read+0x58/0xb0
    tracesys+0xd4/0xd9
    Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
    RIP handle_slab+0x8a/0x180

    To fix the problem, I introduce an object status buffer on each slab.
    With this, we can track object status precisely, so slab leak detector
    would not access active object and no kernel oops would occur. Memory
    overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
    which is mainly used for debugging, so memory overhead isn't big
    problem.

    Signed-off-by: Joonsoo Kim
    Reported-by: Dave Jones
    Reported-by: Tetsuo Handa
    Reviewed-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Trinity finds that mmap access to a hole while it's punched from shmem
    can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
    from completing, until the reader chooses to stop; with the puncher's
    hold on i_mutex locking out all other writers until it can complete.

    It appears that the tmpfs fault path is too light in comparison with its
    hole-punching path, lacking an i_data_sem to obstruct it; but we don't
    want to slow down the common case.

    Extend shmem_fallocate()'s existing range notification mechanism, so
    shmem_fault() can refrain from faulting pages into the hole while it's
    punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
    faulting when not).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
    3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
    kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

    I believe this comes about because, whereas collapsing and splitting THP
    functions take anon_vma lock in write mode (which excludes concurrent
    rmap walks), faulting THP functions (write protection and misplaced
    NUMA) do not - and mostly they do not need to.

    But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
    an instant (indeed, for a long instant, given the inter-CPU TLB flush in
    there), leaves *pmd neither present not trans_huge.

    Which can confuse a concurrent rmap walk, as when removing migration
    ptes, seen in the dumped trace. Although that rmap walk has a 4k page
    to insert, anon_vmas containing THPs are in no way segregated from
    4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
    that instant when a trans_huge pmd is temporarily absent.

    I don't think we need strengthen the locking at the THP end: it's easily
    handled with an ACCESS_ONCE() before testing both conditions.

    And since mm_find_pmd() had only one caller who wanted a THP rather than
    a pmd, let's slightly repurpose it to fail when it hits a THP or
    non-present pmd, and open code split_huge_page_address() again.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Bob Liu
    Cc: Christoph Lameter
    Cc: Dave Jones
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
    in copy_page_rep() called from copy_user_huge_page() called from
    do_huge_pmd_wp_page().

    I believe this is a DEBUG_PAGEALLOC false positive, due to the source
    page being split, and a tail page freed, while copy is in progress; and
    not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
    prevent a miscopy from being made visible.

    Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
    the usual get_page() and put_page() on head page in the usual config;
    but get and put references to all of the tail pages when
    DEBUG_PAGEALLOC.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins