15 Aug, 2015

2 commits

  • Hugetlbfs pages will get a refcount in get_any_page() or
    madvise_hwpoison() if soft offlining through madvise. The refcount which
    is held by the soft offline path should be released if we fail to isolate
    hugetlbfs pages.

    Fix it by reducing the refcount for both isolation success and failure.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • After trying to drain pages from pagevec/pageset, we try to get reference
    count of the page again, however, the reference count of the page is not
    reduced if the page is still not on LRU list.

    Fix it by adding the put_page() to drop the page reference which is from
    __get_any_page().

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

07 Aug, 2015

12 commits

  • The initial value of global_wb_domain.dirty_limit set by
    writeback_set_ratelimit() is zeroed out by the memset in
    wb_domain_init().

    Signed-off-by: Rabin Vincent
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rabin Vincent
     
  • Now page freeing code doesn't consider PageHWPoison as a bad page, so by
    setting it before completing the page containment, we can prevent the
    error page from being reused just after successful page migration.

    I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
    page table entry is transformed into migration entry, not to hwpoison
    entry.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • "non anonymous thp" case is still racy with freeing thp, which causes
    panic due to put_page() for refcount-0 page. It seems that closing up
    this race might be hard (and/or not worth doing,) so let's give up the
    error handling for this case.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When memory_failure() is called on a page which are just freed after
    page migration from soft offlining, the counter num_poisoned_pages is
    raised twi= ce. So let's fix it with using TestSetPageHWPoison.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Recently I addressed a few of hwpoison race problems and the patches are
    merged on v4.2-rc1. It made progress, but unfortunately some problems
    still remain due to less coverage of my testing. So I'm trying to fix
    or avoid them in this series.

    One point I'm expecting to discuss is that patch 4/5 changes the page
    flag set to be checked on free time. In current behavior, __PG_HWPOISON
    is not supposed to be set when the page is freed. I think that there is
    no strong reason for this behavior, and it causes a problem hard to fix
    only in error handler side (because __PG_HWPOISON could be set at
    arbitrary timing.) So I suggest to change it.

    With this patchset, hwpoison stress testing in official mce-test
    testsuite (which previously failed) passes.

    This patch (of 5):

    In "just unpoisoned" path, we do put_page and then unlock_page, which is
    a wrong order and causes "freeing locked page" bug. So let's fix it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • Commit 92923ca3aace ("mm: meminit: only set page reserved in the
    memblock region") broke memory hotplug which expects the memmap for
    newly added sections to be reserved until onlined by
    online_pages_range(). This patch marks hotplugged pages as reserved
    when adding new zones.

    Signed-off-by: Mel Gorman
    Reported-by: David Vrabel
    Tested-by: David Vrabel
    Cc: Nathan Zimmer
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch fixes creation of new kmem-caches after enabling
    sanity_checks for existing mergeable kmem-caches in runtime: before that
    patch creation fails because unique name in sysfs already taken by
    existing kmem-cache.

    Unlike other debug options this doesn't change object layout and could
    be enabled and disabled at any time.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
    before basic setup") introduced a rwsem to signal completion of the
    initialization workers.

    Lockdep complains about possible recursive locking:
    =============================================
    [ INFO: possible recursive locking detected ]
    4.1.0-12802-g1dc51b8 #3 Not tainted
    ---------------------------------------------
    swapper/0/1 is trying to acquire lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0xc7/0xe6

    but task is already holding lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0x3e/0xe6

    Replace the rwsem by a completion together with an atomic
    "outstanding work counter".

    [peterz@infradead.org: Barrier removal on the grounds of being pointless]
    [mgorman@suse.de: Applied review feedback]
    Signed-off-by: Nicolai Stange
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • early_pfn_to_nid() historically was inherently not SMP safe but only
    used during boot which is inherently single threaded or during hotplug
    which is protected by a giant mutex.

    With deferred memory initialisation there was a thread-safe version
    introduced and the early_pfn_to_nid would trigger a BUG_ON if used
    unsafely. Memory hotplug hit that check. This patch makes
    early_pfn_to_nid introduces a lock to make it safe to use during
    hotplug.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Ng
    Tested-by: Alex Ng
    Acked-by: Peter Zijlstra (Intel)
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Aug, 2015

1 commit

  • Nikolay has reported a hang when a memcg reclaim got stuck with the
    following backtrace:

    PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
    #0 __schedule at ffffffff815ab152
    #1 schedule at ffffffff815ab76e
    #2 schedule_timeout at ffffffff815ae5e5
    #3 io_schedule_timeout at ffffffff815aad6a
    #4 bit_wait_io at ffffffff815abfc6
    #5 __wait_on_bit at ffffffff815abda5
    #6 wait_on_page_bit at ffffffff8111fd4f
    #7 shrink_page_list at ffffffff81135445
    #8 shrink_inactive_list at ffffffff81135845
    #9 shrink_lruvec at ffffffff81135ead
    #10 shrink_zone at ffffffff811360c3
    #11 shrink_zones at ffffffff81136eff
    #12 do_try_to_free_pages at ffffffff8113712f
    #13 try_to_free_mem_cgroup_pages at ffffffff811372be
    #14 try_charge at ffffffff81189423
    #15 mem_cgroup_try_charge at ffffffff8118c6f5
    #16 __add_to_page_cache_locked at ffffffff8112137d
    #17 add_to_page_cache_lru at ffffffff81121618
    #18 pagecache_get_page at ffffffff8112170b
    #19 grow_dev_page at ffffffff811c8297
    #20 __getblk_slow at ffffffff811c91d6
    #21 __getblk_gfp at ffffffff811c92c1
    #22 ext4_ext_grow_indepth at ffffffff8124565c
    #23 ext4_ext_create_new_leaf at ffffffff81246ca8
    #24 ext4_ext_insert_extent at ffffffff81246f09
    #25 ext4_ext_map_blocks at ffffffff8124a848
    #26 ext4_map_blocks at ffffffff8121a5b7
    #27 mpage_map_one_extent at ffffffff8121b1fa
    #28 mpage_map_and_submit_extent at ffffffff8121f07b
    #29 ext4_writepages at ffffffff8121f6d5
    #30 do_writepages at ffffffff8112c490
    #31 __filemap_fdatawrite_range at ffffffff81120199
    #32 filemap_flush at ffffffff8112041c
    #33 ext4_alloc_da_blocks at ffffffff81219da1
    #34 ext4_rename at ffffffff81229b91
    #35 ext4_rename2 at ffffffff81229e32
    #36 vfs_rename at ffffffff811a08a5
    #37 SYSC_renameat2 at ffffffff811a3ffc
    #38 sys_renameat2 at ffffffff811a408e
    #39 sys_rename at ffffffff8119e51e
    #40 system_call_fastpath at ffffffff815afa89

    Dave Chinner has properly pointed out that this is a deadlock in the
    reclaim code because ext4 doesn't submit pages which are marked by
    PG_writeback right away.

    The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
    with too many dirty pages") and it was applied only when may_enter_fs
    was specified. The code has been changed by c3b94f44fcb0 ("memcg:
    further prevent OOM with too many dirty pages") which has removed the
    __GFP_FS restriction with a reasoning that we do not get into the fs
    code. But this is not sufficient apparently because the fs doesn't
    necessarily submit pages marked PG_writeback for IO right away.

    ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
    submit the bio. Instead it tries to map more pages into the bio and
    mpage_map_one_extent might trigger memcg charge which might end up
    waiting on a page which is marked PG_writeback but hasn't been submitted
    yet so we would end up waiting for something that never finishes.

    Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
    before we go to wait on the writeback. The page fault path, which is
    the only path that triggers memcg oom killer since 3.12, shouldn't
    require GFP_NOFS and so we shouldn't reintroduce the premature OOM
    killer issue which was originally addressed by the heuristic.

    As per David Chinner the xfs is doing similar thing since 2.6.15 already
    so ext4 is not the only affected filesystem. Moreover he notes:

    : For example: IO completion might require unwritten extent conversion
    : which executes filesystem transactions and GFP_NOFS allocations. The
    : writeback flag on the pages can not be cleared until unwritten
    : extent conversion completes. Hence memory reclaim cannot wait on
    : page writeback to complete in GFP_NOFS context because it is not
    : safe to do so, memcg reclaim or otherwise.

    Cc: stable@vger.kernel.org # 3.9+
    [tytso@mit.edu: corrected the control flow]
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Reported-by: Nikolay Borisov
    Signed-off-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Jul, 2015

5 commits

  • In CMA, 1 bit in bitmap means 1 << order_per_bits pages so size of
    bitmap is cma->count >> order_per_bits rather than just cma->count.
    This patch fixes it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Sasha Levin
    Cc: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • CMA has alloc/free interface for debugging. It is intended that
    alloc/free occurs in specific CMA region, but, currently, alloc/free
    interface is on root dir due to the bug so we can't select CMA region
    where alloc/free happens.

    This patch fixes this problem by making alloc/free interface per CMA
    region.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Cc: Sasha Levin
    Cc: Stefan Strogin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, we set wrong gfp_mask to page_owner info in case of isolated
    freepage by compaction and split page. It causes incorrect mixed
    pageblock report that we can get from '/proc/pagetypeinfo'. This metric
    is really useful to measure fragmentation effect so should be accurate.
    This patch fixes it by setting correct information.

    Without this patch, after kernel build workload is finished, number of
    mixed pageblock is 112 among roughly 210 movable pageblocks.

    But, with this fix, output shows that mixed pageblock is just 57.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When I tested my new patches, I found that page pointer which is used
    for setting page_owner information is changed. This is because page
    pointer is used to set new migratetype in loop. After this work, page
    pointer could be out of bound. If this wrong pointer is used for
    page_owner, access violation happens. Below is error message that I
    got.

    BUG: unable to handle kernel paging request at 0000000000b00018
    IP: [] save_stack_address+0x30/0x40
    PGD 1af2d067 PUD 166e0067 PMD 0
    Oops: 0002 [#1] SMP
    ...snip...
    Call Trace:
    print_context_stack+0xcf/0x100
    dump_trace+0x15f/0x320
    save_stack_trace+0x2f/0x50
    __set_page_owner+0x46/0x70
    __isolate_free_page+0x1f7/0x210
    split_free_page+0x21/0xb0
    isolate_freepages_block+0x1e2/0x410
    compaction_alloc+0x22d/0x2d0
    migrate_pages+0x289/0x8b0
    compact_zone+0x409/0x880
    compact_zone_order+0x6d/0x90
    try_to_compact_pages+0x110/0x210
    __alloc_pages_direct_compact+0x3d/0xe6
    __alloc_pages_nodemask+0x6cd/0x9a0
    alloc_pages_current+0x91/0x100
    runtest_store+0x296/0xa50
    simple_attr_write+0xbd/0xe0
    __vfs_write+0x28/0xf0
    vfs_write+0xa9/0x1b0
    SyS_write+0x46/0xb0
    system_call_fastpath+0x16/0x75

    This patch fixes this error by moving up set_page_owner().

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The kbuild test robot reported the following

    tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
    head: 14a6f1989dae9445d4532941bdd6bbad84f4c8da
    commit: 3b242c66ccbd60cf47ab0e8992119d9617548c23 x86: mm: enable deferred struct page initialisation on x86-64
    date: 3 days ago
    config: x86_64-randconfig-x006-201527 (attached as .config)
    reproduce:
    git checkout 3b242c66ccbd60cf47ab0e8992119d9617548c23
    # save the attached .config to linux build tree
    make ARCH=x86_64

    All warnings (new ones prefixed by >>):

    mm/page_alloc.c: In function 'early_page_uninitialised':
    >> mm/page_alloc.c:247:6: warning: unused variable 'nid' [-Wunused-variable]
    int nid = early_pfn_to_nid(pfn);

    It's due to the NODE_DATA macro ignoring the nid parameter on !NUMA
    configurations. This patch avoids the warning by not declaring nid.

    Signed-off-by: Mel Gorman
    Reported-by: Wu Fengguang
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Jul, 2015

1 commit

  • Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

1 commit

  • Pull block fixes from Jens Axboe:
    "Mainly sending this off now for the writeback fixes, since they fix a
    real regression introduced with the cgroup writeback changes. The
    NVMe fix could wait for next pull for this series, but it's simple
    enough that we might as well include it.

    This contains:

    - two cgroup writeback fixes from Tejun, fixing a user reported issue
    with luks crypt devices hanging when being closed.

    - NVMe error cleanup fix from Jon Derrick, fixing a case where we'd
    attempt to free an unregistered IRQ"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    NVMe: Fix irq freeing when queue_request_irq fails
    writeback: don't drain bdi_writeback_congested on bdi destruction
    writeback: don't embed root bdi_writeback_congested in bdi_writeback

    Linus Torvalds
     

03 Jul, 2015

1 commit

  • …scm/linux/kernel/git/paulg/linux

    Pull module_init replacement part two from Paul Gortmaker:
    "Replace module_init with appropriate alternate initcall in non
    modules.

    This series converts non-modular code that is using the module_init()
    call to hook itself into the system to instead use one of our
    alternate priority initcalls.

    Unlike the previous series that used device_initcall and hence was a
    runtime no-op, these commits change to one of the alternate initcalls,
    because (a) we have them and (b) it seems like the right thing to do.

    For example, it would seem logical to use arch_initcall for arch
    specific setup code and fs_initcall for filesystem setup code.

    This does mean however, that changes in the init ordering will be
    taking place, and so there is a small risk that some kind of implicit
    init ordering issue may lie uncovered. But I think it is still better
    to give these ones sensible priorities than to just assign them all to
    device_initcall in order to exactly preserve the old ordering.

    Thad said, we have already made similar changes in core kernel code in
    commit c96d6660dc65 ("kernel: audit/fix non-modular users of
    module_init in core code") without any regressions reported, so this
    type of change isn't without precedent. It has also got the same
    local testing and linux-next coverage as all the other pull requests
    that I'm sending for this merge window have got.

    Once again, there is an unused module_exit function removal that shows
    up as an outlier upon casual inspection of the diffstat"

    * tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
    x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
    mm/page_owner.c: use late_initcall to hook in enabling
    lib/list_sort: use late_initcall to hook in self tests
    arm: use subsys_initcall in non-modular pl320 IPC code
    powerpc: don't use module_init for non-modular core hugetlb code
    powerpc: use subsys_initcall for Freescale Local Bus
    x86: don't use module_init for non-modular core bootflag code
    netfilter: don't use module_init/exit in core IPV4 code
    fs/notify: don't use module_init for non-modular inotify_user code
    mm: replace module_init usages with subsys_initcall in nommu.c

    Linus Torvalds
     

02 Jul, 2015

4 commits

  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    This patch fixes the bug by updating bdi destruction to not wait for
    cong's to drain. A cong is unlinked from bdi->cgwb_congested_tree on
    bdi destuction regardless of its reference count as the bdi may go
    away any point after destruction. wb_congested_put() checks whether
    the cong is already unlinked on release.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
    Tested-by: Jon Christopherson
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    To fix the bug, bdi destruction will be updated to not wait for cong's
    to drain, which naturally means that cong's may outlive the associated
    bdi. This is fine for non-root cong's but is problematic for the root
    cong's which are embedded in their bdi's as they may end up getting
    dereferenced after the containing bdi's are freed.

    This patch makes root cong's behave the same as non-root cong's. They
    are no longer embedded in their bdi's but allocated separately during
    bdi initialization, indexed and reference counted the same way.

    * As cong handling is the same for all wb's, wb->congested
    initialization is moved into wb_init().

    * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
    bdi->wb_congested is now a pointer pointing to the root cong
    allocated during bdi init and minimal refcnting operations are
    implemented.

    * The above makes root wb init paths diverge depending on
    CONFIG_CGROUP_WRITEBACK. root wb init is moved to cgwb_bdi_init().

    This patch in itself shouldn't cause any consequential behavior
    differences but prepares for the actual fix.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Tested-by: Jon Christopherson

    Added include to backing-dev.h for kfree() definition.

    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Merge third patchbomb from Andrew Morton:

    - the rest of MM

    - scripts/gdb updates

    - ipc/ updates

    - lib/ updates

    - MAINTAINERS updates

    - various other misc things

    * emailed patches from Andrew Morton : (67 commits)
    genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
    genalloc: rename dev_get_gen_pool() to gen_pool_get()
    x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
    MAINTAINERS: add zpool
    MAINTAINERS: BCACHE: Kent Overstreet has changed email address
    MAINTAINERS: move Jens Osterkamp to CREDITS
    MAINTAINERS: remove unused nbd.h pattern
    MAINTAINERS: update brcm gpio filename pattern
    MAINTAINERS: update brcm dts pattern
    MAINTAINERS: update sound soc intel patterns
    MAINTAINERS: remove website for paride
    MAINTAINERS: update Emulex ocrdma email addresses
    bcache: use kvfree() in various places
    libcxgbi: use kvfree() in cxgbi_free_big_mem()
    target: use kvfree() in session alloc and free
    IB/ehca: use kvfree() in ipz_queue_{cd}tor()
    drm/nouveau/gem: use kvfree() in u_free()
    drm: use kvfree() in drm_free_large()
    cxgb4: use kvfree() in t4_free_mem()
    cxgb3: use kvfree() in cxgb_free_mem()
    ...

    Linus Torvalds
     
  • Avoid the warning:

    WARNING: mm/built-in.o(.text.unlikely+0xc22): Section mismatch in reference from the function .new_kmalloc_cache() to the variable .init.rodata:kmalloc_info
    The function .new_kmalloc_cache() references
    the variable __initconst kmalloc_info.

    Signed-off-by: Christoph Lameter
    Reported-by: Stephen Rothwell
    Tested-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

01 Jul, 2015

12 commits

  • Waiman Long reported that 24TB machines hit OOM during basic setup when
    struct page initialisation was deferred. One approach is to initialise
    memory on demand but it interferes with page allocator paths. This patch
    creates dedicated threads to initialise memory before basic setup. It
    then blocks on a rw_semaphore until completion as a wait_queue and counter
    is overkill. This may be slower to boot but it's simplier overall and
    also gets rid of a section mangling which existed so kswapd could do the
    initialisation.

    [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
    Signed-off-by: Mel Gorman
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Scott Norton
    Tested-by: Daniel J Blueman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • mminit_verify_page_links() is an extremely paranoid check that was
    introduced when memory initialisation was being heavily reworked.
    Profiles indicated that up to 10% of parallel memory initialisation was
    spent on checking this for every page. The cost could be reduced but in
    practice this check only found problems very early during the
    initialisation rewrite and has found nothing since. This patch removes an
    expensive unnecessary check.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During parallel sturct page initialisation, ranges are checked for every
    PFN unnecessarily which increases boot times. This patch alters when the
    ranges are checked.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Parallel struct page frees pages one at a time. Try free pages as single
    large pages where possible.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Deferred struct page initialisation is using pfn_to_page() on every PFN
    unnecessarily. This patch minimises the number of lookups and scheduler
    checks.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Only a subset of struct pages are initialised at the moment. When this
    patch is applied kswapd initialise the remaining struct pages in parallel.

    This should boot faster by spreading the work to multiple CPUs and
    initialising data that is local to the CPU. The user-visible effect on
    large machines is that free memory will appear to rapidly increase early
    in the lifetime of the system until kswapd reports that all memory is
    initialised in the kernel log. Once initialised there should be no other
    user-visibile effects.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch initalises all low memory struct pages and 2G of the highest
    zone on each node during memory initialisation if
    CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set
    but will be available in a later patch. Parallel initialisation of struct
    page depends on some features from memory hotplug and it is necessary to
    alter alter section annotations.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are
    unnecessarily visible outside memory initialisation. As well as
    unnecessary visibility, it's unnecessary function call overhead when
    initialising pages. This patch moves the helpers inline.

    [akpm@linux-foundation.org: fix build]
    [mhocko@suse.cz: fix build]
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __early_pfn_to_nid() use static variables to cache recent lookups as
    memblock lookups are very expensive but it assumes that memory
    initialisation is single-threaded. Parallel initialisation of struct
    pages will break that assumption so this patch makes __early_pfn_to_nid()
    SMP-safe by requiring the caller to cache recent search information.
    early_pfn_to_nid() keeps the same interface but is only safe to use early
    in boot due to the use of a global static variable. meminit_pfn_in_nid()
    is an SMP-safe version that callers must maintain their own state for.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __free_pages_bootmem prepares a page for release to the buddy allocator
    and assumes that the struct page is initialised. Parallel initialisation
    of struct pages defers initialisation and __free_pages_bootmem can be
    called for struct pages that cannot yet map struct page to PFN. This
    patch passes PFN to __free_pages_bootmem with no other functional change.

    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Nate Zimmer
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently each page struct is set as reserved upon initialization. This
    patch leaves the reserved bit clear and only sets the reserved bit when it
    is known the memory was allocated by the bootmem allocator. This makes it
    easier to distinguish between uninitialised struct pages and reserved
    struct pages in later patches.

    Signed-off-by: Robin Holt
    Signed-off-by: Nathan Zimmer
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • Currently, memmap_init_zone() has all the smarts for initializing a single
    page. A subset of this is required for parallel page initialisation and
    so this patch breaks up the monolithic function in preparation.

    Signed-off-by: Robin Holt
    Signed-off-by: Nathan Zimmer
    Signed-off-by: Mel Gorman
    Tested-by: Nate Zimmer
    Tested-by: Waiman Long
    Tested-by: Daniel J Blueman
    Acked-by: Pekka Enberg
    Cc: Robin Holt
    Cc: Dave Hansen
    Cc: Waiman Long
    Cc: Scott Norton
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt