15 Jan, 2020

25 commits

  • The only possible values are nfs_fill_super and nfs_clone_super. The
    latter is used only when crossing into a submount and it is almost
    identical to the former; the only differences are
    * ->s_time_gran unconditionally set to 1 (even for v2 mounts).
    Regression dating back to 2012, actually.
    * ->s_blocksize/->s_blocksize_bits set to that of parent.

    Rather than messing with the method, stash ->s_blocksize_bits in
    mount_info in submount case and after the (now unconditional)
    call of nfs_fill_super() override ->s_blocksize/->s_blocksize_bits
    if that has been set.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • pick it from mount_info

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Make it static, even. And remove a stale extern of (long-gone)
    nfs_xdev_mount_common() from internal.h, while we are at it.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • they are identical now...

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • That will allow to get rid of passing those references around in
    quite a few places. Moreover, that will allow to merge xdev and
    remote file_system_type.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Do it in nfs_do_submount() instead. As a side benefit, nfs_clone_data
    doesn't need ->fh and ->fattr anymore.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • nothing in it will be looking at that thing anyway

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • They are identical now.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Do that (fhandle allocation, setting struct server up) in
    nfs4_referral_mount() and nfs4_try_mount() resp. and pass the
    server and pointer to mount_info into nfs_do_root_mount() so that
    nfs4_remote_referral_mount()/nfs_remote_mount() could be merged.

    Since we are moving stuff from ->mount() instances to the points
    prior to vfs_kern_mount() that would trigger those, we need to
    make sure that do_nfs_root_mount() will do the corresponding
    cleanup itself if it doesn't trigger those ->mount() instances.

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Allow it to take ERR_PTR() for server and return ERR_CAST() of it in
    such case. All callers used to open-code that...

    Reviewed-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Anna Schumaker

    Al Viro
     
  • Pull NFS client bugfixes from Anna Schumaker:
    "Three NFS over RDMA fixes for bugs Chuck found that can be hit during
    device removal:

    - Fix create_qp crash on device unload

    - Fix completion wait during device removal

    - Fix oops in receive handler after device removal"

    * tag 'nfs-for-5.5-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    xprtrdma: Fix oops in Receive handler after device removal
    xprtrdma: Fix completion wait during device removal
    xprtrdma: Fix create_qp crash on device unload

    Linus Torvalds
     
  • Since v5.4, a device removal occasionally triggered this oops:

    Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
    Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
    Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
    Dec 2 17:13:53 manet kernel: PGD 0 P4D 0
    Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
    Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883
    Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
    Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
    Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
    Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
    Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
    Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
    Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
    Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
    Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
    Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
    Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
    Dec 2 17:13:53 manet kernel: Call Trace:
    Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
    Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
    Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
    Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
    Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9
    Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
    Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

    The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
    is still pointing to the old ib_device, which has been freed. The
    only way that is possible is if this rpcrdma_rep was not destroyed
    by rpcrdma_ia_remove.

    Debugging showed that was indeed the case: this rpcrdma_rep was
    still in use by a completing RPC at the time of the device removal,
    and thus wasn't on the rep free list. So, it was not found by
    rpcrdma_reps_destroy().

    The fix is to introduce a list of all rpcrdma_reps so that they all
    can be found when a device is removed. That list is used to perform
    only regbuf DMA unmapping, replacing that call to
    rpcrdma_reps_destroy().

    Meanwhile, to prevent corruption of this list, I've moved the
    destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
    rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
    not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
    protecting the rb_all_reps list.

    Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • I've found that on occasion, "rmmod " will hang while if an NFS
    is under load.

    Ensure that ri_remove_done is initialized only just before the
    transport is woken up to force a close. This avoids the completion
    possibly getting initialized again while the CM event handler is
    waiting for a wake-up.

    Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • On device re-insertion, the RDMA device driver crashes trying to set
    up a new QP:

    Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
    Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
    Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
    Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
    Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
    Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852
    Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
    Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
    Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
    Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
    Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
    Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
    Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
    Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
    Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
    Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
    Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
    Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
    Nov 27 16:32:06 manet kernel: Call Trace:
    Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
    Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
    Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
    Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
    Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

    The fix is to copy the qp_init_attr struct that was just created by
    rpcrdma_ep_create() instead of using the one from the previous
    connection instance.

    Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect")
    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Pull parisc fixes from Helge Deller:
    "A boot crash fix by Mike Rapoport and a printk fix by Krzysztof
    Kozlowski"

    * 'parisc-5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc: fix map_pages() to actually populate upper directory
    parisc: Use proper printk format for resource_size_t

    Linus Torvalds
     
  • Pull asm-generic fixes from Arnd Bergmann:
    "Here are two bugfixes from Mike Rapoport, both fixing compile-time
    errors for the nds32 architecture that were recently introduced"

    * tag 'asm-generic-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    nds32: fix build failure caused by page table folding updates
    asm-generic/nds32: don't redefine cacheflush primitives

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Two simple fixes in the upper drivers (so both fairly core), one in
    enclosures, which fixes replugging a device into an enclosure slot and
    one in the disk driver which fixes revalidating a drive with
    protection information (PI) to make it a non-PI drive ... previously
    we were still remembering the old PI state.

    Both fixed issues are quite rare in the field"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: enclosure: Fix stale device oops with hot replug
    scsi: sd: Clear sdkp->protection_type if disk is reformatted without PI

    Linus Torvalds
     
  • Merge misc fixes from David Howells.

    Two afs fixes and a key refcounting fix.

    * dhowells:
    afs: Fix afs_lookup() to not clobber the version on a new dentry
    afs: Fix use-after-loss-of-ref
    keys: Fix request_key() cache

    Linus Torvalds
     
  • Fix afs_lookup() to not clobber the version set on a new dentry by
    afs_do_lookup() - especially as it's using the wrong version of the
    version (we need to use the one given to us by whatever op the dir
    contents correspond to rather than what's in the afs_vnode).

    Fixes: 9dd0b82ef530 ("afs: Fix missing dentry data version updating")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • afs_lookup() has a tracepoint to indicate the outcome of
    d_splice_alias(), passing it the inode to retrieve the fid from.
    However, the function gave up its ref on that inode when it called
    d_splice_alias(), which may have failed and dropped the inode.

    Fix this by caching the fid.

    Fixes: 80548b03991f ("afs: Add more tracepoints")
    Reported-by: Al Viro
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • When the key cached by request_key() and co. is cleaned up on exit(),
    the code looks in the wrong task_struct, and so clears the wrong cache.
    This leads to anomalies in key refcounting when doing, say, a kernel
    build on an afs volume, that then trigger kasan to report a
    use-after-free when the key is viewed in /proc/keys.

    Fix this by making exit_creds() look in the passed-in task_struct rather
    than in current (the task_struct cleanup code is deferred by RCU and
    potentially run in another task).

    Fixes: 7743c48e54ee ("keys: Cache result of request_key*() temporarily in task_struct")
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Merge misc fixes from Andrew Morton:
    "11 mm fixes"

    * emailed patches from Andrew Morton :
    mm: khugepaged: add trace status description for SCAN_PAGE_HAS_PRIVATE
    mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid
    mm/page-writeback.c: improve arithmetic divisions
    mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide
    mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
    mm, debug_pagealloc: don't rely on static keys too early
    mm: memcg/slab: fix percpu slab vmstats flushing
    mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment
    mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment
    mm/memory_hotplug: don't free usage map when removing a re-added early section
    mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations

    Linus Torvalds
     

14 Jan, 2020

14 commits

  • The commit d96885e277b5 ("parisc: use pgtable-nopXd instead of
    4level-fixup") converted PA-RISC to use folded page tables, but it missed
    the conversion of pgd_populate() to pud_populate() in maps_pages()
    function. This caused the upper page table directory to remain empty and
    the system would crash as a result.

    Using pud_populate() that actually populates the page table instead of
    dummy pgd_populate() fixes the issue.

    Fixes: d96885e277b5 ("parisc: use pgtable-nopXd instead of 4level-fixup")
    Reported-by: Meelis Roos
    Reported-by: Jeroen Roovers
    Reported-by: Mikulas Patocka
    Tested-by: Jeroen Roovers
    Tested-by: Mikulas Patocka
    Signed-off-by: Mike Rapoport
    Signed-off-by: Helge Deller

    Mike Rapoport
     
  • resource_size_t should be printed with its own size-independent format
    to fix warnings when compiling on 64-bit platform (e.g. with
    COMPILE_TEST):

    arch/parisc/kernel/drivers.c: In function 'print_parisc_device':
    arch/parisc/kernel/drivers.c:892:9: warning:
    format '%p' expects argument of type 'void *',
    but argument 4 has type 'resource_size_t {aka unsigned int}' [-Wformat=]

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Helge Deller

    Krzysztof Kozlowski
     
  • Merge Intel Gen9 graphics fix from Akeem Abodunrin:
    "Insufficient control flow in certain data structures for some Intel
    Processors with Intel Processor Graphics may allow an unauthenticated
    user to potentially enable information disclosure via local access

    This provides mitigation for Gen9 hardware. Note that Gen8 is not
    impacted due to a previously implemented workaround.

    The mitigation involves using an existing hardware feature to forcibly
    clear down all EU state at each context switch"

    * tag 'Intel-CVE-2019-14615' of emailed bundle from Akeem G Abodunrin :
    drm/i915/gen9: Clear residual context state on context switch

    Linus Torvalds
     
  • Commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem)
    FS") introduced a new khugepaged scan result: SCAN_PAGE_HAS_PRIVATE, but
    the corresponding description for trace events were not added.

    Link: http://lkml.kernel.org/r/1574793844-2914-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Yang Shi
    Cc: Song Liu
    Cc: Kirill A. Shutemov
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When booting with amd_iommu=off, the following WARNING message
    appears:

    AMD-Vi: AMD IOMMU disabled on kernel command-line
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
    Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
    RIP: 0010:flush_workqueue+0x42e/0x450
    Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
    Call Trace:
    kmem_cache_destroy+0x69/0x260
    iommu_go_to_state+0x40c/0x5ab
    amd_iommu_prepare+0x16/0x2a
    irq_remapping_prepare+0x36/0x5f
    enable_IR_x2apic+0x21/0x172
    default_setup_apic_routing+0x12/0x6f
    apic_intr_mode_init+0x1a1/0x1f1
    x86_late_time_init+0x17/0x1c
    start_kernel+0x480/0x53f
    secondary_startup_64+0xb6/0xc0
    ---[ end trace 30894107c3749449 ]---
    x2apic: IRQ remapping doesn't support X2APIC mode
    x2apic disabled

    The warning is caused by the calling of 'kmem_cache_destroy()'
    in free_iommu_resources(). Here is the call path:

    free_iommu_resources
    kmem_cache_destroy
    flush_memcg_workqueue
    flush_workqueue

    The root cause is that the IOMMU subsystem runs before the workqueue
    subsystem, which the variable 'wq_online' is still 'false'. This leads
    to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
    'true'.

    Since the variable 'memcg_kmem_cache_wq' is not allocated during the
    time, it is unnecessary to call flush_memcg_workqueue(). This prevents
    the WARNING message triggered by flush_workqueue().

    Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
    Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Adrian Huang
    Reported-by: Xiaochun Lee
    Reviewed-by: Shakeel Butt
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Huang
     
  • Use div64_ul() instead of do_div() if the divisor is unsigned long, to
    avoid truncation to 32-bit on 64-bit platforms.

    Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • The two variables 'numerator' and 'denominator', though they are
    declared as long, they should actually be unsigned long (according to
    the implementation of the fprop_fraction_percpu() function)

    And do_div() does a 64-by-32 division, while the divisor 'denominator'
    is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
    function to call is div64_ul().

    Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • Patch series "use div64_ul() instead of div_u64() if the divisor is
    unsigned long".

    We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
    by zero in avg_atom () calculation"), then refer to the recently analyzed
    mm code, we found this suspicious place.

    201 if (min) {
    202 min *= this_bw;
    203 do_div(min, tot_bw);
    204 }

    And we also disassembled and confirmed it:

    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
    0xffffffff811c37da : xor %r10d,%r10d
    0xffffffff811c37dd : test %rax,%rax
    0xffffffff811c37e0 : je 0xffffffff811c3800
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
    0xffffffff811c37e2 : imul %r8,%rax
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
    0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
    0xffffffff811c37e9 : xor %edx,%edx
    0xffffffff811c37eb : div %r10
    0xffffffff811c37ee : imul %rbx,%rax
    0xffffffff811c37f2 : shr $0x2,%rax
    0xffffffff811c37f6 : mul %rcx
    0xffffffff811c37f9 : shr $0x2,%rdx
    0xffffffff811c37fd : mov %rdx,%r10

    This series uses div64_ul() instead of div_u64() if the divisor is
    unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

    This patch (of 3):

    The variables 'min' and 'max' are unsigned long and do_div truncates
    them to 32 bits, which means it can test non-zero and be truncated to
    zero for division. Fix this issue by using div64_ul() instead.

    Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
    Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently slab percpu vmstats are flushed twice: during the memcg
    offlining and just before freeing the memcg structure. Each time percpu
    counters are summed, added to the atomic counterparts and propagated up
    by the cgroup tree.

    The second flushing is required due to how recursive vmstats are
    implemented: counters are batched in percpu variables on a local level,
    and once a percpu value is crossing some predefined threshold, it spills
    over to atomic values on the local and each ascendant levels. It means
    that without flushing some numbers cached in percpu variables will be
    dropped on floor each time a cgroup is destroyed. And with uptime the
    error on upper levels might become noticeable.

    The first flushing aims to make counters on ancestor levels more
    precise. Dying cgroups may resume in the dying state for a long time.
    After kmem_cache reparenting which is performed during the offlining
    slab counters of the dying cgroup don't have any chances to be updated,
    because any slab operations will be performed on the parent level. It
    means that the inaccuracy caused by percpu batching will not decrease up
    to the final destruction of the cgroup. By the original idea flushing
    slab counters during the offlining should minimize the visible
    inaccuracy of slab counters on the parent level.

    The problem is that percpu counters are not zeroed after the first
    flushing. So every cached percpu value is summed twice. It creates a
    small error (up to 32 pages per cpu, but usually less) which accumulates
    on parent cgroup level. After creating and destroying of thousands of
    child cgroups, slab counter on parent level can be way off the real
    value.

    For now, let's just stop flushing slab counters on memcg offlining. It
    can't be done correctly without scheduling a work on each cpu: reading
    and zeroing it during css offlining can race with an asynchronous
    update, which doesn't expect values to be changed underneath.

    With this change, slab counters on parent level will become eventually
    consistent. Once all dying children are gone, values are correct. And
    if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

    It's not perfect, as slab are reparented, so any updates after the
    reparenting will happen on the parent level. It means that if a slab
    page was allocated, a counter on child level was bumped, then the page
    was reparented and freed, the annihilation of positive and negative
    counter values will not happen until the child cgroup is released. It
    makes slab counters different from others, and it might want us to
    implement flushing in a correct form again. But it's also a question of
    performance: scheduling a work on each cpu isn't free, and it's an open
    question if the benefit of having more accurate counters is worth it.

    We might also consider flushing all counters on offlining, not only slab
    counters.

    So let's fix the main problem now: make the slab counters eventually
    consistent, so at least the error won't grow with uptime (or more
    precisely the number of created and destroyed cgroups). And think about
    the accuracy of counters separately.

    Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
    Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
    enabled. But it doesn't work well with above-47bit hint address.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks THP alignment in shmem/tmp:
    shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
    *any* hint address specified.

    This can be fixed by requesting the aligned area if the we failed to
    allocated at user-specified hint address. The request with inflated
    length will also take the user-specified hint address. This way we will
    not lose an allocation request from the full address space.

    [kirill@shutemov.name: fold in a fixup]
    Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
    Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Cc: "Willhalm, Thomas"
    Cc: Dan Williams
    Cc: "Bruggeman, Otto G"
    Cc: "Aneesh Kumar K . V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Patch series "Fix two above-47bit hint address vs. THP bugs".

    The two get_unmapped_area() implementations have to be fixed to provide
    THP-friendly mappings if above-47bit hint address is specified.

    This patch (of 2):

    Filesystems use thp_get_unmapped_area() to provide THP-friendly
    mappings. For DAX in particular.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks thp_get_unmapped_area(): the function
    would not try to allocate PMD-aligned area if *any* hint address
    specified.

    Modify the routine to handle it correctly:

    - Try to allocate the space at the specified hint address with length
    padding required for PMD alignment.
    - If failed, retry without length padding (but with the same hint
    address);
    - If the returned address matches the hint address return it.
    - Otherwise, align the address as required for THP and return.

    The user specified hint address is passed down to get_unmapped_area() so
    above-47bit hint address will be taken into account without breaking
    alignment requirements.

    Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Thomas Willhalm
    Tested-by: Dan Williams
    Cc: "Aneesh Kumar K . V"
    Cc: "Bruggeman, Otto G"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • THP page faults now attempt a __GFP_THISNODE allocation first, which
    should only compact existing free memory, followed by another attempt
    that can allocate from any node using reclaim/compaction effort
    specified by global defrag setting and madvise.

    This patch makes the following changes to the scheme:

    - Before the patch, the first allocation relies on a check for
    pageblock order and __GFP_IO to prevent excessive reclaim. This
    however affects also the second attempt, which is not limited to
    single node.

    Instead of that, reuse the existing check for costly order
    __GFP_NORETRY allocations, and make sure the first THP attempt uses
    __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
    allocations will bail out if compaction needs reclaim, while
    previously they only bailed out when compaction was deferred due to
    previous failures.

    This should be still acceptable within the __GFP_NORETRY semantics.

    - Before the patch, the second allocation attempt (on all nodes) was
    passing __GFP_NORETRY. This is redundant as the check for pageblock
    order (discussed above) was stronger. It's also contrary to
    madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
    requested.

    After this patch, the second attempt doesn't pass __GFP_THISNODE nor
    __GFP_NORETRY.

    To sum up, THP page faults now try the following attempts:

    1. local node only THP allocation with no reclaim, just compaction.
    2. for madvised VMA's or when synchronous compaction is enabled always - THP
    allocation from any node with effort determined by global defrag setting
    and VMA madvise
    3. fallback to base pages on any node

    Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
    Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

13 Jan, 2020

1 commit