30 Jan, 2020

1 commit


29 Jan, 2020

4 commits

  • Pull btrfs updates from David Sterba:
    "Features, highlights:

    - async discard
    - "mount -o discard=async" to enable it
    - freed extents are not discarded immediatelly, but grouped
    together and trimmed later, with IO rate limiting
    - the "sync" mode submits short extents that could have been
    ignored completely by the device, for SATA prior to 3.1 the
    requests are unqueued and have a big impact on performance
    - the actual discard IO requests have been moved out of
    transaction commit to a worker thread, improving commit latency
    - IO rate and request size can be tuned by sysfs files, for now
    enabled only with CONFIG_BTRFS_DEBUG as we might need to
    add/delete the files and don't have a stable-ish ABI for
    general use, defaults are conservative

    - export device state info in sysfs, eg. missing, writeable

    - no discard of extents known to be untouched on disk (eg. after
    reservation)

    - device stats reset is logged with process name and PID that called
    the ioctl

    Fixes:

    - fix missing hole after hole punching and fsync when using NO_HOLES

    - writeback: range cyclic mode could miss some dirty pages and lead
    to OOM

    - two more corner cases for metadata_uuid change after power loss
    during the change

    - fix infinite loop during fsync after mix of rename operations

    Core changes:

    - qgroup assign returns ENOTCONN when quotas not enabled, used to
    return EINVAL that was confusing

    - device closing does not need to allocate memory anymore

    - snapshot aware code got removed, disabled for years due to
    performance problems, reimplmentation will allow to select wheter
    defrag breaks or does not break COW on shared extents

    - tree-checker:
    - check leaf chunk item size, cross check against number of
    stripes
    - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items

    - new self test for physical -> logical mapping code, used for super
    block range exclusion

    - assertion helpers/macros updated to avoid objtool "unreachable
    code" reports on older compilers or config option combinations"

    * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
    btrfs: free block groups after free'ing fs trees
    btrfs: Fix split-brain handling when changing FSID to metadata uuid
    btrfs: Handle another split brain scenario with metadata uuid feature
    btrfs: Factor out metadata_uuid code from find_fsid.
    btrfs: Call find_fsid from find_fsid_inprogress
    Btrfs: fix infinite loop during fsync after rename operations
    btrfs: set trans->drity in btrfs_commit_transaction
    btrfs: drop log root for dropped roots
    btrfs: sysfs, add devid/dev_state kobject and device attributes
    btrfs: Refactor btrfs_rmap_block to improve readability
    btrfs: Add self-tests for btrfs_rmap_block
    btrfs: selftests: Add support for dummy devices
    btrfs: Move and unexport btrfs_rmap_block
    btrfs: separate definition of assertion failure handlers
    btrfs: device stats, log when stats are zeroed
    btrfs: fix improper setting of scanned for range cyclic write cache pages
    btrfs: safely advance counter when looking up bio csums
    btrfs: remove unused member btrfs_device::work
    btrfs: remove unnecessary wrapper get_alloc_profile
    btrfs: add correction to handle -1 edge case in async discard
    ...

    Linus Torvalds
     
  • Pull misc x86 updates from Ingo Molnar:
    "Misc changes:

    - Enhance #GP fault printouts by distinguishing between canonical and
    non-canonical address faults, and also add KASAN fault decoding.

    - Fix/enhance the x86 NMI handler by putting the duration check into
    a direct function call instead of an irq_work which we know to be
    broken in some cases.

    - Clean up do_general_protection() a bit"

    * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/nmi: Remove irq_work from the long duration NMI handler
    x86/traps: Cleanup do_general_protection()
    x86/kasan: Print original address on #GP
    x86/dumpstack: Introduce die_addr() for die() with #GP fault address
    x86/traps: Print address on #GP
    x86/insn-eval: Add support for 64-bit kernel mode

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "These were the main changes in this cycle:

    - More -rt motivated separation of CONFIG_PREEMPT and
    CONFIG_PREEMPTION.

    - Add more low level scheduling topology sanity checks and warnings
    to filter out nonsensical topologies that break scheduling.

    - Extend uclamp constraints to influence wakeup CPU placement

    - Make the RT scheduler more aware of asymmetric topologies and CPU
    capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

    - Make idle CPU selection more consistent

    - Various fixes, smaller cleanups, updates and enhancements - please
    see the git log for details"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
    sched/fair: Define sched_idle_cpu() only for SMP configurations
    sched/topology: Assert non-NUMA topology masks don't (partially) overlap
    idle: fix spelling mistake "iterrupts" -> "interrupts"
    sched/fair: Remove redundant call to cpufreq_update_util()
    sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
    sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
    sched/fair: calculate delta runnable load only when it's needed
    sched/cputime: move rq parameter in irqtime_account_process_tick
    stop_machine: Make stop_cpus() static
    sched/debug: Reset watchdog on all CPUs while processing sysrq-t
    sched/core: Fix size of rq::uclamp initialization
    sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
    sched/fair: Load balance aggressively for SCHED_IDLE CPUs
    sched/fair : Improve update_sd_pick_busiest for spare capacity case
    watchdog: Remove soft_lockup_hrtimer_cnt and related code
    sched/rt: Make RT capacity-aware
    sched/fair: Make EAS wakeup placement consider uclamp restrictions
    sched/fair: Make task_fits_capacity() consider uclamp restrictions
    sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
    sched/uclamp: Make uclamp util helpers use and return UL values
    ...

    Linus Torvalds
     
  • Pull EFI updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Cleanup of the GOP [graphics output] handling code in the EFI stub

    - Complete refactoring of the mixed mode handling in the x86 EFI stub

    - Overhaul of the x86 EFI boot/runtime code

    - Increase robustness for mixed mode code

    - Add the ability to disable DMA at the root port level in the EFI
    stub

    - Get rid of RWX mappings in the EFI memory map and page tables,
    where possible

    - Move the support code for the old EFI memory mapping style into its
    only user, the SGI UV1+ support code.

    - plus misc fixes, updates, smaller cleanups.

    ... and due to interactions with the RWX changes, another round of PAT
    cleanups make a guest appearance via the EFI tree - with no side
    effects intended"

    * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    efi/x86: Disable instrumentation in the EFI runtime handling code
    efi/libstub/x86: Fix EFI server boot failure
    efi/x86: Disallow efi=old_map in mixed mode
    x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
    efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
    efi: Fix handling of multiple efi_fake_mem= entries
    efi: Fix efi_memmap_alloc() leaks
    efi: Add tracking for dynamically allocated memmaps
    efi: Add a flags parameter to efi_memory_map
    efi: Fix comment for efi_mem_type() wrt absent physical addresses
    efi/arm: Defer probe of PCIe backed efifb on DT systems
    efi/x86: Limit EFI old memory map to SGI UV machines
    efi/x86: Avoid RWX mappings for all of DRAM
    efi/x86: Don't map the entire kernel text RW for mixed mode
    x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
    efi/libstub/x86: Fix unused-variable warning
    efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
    efi/libstub/x86: Use const attribute for efi_is_64bit()
    efi: Allow disabling PCI busmastering on bridges during boot
    efi/x86: Allow translating 64-bit arguments for mixed mode calls
    ...

    Linus Torvalds
     

28 Jan, 2020

3 commits

  • Pull core SMP updates from Thomas Gleixner:
    "A small set of SMP core code changes:

    - Rework the smp function call core code to avoid the allocation of
    an additional cpumask

    - Remove the not longer required GFP argument from on_each_cpu_cond()
    and on_each_cpu_cond_mask() and fixup the callers"

    * tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smp: Remove allocation mask from on_each_cpu_cond.*()
    smp: Add a smp_cond_func_t argument to smp_call_function_many()
    smp: Use smp_cond_func_t as type for the conditional function

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "The timekeeping and timers departement provides:

    - Time namespace support:

    If a container migrates from one host to another then it expects
    that clocks based on MONOTONIC and BOOTTIME are not subject to
    disruption. Due to different boot time and non-suspended runtime
    these clocks can differ significantly on two hosts, in the worst
    case time goes backwards which is a violation of the POSIX
    requirements.

    The time namespace addresses this problem. It allows to set offsets
    for clock MONOTONIC and BOOTTIME once after creation and before
    tasks are associated with the namespace. These offsets are taken
    into account by timers and timekeeping including the VDSO.

    Offsets for wall clock based clocks (REALTIME/TAI) are not provided
    by this mechanism. While in theory possible, the overhead and code
    complexity would be immense and not justified by the esoteric
    potential use cases which were discussed at Plumbers '18.

    The overhead for tasks in the root namespace (ie where host time
    offsets = 0) is in the noise and great effort was made to ensure
    that especially in the VDSO. If time namespace is disabled in the
    kernel configuration the code is compiled out.

    Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
    feature and kept on for more than a year addressing review
    comments, finding better solutions. A pleasant experience.

    - Overhaul of the alarmtimer device dependency handling to ensure
    that the init/suspend/resume ordering is correct.

    - A new clocksource/event driver for Microchip PIT64

    - Suspend/resume support for the Hyper-V clocksource

    - The usual pile of fixes, updates and improvements mostly in the
    driver code"

    * tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
    alarmtimer: Use wakeup source from alarmtimer platform device
    alarmtimer: Make alarmtimer platform device child of RTC device
    alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
    hrtimer: Add missing sparse annotation for __run_timer()
    lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
    MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
    clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
    clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
    clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
    clocksource/drivers/exynos_mct: Rename Exynos to lowercase
    clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
    clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
    clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
    clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
    clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
    clocksource/drivers/bcm2835_timer: Fix memory leak of timer
    clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
    clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
    clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - cgroup2 interface for hugetlb controller. I think this was the last
    remaining bit which was missing from cgroup2

    - fixes for race and a spurious warning in threaded cgroup handling

    - other minor changes

    * 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    iocost: Fix iocost_monitor.py due to helper type mismatch
    cgroup: Prevent double killing of css when enabling threaded cgroup
    cgroup: fix function name in comment
    mm: hugetlb controller for cgroups v2

    Linus Torvalds
     

25 Jan, 2020

1 commit


20 Jan, 2020

3 commits


14 Jan, 2020

11 commits

  • If a task belongs to a time namespace then the VVAR page which contains
    the system wide VDSO data is replaced with a namespace specific page
    which has the same layout as the VVAR page.

    Co-developed-by: Andrei Vagin
    Signed-off-by: Andrei Vagin
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com

    Dmitry Safonov
     
  • When booting with amd_iommu=off, the following WARNING message
    appears:

    AMD-Vi: AMD IOMMU disabled on kernel command-line
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
    Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
    RIP: 0010:flush_workqueue+0x42e/0x450
    Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
    Call Trace:
    kmem_cache_destroy+0x69/0x260
    iommu_go_to_state+0x40c/0x5ab
    amd_iommu_prepare+0x16/0x2a
    irq_remapping_prepare+0x36/0x5f
    enable_IR_x2apic+0x21/0x172
    default_setup_apic_routing+0x12/0x6f
    apic_intr_mode_init+0x1a1/0x1f1
    x86_late_time_init+0x17/0x1c
    start_kernel+0x480/0x53f
    secondary_startup_64+0xb6/0xc0
    ---[ end trace 30894107c3749449 ]---
    x2apic: IRQ remapping doesn't support X2APIC mode
    x2apic disabled

    The warning is caused by the calling of 'kmem_cache_destroy()'
    in free_iommu_resources(). Here is the call path:

    free_iommu_resources
    kmem_cache_destroy
    flush_memcg_workqueue
    flush_workqueue

    The root cause is that the IOMMU subsystem runs before the workqueue
    subsystem, which the variable 'wq_online' is still 'false'. This leads
    to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
    'true'.

    Since the variable 'memcg_kmem_cache_wq' is not allocated during the
    time, it is unnecessary to call flush_memcg_workqueue(). This prevents
    the WARNING message triggered by flush_workqueue().

    Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
    Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Adrian Huang
    Reported-by: Xiaochun Lee
    Reviewed-by: Shakeel Butt
    Cc: Joerg Roedel
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Huang
     
  • Use div64_ul() instead of do_div() if the divisor is unsigned long, to
    avoid truncation to 32-bit on 64-bit platforms.

    Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • The two variables 'numerator' and 'denominator', though they are
    declared as long, they should actually be unsigned long (according to
    the implementation of the fprop_fraction_percpu() function)

    And do_div() does a 64-by-32 division, while the divisor 'denominator'
    is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
    function to call is div64_ul().

    Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • Patch series "use div64_ul() instead of div_u64() if the divisor is
    unsigned long".

    We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
    by zero in avg_atom () calculation"), then refer to the recently analyzed
    mm code, we found this suspicious place.

    201 if (min) {
    202 min *= this_bw;
    203 do_div(min, tot_bw);
    204 }

    And we also disassembled and confirmed it:

    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
    0xffffffff811c37da : xor %r10d,%r10d
    0xffffffff811c37dd : test %rax,%rax
    0xffffffff811c37e0 : je 0xffffffff811c3800
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
    0xffffffff811c37e2 : imul %r8,%rax
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
    0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
    0xffffffff811c37e9 : xor %edx,%edx
    0xffffffff811c37eb : div %r10
    0xffffffff811c37ee : imul %rbx,%rax
    0xffffffff811c37f2 : shr $0x2,%rax
    0xffffffff811c37f6 : mul %rcx
    0xffffffff811c37f9 : shr $0x2,%rdx
    0xffffffff811c37fd : mov %rdx,%r10

    This series uses div64_ul() instead of div_u64() if the divisor is
    unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

    This patch (of 3):

    The variables 'min' and 'max' are unsigned long and do_div truncates
    them to 32 bits, which means it can test non-zero and be truncated to
    zero for division. Fix this issue by using div64_ul() instead.

    Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
    Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently slab percpu vmstats are flushed twice: during the memcg
    offlining and just before freeing the memcg structure. Each time percpu
    counters are summed, added to the atomic counterparts and propagated up
    by the cgroup tree.

    The second flushing is required due to how recursive vmstats are
    implemented: counters are batched in percpu variables on a local level,
    and once a percpu value is crossing some predefined threshold, it spills
    over to atomic values on the local and each ascendant levels. It means
    that without flushing some numbers cached in percpu variables will be
    dropped on floor each time a cgroup is destroyed. And with uptime the
    error on upper levels might become noticeable.

    The first flushing aims to make counters on ancestor levels more
    precise. Dying cgroups may resume in the dying state for a long time.
    After kmem_cache reparenting which is performed during the offlining
    slab counters of the dying cgroup don't have any chances to be updated,
    because any slab operations will be performed on the parent level. It
    means that the inaccuracy caused by percpu batching will not decrease up
    to the final destruction of the cgroup. By the original idea flushing
    slab counters during the offlining should minimize the visible
    inaccuracy of slab counters on the parent level.

    The problem is that percpu counters are not zeroed after the first
    flushing. So every cached percpu value is summed twice. It creates a
    small error (up to 32 pages per cpu, but usually less) which accumulates
    on parent cgroup level. After creating and destroying of thousands of
    child cgroups, slab counter on parent level can be way off the real
    value.

    For now, let's just stop flushing slab counters on memcg offlining. It
    can't be done correctly without scheduling a work on each cpu: reading
    and zeroing it during css offlining can race with an asynchronous
    update, which doesn't expect values to be changed underneath.

    With this change, slab counters on parent level will become eventually
    consistent. Once all dying children are gone, values are correct. And
    if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

    It's not perfect, as slab are reparented, so any updates after the
    reparenting will happen on the parent level. It means that if a slab
    page was allocated, a counter on child level was bumped, then the page
    was reparented and freed, the annihilation of positive and negative
    counter values will not happen until the child cgroup is released. It
    makes slab counters different from others, and it might want us to
    implement flushing in a correct form again. But it's also a question of
    performance: scheduling a work on each cpu isn't free, and it's an open
    question if the benefit of having more accurate counters is worth it.

    We might also consider flushing all counters on offlining, not only slab
    counters.

    So let's fix the main problem now: make the slab counters eventually
    consistent, so at least the error won't grow with uptime (or more
    precisely the number of created and destroyed cgroups). And think about
    the accuracy of counters separately.

    Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
    Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
    enabled. But it doesn't work well with above-47bit hint address.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks THP alignment in shmem/tmp:
    shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
    *any* hint address specified.

    This can be fixed by requesting the aligned area if the we failed to
    allocated at user-specified hint address. The request with inflated
    length will also take the user-specified hint address. This way we will
    not lose an allocation request from the full address space.

    [kirill@shutemov.name: fold in a fixup]
    Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
    Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Cc: "Willhalm, Thomas"
    Cc: Dan Williams
    Cc: "Bruggeman, Otto G"
    Cc: "Aneesh Kumar K . V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Patch series "Fix two above-47bit hint address vs. THP bugs".

    The two get_unmapped_area() implementations have to be fixed to provide
    THP-friendly mappings if above-47bit hint address is specified.

    This patch (of 2):

    Filesystems use thp_get_unmapped_area() to provide THP-friendly
    mappings. For DAX in particular.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks thp_get_unmapped_area(): the function
    would not try to allocate PMD-aligned area if *any* hint address
    specified.

    Modify the routine to handle it correctly:

    - Try to allocate the space at the specified hint address with length
    padding required for PMD alignment.
    - If failed, retry without length padding (but with the same hint
    address);
    - If the returned address matches the hint address return it.
    - Otherwise, align the address as required for THP and return.

    The user specified hint address is passed down to get_unmapped_area() so
    above-47bit hint address will be taken into account without breaking
    alignment requirements.

    Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Thomas Willhalm
    Tested-by: Dan Williams
    Cc: "Aneesh Kumar K . V"
    Cc: "Bruggeman, Otto G"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • When we remove an early section, we don't free the usage map, as the
    usage maps of other sections are placed into the same page. Once the
    section is removed, it is no longer an early section (especially, the
    memmap is freed). When we re-add that section, the usage map is reused,
    however, it is no longer an early section. When removing that section
    again, we try to kfree() a usage map that was allocated during early
    boot - bad.

    Let's check against PageReserved() to see if we are dealing with an
    usage map that was allocated during boot. We could also check against
    !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
    cleaner.

    Can be triggered using memtrace under ppc64/powernv:

    $ mount -t debugfs none /sys/kernel/debug/
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
    ------------[ cut here ]------------
    kernel BUG at mm/slub.c:3969!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
    NIP kfree+0x338/0x3b0
    LR section_deactivate+0x138/0x200
    Call Trace:
    section_deactivate+0x138/0x200
    __remove_pages+0x114/0x150
    arch_remove_memory+0x3c/0x160
    try_remove_memory+0x114/0x1a0
    __remove_memory+0x20/0x40
    memtrace_enable_set+0x254/0x850
    simple_attr_write+0x138/0x160
    full_proxy_write+0x8c/0x110
    __vfs_write+0x38/0x70
    vfs_write+0x11c/0x2a0
    ksys_write+0x84/0x140
    system_call+0x5c/0x68
    ---[ end trace 4b053cbd84e0db62 ]---

    The first invocation will offline+remove memory blocks. The second
    invocation will first add+online them again, in order to offline+remove
    them again (usually we are lucky and the exact same memory blocks will
    get "reallocated").

    Tested on powernv with boot memory: The usage map will not get freed.
    Tested on x86-64 with DIMMs: The usage map will get freed.

    Using Dynamic Memory under a Power DLAPR can trigger it easily.

    Triggering removal (I assume after previously removed+re-added) of
    memory from the HMC GUI can crash the kernel with the same call trace
    and is fixed by this patch.

    Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
    Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
    Signed-off-by: David Hildenbrand
    Tested-by: Pingfan Liu
    Cc: Dan Williams
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • THP page faults now attempt a __GFP_THISNODE allocation first, which
    should only compact existing free memory, followed by another attempt
    that can allocate from any node using reclaim/compaction effort
    specified by global defrag setting and madvise.

    This patch makes the following changes to the scheme:

    - Before the patch, the first allocation relies on a check for
    pageblock order and __GFP_IO to prevent excessive reclaim. This
    however affects also the second attempt, which is not limited to
    single node.

    Instead of that, reuse the existing check for costly order
    __GFP_NORETRY allocations, and make sure the first THP attempt uses
    __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
    allocations will bail out if compaction needs reclaim, while
    previously they only bailed out when compaction was deferred due to
    previous failures.

    This should be still acceptable within the __GFP_NORETRY semantics.

    - Before the patch, the second allocation attempt (on all nodes) was
    passing __GFP_NORETRY. This is redundant as the check for pageblock
    order (discussed above) was stronger. It's also contrary to
    madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
    requested.

    After this patch, the second attempt doesn't pass __GFP_THISNODE nor
    __GFP_NORETRY.

    To sum up, THP page faults now try the following attempts:

    1. local node only THP allocation with no reclaim, just compaction.
    2. for madvised VMA's or when synchronous compaction is enabled always - THP
    allocation from any node with effort determined by global defrag setting
    and VMA madvise
    3. fallback to base pages on any node

    Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
    Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

13 Jan, 2020

1 commit


11 Jan, 2020

1 commit


07 Jan, 2020

1 commit

  • The ARMv8 64-bit architecture supports execute-only user permissions by
    clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
    privileged mapping but from which user running at EL0 can still execute.

    The downside, however, is that the kernel at EL1 inadvertently reading
    such mapping would not trip over the PAN (privileged access never)
    protection.

    Revert the relevant bits from commit cab15ce604e5 ("arm64: Introduce
    execute-only page access permissions") so that PROT_EXEC implies
    PROT_READ (and therefore PTE_USER) until the architecture gains proper
    support for execute-only user mappings.

    Fixes: cab15ce604e5 ("arm64: Introduce execute-only page access permissions")
    Cc: # 4.9.x-
    Acked-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

06 Jan, 2020

1 commit


05 Jan, 2020

6 commits

  • The following lockdep splat was observed when a certain hugetlbfs test
    was run:

    ================================
    WARNING: inconsistent lock state
    4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - -
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
    {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    __nr_hugepages_store_common+0x11b/0xb30
    hugetlb_sysctl_handler_common+0x209/0x2d0
    proc_sys_call_handler+0x37f/0x450
    vfs_write+0x157/0x460
    ksys_write+0xb8/0x170
    do_syscall_64+0xa5/0x4d0
    entry_SYSCALL_64_after_hwframe+0x6a/0xdf
    irq event stamp: 691296
    hardirqs last enabled at (691296): [] _raw_spin_unlock_irqrestore+0x4b/0x60
    hardirqs last disabled at (691295): [] _raw_spin_lock_irqsave+0x22/0x81
    softirqs last enabled at (691284): [] irq_enter+0xc3/0xe0
    softirqs last disabled at (691285): [] irq_exit+0x23e/0x2b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(hugetlb_lock);

    lock(hugetlb_lock);

    *** DEADLOCK ***
    :
    Call Trace:

    __lock_acquire+0x146b/0x48c0
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    free_huge_page+0x36f/0xaa0
    bio_check_pages_dirty+0x2fc/0x5c0
    clone_endio+0x17f/0x670 [dm_mod]
    blk_update_request+0x276/0xe50
    scsi_end_request+0x7b/0x6a0
    scsi_io_completion+0x1c6/0x1570
    blk_done_softirq+0x22e/0x350
    __do_softirq+0x23d/0xad8
    irq_exit+0x23e/0x2b0
    do_IRQ+0x11a/0x200
    common_interrupt+0xf/0xf

    Both the hugetbl_lock and the subpool lock can be acquired in
    free_huge_page(). One way to solve the problem is to make both locks
    irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
    held for a linear scan of ALL hugetlb pages during a cgroup reparentling
    operation. So it is just too long to have irq disabled unless we can
    break hugetbl_lock down into finer-grained locks with shorter lock hold
    times.

    Another alternative is to defer the freeing to a workqueue job. This
    patch implements the deferred freeing by adding a free_hpage_workfn()
    work function to do the actual freeing. The free_huge_page() call in a
    non-task context saves the page to be freed in the hpage_freelist linked
    list in a lockless manner using the llist APIs.

    The generic workqueue is used to process the work, but a dedicated
    workqueue can be used instead if it is desirable to have the huge page
    freed ASAP.

    Thanks to Kirill Tkhai for suggesting the use of
    llist APIs which simplfy the code.

    Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Mike Kravetz
    Acked-by: Davidlohr Bueso
    Acked-by: Michal Hocko
    Reviewed-by: Kirill Tkhai
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • In the implementation of __gup_benchmark_ioctl() the allocated pages
    should be released before returning in case of an invalid cmd. Release
    pages via kvfree().

    [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
    Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
    Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods")
    Signed-off-by: Navid Emamdoost
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Reviewed-by: John Hubbard
    Cc: Keith Busch
    Cc: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Navid Emamdoost
     
  • pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
    As everything else is printed in kB, I chose to fix the value rather than
    the string.

    Before:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1878] 1000 1878 217253 151144 1269760 0 0 python
    ...
    Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

    After:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1436] 1000 1436 217253 151890 1294336 0 0 python
    ...
    Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

    Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
    Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Edward Chron
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Dryomov
     
  • Felix Abecassis reports move_pages() would return random status if the
    pages are already on the target node by the below test program:

    int main(void)
    {
    const long node_id = 1;
    const long page_size = sysconf(_SC_PAGESIZE);
    const int64_t num_pages = 8;

    unsigned long nodemask = 1 << node_id;
    long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
    if (ret < 0)
    return (EXIT_FAILURE);

    void **pages = malloc(sizeof(void*) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
    MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
    -1, 0);
    if (pages[i] == MAP_FAILED)
    return (EXIT_FAILURE);
    }

    ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
    if (ret < 0)
    return (EXIT_FAILURE);

    int *nodes = malloc(sizeof(int) * num_pages);
    int *status = malloc(sizeof(int) * num_pages);
    for (int i = 0; i < num_pages; ++i) {
    nodes[i] = node_id;
    status[i] = 0xd0; /* simulate garbage values */
    }

    ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    printf("move_pages: %ld\n", ret);
    for (int i = 0; i < num_pages; ++i)
    printf("status[%d] = %d\n", i, status[i]);
    }

    Then running the program would return nonsense status values:

    $ ./move_pages_bug
    move_pages: 0
    status[0] = 208
    status[1] = 208
    status[2] = 208
    status[3] = 208
    status[4] = 208
    status[5] = 208
    status[6] = 208
    status[7] = 208

    This is because the status is not set if the page is already on the
    target node, but move_pages() should return valid status as long as it
    succeeds. The valid status may be errno or node id.

    We can't simply initialize status array to zero since the pages may be
    not on node 0. Fix it by updating status with node id which the page is
    already on.

    Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
    Signed-off-by: Yang Shi
    Reported-by: Felix Abecassis
    Tested-by: Felix Abecassis
    Suggested-by: Michal Hocko
    Reviewed-by: John Hubbard
    Acked-by: Christoph Lameter
    Acked-by: Michal Hocko
    Reviewed-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: [4.17+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When zspage is migrated to the other zone, the zone page state should be
    updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
    counts including proc/zoneinfo in practice.

    Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
    Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat")
    Signed-off-by: Chanho Min
    Signed-off-by: Jinsuk Choi
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chanho Min
     
  • We currently try to shrink a single zone when removing memory. We use
    the zone of the first page of the memory we are removing. If that
    memmap was never initialized (e.g., memory was never onlined), we will
    read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __remove_pages+0x4b/0x640
    arch_remove_memory+0x63/0x8d
    try_remove_memory+0xdb/0x130
    __remove_memory+0xa/0x11
    acpi_memory_device_remove+0x70/0x100
    acpi_bus_trim+0x55/0x90
    acpi_device_hotplug+0x227/0x3a0
    acpi_hotplug_work_fn+0x1a/0x30
    process_one_work+0x221/0x550
    worker_thread+0x50/0x3b0
    kthread+0x105/0x140
    ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

    Instead, shrink the zones when offlining memory or when onlining failed.
    Introduce and use remove_pfn_range_from_zone(() for that. We now
    properly shrink the zones, even if we have DIMMs whereby

    - Some memory blocks fall into no zone (never onlined)

    - Some memory blocks fall into multiple zones (offlined+re-onlined)

    - Multiple memory blocks that fall into different zones

    Drop the zone parameter (with a potential dubious value) from
    __remove_pages() and __remove_section().

    Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
    Signed-off-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Aneesh Kumar K.V"
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Dan Williams
    Cc: Logan Gunthorpe
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

31 Dec, 2019

1 commit

  • Make #GP exceptions caused by out-of-bounds KASAN shadow accesses easier
    to understand by computing the address of the original access and
    printing that. More details are in the comments in the patch.

    This turns an error like this:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault, probably for non-canonical address
    0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI

    into this:

    general protection fault, probably for non-canonical address
    0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
    KASAN: maybe wild-memory-access in range
    [0x00badbeefbadbee8-0x00badbeefbadbeef]

    The hook is placed in architecture-independent code, but is currently
    only wired up to the X86 exception handler because I'm not sufficiently
    familiar with the address space layout and exception handling mechanisms
    on other architectures.

    Signed-off-by: Jann Horn
    Signed-off-by: Borislav Petkov
    Reviewed-by: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrew Morton
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: kasan-dev@googlegroups.com
    Cc: linux-mm
    Cc: Peter Zijlstra
    Cc: Sean Christopherson
    Cc: Thomas Gleixner
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/20191218231150.12139-4-jannh@google.com

    Jann Horn
     

29 Dec, 2019

1 commit


25 Dec, 2019

1 commit


18 Dec, 2019

4 commits

  • Since commit 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on
    memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
    CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
    with CONFIG_MEMCG_KMEM.

    And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
    shrinker, and it is deferred_split_shrinker. But it is never actually
    called, since idr_replace() is never compiled due to the wrong #ifdef.
    The deferred_split_shrinker all the time is staying in half-registered
    state, and it's never called for subordinate mem cgroups.

    Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 0a432dcbeb32 ("mm: shrinker: make shrinker not depend on memcg kmem")
    Signed-off-by: Yang Shi
    Reviewed-by: Kirill Tkhai
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Cc: [5.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • syzkaller and the fault injector showed that I was wrong to assume that
    we could ignore percpu shadow allocation failures.

    Handle failures properly. Merge all the allocated areas back into the
    free list and release the shadow, then clean up and return NULL. The
    shadow is released unconditionally, which relies upon the fact that the
    release function is able to tolerate pages not being present.

    Also clean up shadows in the recovery path - currently they are not
    released, which leaks a bit of memory.

    Link: http://lkml.kernel.org/r/20191205140407.1874-3-dja@axtens.net
    Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
    Signed-off-by: Daniel Axtens
    Reported-by: syzbot+82e323920b78d54aaed5@syzkaller.appspotmail.com
    Reported-by: syzbot+59b7daa4315e07a994f1@syzkaller.appspotmail.com
    Reviewed-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Qian Cai
    Cc: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Axtens
     
  • kasan_release_vmalloc uses apply_to_page_range to release vmalloc
    shadow. Unfortunately, apply_to_page_range can allocate memory to fill
    in page table entries, which is not what we want.

    Also, kasan_release_vmalloc is called under free_vmap_area_lock, so if
    apply_to_page_range does allocate memory, we get a sleep in atomic bug:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:4681
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 15087, name:

    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x199/0x216 lib/dump_stack.c:118
    ___might_sleep.cold.97+0x1f5/0x238 kernel/sched/core.c:6800
    __might_sleep+0x95/0x190 kernel/sched/core.c:6753
    prepare_alloc_pages mm/page_alloc.c:4681 [inline]
    __alloc_pages_nodemask+0x3cd/0x890 mm/page_alloc.c:4730
    alloc_pages_current+0x10c/0x210 mm/mempolicy.c:2211
    alloc_pages include/linux/gfp.h:532 [inline]
    __get_free_pages+0xc/0x40 mm/page_alloc.c:4786
    __pte_alloc_one_kernel include/asm-generic/pgalloc.h:21 [inline]
    pte_alloc_one_kernel include/asm-generic/pgalloc.h:33 [inline]
    __pte_alloc_kernel+0x1d/0x200 mm/memory.c:459
    apply_to_pte_range mm/memory.c:2031 [inline]
    apply_to_pmd_range mm/memory.c:2068 [inline]
    apply_to_pud_range mm/memory.c:2088 [inline]
    apply_to_p4d_range mm/memory.c:2108 [inline]
    apply_to_page_range+0x77d/0xa00 mm/memory.c:2133
    kasan_release_vmalloc+0xa7/0xc0 mm/kasan/common.c:970
    __purge_vmap_area_lazy+0xcbb/0x1f30 mm/vmalloc.c:1313
    try_purge_vmap_area_lazy mm/vmalloc.c:1332 [inline]
    free_vmap_area_noflush+0x2ca/0x390 mm/vmalloc.c:1368
    free_unmap_vmap_area mm/vmalloc.c:1381 [inline]
    remove_vm_area+0x1cc/0x230 mm/vmalloc.c:2209
    vm_remove_mappings mm/vmalloc.c:2236 [inline]
    __vunmap+0x223/0xa20 mm/vmalloc.c:2299
    __vfree+0x3f/0xd0 mm/vmalloc.c:2356
    __vmalloc_area_node mm/vmalloc.c:2507 [inline]
    __vmalloc_node_range+0x5d5/0x810 mm/vmalloc.c:2547
    __vmalloc_node mm/vmalloc.c:2607 [inline]
    __vmalloc_node_flags mm/vmalloc.c:2621 [inline]
    vzalloc+0x6f/0x80 mm/vmalloc.c:2666
    alloc_one_pg_vec_page net/packet/af_packet.c:4233 [inline]
    alloc_pg_vec net/packet/af_packet.c:4258 [inline]
    packet_set_ring+0xbc0/0x1b50 net/packet/af_packet.c:4342
    packet_setsockopt+0xed7/0x2d90 net/packet/af_packet.c:3695
    __sys_setsockopt+0x29b/0x4d0 net/socket.c:2117
    __do_sys_setsockopt net/socket.c:2133 [inline]
    __se_sys_setsockopt net/socket.c:2130 [inline]
    __x64_sys_setsockopt+0xbe/0x150 net/socket.c:2130
    do_syscall_64+0xfa/0x780 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Switch to using the apply_to_existing_page_range() helper instead, which
    won't allocate memory.

    [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
    Link: http://lkml.kernel.org/r/20191205140407.1874-2-dja@axtens.net
    Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
    Signed-off-by: Daniel Axtens
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Qian Cai
    Cc: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Axtens
     
  • apply_to_page_range() takes an address range, and if any parts of it are
    not covered by the existing page table hierarchy, it allocates memory to
    fill them in.

    In some use cases, this is not what we want - we want to be able to
    operate exclusively on PTEs that are already in the tables.

    Add apply_to_existing_page_range() for this. Adjust the walker
    functions for apply_to_page_range to take 'create', which switches them
    between the old and new modes.

    This will be used in KASAN vmalloc.

    [akpm@linux-foundation.org: reduce code duplication]
    [akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
    [akpm@linux-foundation.org: initialize __apply_to_page_range::err]
    Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.net
    Signed-off-by: Daniel Axtens
    Cc: Dmitry Vyukov
    Cc: Uladzislau Rezki (Sony)
    Cc: Alexander Potapenko
    Cc: Daniel Axtens
    Cc: Qian Cai
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Axtens