21 Oct, 2019

1 commit


20 Oct, 2019

2 commits


19 Oct, 2019

1 commit

  • Attaching uprobe to text section in THP splits the PMD mapped page table
    into PTE mapped entries. On uprobe detach, we would like to regroup PMD
    mapped page table entry to regain performance benefit of THP.

    However, the regroup is broken For perf_event based trace_uprobe. This
    is because perf_event based trace_uprobe calls uprobe_unregister twice
    on close: first in TRACE_REG_PERF_CLOSE, then in
    TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped
    page table entry, which is not the desired behavior.

    Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

    Add a WARN() to confirm uprobe unregister never work on huge pages, and
    abort the operation when this WARN() triggers.

    Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
    Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
    Signed-off-by: Song Liu
    Reviewed-by: Srikar Dronamraju
    Cc: Kirill A. Shutemov
    Cc: Oleg Nesterov
    Cc: Matthew Wilcox (Oracle)
    Cc: William Kucharski
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

18 Oct, 2019

2 commits

  • Pull power management fixes from Rafael Wysocki:
    "These include a fix for a recent regression in the ACPI CPU
    performance scaling code, a PCI device power management fix,
    a system shutdown fix related to cpufreq, a removal of an ACPI
    suspend-to-idle blacklist entry and a build warning fix.

    Specifics:

    - Fix possible NULL pointer dereference in the ACPI processor scaling
    initialization code introduced by a recent cpufreq update (Rafael
    Wysocki).

    - Fix possible deadlock due to suspending cpufreq too late during
    system shutdown (Rafael Wysocki).

    - Make the PCI device system resume code path be more consistent with
    its PM-runtime counterpart to fix an issue with missing delay on
    transitions from D3cold to D0 during system resume from
    suspend-to-idle on some systems (Rafael Wysocki).

    - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
    use suspend-to-idle by default (Mario Limonciello).

    - Fix build warning in the core system suspend support code (Ben
    Dooks)"

    * tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI: processor: Avoid NULL pointer dereferences at init time
    PCI: PM: Fix pci_power_up()
    PM: sleep: include for pm_wq
    cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown
    ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist

    Linus Torvalds
     
  • * pm-cpufreq:
    ACPI: processor: Avoid NULL pointer dereferences at init time
    cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown

    * pm-sleep:
    PM: sleep: include for pm_wq
    ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist

    Rafael J. Wysocki
     

17 Oct, 2019

3 commits

  • Both multi_cpu_stop() and set_state() access multi_stop_data::state
    racily using plain accesses. These are subject to compiler
    transformations which could break the intended behaviour of the code,
    and this situation is detected by KCSAN on both arm64 and x86 (splats
    below).

    Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the
    compiler cannot elide, replay, or tear loads and stores.

    In multi_cpu_stop() the two loads of multi_stop_data::state are expected to
    be a consistent value, so snapshot the value into a temporary variable to
    ensure this.

    The state transitions are serialized by atomic manipulation of
    multi_stop_data::num_threads, and other fields in multi_stop_data are not
    modified while subject to concurrent reads.

    KCSAN splat on arm64:

    | BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
    |
    | write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
    | set_state+0x80/0xb0
    | multi_cpu_stop+0x16c/0x198
    | cpu_stopper_thread+0x170/0x298
    | smpboot_thread_fn+0x40c/0x560
    | kthread+0x1a8/0x1b0
    | ret_from_fork+0x10/0x18
    |
    | read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
    | multi_cpu_stop+0xa8/0x198
    | cpu_stopper_thread+0x170/0x298
    | smpboot_thread_fn+0x40c/0x560
    | kthread+0x1a8/0x1b0
    | ret_from_fork+0x10/0x18
    |
    | Reported by Kernel Concurrency Sanitizer on:
    | CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
    | Hardware name: linux,dummy-virt (DT)

    KCSAN splat on x86:

    | write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2:
    | set_state kernel/stop_machine.c:170 [inline]
    | ack_state kernel/stop_machine.c:177 [inline]
    | multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227
    | cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
    | smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
    | kthread+0x1b5/0x200 kernel/kthread.c:255
    | ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
    |
    | read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7:
    | multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213
    | cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
    | smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
    | kthread+0x1b5/0x200 kernel/kthread.c:255
    | ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
    |
    | Reported by Kernel Concurrency Sanitizer on:
    | CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014

    Signed-off-by: Mark Rutland
    Signed-off-by: Thomas Gleixner
    Acked-by: Marco Elver
    Link: https://lkml.kernel.org/r/20191007104536.27276-1-mark.rutland@arm.com

    Mark Rutland
     
  • The option --sort=ORDER was only introduced in tar 1.28 (2014), which
    is rather new and might not be available in some setups.

    This patch tries to replicate the previous behaviour as closely as
    possible to fix the kheaders build for older environments. It does
    not produce identical archives compared to the previous version due
    to minor sorting differences but produces reproducible results itself
    in my tests.

    Reported-by: Andreas Schwab
    Signed-off-by: Dmitry Goldin
    Tested-by: Andreas Schwab
    Tested-by: Quentin Perret
    Signed-off-by: Masahiro Yamada

    Dmitry Goldin
     
  • The __kthread_queue_delayed_work is not exported so
    make it static, to avoid the following sparse warning:

    kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static?

    Signed-off-by: Ben Dooks
    Signed-off-by: Linus Torvalds

    Ben Dooks
     

16 Oct, 2019

1 commit

  • Pull parisc fixes from Helge Deller:

    - Fix a parisc-specific fallout of Christoph's
    dma_set_mask_and_coherent() patches (Sven)

    - Fix a vmap memory leak in ioremap()/ioremap() (Helge)

    - Some minor cleanups and documentation updates (Nick, Helge)

    * 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc: Remove 32-bit DMA enforcement from sba_iommu
    parisc: Fix vmap memory leak in ioremap()/iounmap()
    parisc: prefer __section from compiler_attributes.h
    parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
    MAINTAINERS: Add hp_sdc drivers to parisc arch

    Linus Torvalds
     

15 Oct, 2019

1 commit


14 Oct, 2019

2 commits

  • Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access
    to timer->base")

    lock_hrtimer_base() fetches timer->base without lock exclusion.

    Compiler is allowed to read timer->base twice (even if considered dumb)
    which could end up trying to lock migration_base and return
    &migration_base.

    base = timer->base;
    if (likely(base != &migration_base)) {

    /* compiler reads timer->base again, and now (base == &migration_base)

    raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
    if (likely(base == timer->base))
    return base; /* == &migration_base ! */

    Similarly the write sides must use WRITE_ONCE() to avoid store tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20191008173204.180879-1-edumazet@google.com

    Eric Dumazet
     
  • Pull tracing fixes from Steven Rostedt:
    "A few tracing fixes:

    - Remove lockdown from tracefs itself and moved it to the trace
    directory. Have the open functions there do the lockdown checks.

    - Fix a few races with opening an instance file and the instance
    being deleted (Discovered during the lockdown updates). Kept
    separate from the clean up code such that they can be backported to
    stable easier.

    - Clean up and consolidated the checks done when opening a trace
    file, as there were multiple checks that need to be done, and it
    did not make sense having them done in each open instance.

    - Fix a regression in the record mcount code.

    - Small hw_lat detector tracer fixes.

    - A trace_pipe read fix due to not initializing trace_seq"

    * tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
    tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
    tracing/hwlat: Report total time spent in all NMIs during the sample
    recordmcount: Fix nop_mcount() function
    tracing: Do not create tracefs files if tracefs lockdown is in effect
    tracing: Add locked_down checks to the open calls of files created for tracefs
    tracing: Add tracing_check_open_get_tr()
    tracing: Have trace events system open call tracing_open_generic_tr()
    tracing: Get trace_array reference for available_tracers files
    ftrace: Get a reference counter for the trace_array on filter files
    tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")

    Linus Torvalds
     

13 Oct, 2019

10 commits

  • A customer reported the following softlockup:

    [899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
    [899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
    [899688.160002] RIP: 0010:up_write+0x1a/0x30
    [899688.160002] Kernel panic - not syncing: softlockup: hung tasks
    [899688.160002] RIP: 0010:up_write+0x1a/0x30
    [899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
    [899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
    [899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
    [899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
    [899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
    [899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
    [899688.160002] tracing_read_pipe+0x336/0x3c0
    [899688.160002] __vfs_read+0x26/0x140
    [899688.160002] vfs_read+0x87/0x130
    [899688.160002] SyS_read+0x42/0x90
    [899688.160002] do_syscall_64+0x74/0x160

    It caught the process in the middle of trace_access_unlock(). There is
    no loop. So, it must be looping in the caller tracing_read_pipe()
    via the "waitagain" label.

    Crashdump analyze uncovered that iter->seq was completely zeroed
    at this point, including iter->seq.seq.size. It means that
    print_trace_line() was never able to print anything and
    there was no forward progress.

    The culprit seems to be in the code:

    /* reset all but tr, trace, and overruns */
    memset(&iter->seq, 0,
    sizeof(struct trace_iterator) -
    offsetof(struct trace_iterator, seq));

    It was added by the commit 53d0aa773053ab182877 ("ftrace:
    add logic to record overruns"). It was v2.6.27-rc1.
    It was the time when iter->seq looked like:

    struct trace_seq {
    unsigned char buffer[PAGE_SIZE];
    unsigned int len;
    };

    There was no "size" variable and zeroing was perfectly fine.

    The solution is to reinitialize the structure after or without
    zeroing.

    Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

    Signed-off-by: Petr Mladek
    Signed-off-by: Steven Rostedt (VMware)

    Petr Mladek
     
  • max_latency is intended to record the maximum ever observed hardware
    latency, which may occur in either part of the loop (inner/outer). So
    we need to also consider the outer-loop sample when updating
    max_latency.

    Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu

    Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer")
    Cc: stable@vger.kernel.org
    Signed-off-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Steven Rostedt (VMware)

    Srivatsa S. Bhat (VMware)
     
  • nmi_total_ts is supposed to record the total time spent in *all* NMIs
    that occur on the given CPU during the (active portion of the)
    sampling window. However, the code seems to be overwriting this
    variable for each NMI, thereby only recording the time spent in the
    most recent NMI. Fix it by accumulating the duration instead.

    Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu

    Fixes: 7b2c86250122 ("tracing: Add NMI tracing in hwlat detector")
    Cc: stable@vger.kernel.org
    Signed-off-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Steven Rostedt (VMware)

    Srivatsa S. Bhat (VMware)
     
  • Added various checks on open tracefs calls to see if tracefs is in lockdown
    mode, and if so, to return -EPERM.

    Note, the event format files (which are basically standard on all machines)
    as well as the enabled_functions file (which shows what is currently being
    traced) are not lockde down. Perhaps they should be, but it seems counter
    intuitive to lockdown information to help you know if the system has been
    modified.

    Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com

    Suggested-by: Linus Torvalds
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Currently, most files in the tracefs directory test if tracing_disabled is
    set. If so, it should return -ENODEV. The tracing_disabled is called when
    tracing is found to be broken. Originally it was done in case the ring
    buffer was found to be corrupted, and we wanted to prevent reading it from
    crashing the kernel. But it's also called if a tracing selftest fails on
    boot. It's a one way switch. That is, once it is triggered, tracing is
    disabled until reboot.

    As most tracefs files can also be used by instances in the tracefs
    directory, they need to be carefully done. Each instance has a trace_array
    associated to it, and when the instance is removed, the trace_array is
    freed. But if an instance is opened with a reference to the trace_array,
    then it requires looking up the trace_array to get its ref counter (as there
    could be a race with it being deleted and the open itself). Once it is
    found, a reference is added to prevent the instance from being removed (and
    the trace_array associated with it freed).

    Combine the two checks (tracing_disabled and trace_array_get()) into a
    single helper function. This will also make it easier to add lockdown to
    tracefs later.

    Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Instead of having the trace events system open call open code the taking of
    the trace_array descriptor (with trace_array_get()) and then calling
    trace_open_generic(), have it use the tracing_open_generic_tr() that does
    the combination of the two. This requires making tracing_open_generic_tr()
    global.

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • As instances may have different tracers available, we need to look at the
    trace_array descriptor that shows the list of the available tracers for the
    instance. But there's a race between opening the file and an admin
    deleting the instance. The trace_array_get() needs to be called before
    accessing the trace_array.

    Cc: stable@vger.kernel.org
    Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
    an instance now. They need to take a reference to the instance otherwise
    there could be a race between accessing the files and deleting the instance.

    It wasn't until the :mod: caching where these file operations started
    referencing the trace_array directly.

    Cc: stable@vger.kernel.org
    Fixes: 673feb9d76ab3 ("ftrace: Add :mod: caching infrastructure to trace_array")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Pull scheduler fixes from Ingo Molnar:
    "Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
    quota precision fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/vtime: Fix guest/system mis-accounting on task switch
    sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also a couple of updates for new Intel
    models (which are technically hw-enablement, but to users it's a fix
    to perf behavior on those new CPUs - hope this is fine), an AUX
    inheritance fix, event time-sharing fix, and a fix for lost non-perf
    NMI events on AMD systems"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    perf/x86/cstate: Add Tiger Lake CPU support
    perf/x86/msr: Add Tiger Lake CPU support
    perf/x86/intel: Add Tiger Lake CPU support
    perf/x86/cstate: Update C-state counters for Ice Lake
    perf/x86/msr: Add new CPU model numbers for Ice Lake
    perf/x86/cstate: Add Comet Lake CPU support
    perf/x86/msr: Add Comet Lake CPU support
    perf/x86/intel: Add Comet Lake CPU support
    perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp
    perf/core: Fix corner case in perf_rotate_context()
    perf/core: Rework memory accounting in perf_mmap()
    perf/core: Fix inheritance of aux_output groups
    perf annotate: Don't return -1 for error when doing BPF disassembly
    perf annotate: Return appropriate error code for allocation failures
    perf annotate: Fix arch specific ->init() failure errors
    perf annotate: Propagate the symbol__annotate() error return
    perf annotate: Fix the signedness of failure returns
    perf annotate: Propagate perf_env__arch() error
    perf evsel: Fall back to global 'perf_env' in perf_evsel__env()
    perf tools: Propagate get_cpuid() error
    ...

    Linus Torvalds
     

11 Oct, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Fix wbt performance regression introduced with the blk-rq-qos
    refactoring (Harshad)

    - Fix io_uring fileset removal inadvertently killing the workqueue (me)

    - Fix io_uring typo in linked command nonblock submission (Pavel)

    - Remove spurious io_uring wakeups on request free (Pavel)

    - Fix null_blk zoned command error return (Keith)

    - Don't use freezable workqueues for backing_dev, also means we can
    revert a previous libata hack (Mika)

    - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

    * tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
    nbd: fix possible sysfs duplicate warning
    null_blk: Fix zoned command return code
    io_uring: only flush workqueues on fileset removal
    io_uring: remove wait loop spurious wakeups
    blk-wbt: fix performance regression in wbt scale_up/scale_down
    Revert "libata, freezer: avoid block device removal while system is frozen"
    bdi: Do not use freezable workqueue
    io_uring: fix reversed nonblock flag for link submission

    Linus Torvalds
     

10 Oct, 2019

1 commit


09 Oct, 2019

4 commits

  • In perf_rotate_context(), when the first cpu flexible event fail to
    schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
    NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
    cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
    perf_event_sched_in() will skip all cpu flexible events because of the
    EVENT_FLEXIBLE bit.

    In the next call of perf_rotate_context(), cpu_rotate stays 1, and
    cpu_event stays NULL, so this process repeats. The end result is, flexible
    events on this cpu will not be scheduled (until another event being added
    to the cpuctx).

    Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
    could only use one counter, run one pinned event for ref-cycles, one
    flexible event for ref-cycles, and one flexible event for cycles. The
    flexible ref-cycles is never scheduled, which is expected. However,
    because of this issue, the cycles event is never scheduled either.

    $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

    time counts unit events
    1.000152973 15,412,480 ref-cycles:D
    1.000152973 ref-cycles (0.00%)
    1.000152973 cycles (0.00%)
    2.000486957 18,263,120 ref-cycles:D
    2.000486957 ref-cycles (0.00%)
    2.000486957 cycles (0.00%)

    To fix this, when the flexible_active list is empty, try rotate the
    first event in the flexible_groups. Also, rename ctx_first_active() to
    ctx_event_to_rotate(), which is more accurate.

    Signed-off-by: Song Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: Thomas Gleixner
    Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
    Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
    Signed-off-by: Ingo Molnar

    Song Liu
     
  • perf_mmap() always increases user->locked_vm. As a result, "extra" could
    grow bigger than "user_extra", which doesn't make sense. Here is an
    example case:

    (Note: Assume "user_lock_limit" is very small.)

    | # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
    | 0 | 0 | 0 |
    | 1 | user_extra | user_extra |
    | 2 | 3 * user_extra | 2 * user_extra|
    | 3 | 6 * user_extra | 3 * user_extra|
    | 4 | 10 * user_extra | 4 * user_extra|

    Fix this by maintaining proper user_extra and extra.

    Reviewed-By: Hechao Li
    Reported-by: Hechao Li
    Signed-off-by: Song Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Jie Meng
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
    Signed-off-by: Ingo Molnar

    Song Liu
     
  • vtime_account_system() assumes that the target task to account cputime
    to is always the current task. This is most often true indeed except on
    task switch where we call:

    vtime_common_task_switch(prev)
    vtime_account_system(prev)

    Here prev is the scheduling-out task where we account the cputime to. It
    doesn't match current that is already the scheduling-in task at this
    stage of the context switch.

    So we end up checking the wrong task flags to determine if we are
    accounting guest or system time to the previous task.

    As a result the wrong task is used to check if the target is running in
    guest mode. We may then spuriously account or leak either system or
    guest time on task switch.

    Fix this assumption and also turn vtime_guest_enter/exit() to use the
    task passed in parameter as well to avoid future similar issues.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
    Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The quota/period ratio is used to ensure a child task group won't get
    more bandwidth than the parent task group, and is calculated as:

    normalized_cfs_quota() = [(quota_us << 20) / period_us]

    If the quota/period ratio was changed during this scaling due to
    precision loss, it will cause inconsistency between parent and child
    task groups.

    See below example:

    A userspace container manager (kubelet) does three operations:

    1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
    2) Create a few children cgroups.
    3) Set quota to 1,000us and period to 10,000us on a child cgroup.

    These operations are expected to succeed. However, if the scaling of
    147/128 happens before step 3, quota and period of the parent cgroup
    will be changed:

    new_quota: 1148437ns, 1148us
    new_period: 11484375ns, 11484us

    And when step 3 comes in, the ratio of the child cgroup will be
    104857, which will be larger than the parent cgroup ratio (104821),
    and will fail.

    Scaling them by a factor of 2 will fix the problem.

    Tested-by: Phil Auld
    Signed-off-by: Xuewei Zhang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Phil Auld
    Cc: Anton Blanchard
    Cc: Ben Segall
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
    Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
    Signed-off-by: Ingo Molnar

    Xuewei Zhang
     

08 Oct, 2019

3 commits

  • Merge misc fixes from Andrew Morton:
    "The usual shower of hotfixes.

    Chris's memcg patches aren't actually fixes - they're mature but a few
    niggling review issues were late to arrive.

    The ocfs2 fixes are quite old - those took some time to get reviewer
    attention.

    Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
    mm/slab-generic"

    * emailed patches from Andrew Morton :
    mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
    mm, sl[ou]b: improve memory accounting
    mm, memcg: make scan aggression always exclude protection
    mm, memcg: make memory.emin the baseline for utilisation determination
    mm, memcg: proportional memory.{low,min} reclaim
    mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
    mm/page_alloc.c: fix a crash in free_pages_prepare()
    mm/z3fold.c: claim page in the beginning of free
    kernel/sysctl.c: do not override max_threads provided by userspace
    memcg: only record foreign writebacks with dirty pages when memcg is not disabled
    mm: fix -Wmissing-prototypes warnings
    writeback: fix use-after-free in finish_writeback_work()
    mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
    panic: ensure preemption is disabled during panic()
    fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
    fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
    fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
    ocfs2: clear zero in unaligned direct IO

    Linus Torvalds
     
  • Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe
    limits") because the patch is causing a regression to any workload which
    needs to override the auto-tuning of the limit provided by kernel.

    set_max_threads is implementing a boot time guesstimate to provide a
    sensible limit of the concurrently running threads so that runaways will
    not deplete all the memory. This is a good thing in general but there
    are workloads which might need to increase this limit for an application
    to run (reportedly WebSpher MQ is affected) and that is simply not
    possible after the mentioned change. It is also very dubious to
    override an admin decision by an estimation that doesn't have any direct
    relation to correctness of the kernel operation.

    Fix this by dropping set_max_threads from sysctl_max_threads so any
    value is accepted as long as it fits into MAX_THREADS which is important
    to check because allowing more threads could break internal robust futex
    restriction. While at it, do not use MIN_THREADS as the lower boundary
    because it is also only a heuristic for automatic estimation and admin
    might have a good reason to stop new threads to be created even when
    below this limit.

    This became more severe when we switched x86 from 4k to 8k kernel
    stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to
    16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
    value.

    In the particular case

    3.12
    kernel.threads-max = 515561

    4.4
    kernel.threads-max = 200000

    Neither of the two values is really insane on 32GB machine.

    I am not sure we want/need to tune the max_thread value further. If
    anything the tuning should be removed altogether if proven not useful in
    general. But we definitely need a way to override this auto-tuning.

    Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
    Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits")
    Signed-off-by: Michal Hocko
    Reviewed-by: "Eric W. Biederman"
    Cc: Heinrich Schuchardt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
    calling CPU in an infinite loop, but with interrupts and preemption
    enabled. From this state, userspace can continue to be scheduled,
    despite the system being "dead" as far as the kernel is concerned.

    This is easily reproducible on arm64 when booting with "nosmp" on the
    command line; a couple of shell scripts print out a periodic "Ping"
    message whilst another triggers a crash by writing to
    /proc/sysrq-trigger:

    | sysrq: Trigger a crash
    | Kernel panic - not syncing: sysrq triggered crash
    | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
    | Hardware name: linux,dummy-virt (DT)
    | Call trace:
    | dump_backtrace+0x0/0x148
    | show_stack+0x14/0x20
    | dump_stack+0xa0/0xc4
    | panic+0x140/0x32c
    | sysrq_handle_reboot+0x0/0x20
    | __handle_sysrq+0x124/0x190
    | write_sysrq_trigger+0x64/0x88
    | proc_reg_write+0x60/0xa8
    | __vfs_write+0x18/0x40
    | vfs_write+0xa4/0x1b8
    | ksys_write+0x64/0xf0
    | __arm64_sys_write+0x14/0x20
    | el0_svc_common.constprop.0+0xb0/0x168
    | el0_svc_handler+0x28/0x78
    | el0_svc+0x8/0xc
    | Kernel Offset: disabled
    | CPU features: 0x0002,24002004
    | Memory Limit: none
    | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
    | Ping 2!
    | Ping 1!
    | Ping 1!
    | Ping 2!

    The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
    otherwise local interrupts are disabled in 'smp_send_stop()'.

    Disable preemption in 'panic()' before re-enabling interrupts.

    Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
    Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
    Signed-off-by: Will Deacon
    Reported-by: Xogium
    Reviewed-by: Kees Cook
    Cc: Russell King
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Petr Mladek
    Cc: Feng Tang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

07 Oct, 2019

2 commits

  • Commit:

    ab43762ef010 ("perf: Allow normal events to output AUX data")

    forgets to configure aux_output relation in the inherited groups, which
    results in child PEBS events forever failing to schedule.

    Fix this by setting up the AUX output link in the inheritance path.

    Signed-off-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Pull dma-mapping regression fix from Christoph Hellwig:
    "Revert an incorret hunk from a patch that caused problems on various
    arm boards (Andrey Smirnov)"

    * tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
    dma-mapping: fix false positive warnings in dma_common_free_remap()

    Linus Torvalds
     

06 Oct, 2019

2 commits

  • This reverts commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557.

    The commit was added as a quick band-aid for a hang that happened when a
    block device was removed during system suspend. Now that bdi_wq is not
    freezable anymore the hang should not be possible and we can get rid of
    this hack by reverting it.

    Acked-by: Rafael J. Wysocki
    Signed-off-by: Mika Westerberg
    Signed-off-by: Jens Axboe

    Mika Westerberg
     
  • …asahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - remove unneeded ar-option and KBUILD_ARFLAGS

    - remove long-deprecated SUBDIRS

    - fix modpost to suppress false-positive warnings for UML builds

    - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}

    - make setlocalversion work for /bin/sh

    - make header archive reproducible

    - fix some Makefiles and documents

    * tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kheaders: make headers archive reproducible
    kbuild: update compile-test header list for v5.4-rc2
    kbuild: two minor updates for Documentation/kbuild/modules.rst
    scripts/setlocalversion: clear local variable to make it work for sh
    namespace: fix namespace.pl script to support relative paths
    video/logo: do not generate unneeded logo C files
    video/logo: remove unneeded *.o pattern from clean-files
    integrity: remove pointless subdir-$(CONFIG_...)
    integrity: remove unneeded, broken attempt to add -fshort-wchar
    modpost: fix static EXPORT_SYMBOL warnings for UML build
    kbuild: correct formatting of header in kbuild module docs
    kbuild: remove SUBDIRS support
    kbuild: remove ar-option and KBUILD_ARFLAGS

    Linus Torvalds
     

05 Oct, 2019

4 commits

  • Commit 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages
    helper") changed invalid input check in dma_common_free_remap() from:

    if (!area || !area->flags != VM_DMA_COHERENT)

    to

    if (!area || !area->flags != VM_DMA_COHERENT || !area->pages)

    which seem to produce false positives for memory obtained via
    dma_common_contiguous_remap()

    This triggers the following warning message when doing "reboot" on ZII
    VF610 Dev Board Rev B:

    WARNING: CPU: 0 PID: 1 at kernel/dma/remap.c:112 dma_common_free_remap+0x88/0x8c
    trying to free invalid coherent area: 9ef82980
    Modules linked in:
    CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.3.0-rc6-next-20190820 #119
    Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
    Backtrace:
    [] (dump_backtrace) from [] (show_stack+0x20/0x24)
    r7:8015ed78 r6:00000009 r5:00000000 r4:9f4d9b14
    [] (show_stack) from [] (dump_stack+0x24/0x28)
    [] (dump_stack) from [] (__warn.part.3+0xcc/0xe4)
    [] (__warn.part.3) from [] (warn_slowpath_fmt+0x78/0x94)
    r6:00000070 r5:808e540c r4:81c03048
    [] (warn_slowpath_fmt) from [] (dma_common_free_remap+0x88/0x8c)
    r3:9ef82980 r2:808e53e0
    r7:00001000 r6:a0b1e000 r5:a0b1e000 r4:00001000
    [] (dma_common_free_remap) from [] (remap_allocator_free+0x60/0x68)
    r5:81c03048 r4:9f4d9b78
    [] (remap_allocator_free) from [] (__arm_dma_free.constprop.3+0xf8/0x148)
    r5:81c03048 r4:9ef82900
    [] (__arm_dma_free.constprop.3) from [] (arm_dma_free+0x24/0x2c)
    r5:9f563410 r4:80110120
    [] (arm_dma_free) from [] (dma_free_attrs+0xa0/0xdc)
    [] (dma_free_attrs) from [] (dma_pool_destroy+0xc0/0x154)
    r8:9efa8860 r7:808f02f0 r6:808f02d0 r5:9ef82880 r4:9ef82780
    [] (dma_pool_destroy) from [] (ehci_mem_cleanup+0x6c/0x150)
    r7:9f563410 r6:9efa8810 r5:00000000 r4:9efd0148
    [] (ehci_mem_cleanup) from [] (ehci_stop+0xac/0xc0)
    r5:9efd0148 r4:9efd0000
    [] (ehci_stop) from [] (usb_remove_hcd+0xf4/0x1b0)
    r7:9f563410 r6:9efd0074 r5:81c03048 r4:9efd0000
    [] (usb_remove_hcd) from [] (host_stop+0x48/0xb8)
    r7:9f563410 r6:9efd0000 r5:9f5f4040 r4:9f5f5040
    [] (host_stop) from [] (ci_hdrc_host_destroy+0x34/0x38)
    r7:9f563410 r6:9f5f5040 r5:9efa8800 r4:9f5f4040
    [] (ci_hdrc_host_destroy) from [] (ci_hdrc_remove+0x50/0x10c)
    [] (ci_hdrc_remove) from [] (platform_drv_remove+0x34/0x4c)
    r7:9f563410 r6:81c4f99c r5:9efa8810 r4:9efa8810
    [] (platform_drv_remove) from [] (device_release_driver_internal+0xec/0x19c)
    r5:00000000 r4:9efa8810
    [] (device_release_driver_internal) from [] (device_release_driver+0x20/0x24)
    r7:9f563410 r6:81c41ed0 r5:9efa8810 r4:9f4a1dac
    [] (device_release_driver) from [] (bus_remove_device+0xdc/0x108)
    [] (bus_remove_device) from [] (device_del+0x150/0x36c)
    r7:9f563410 r6:81c03048 r5:9efa8854 r4:9efa8810
    [] (device_del) from [] (platform_device_del.part.2+0x20/0x84)
    r10:9f563414 r9:809177e0 r8:81cb07dc r7:81c78320 r6:9f563454 r5:9efa8800
    r4:9efa8800
    [] (platform_device_del.part.2) from [] (platform_device_unregister+0x28/0x34)
    r5:9f563400 r4:9efa8800
    [] (platform_device_unregister) from [] (ci_hdrc_remove_device+0x1c/0x30)
    r5:9f563400 r4:00000001
    [] (ci_hdrc_remove_device) from [] (ci_hdrc_imx_remove+0x38/0x118)
    r7:81c78320 r6:9f563454 r5:9f563410 r4:9f541010
    [] (ci_hdrc_imx_shutdown) from [] (platform_drv_shutdown+0x2c/0x30)
    [] (platform_drv_shutdown) from [] (device_shutdown+0x158/0x1f0)
    [] (device_shutdown) from [] (kernel_restart_prepare+0x44/0x48)
    r10:00000058 r9:9f4d8000 r8:fee1dead r7:379ce700 r6:81c0b280 r5:81c03048
    r4:00000000
    [] (kernel_restart_prepare) from [] (kernel_restart+0x1c/0x60)
    [] (kernel_restart) from [] (__do_sys_reboot+0xe0/0x1d8)
    r5:81c03048 r4:00000000
    [] (__do_sys_reboot) from [] (sys_reboot+0x18/0x1c)
    r8:80101204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
    [] (sys_reboot) from [] (ret_fast_syscall+0x0/0x54)
    Exception stack(0x9f4d9fa8 to 0x9f4d9ff0)
    9fa0: 00000000 00000000 fee1dead 28121969 01234567 379ce700
    9fc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00016d04
    9fe0: 00028e0c 7ec87c64 000135ec 76c1f410

    Restore original invalid input check in dma_common_free_remap() to
    avoid this problem.

    Fixes: 5cf4537975bb ("dma-mapping: introduce a dma_common_find_pages helper")
    Signed-off-by: Andrey Smirnov
    [hch: just revert the offending hunk instead of creating a new helper]
    Signed-off-by: Christoph Hellwig

    Andrey Smirnov
     
  • In commit 43d8ce9d65a5 ("Provide in-kernel headers to make
    extending kernel easier") a new mechanism was introduced, for kernels
    >=5.2, which embeds the kernel headers in the kernel image or a module
    and exposes them in procfs for use by userland tools.

    The archive containing the header files has nondeterminism caused by
    header files metadata. This patch normalizes the metadata and utilizes
    KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the
    default behaviour.

    In commit f7b101d33046 ("kheaders: Move from proc to sysfs") it was
    modified to use sysfs and the script for generation of the archive was
    renamed to what is being patched.

    Signed-off-by: Dmitry Goldin
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Masahiro Yamada

    Dmitry Goldin
     
  • …/kernel/git/brauner/linux

    Pull copy_struct_from_user() helper from Christian Brauner:
    "This contains the copy_struct_from_user() helper which got split out
    from the openat2() patchset. It is a generic interface designed to
    copy a struct from userspace.

    The helper will be especially useful for structs versioned by size of
    which we have quite a few. This allows for backwards compatibility,
    i.e. an extended struct can be passed to an older kernel, or a legacy
    struct can be passed to a newer kernel. For the first case (extended
    struct, older kernel) the new fields in an extended struct can be set
    to zero and the struct safely passed to an older kernel.

    The most obvious benefit is that this helper lets us get rid of
    duplicate code present in at least sched_setattr(), perf_event_open(),
    and clone3(). More importantly it will also help to ensure that users
    implementing versioning-by-size end up with the same core semantics.

    This point is especially crucial since we have at least one case where
    versioning-by-size is used but with slighly different semantics:
    sched_setattr(), perf_event_open(), and clone3() all do do similar
    checks to copy_struct_from_user() while rt_sigprocmask(2) always
    rejects differently-sized struct arguments.

    With this pull request we also switch over sched_setattr(),
    perf_event_open(), and clone3() to use the new helper"

    * tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    usercopy: Add parentheses around assignment in test_copy_struct_from_user
    perf_event_open: switch to copy_struct_from_user()
    sched_setattr: switch to copy_struct_from_user()
    clone3: switch to copy_struct_from_user()
    lib: introduce copy_struct_from_user() helper

    Linus Torvalds
     
  • Pull clone3/pidfd fixes from Christian Brauner:
    "This contains a couple of fixes:

    - Fix pidfd selftest compilation (Shuah Kahn)

    Due to a false linking instruction in the Makefile compilation for
    the pidfd selftests would fail on some systems.

    - Fix compilation for glibc on RISC-V systems (Seth Forshee)

    In some scenarios linux/uapi/linux/sched.h is included where
    __ASSEMBLY__ is defined causing a build failure because struct
    clone_args was not guarded by an #ifndef __ASSEMBLY__.

    - Add missing clone3() and struct clone_args kernel-doc (Christian Brauner)

    clone3() and struct clone_args were missing kernel-docs. (The goal
    is to use kernel-doc for any function or type where it's worth it.)
    For struct clone_args this also contains a comment about the fact
    that it's versioned by size"

    * tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    sched: add kernel-doc for struct clone_args
    fork: add kernel-doc for clone3
    selftests: pidfd: Fix undefined reference to pthread_create()
    sched: Add __ASSEMBLY__ guards around struct clone_args

    Linus Torvalds