05 Mar, 2020

13 commits

  • [ Upstream commit c780e86dd48ef6467a1146cf7d0fe1e05a635039 ]

    KASAN is reporting that __blk_add_trace() has a use-after-free issue
    when accessing q->blk_trace. Indeed the switching of block tracing (and
    thus eventual freeing of q->blk_trace) is completely unsynchronized with
    the currently running tracing and thus it can happen that the blk_trace
    structure is being freed just while __blk_add_trace() works on it.
    Protect accesses to q->blk_trace by RCU during tracing and make sure we
    wait for the end of RCU grace period when shutting down tracing. Luckily
    that is rare enough event that we can afford that. Note that postponing
    the freeing of blk_trace to an RCU callback should better be avoided as
    it could have unexpected user visible side-effects as debugfs files
    would be still existing for a short while block tracing has been shut
    down.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
    CC: stable@vger.kernel.org
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Ming Lei
    Tested-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reported-by: Tristan Madani
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • commit a030f9767da1a6bbcec840fc54770eb11c2414b6 upstream.

    It was found that two lines in the output of /proc/lockdep_stats have
    indentation problem:

    # cat /proc/lockdep_stats
    :
    in-process chains: 25057
    stack-trace entries: 137827 [max: 524288]
    number of stack traces: 7973
    number of stack hash chains: 6355
    combined max dependencies: 1356414598
    hardirq-safe locks: 57
    hardirq-unsafe locks: 1286
    :

    All the numbers displayed in /proc/lockdep_stats except the two stack
    trace numbers are formatted with a field with of 11. To properly align
    all the numbers, a field width of 11 is now added to the two stack
    trace numbers.

    Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics")
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Bart Van Assche
    Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     
  • commit 4bc6b745e5cbefed92c48071e28a5f41246d0470 upstream.

    The current expedited RCU grace-period code expects that a task
    requesting an expedited grace period cannot awaken until that grace
    period has reached the wakeup phase. However, it is possible for a long
    preemption to result in the waiting task never sleeping. For example,
    consider the following sequence of events:

    1. Task A starts an expedited grace period by invoking
    synchronize_rcu_expedited(). It proceeds normally up to the
    wait_event() near the end of that function, and is then preempted
    (or interrupted or whatever).

    2. The expedited grace period completes, and a kworker task starts
    the awaken phase, having incremented the counter and acquired
    the rcu_state structure's .exp_wake_mutex. This kworker task
    is then preempted or interrupted or whatever.

    3. Task A resumes and enters wait_event(), which notes that the
    expedited grace period has completed, and thus doesn't sleep.

    4. Task B starts an expedited grace period exactly as did Task A,
    complete with the preemption (or whatever delay) just before
    the call to wait_event().

    5. The expedited grace period completes, and another kworker
    task starts the awaken phase, having incremented the counter.
    However, it blocks when attempting to acquire the rcu_state
    structure's .exp_wake_mutex because step 2's kworker task has
    not yet released it.

    6. Steps 4 and 5 repeat, resulting in overflow of the rcu_node
    structure's ->exp_wq[] array.

    In theory, this is harmless. Tasks waiting on the various ->exp_wq[]
    array will just be spuriously awakened, but they will just sleep again
    on noting that the rcu_state structure's ->expedited_sequence value has
    not advanced far enough.

    In practice, this wastes CPU time and is an accident waiting to happen.
    This commit therefore moves the rcu_exp_gp_seq_end() call that officially
    ends the expedited grace period (along with associate tracing) until
    after the ->exp_wake_mutex has been acquired. This prevents Task A from
    awakening prematurely, thus preventing more than one expedited grace
    period from being in flight during a previous expedited grace period's
    wakeup phase.

    Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
    Signed-off-by: Neeraj Upadhyay
    [ paulmck: Added updated comment. ]
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Greg Kroah-Hartman

    Neeraj Upadhyay
     
  • commit 9f24c540f7f8eb3a981528da9a9a636a5bdf5987 upstream.

    The low resolution parts of the VDSO, i.e.:

    clock_gettime(CLOCK_*_COARSE), clock_getres(), time()

    can be used even if there is no VDSO capable clocksource.

    But if an architecture opts out of the VDSO data update then this
    information becomes stale. This affects ARM when there is no architected
    timer available. The lack of update causes userspace to use stale data
    forever.

    Make the update of the low resolution parts unconditional and only skip
    the update of the high resolution parts if the architecture requests it.

    Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200114185946.765577901@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 9a6b55ac4a44060bcb782baf002859b2a2c63267 upstream.

    The function name suggests that this is a boolean checking whether the
    architecture asks for an update of the VDSO data, but it works the other
    way round. To spare further confusion invert the logic.

    Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200114185946.656652824@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit f66c0447cca1281116224d474cdb37d6a18e4b5b upstream.

    Set the unoptimized flag after confirming the code is completely
    unoptimized. Without this fix, when a kprobe hits the intermediate
    modified instruction (the first byte is replaced by an INT3, but
    later bytes can still be a jump address operand) while unoptimizing,
    it can return to the middle byte of the modified code, which causes
    an invalid instruction exception in the kernel.

    Usually, this is a rare case, but if we put a probe on the function
    call while text patching, it always causes a kernel panic as below:

    # echo p text_poke+5 > kprobe_events
    # echo 1 > events/kprobes/enable
    # echo 0 > events/kprobes/enable

    invalid opcode: 0000 [#1] PREEMPT SMP PTI
    RIP: 0010:text_poke+0x9/0x50
    Call Trace:
    arch_unoptimize_kprobe+0x22/0x28
    arch_unoptimize_kprobes+0x39/0x87
    kprobe_optimizer+0x6e/0x290
    process_one_work+0x2a0/0x610
    worker_thread+0x28/0x3d0
    ? process_one_work+0x610/0x610
    kthread+0x10d/0x130
    ? kthread_park+0x80/0x80
    ret_from_fork+0x3a/0x50

    text_poke() is used for patching the code in optprobes.

    This can happen even if we blacklist text_poke() and other functions,
    because there is a small time window during which we show the intermediate
    code to other CPUs.

    [ mingo: Edited the changelog. ]

    Tested-by: Alexei Starovoitov
    Signed-off-by: Masami Hiramatsu
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
    Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     
  • commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.

    select_idle_cpu() will scan the LLC domain for idle CPUs,
    it's always expensive. so the next commit :

    1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")

    introduces a way to limit how many CPUs we scan.

    But it consume some CPUs out of 'nr' that are not allowed
    for the task and thus waste our attempts. The function
    always return nr_cpumask_bits, and we can't find a CPU
    which our task is allowed to run.

    Cpumask may be too big, similar to select_idle_core(), use
    per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

    Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
    Signed-off-by: Cheng Jian
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Srikar Dronamraju
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Cheng Jian
     
  • commit 78041c0c9e935d9ce4086feeff6c569ed88ddfd4 upstream.

    The tracing seftests checks various aspects of the tracing infrastructure,
    and one is filtering. If trace_printk() is active during a self test, it can
    cause the filtering to fail, which will disable that part of the trace.

    To keep the selftests from failing because of trace_printk() calls,
    trace_printk() checks the variable tracing_selftest_running, and if set, it
    does not write to the tracing buffer.

    As some tracers were registered earlier in boot, the selftest they triggered
    would fail because not all the infrastructure was set up for the full
    selftest. Thus, some of the tests were post poned to when their
    infrastructure was ready (namely file system code). The postpone code did
    not set the tracing_seftest_running variable, and could fail if a
    trace_printk() was added and executed during their run.

    Cc: stable@vger.kernel.org
    Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 756125289285f6e55a03861bf4b6257aa3d19a93 upstream.

    This patch ensures that we always check the netlink payload length
    in audit_receive_msg() before we take any action on the payload
    itself.

    Cc: stable@vger.kernel.org
    Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
    Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • commit 2ad3e17ebf94b7b7f3f64c050ff168f9915345eb upstream.

    Commit 219ca39427bf ("audit: use union for audit_field values since
    they are mutually exclusive") combined a number of separate fields in
    the audit_field struct into a single union. Generally this worked
    just fine because they are generally mutually exclusive.
    Unfortunately in audit_data_to_entry() the overlap can be a problem
    when a specific error case is triggered that causes the error path
    code to attempt to cleanup an audit_field struct and the cleanup
    involves attempting to free a stored LSM string (the lsm_str field).
    Currently the code always has a non-NULL value in the
    audit_field.lsm_str field as the top of the for-loop transfers a
    value into audit_field.val (both .lsm_str and .val are part of the
    same union); if audit_data_to_entry() fails and the audit_field
    struct is specified to contain a LSM string, but the
    audit_field.lsm_str has not yet been properly set, the error handling
    code will attempt to free the bogus audit_field.lsm_str value that
    was set with audit_field.val at the top of the for-loop.

    This patch corrects this by ensuring that the audit_field.val is only
    set when needed (it is cleared when the audit_field struct is
    allocated with kcalloc()). It also corrects a few other issues to
    ensure that in case of error the proper error code is returned.

    Cc: stable@vger.kernel.org
    Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
    Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore
    Signed-off-by: Greg Kroah-Hartman

    Paul Moore
     
  • [ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]

    When a running task is moved on a throttled task group and there is no
    other task enqueued on the CPU, the task can keep running using 100% CPU
    whatever the allocated bandwidth for the group and although its cfs rq is
    throttled. Furthermore, the group entity of the cfs_rq and its parents are
    not enqueued but only set as curr on their respective cfs_rqs.

    We have the following sequence:

    sched_move_task
    -dequeue_task: dequeue task and group_entities.
    -put_prev_task: put task and group entities.
    -sched_change_group: move task to new group.
    -enqueue_task: enqueue only task but not group entities because cfs_rq is
    throttled.
    -set_next_task : set task and group_entities as current sched_entity of
    their cfs_rq.

    Another impact is that the root cfs_rq runnable_load_avg at root rq stays
    null because the group_entities are not enqueued. This situation will stay
    the same until an "external" event triggers a reschedule. Let trigger it
    immediately instead.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Ben Segall
    Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • [ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]

    The way loadavg is tracked during nohz only pays attention to the load
    upon entering nohz. This can be particularly noticeable if full nohz is
    entered while non-idle, and then the cpu goes idle and stays that way for
    a long time.

    Use the remote tick to ensure that full nohz cpus report their deltas
    within a reasonable time.

    [ swood: Added changelog and removed recheck of stopped tick. ]

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Peter Zijlstra (Intel)
     
  • [ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]

    This will be used in the next patch to get a loadavg update from
    nohz cpus. The delta check is skipped because idle_sched_class
    doesn't update se.exec_start.

    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Scott Wood
     

29 Feb, 2020

3 commits

  • commit e20d3a055a457a10a4c748ce5b7c2ed3173a1324 upstream.

    This if guards whether user-space wants a copy of the offload-jited
    bytecode and whether this bytecode exists. By erroneously doing a bitwise
    AND instead of a logical AND on user- and kernel-space buffer-size can lead
    to no data being copied to user-space especially when user-space size is a
    power of two and bigger then the kernel-space buffer.

    Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
    Signed-off-by: Johannes Krude
    Signed-off-by: Daniel Borkmann
    Acked-by: Jakub Kicinski
    Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
    Signed-off-by: Greg Kroah-Hartman

    Johannes Krude
     
  • commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream.

    Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
    which catches cases where the affinity setting is not done on the CPU which
    is the current target of the interrupt, triggers during CPU hotplug stress
    testing.

    It turns out that the warning which was added with the commit addressing
    the MSI affinity race unearthed yet another long standing bug.

    If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
    then it calls irq_select_affinity_usr(). This was introduced for ALPHA in

    eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)")

    and subsequently made available for all architectures in

    18404756765c ("genirq: Expose default irq affinity mask (take 3)")

    which introduced the circumvention of the affinity setting restrictions for
    interrupt which cannot be moved in process context.

    The whole exercise is bogus in various aspects:

    1) If the interrupt is already started up then there is absolutely
    no point to honour a bogus interrupt affinity setting from user
    space. The interrupt is already assigned to an online CPU and it
    does not make any sense to reassign it to some other randomly
    chosen online CPU.

    2) If the interupt is not yet started up then there is no point
    either. A subsequent startup of the interrupt will invoke
    irq_setup_affinity() anyway which will chose a valid target CPU.

    So the only correct solution is to just return -EINVAL in case user space
    wrote an affinity mask which does not contain any online CPUs, except for
    ALPHA which has it's own magic sauce for this.

    Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)")
    Reported-by: Qian Cai
    Signed-off-by: Thomas Gleixner
    Tested-by: Qian Cai
    Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.

    Issuing write() with count parameter set to 0 on any file under
    /proc/pressure/ will cause an OOB write because of the access to
    buf[buf_size-1] when NUL-termination is performed. Fix this by checking
    for buf_size to be non-zero.

    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Johannes Weiner
    Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

24 Feb, 2020

19 commits

  • [ Upstream commit 6722b23e7a2ace078344064a9735fb73e554e9ef ]

    if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    Without patch:
    # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
    dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
    n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
    # Available triggers:
    # traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
    6+1 records in
    6+1 records out
    206 bytes copied, 0.00027916 s, 738 kB/s

    Notice the printing of "# Available triggers:..." after the line.

    With the patch:
    # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
    dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
    n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
    2+1 records in
    2+1 records out
    88 bytes copied, 0.000526867 s, 167 kB/s

    It only prints the end of the file, and does not restart.

    Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit e4075e8bdffd93a9b6d6e1d52fabedceeca5a91b ]

    if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    Without patch:
    # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
    dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
    id
    no pid
    2+1 records in
    2+1 records out
    10 bytes copied, 0.000213285 s, 46.9 kB/s

    Notice the "id" followed by "no pid".

    With the patch:
    # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
    dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
    id
    0+1 records in
    0+1 records out
    3 bytes copied, 0.000202112 s, 14.8 kB/s

    Notice that it only prints "id" and not the "no pid" afterward.

    Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit 90435a7891a2259b0f74c5a1bc5600d0d64cba8f ]

    If seq_file .next fuction does not change position index,
    read after some lseek can generate an unexpected output.

    See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

    v1 -> v2: removed missed increment in end of function

    Signed-off-by: Vasily Averin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit c79108bd19a8490315847e0c95ac6526fcd8e770 ]

    The alarmtimer_suspend() function will fail if an RTC device is on a bus
    such as SPI or i2c and that RTC device registers and probes after
    alarmtimer_init() registers and probes the 'alarmtimer' platform device.

    This is because system wide suspend suspends devices in the reverse order
    of their probe. When alarmtimer_suspend() attempts to program the RTC for a
    wakeup it will try to program an RTC device on a bus that has already been
    suspended.

    Move the alarmtimer device registration to happen when the RTC which is
    used for wakeup is registered. Register the 'alarmtimer' platform device as
    a child of the RTC device too, so that it can be guaranteed that the RTC
    device won't be suspended when alarmtimer_suspend() is called.

    Reported-by: Douglas Anderson
    Signed-off-by: Stephen Boyd
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Douglas Anderson
    Link: https://lore.kernel.org/r/20200124055849.154411-2-swboyd@chromium.org
    Signed-off-by: Sasha Levin

    Stephen Boyd
     
  • [ Upstream commit 708e0ada1916be765b7faa58854062f2bc620bbf ]

    In setup_load_info(), info->name (which contains the name of the module,
    mostly used for early logging purposes before the module gets set up)
    gets unconditionally assigned if .modinfo is missing despite the fact
    that there is an if (!info->name) check near the end of the function.
    Avoid assigning a placeholder string to info->name if .modinfo doesn't
    exist, so that we can fall back to info->mod->name later on.

    Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()")
    Reviewed-by: Miroslav Benes
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin

    Jessica Yu
     
  • [ Upstream commit 11e31f608b499f044f24b20be73f1dcab3e43f8a ]

    Robert reported that during boot the watchdog timestamp is set to 0 for one
    second which is the indicator for a watchdog reset.

    The reason for this is that the timestamp is in seconds and the time is
    taken from sched clock and divided by ~1e9. sched clock starts at 0 which
    means that for the first second during boot the watchdog timestamp is 0,
    i.e. reset.

    Use ULONG_MAX as the reset indicator value so the watchdog works correctly
    right from the start. ULONG_MAX would only conflict with a real timestamp
    if the system reaches an uptime of 136 years on 32bit and almost eternity
    on 64bit.

    Reported-by: Robert Richter
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     
  • [ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]

    topology.c::get_group() relies on the assumption that non-NUMA domains do
    not partially overlap. Zeng Tao pointed out in [1] that such topology
    descriptions, while completely bogus, can end up being exposed to the
    scheduler.

    In his example (8 CPUs, 2-node system), we end up with:
    MC span for CPU3 == 3-7
    MC span for CPU4 == 4-7

    The first pass through get_group(3, sdd@MC) will result in the following
    sched_group list:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `----------------'

    And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `-----------'

    which will completely break things like 'while (sg != sd->groups)' when
    using CPU3's base sched_domain.

    There already are some architecture-specific checks in place such as
    x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
    in the core scheduler, so it seems worthwhile to do so.

    Warn and abort the construction of the sched domains if such a broken
    topology description is detected. Note that this is somewhat
    expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
    gated under SCHED_DEBUG if deemed necessary.

    Testing
    =======

    Dietmar managed to reproduce this using the following qemu incantation:

    $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
    -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
    cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
    node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

    alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
    needed if '-smp cores=X, sockets=Y' would work with qemu):

    8package_id != cpu_topo->package_id)
    continue;

    + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
    + continue;
    +
    cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
    cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

    8
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
    Signed-off-by: Sasha Levin

    Valentin Schneider
     
  • [ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]

    rq::uclamp is an array of struct uclamp_rq, make sure we clear the
    whole thing.

    Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
    Signed-off-by: Li Guanglei
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Qais Yousef
    Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
    Signed-off-by: Sasha Levin

    Li Guanglei
     
  • [ Upstream commit 894c9ef9780c5cf2f143415e867ee39a33ecb75d ]

    Configuring an instance's parallel mask without any online CPUs...

    echo 2 > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
    echo 0 > /sys/devices/system/cpu/cpu1/online

    ...makes tcrypt mode=215 crash like this:

    divide error: 0000 [#1] SMP PTI
    CPU: 4 PID: 283 Comm: modprobe Not tainted 5.4.0-rc8-padata-doc-v2+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
    RIP: 0010:padata_do_parallel+0x114/0x300
    Call Trace:
    pcrypt_aead_encrypt+0xc0/0xd0 [pcrypt]
    crypto_aead_encrypt+0x1f/0x30
    do_mult_aead_op+0x4e/0xdf [tcrypt]
    test_mb_aead_speed.constprop.0.cold+0x226/0x564 [tcrypt]
    do_test+0x28c2/0x4d49 [tcrypt]
    tcrypt_mod_init+0x55/0x1000 [tcrypt]
    ...

    cpumask_weight() in padata_cpu_hash() returns 0 because the mask has no
    CPUs. The problem is __padata_remove_cpu() checks for valid masks too
    early and so doesn't mark the instance PADATA_INVALID as expected, which
    would have made padata_do_parallel() return error before doing the
    division.

    Fix by introducing a second padata CPU hotplug state before
    CPUHP_BRINGUP_CPU so that __padata_remove_cpu() sees the online mask
    without @cpu. No need for the second argument to padata_replace() since
    @cpu is now already missing from the online mask.

    Fixes: 33e54450683c ("padata: Handle empty padata cpumasks")
    Signed-off-by: Daniel Jordan
    Cc: Eric Biggers
    Cc: Herbert Xu
    Cc: Sebastian Andrzej Siewior
    Cc: Steffen Klassert
    Cc: Thomas Gleixner
    Cc: linux-crypto@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Herbert Xu
    Signed-off-by: Sasha Levin

    Daniel Jordan
     
  • [ Upstream commit bf08949cc8b98b7d1e20cfbba169a5938d42dae8 ]

    While running kprobe module test, find_module_all() caused
    a suspicious RCU usage warning.

    -----
    =============================
    WARNING: suspicious RCU usage
    5.4.0-next-20191202+ #63 Not tainted
    -----------------------------
    kernel/module.c:619 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by rmmod/642:
    #0: ffffffff8227da80 (module_mutex){+.+.}, at: __x64_sys_delete_module+0x9a/0x230

    stack backtrace:
    CPU: 0 PID: 642 Comm: rmmod Not tainted 5.4.0-next-20191202+ #63
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x71/0xa0
    find_module_all+0xc1/0xd0
    __x64_sys_delete_module+0xac/0x230
    ? do_syscall_64+0x12/0x1f0
    do_syscall_64+0x50/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4b6d49
    -----

    This is because list_for_each_entry_rcu(modules) is called
    without rcu_read_lock(). This is safe because the module_mutex
    is locked.

    Pass lockdep_is_held(&module_mutex) to the list_for_each_entry_rcu()
    to suppress this warning, This also fixes similar issue in
    mod_find() and each_symbol_section().

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit b527b638fd63ba791dc90a0a6e9a3035b10df52b ]

    In the process of adding better error messages for sorting, I realized
    that strsep was being used incorrectly and some of the error paths I
    was expecting to be hit weren't and just fell through to the common
    invalid key error case.

    It also became obvious that for keyword assignments, it wasn't
    necessary to save the full assignment and reparse it later, and having
    a common empty-assignment check would also make more sense in terms of
    error processing.

    Change the code to fix these problems and simplify it for new error
    message changes in a subsequent patch.

    Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org

    Fixes: e62347d24534 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
    Fixes: 7ef224d1d0e3 ("tracing: Add 'hist' event trigger command")
    Fixes: a4072fe85ba3 ("tracing: Add a clock attribute for hist triggers")
    Reported-by: Masami Hiramatsu
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Tom Zanussi
     
  • [ Upstream commit dfb6cd1e654315168e36d947471bd2a0ccd834ae ]

    Looking through old emails in my INBOX, I came across a patch from Luis
    Henriques that attempted to fix a race of two stat tracers registering the
    same stat trace (extremely unlikely, as this is done in the kernel, and
    probably doesn't even exist). The submitted patch wasn't quite right as it
    needed to deal with clean up a bit better (if two stat tracers were the
    same, it would have the same files).

    But to make the code cleaner, all we needed to do is to keep the
    all_stat_sessions_mutex held for most of the registering function.

    Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com

    Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
    Reported-by: Luis Henriques
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Steven Rostedt (VMware)
     
  • [ Upstream commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 ]

    tracing_stat_init() was always returning '0', even on the error paths. It
    now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
    to created the 'trace_stat' debugfs directory.

    Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com

    Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
    Signed-off-by: Luis Henriques
    [ Pulled from the archeological digging of my INBOX ]
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Luis Henriques
     
  • [ Upstream commit f6d061d617124abbd55396a3bc37b9bf7d33233c ]

    In module_add_modinfo_attrs() if sysfs_create_file() fails
    on the first iteration of the loop (so i = 0), we forget to
    free the modinfo_attrs.

    Fixes: bc6f2a757d52 ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
    Reviewed-by: Miroslav Benes
    Signed-off-by: YueHaibing
    Signed-off-by: Jessica Yu
    Signed-off-by: Sasha Levin

    YueHaibing
     
  • [ Upstream commit def97da136515cb289a14729292c193e0a93bc64 ]

    Commit f92b070f2dc8 ("printk: Do not miss new messages when replaying
    the log") introduced a new variable @exclusive_console_stop_seq to
    store when an exclusive console should stop printing. It should be
    set to the @console_seq value at registration. However, @console_seq
    is previously set to @syslog_seq so that the exclusive console knows
    where to begin. This results in the exclusive console immediately
    reactivating all the other consoles and thus repeating the messages
    for those consoles.

    Set @console_seq after @exclusive_console_stop_seq has stored the
    current @console_seq value.

    Fixes: f92b070f2dc8 ("printk: Do not miss new messages when replaying the log")
    Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
    Cc: Steven Rostedt
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: John Ogness
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Sasha Levin

    John Ogness
     
  • [ Upstream commit 45178ac0cea853fe0e405bf11e101bdebea57b15 ]

    Paul reported a very sporadic, rcutorture induced, workqueue failure.
    When the planets align, the workqueue rescuer's self-migrate fails and
    then triggers a WARN for running a work on the wrong CPU.

    Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
    could be ignored! When stopper->enabled is false, stop_machine will
    insta complete the work, without actually doing the work. Worse, it
    will not WARN about this (we really should fix this).

    It turns out there is a small window where a freshly online'ed CPU is
    marked 'online' but doesn't yet have the stopper task running:

    BP AP

    bringup_cpu()
    __cpu_up(cpu, idle) --> start_secondary()
    ...
    cpu_startup_entry()
    bringup_wait_for_ap()
    wait_for_ap_thread() enable = true;

    Close this by moving the stop_machine_unpark() into
    cpuhp_online_idle(), such that the stopper thread is ready before we
    start the idle loop and schedule.

    Reported-by: "Paul E. McKenney"
    Debugged-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: "Paul E. McKenney"
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

    This fixes a data-race where `atomic_t dynticks` is copied by value. The
    copy is performed non-atomically, resulting in a data-race if `dynticks`
    is updated concurrently.

    This data-race was found with KCSAN:
    ==================================================================
    BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

    write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
    atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
    rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
    dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
    force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
    rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
    rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
    rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
    kthread+0x1b5/0x200 kernel/kthread.c:255

    read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
    rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
    rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
    irq_enter+0x5/0x50 kernel/softirq.c:347

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    Workqueue: kblockd blk_mq_run_work_fn
    ==================================================================

    Signed-off-by: Marco Elver
    Cc: Paul E. McKenney
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Joel Fernandes
    Cc: Ingo Molnar
    Cc: Dmitry Vyukov
    Cc: rcu@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Marco Elver
     
  • [ Upstream commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f ]

    Tasks waiting within exp_funnel_lock() for an expedited grace period to
    elapse can be starved due to the following sequence of events:

    1. Tasks A and B both attempt to start an expedited grace
    period at about the same time. This grace period will have
    completed when the lower four bits of the rcu_state structure's
    ->expedited_sequence field are 0b'0100', for example, when the
    initial value of this counter is zero. Task A wins, and thus
    does the actual work of starting the grace period, including
    acquiring the rcu_state structure's .exp_mutex and sets the
    counter to 0b'0001'.

    2. Because task B lost the race to start the grace period, it
    waits on ->expedited_sequence to reach 0b'0100' inside of
    exp_funnel_lock(). This task therefore blocks on the rcu_node
    structure's ->exp_wq[1] field, keeping in mind that the
    end-of-grace-period value of ->expedited_sequence (0b'0100')
    is shifted down two bits before indexing the ->exp_wq[] field.

    3. Task C attempts to start another expedited grace period,
    but blocks on ->exp_mutex, which is still held by Task A.

    4. The aforementioned expedited grace period completes, so that
    ->expedited_sequence now has the value 0b'0100'. A kworker task
    therefore acquires the rcu_state structure's ->exp_wake_mutex
    and starts awakening any tasks waiting for this grace period.

    5. One of the first tasks awakened happens to be Task A. Task A
    therefore releases the rcu_state structure's ->exp_mutex,
    which allows Task C to start the next expedited grace period,
    which causes the lower four bits of the rcu_state structure's
    ->expedited_sequence field to become 0b'0101'.

    6. Task C's expedited grace period completes, so that the lower four
    bits of the rcu_state structure's ->expedited_sequence field now
    become 0b'1000'.

    7. The kworker task from step 4 above continues its wakeups.
    Unfortunately, the wake_up_all() refetches the rcu_state
    structure's .expedited_sequence field:

    wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

    This results in the wakeup being applied to the rcu_node
    structure's ->exp_wq[2] field, which is unfortunate given that
    Task B is instead waiting on ->exp_wq[1].

    On a busy system, no harm is done (or at least no permanent harm is done).
    Some later expedited grace period will redo the wakeup. But on a quiet
    system, such as many embedded systems, it might be a good long time before
    there was another expedited grace period. On such embedded systems,
    this situation could therefore result in a system hang.

    This issue manifested as DPM device timeout during suspend (which
    usually qualifies as a quiet time) due to a SCSI device being stuck in
    _synchronize_rcu_expedited(), with the following stack trace:

    schedule()
    synchronize_rcu_expedited()
    synchronize_rcu()
    scsi_device_quiesce()
    scsi_bus_suspend()
    dpm_run_callback()
    __device_suspend()

    This commit therefore prevents such delays, timeouts, and hangs by
    making rcu_exp_wait_wake() use its "s" argument consistently instead of
    refetching from rcu_state.expedited_sequence.

    Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
    Signed-off-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Neeraj Upadhyay
     
  • [ Upstream commit 610dea36d3083a977e4f156206cbe1eaa2a532f0 ]

    Commit 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if
    dump_tree") added print statements to rcu_organize_nocb_kthreads for
    debugging, but incorrectly guarded them, causing the function to always
    spew out its message.

    This patch fixes it by guarding both pr_alert statements with dump_tree,
    while also changing the second pr_alert to a pr_cont, to print the
    hierarchy in a single line (assuming that's how it was supposed to
    work).

    Fixes: 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
    Signed-off-by: Stefan Reiter
    [ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Sasha Levin

    Stefan Reiter
     

20 Feb, 2020

2 commits

  • commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.

    The check to ensure that the new written value into cpu.uclamp.{min,max}
    is within range, [0:100], wasn't working because of the signed
    comparison

    7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
    7302 req.ret = -ERANGE;
    7303 return req;
    7304 }

    # echo -1 > cpu.uclamp.min
    # cat cpu.uclamp.min
    42949671.96

    Cast req.percent into u64 to force the comparison to be unsigned and
    work as intended in capacity_from_percent().

    # echo -1 > cpu.uclamp.min
    sh: write error: Numerical result out of range

    Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     
  • commit e3728b50cd9be7d4b1469447cdf1feb93e3b7adb upstream.

    It is theoretically possible for the ACPI EC GPE to be set after the
    s2idle_ops->wake() called from s2idle_loop() has returned and before
    the subsequent pm_wakeup_pending() check is carried out. If that
    happens, the resulting wakeup event will cause the system to resume
    even though it may be a spurious one.

    To avoid that race, first make the ->wake() callback in struct
    platform_s2idle_ops return a bool value indicating whether or not
    to let the system resume and rearrange s2idle_loop() to use that
    value instad of the direct pm_wakeup_pending() call if ->wake() is
    present.

    Next, rework acpi_s2idle_wake() to process EC events and check
    pm_wakeup_pending() before re-arming the SCI for system wakeup
    to prevent it from triggering prematurely and add comments to
    that function to explain the rationale for the new code flow.

    Fixes: 56b991849009 ("PM: sleep: Simplify suspend-to-idle control flow")
    Cc: 5.4+ # 5.4+
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

15 Feb, 2020

1 commit

  • commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.

    When a new cgroup is created, the effective uclamp value wasn't updated
    with a call to cpu_util_update_eff() that looks at the hierarchy and
    update to the most restrictive values.

    Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
    becomes online.

    Without this change, the newly created cgroup uses the default
    root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
    which will cause the rq to to be clamped to max, hence cause the
    system to run at max frequency.

    The problem was observed on Ubuntu server and was reproduced on Debian
    and Buildroot rootfs.

    By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
    and add all tasks to it - which creates enough noise to keep the rq
    uclamp value at max most of the time. Imitating this behavior makes the
    problem visible in Buildroot too which otherwise looks fine since it's a
    minimal userspace.

    Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
    Reported-by: Doug Smythies
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Doug Smythies
    Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     

11 Feb, 2020

2 commits

  • commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream.

    Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
    a perf ring buffer may lead to an integer underflow in locked memory
    accounting. This may lead to the undesired behaviors, such as failures in
    BPF map creation.

    Address this by adjusting the accounting logic to take into account the
    possibility that the amount of already locked memory may exceed the
    current limit.

    Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again")
    Suggested-by: Alexander Shishkin
    Signed-off-by: Song Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Cc:
    Acked-by: Alexander Shishkin
    Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
    Signed-off-by: Greg Kroah-Hartman

    Song Liu
     
  • commit febac332a819f0e764aa4da62757ba21d18c182b upstream.

    Kernel crashes inside QEMU/KVM are observed:

    kernel BUG at kernel/time/timer.c:1154!
    BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().

    At the same time another cpu got:

    general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:

    __hlist_del at include/linux/list.h:681
    (inlined by) detach_timer at kernel/time/timer.c:818
    (inlined by) expire_timers at kernel/time/timer.c:1355
    (inlined by) __run_timers at kernel/time/timer.c:1686
    (inlined by) run_timer_softirq at kernel/time/timer.c:1699

    Unfortunately kernel logs are badly scrambled, stacktraces are lost.

    Printing the timer->function before the BUG_ON() pointed to
    clocksource_watchdog().

    The execution of clocksource_watchdog() can race with a sequence of
    clocksource_stop_watchdog() .. clocksource_start_watchdog():

    expire_timers()
    detach_timer(timer, true);
    timer->entry.pprev = NULL;
    raw_spin_unlock_irq(&base->lock);
    call_timer_fn
    clocksource_watchdog()

    clocksource_watchdog_kthread() or
    clocksource_unbind()

    spin_lock_irqsave(&watchdog_lock, flags);
    clocksource_stop_watchdog();
    del_timer(&watchdog_timer);
    watchdog_running = 0;
    spin_unlock_irqrestore(&watchdog_lock, flags);

    spin_lock_irqsave(&watchdog_lock, flags);
    clocksource_start_watchdog();
    add_timer_on(&watchdog_timer, ...);
    watchdog_running = 1;
    spin_unlock_irqrestore(&watchdog_lock, flags);

    spin_lock(&watchdog_lock);
    add_timer_on(&watchdog_timer, ...);
    BUG_ON(timer_pending(timer) || !timer->function);
    timer_pending() -> true
    BUG()

    I.e. inside clocksource_watchdog() watchdog_timer could be already armed.

    Check timer_pending() before calling add_timer_on(). This is sufficient as
    all operations are synchronized by watchdog_lock.

    Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov