10 Feb, 2020

6 commits

  • Pull more Kbuild updates from Masahiro Yamada:

    - fix randconfig to generate a sane .config

    - rename hostprogs-y / always to hostprogs / always-y, which are more
    natual syntax.

    - optimize scripts/kallsyms

    - fix yes2modconfig and mod2yesconfig

    - make multiple directory targets ('make foo/ bar/') work

    * tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: make multiple directory targets work
    kconfig: Invalidate all symbols after changing to y or m.
    kallsyms: fix type of kallsyms_token_table[]
    scripts/kallsyms: change table to store (strcut sym_entry *)
    scripts/kallsyms: rename local variables in read_symbol()
    kbuild: rename hostprogs-y/always to hostprogs/always-y
    kbuild: fix the document to use extra-y for vmlinux.lds
    kconfig: fix broken dependency in randconfig-generated .config

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "A set of fixes for X86:

    - Ensure that the PIT is set up when the local APIC is disable or
    configured in legacy mode. This is caused by an ordering issue
    introduced in the recent changes which skip PIT initialization when
    the TSC and APIC frequencies are already known.

    - Handle malformed SRAT tables during early ACPI parsing which caused
    an infinite loop anda boot hang.

    - Fix a long standing race in the affinity setting code which affects
    PCI devices with non-maskable MSI interrupts. The problem is caused
    by the non-atomic writes of the MSI address (destination APIC id)
    and data (vector) fields which the device uses to construct the MSI
    message. The non-atomic writes are mandated by PCI.

    If both fields change and the device raises an interrupt after
    writing address and before writing data, then the MSI block
    constructs a inconsistent message which causes interrupts to be
    lost and subsequent malfunction of the device.

    The fix is to redirect the interrupt to the new vector on the
    current CPU first and then switch it over to the new target CPU.
    This allows to observe an eventually raised interrupt in the
    transitional stage (old CPU, new vector) to be observed in the APIC
    IRR and retriggered on the new target CPU and the new vector.

    The potential spurious interrupts caused by this are harmless and
    can in the worst case expose a buggy driver (all handlers have to
    be able to deal with spurious interrupts as they can and do happen
    for various reasons).

    - Add the missing suspend/resume mechanism for the HYPERV hypercall
    page which prevents resume hibernation on HYPERV guests. This
    change got lost before the merge window.

    - Mask the IOAPIC before disabling the local APIC to prevent
    potentially stale IOAPIC remote IRR bits which cause stale
    interrupt lines after resume"

    * tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/apic: Mask IOAPIC entries when disabling the local APIC
    x86/hyperv: Suspend/resume the hypercall page for hibernation
    x86/apic/msi: Plug non-maskable MSI affinity race
    x86/boot: Handle malformed SRAT tables during early ACPI parsing
    x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode

    Linus Torvalds
     
  • Pull SMP fixes from Thomas Gleixner:
    "Two fixes for the SMP related functionality:

    - Make the UP version of smp_call_function_single() match SMP
    semantics when called for a not available CPU. Instead of emitting
    a warning and assuming that the function call target is CPU0,
    return a proper error code like the SMP version does.

    - Remove a superfluous check in smp_call_function_many_cond()"

    * tag 'smp-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smp/up: Make smp_call_function_single() match SMP semantics
    smp: Remove superfluous cond_func check in smp_call_function_many_cond()

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A set of fixes and improvements for the perf subsystem:

    Kernel fixes:

    - Install cgroup events to the correct CPU context to prevent a
    potential list double add

    - Prevent an integer underflow in the perf mlock accounting

    - Add a missing prototype for arch_perf_update_userpage()

    Tooling:

    - Add a missing unlock in the error path of maps__insert() in perf
    maps.

    - Fix the build with the latest libbfd

    - Fix the perf parser so it does not delete parse event terms, which
    caused a regression for using perf with the ARM CoreSight as the
    sink configuration was missing due to the deletion.

    - Fix the double free in the perf CPU map merging test case

    - Add the missing ustring support for the perf probe command"

    * tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf maps: Add missing unlock to maps__insert() error case
    perf probe: Add ustring support for perf probe command
    perf: Make perf able to build with latest libbfd
    perf test: Fix test case Merge cpu map
    perf parse: Copy string to perf_evsel_config_term
    perf parse: Refactor 'struct perf_evsel_config_term'
    kernel/events: Add a missing prototype for arch_perf_update_userpage()
    perf/cgroups: Install cgroup events to correct cpuctx
    perf/core: Fix mlock accounting in perf_mmap()

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "Two small fixes for the time(r) subsystem:

    - Handle a subtle race between the clocksource watchdog and a
    concurrent clocksource watchdog stop/start sequence correctly to
    prevent a timer double add bug.

    - Fix the file path for the core time namespace file"

    * tag 'timers-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: Prevent double add_timer_on() for watchdog_timer
    MAINTAINERS: Correct path to time namespace source file

    Linus Torvalds
     
  • Pull interrupt fixes from Thomas Gleixner:
    "A set of fixes for the interrupt subsystem:

    - Provision only ACPI enabled redistributors on GICv3

    - Use the proper command colums when building the INVALL command for
    the GICv3-ITS

    - Ensure the allocation of the L2 vPE table for GICv4.1

    - Correct the GICv4.1 VPROBASER programming so it uses the proper
    size

    - A set of small GICv4.1 tidy up patches

    - Configuration cleanup for C-SKY interrupt chip

    - Clarify the function documentation for irq_set_wake() to document
    that the wakeup functionality is orthogonal to the irq
    disable/enable mechanism"

    * tag 'irq-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/gic-v3-its: Rename VPENDBASER/VPROPBASER accessors
    irqchip/gic-v3-its: Remove superfluous WARN_ON
    irqchip/gic-v4.1: Drop 'tmp' in inherit_vpe_l1_table_from_rd()
    irqchip/gic-v4.1: Ensure L2 vPE table is allocated at RD level
    irqchip/gic-v4.1: Set vpe_l1_base for all redistributors
    irqchip/gic-v4.1: Fix programming of GICR_VPROPBASER_4_1_SIZE
    genirq: Clarify that irq wake state is orthogonal to enable/disable
    irqchip/gic-v3-its: Reference to its_invall_cmd descriptor when building INVALL
    irqchip: Some Kconfig cleanup for C-SKY
    irqchip/gic-v3: Only provision redistributors that are enabled in ACPI

    Linus Torvalds
     

09 Feb, 2020

2 commits

  • Pull networking fixes from David Miller:

    1) Unbalanced locking in mwifiex_process_country_ie, from Brian Norris.

    2) Fix thermal zone registration in iwlwifi, from Andrei
    Otcheretianski.

    3) Fix double free_irq in sgi ioc3 eth, from Thomas Bogendoerfer.

    4) Use after free in mptcp, from Florian Westphal.

    5) Use after free in wireguard's root_remove_peer_lists, from Eric
    Dumazet.

    6) Properly access packets heads in bonding alb code, from Eric
    Dumazet.

    7) Fix data race in skb_queue_len(), from Qian Cai.

    8) Fix regression in r8169 on some chips, from Heiner Kallweit.

    9) Fix XDP program ref counting in hv_netvsc, from Haiyang Zhang.

    10) Certain kinds of set link netlink operations can cause a NULL deref
    in the ipv6 addrconf code. Fix from Eric Dumazet.

    11) Don't cancel uninitialized work queue in drop monitor, from Ido
    Schimmel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
    net: thunderx: use proper interface type for RGMII
    mt76: mt7615: fix max_nss in mt7615_eeprom_parse_hw_cap
    bpf: Improve bucket_log calculation logic
    selftests/bpf: Test freeing sockmap/sockhash with a socket in it
    bpf, sockhash: Synchronize_rcu before free'ing map
    bpf, sockmap: Don't sleep while holding RCU lock on tear-down
    bpftool: Don't crash on missing xlated program instructions
    bpf, sockmap: Check update requirements after locking
    drop_monitor: Do not cancel uninitialized work item
    mlxsw: spectrum_dpipe: Add missing error path
    mlxsw: core: Add validation of hardware device types for MGPIR register
    mlxsw: spectrum_router: Clear offload indication from IPv6 nexthops on abort
    selftests: mlxsw: Add test cases for local table route replacement
    mlxsw: spectrum_router: Prevent incorrect replacement of local table routes
    net: dsa: microchip: enable module autoprobe
    ipv6/addrconf: fix potential NULL deref in inet6_set_link_af()
    dpaa_eth: support all modes with rate adapting PHYs
    net: stmmac: update pci platform data to use phy_interface
    net: stmmac: xgmac: fix missing IFF_MULTICAST checki in dwxgmac2_set_filter
    net: stmmac: fix missing IFF_MULTICAST check in dwmac4_set_filter
    ...

    Linus Torvalds
     
  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

6 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-02-07

    The following pull-request contains BPF updates for your *net* tree.

    We've added 15 non-merge commits during the last 10 day(s) which contain
    a total of 12 files changed, 114 insertions(+), 31 deletions(-).

    The main changes are:

    1) Various BPF sockmap fixes related to RCU handling in the map's tear-
    down code, from Jakub Sitnicki.

    2) Fix macro state explosion in BPF sk_storage map when calculating its
    bucket_log on allocation, from Martin KaFai Lau.

    3) Fix potential BPF sockmap update race by rechecking socket's established
    state under lock, from Lorenz Bauer.

    4) Fix crash in bpftool on missing xlated instructions when kptr_restrict
    sysctl is set, from Toke Høiland-Jørgensen.

    5) Fix i40e's XSK wakeup code to return proper error in busy state and
    various misc fixes in xdpsock BPF sample code, from Maciej Fijalkowski.

    6) Fix the way modifiers are skipped in BTF in the verifier while walking
    pointers to avoid program rejection, from Alexei Starovoitov.

    7) Fix Makefile for runqslower BPF tool to i) rebuild on libbpf changes and
    ii) to fix undefined reference linker errors for older gcc version due to
    order of passed gcc parameters, from Yulia Kartseva and Song Liu.

    8) Fix a trampoline_count BPF kselftest warning about missing braces around
    initializer, from Andrii Nakryiko.

    9) Fix up redundant "HAVE" prefix from large INSN limit kernel probe in
    bpftool, from Michal Rostecki.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • There's some confusion around if an irq that's disabled with disable_irq()
    can still wake the system from sleep states such as "suspend to RAM".

    Clarify this in the kernel documentation for irq_set_irq_wake() so that
    it's clear that an irq can be disabled and still wake the system if it has
    been marked for wakeup.

    Signed-off-by: Stephen Boyd
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Douglas Anderson
    Link: https://lkml.kernel.org/r/20200206191521.94559-1-swboyd@chromium.org

    Stephen Boyd
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • The former contains nothing but a pointer to an array of the latter...

    Signed-off-by: Al Viro

    Al Viro
     
  • Unused now.

    Signed-off-by: Eric Sandeen
    Acked-by: David Howells
    Signed-off-by: Al Viro

    Eric Sandeen
     
  • pointless alias for invalf()...

    Signed-off-by: Al Viro

    Al Viro
     

07 Feb, 2020

2 commits

  • In CONFIG_SMP=y kernels, smp_call_function_single() returns -ENXIO when
    invoked for a non-existent CPU. In contrast, in CONFIG_SMP=n kernels,
    a splat is emitted and smp_call_function_single() otherwise silently
    ignores its "cpu" argument, instead pretending that the caller intended
    to have something happen on CPU 0. Given that there is now code that
    expects smp_call_function_single() to return an error if a bad CPU was
    specified, this difference in semantics needs to be addressed.

    Bring the semantics of the CONFIG_SMP=n version of
    smp_call_function_single() into alignment with its CONFIG_SMP=y
    counterpart.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200205143409.GA7021@paulmck-ThinkPad-P72

    Paul E. McKenney
     
  • Pull kgdb fix from Daniel Thompson:
    "One of the simplifications added for 5.6-rc1 has caused build
    regressions on some platforms (it was reported for sparc64).

    This fixes it with a revert"

    * tag 'kgdb-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
    Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no regs"

    Linus Torvalds
     

06 Feb, 2020

6 commits

  • This reverts commit bbfceba15f8d1260c328a254efc2b3f2deae4904.

    When DBG_MAX_REG_NUM is zero then a number of symbols are conditionally
    defined. It is therefore not possible to check it using C expressions.

    Reported-by: Anatoly Pugachev
    Acked-by: Doug Anderson
    Signed-off-by: Daniel Thompson

    Daniel Thompson
     
  • Pull tracing updates from Steven Rostedt:

    - Added new "bootconfig".

    This looks for a file appended to initrd to add boot config options,
    and has been discussed thoroughly at Linux Plumbers.

    Very useful for adding kprobes at bootup.

    Only enabled if "bootconfig" is on the real kernel command line.

    - Created dynamic event creation.

    Merges common code between creating synthetic events and kprobe
    events.

    - Rename perf "ring_buffer" structure to "perf_buffer"

    - Rename ftrace "ring_buffer" structure to "trace_buffer"

    Had to rename existing "trace_buffer" to "array_buffer"

    - Allow trace_printk() to work withing (some) tracing code.

    - Sort of tracing configs to be a little better organized

    - Fixed bug where ftrace_graph hash was not being protected properly

    - Various other small fixes and clean ups

    * tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
    bootconfig: Show the number of nodes on boot message
    tools/bootconfig: Show the number of bootconfig nodes
    bootconfig: Add more parse error messages
    bootconfig: Use bootconfig instead of boot config
    ftrace: Protect ftrace_graph_hash with ftrace_sync
    ftrace: Add comment to why rcu_dereference_sched() is open coded
    tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
    tracing: Annotate ftrace_graph_hash pointer with __rcu
    bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
    tracing: Use seq_buf for building dynevent_cmd string
    tracing: Remove useless code in dynevent_arg_pair_add()
    tracing: Remove check_arg() callbacks from dynevent args
    tracing: Consolidate some synth_event_trace code
    tracing: Fix now invalid var_ref_vals assumption in trace action
    tracing: Change trace_boot to use synth_event interface
    tracing: Move tracing selftests to bottom of menu
    tracing: Move mmio tracer config up with the other tracers
    tracing: Move tracing test module configs together
    tracing: Move all function tracing configs together
    tracing: Documentation for in-kernel synthetic event API
    ...

    Linus Torvalds
     
  • As function_graph tracer can run when RCU is not "watching", it can not be
    protected by synchronize_rcu() it requires running a task on each CPU before
    it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.

    Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72

    Cc: stable@vger.kernel.org
    Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables")
    Reported-by: "Paul E. McKenney"
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Because the function graph tracer can execute in sections where RCU is not
    "watching", the rcu_dereference_sched() for the has needs to be open coded.
    This is fine because the RCU "flavor" of the ftrace hash is protected by
    its own RCU handling (it does its own little synchronization on every CPU
    and does not rely on RCU sched).

    Acked-by: Joel Fernandes (Google)
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Fix following instances of sparse error
    kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison
    kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison
    kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison
    kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison

    Use rcu_dereference_protected to dereference the newly annotated pointer.

    Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com

    Signed-off-by: Amol Grover
    Signed-off-by: Steven Rostedt (VMware)

    Amol Grover
     
  • Fix following instances of sparse error
    kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison
    kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison
    kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison
    kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison

    Use rcu_dereference_protected to access the __rcu annotated pointer.

    Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com

    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Amol Grover
    Signed-off-by: Steven Rostedt (VMware)

    Amol Grover
     

05 Feb, 2020

3 commits


04 Feb, 2020

3 commits

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Fix the way modifiers are skipped while walking pointers. Otherwise second
    level dereferences of 'const struct foo *' will be rejected by the verifier.

    Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200201000314.261392-1-ast@kernel.org

    Alexei Starovoitov
     
  • Pull kgdb updates from Daniel Thompson:
    "Everything for kgdb this time around is either simplifications or
    clean ups.

    In particular Douglas Anderson's modifications to the backtrace
    machine in the *last* dev cycle have enabled Doug to tidy up some MIPS
    specific backtrace code and stop sharing certain data structures
    across the kernel. Note that The MIPS folks were on Cc: for the MIPS
    patch and reacted positively (but without an explicit Acked-by).

    Doug also got rid of the implicit switching between tasks and register
    sets during some but not of kdb's backtrace actions (because the
    implicit switching was either confusing for users, pointless or both).

    Finally there is a coverity fix and patch to replace open coded
    console traversal with the proper helper function"

    * tag 'kgdb-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
    kdb: Use for_each_console() helper
    kdb: remove redundant assignment to pointer bp
    kdb: Get rid of confusing diag msg from "rd" if current task has no regs
    kdb: Gid rid of implicit setting of the current task / regs
    kdb: kdb_current_task shouldn't be exported
    kdb: kdb_current_regs should be private
    MIPS: kdb: Remove old workaround for backtracing on other CPUs

    Linus Torvalds
     

02 Feb, 2020

3 commits

  • The dynevent_cmd commands that build up the command string don't need
    to do that themselves - there's a seq_buf facility that does pretty
    much the same thing those command are doing manually, so use it
    instead.

    Link: http://lkml.kernel.org/r/eb8a6e835c964d0ab8a38cbf5ffa60746b54a465.1580506712.git.zanussi@kernel.org

    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     
  • The final addition to q is unnecessary, since q isn't ever used
    afterwards.

    Link: http://lkml.kernel.org/r/7880a1268217886cdba7035526650195668da856.1580506712.git.zanussi@kernel.org

    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     
  • It's kind of strange to have check_arg() callbacks as part of the arg
    objects themselves; it makes more sense to just pass these in when the
    args are added instead.

    Remove the check_arg() callbacks from those objects which also means
    removing the check_arg() args from the init functions, adding them to
    the add functions and fixing up existing callers.

    Link: http://lkml.kernel.org/r/c7708d6f177fcbe1a36b6e4e8e150907df0fa5d2.1580506712.git.zanussi@kernel.org

    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     

01 Feb, 2020

9 commits

  • Kernel crashes inside QEMU/KVM are observed:

    kernel BUG at kernel/time/timer.c:1154!
    BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().

    At the same time another cpu got:

    general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:

    __hlist_del at include/linux/list.h:681
    (inlined by) detach_timer at kernel/time/timer.c:818
    (inlined by) expire_timers at kernel/time/timer.c:1355
    (inlined by) __run_timers at kernel/time/timer.c:1686
    (inlined by) run_timer_softirq at kernel/time/timer.c:1699

    Unfortunately kernel logs are badly scrambled, stacktraces are lost.

    Printing the timer->function before the BUG_ON() pointed to
    clocksource_watchdog().

    The execution of clocksource_watchdog() can race with a sequence of
    clocksource_stop_watchdog() .. clocksource_start_watchdog():

    expire_timers()
    detach_timer(timer, true);
    timer->entry.pprev = NULL;
    raw_spin_unlock_irq(&base->lock);
    call_timer_fn
    clocksource_watchdog()

    clocksource_watchdog_kthread() or
    clocksource_unbind()

    spin_lock_irqsave(&watchdog_lock, flags);
    clocksource_stop_watchdog();
    del_timer(&watchdog_timer);
    watchdog_running = 0;
    spin_unlock_irqrestore(&watchdog_lock, flags);

    spin_lock_irqsave(&watchdog_lock, flags);
    clocksource_start_watchdog();
    add_timer_on(&watchdog_timer, ...);
    watchdog_running = 1;
    spin_unlock_irqrestore(&watchdog_lock, flags);

    spin_lock(&watchdog_lock);
    add_timer_on(&watchdog_timer, ...);
    BUG_ON(timer_pending(timer) || !timer->function);
    timer_pending() -> true
    BUG()

    I.e. inside clocksource_watchdog() watchdog_timer could be already armed.

    Check timer_pending() before calling add_timer_on(). This is sufficient as
    all operations are synchronized by watchdog_lock.

    Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz

    Konstantin Khlebnikov
     
  • Evan tracked down a subtle race between the update of the MSI message and
    the device raising an interrupt internally on PCI devices which do not
    support MSI masking. The update of the MSI message is non-atomic and
    consists of either 2 or 3 sequential 32bit wide writes to the PCI config
    space.

    - Write address low 32bits
    - Write address high 32bits (If supported by device)
    - Write data

    When an interrupt is migrated then both address and data might change, so
    the kernel attempts to mask the MSI interrupt first. But for MSI masking is
    optional, so there exist devices which do not provide it. That means that
    if the device raises an interrupt internally between the writes then a MSI
    message is sent built from half updated state.

    On x86 this can lead to spurious interrupts on the wrong interrupt
    vector when the affinity setting changes both address and data. As a
    consequence the device interrupt can be lost causing the device to
    become stuck or malfunctioning.

    Evan tried to handle that by disabling MSI accross an MSI message
    update. That's not feasible because disabling MSI has issues on its own:

    If MSI is disabled the PCI device is routing an interrupt to the legacy
    INTx mechanism. The INTx delivery can be disabled, but the disablement is
    not working on all devices.

    Some devices lose interrupts when both MSI and INTx delivery are disabled.

    Another way to solve this would be to enforce the allocation of the same
    vector on all CPUs in the system for this kind of screwed devices. That
    could be done, but it would bring back the vector space exhaustion problems
    which got solved a few years ago.

    Fortunately the high address (if supported by the device) is only relevant
    when X2APIC is enabled which implies interrupt remapping. In the interrupt
    remapping case the affinity setting is happening at the interrupt remapping
    unit and the PCI MSI message is programmed only once when the PCI device is
    initialized.

    That makes it possible to solve it with a two step update:

    1) Target the MSI msg to the new vector on the current target CPU

    2) Target the MSI msg to the new vector on the new target CPU

    In both cases writing the MSI message is only changing a single 32bit word
    which prevents the issue of inconsistency.

    After writing the final destination it is necessary to check whether the
    device issued an interrupt while the intermediate state #1 (new vector,
    current CPU) was in effect.

    This is possible because the affinity change is always happening on the
    current target CPU. The code runs with interrupts disabled, so the
    interrupt can be detected by checking the IRR of the local APIC. If the
    vector is pending in the IRR then the interrupt is retriggered on the new
    target CPU by sending an IPI for the associated vector on the target CPU.

    This can cause spurious interrupts on both the local and the new target
    CPU.

    1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

    2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

    3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

    expose issues in device driver interrupt handlers which are not prepared to
    handle a spurious interrupt correctly. This not a regression, it's just
    exposing something which was already broken as spurious interrupts can
    happen for a lot of reasons and all driver handlers need to be able to deal
    with them.

    Reported-by: Evan Green
    Debugged-by: Evan Green
    Signed-off-by: Thomas Gleixner
    Tested-by: Evan Green
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de

    Thomas Gleixner
     
  • The synth_event trace code contains some almost identical functions
    and some small functions that are called only once - consolidate the
    common code into single functions and fold in the small functions to
    simplify the code overall.

    Link: http://lkml.kernel.org/r/d1c8d8ad124a653b7543afe801d38c199ca5c20e.1580506712.git.zanussi@kernel.org

    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     
  • Pull updates from Andrew Morton:
    "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
    ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

    MM is fairly quiet this time. Holidays, I assume"

    * emailed patches from Andrew Morton : (118 commits)
    kcov: ignore fault-inject and stacktrace
    include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
    execve: warn if process starts with executable stack
    reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
    init/main.c: fix misleading "This architecture does not have kernel memory protection" message
    init/main.c: fix quoted value handling in unknown_bootoption
    init/main.c: remove unnecessary repair_env_string in do_initcall_level
    init/main.c: log arguments and environment passed to init
    fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
    fs/binfmt_elf.c: coredump: delete duplicated overflow check
    fs/binfmt_elf.c: coredump: allocate core ELF header on stack
    fs/binfmt_elf.c: make BAD_ADDR() unlikely
    fs/binfmt_elf.c: better codegen around current->mm
    fs/binfmt_elf.c: don't copy ELF header around
    fs/binfmt_elf.c: fix ->start_code calculation
    fs/binfmt_elf.c: smaller code generation around auxv vector fill
    lib/find_bit.c: uninline helper _find_next_bit()
    lib/find_bit.c: join _find_next_bit{_le}
    uapi: rename ext2_swab() to swab() and share globally in swab.h
    lib/scatterlist.c: adjust indentation in __sg_alloc_table
    ...

    Linus Torvalds
     
  • Pull module updates from Jessica Yu:
    "Summary of modules changes for the 5.6 merge window:

    - Add "MS" (SHF_MERGE|SHF_STRINGS) section flags to __ksymtab_strings
    to indicate to the linker that it can perform string deduplication
    (i.e., duplicate strings are reduced to a single copy in the string
    table). This means any repeated namespace string would be merged to
    just one entry in __ksymtab_strings.

    - Various code cleanups and small fixes (fix small memleak in error
    path, improve moduleparam docs, silence rcu warnings, improve error
    logging)"

    * tag 'modules-for-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    module.h: Annotate mod_kallsyms with __rcu
    module: avoid setting info->name early in case we can fall back to info->mod->name
    modsign: print module name along with error message
    kernel/module: Fix memleak in module_add_modinfo_attrs()
    export.h: reduce __ksymtab_strings string duplication by using "MS" section flags
    moduleparam: fix kerneldoc
    modules: lockdep: Suppress suspicious RCU usage warning

    Linus Torvalds
     
  • Don't instrument 3 more files that contain debugging facilities and
    produce large amounts of uninteresting coverage for every syscall.

    The following snippets are sprinkled all over the place in kcov traces
    in a debugging kernel. We already try to disable instrumentation of
    stack unwinding code and of most debug facilities. I guess we did not
    use fault-inject.c at the time, and stacktrace.c was somehow missed (or
    something has changed in kernel/configs). This change both speeds up
    kcov (kernel doesn't need to store these PCs, user-space doesn't need to
    process them) and frees trace buffer capacity for more useful coverage.

    should_fail
    lib/fault-inject.c:149
    fail_dump
    lib/fault-inject.c:45

    stack_trace_save
    kernel/stacktrace.c:124
    stack_trace_consume_entry
    kernel/stacktrace.c:86
    stack_trace_consume_entry
    kernel/stacktrace.c:89
    ... a hundred frames skipped ...
    stack_trace_consume_entry
    kernel/stacktrace.c:93
    stack_trace_consume_entry
    kernel/stacktrace.c:86

    Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • The patch 'tracing: Fix histogram code when expression has same var as
    value' added code to return an existing variable reference when
    creating a new variable reference, which resulted in var_ref_vals
    slots being reused instead of being duplicated.

    The implementation of the trace action assumes that the end of the
    var_ref_vals array starting at action_data.var_ref_idx corresponds to
    the values that will be assigned to the trace params. The patch
    mentioned above invalidates that assumption, which means that each
    param needs to explicitly specify its index into var_ref_vals.

    This fix changes action_data.var_ref_idx to an array of var ref
    indexes to account for that.

    Link: https://lore.kernel.org/r/1580335695.6220.8.camel@kernel.org

    Fixes: 8bcebc77e85f ("tracing: Fix histogram code when expression has same var as value")
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     
  • Have trace_boot_add_synth_event() use the synth_event interface.

    Also, rename synth_event_run_cmd() to synth_event_run_command() now
    that trace_boot's version is gone.

    Link: http://lkml.kernel.org/r/94f1fa0e31846d0bddca916b8663404b20559e34.1580323897.git.zanussi@kernel.org

    Acked-by: Masami Hiramatsu
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi
     
  • Replace open coded single-linked list iteration loop with for_each_console()
    helper in use.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Daniel Thompson

    Andy Shevchenko