02 Dec, 2018

1 commit

  • Pull STIBP fallout fixes from Thomas Gleixner:
    "The performance destruction department finally got it's act together
    and came up with a cure for the STIPB regression:

    - Provide a command line option to control the spectre v2 user space
    mitigations. Default is either seccomp or prctl (if seccomp is
    disabled in Kconfig). prctl allows mitigation opt-in, seccomp
    enables the migitation for sandboxed processes.

    - Rework the code to handle the conditional STIBP/IBPB control and
    remove the now unused ptrace_may_access_sched() optimization
    attempt

    - Disable STIBP automatically when SMT is disabled

    - Optimize the switch_to() logic to avoid MSR writes and invocations
    of __switch_to_xtra().

    - Make the asynchronous speculation TIF updates synchronous to
    prevent stale mitigation state.

    As a general cleanup this also makes retpoline directly depend on
    compiler support and removes the 'minimal retpoline' option which just
    pretended to provide some form of security while providing none"

    * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
    x86/speculation: Provide IBPB always command line options
    x86/speculation: Add seccomp Spectre v2 user space protection mode
    x86/speculation: Enable prctl mode for spectre_v2_user
    x86/speculation: Add prctl() control for indirect branch speculation
    x86/speculation: Prepare arch_smt_update() for PRCTL mode
    x86/speculation: Prevent stale SPEC_CTRL msr content
    x86/speculation: Split out TIF update
    ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS
    x86/speculation: Prepare for conditional IBPB in switch_mm()
    x86/speculation: Avoid __switch_to_xtra() calls
    x86/process: Consolidate and simplify switch_to_xtra() code
    x86/speculation: Prepare for per task indirect branch speculation control
    x86/speculation: Add command line control for indirect branch speculation
    x86/speculation: Unify conditional spectre v2 print functions
    x86/speculataion: Mark command line parser data __initdata
    x86/speculation: Mark string arrays const correctly
    x86/speculation: Reorder the spec_v2 code
    x86/l1tf: Show actual SMT state
    x86/speculation: Rework SMT state change
    sched/smt: Expose sched_smt_present static key
    ...

    Linus Torvalds
     

01 Dec, 2018

8 commits

  • Merge misc fixes from Andrew Morton:
    "31 fixes"

    * emailed patches from Andrew Morton : (31 commits)
    ocfs2: fix potential use after free
    mm/khugepaged: fix the xas_create_range() error path
    mm/khugepaged: collapse_shmem() do not crash on Compound
    mm/khugepaged: collapse_shmem() without freezing new_page
    mm/khugepaged: minor reorderings in collapse_shmem()
    mm/khugepaged: collapse_shmem() remember to clear holes
    mm/khugepaged: fix crashes due to misaccounted holes
    mm/khugepaged: collapse_shmem() stop if punched or truncated
    mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
    mm/huge_memory: splitting set mapping+index before unfreeze
    mm/huge_memory: rename freeze_page() to unmap_page()
    initramfs: clean old path before creating a hardlink
    kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
    psi: make disabling/enabling easier for vendor kernels
    proc: fixup map_files test on arm
    debugobjects: avoid recursive calls with kmemleak
    userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
    userfaultfd: shmem: add i_size checks
    userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
    userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
    ...

    Linus Torvalds
     
  • Pull stackleak plugin fix from Kees Cook:
    "Fix crash by not allowing kprobing of stackleak_erase() (Alexander
    Popov)"

    * tag 'gcc-plugins-v4.20-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    stackleak: Disable function tracing and kprobes for stackleak_erase()

    Linus Torvalds
     
  • Since __sanitizer_cov_trace_pc() is marked as notrace, function calls in
    __sanitizer_cov_trace_pc() shouldn't be traced either.
    ftrace_graph_caller() gets called for each function that isn't marked
    'notrace', like canonicalize_ip(). This is the call trace from a run:

    [ 139.644550] ftrace_graph_caller+0x1c/0x24
    [ 139.648352] canonicalize_ip+0x18/0x28
    [ 139.652313] __sanitizer_cov_trace_pc+0x14/0x58
    [ 139.656184] sched_clock+0x34/0x1e8
    [ 139.659759] trace_clock_local+0x40/0x88
    [ 139.663722] ftrace_push_return_trace+0x8c/0x1f0
    [ 139.667767] prepare_ftrace_return+0xa8/0x100
    [ 139.671709] ftrace_graph_caller+0x1c/0x24

    Rework so that check_kcov_mode() and canonicalize_ip() that are called
    from __sanitizer_cov_trace_pc() are also marked as notrace.

    Link: http://lkml.kernel.org/r/20181128081239.18317-1-anders.roxell@linaro.org
    Signed-off-by: Arnd Bergmann
    Signen-off-by: Anders Roxell
    Co-developed-by: Arnd Bergmann
    Acked-by: Steven Rostedt (VMware)
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anders Roxell
     
  • Mel Gorman reports a hackbench regression with psi that would prohibit
    shipping the suse kernel with it default-enabled, but he'd still like
    users to be able to opt in at little to no cost to others.

    With the current combination of CONFIG_PSI and the psi_disabled bool set
    from the commandline, this is a challenge. Do the following things to
    make it easier:

    1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
    to enable CONFIG_PSI in their kernel but leave the feature disabled
    unless a user requests it at boot-time.

    To avoid double negatives, rename psi_disabled= to psi=.

    2. Make psi_disabled a static branch to eliminate any branch costs
    when the feature is disabled.

    In terms of numbers before and after this patch, Mel says:

    : The following is a comparision using CONFIG_PSI=n as a baseline against
    : your patch and a vanilla kernel
    :
    : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
    : kconfigdisable-v1r1 vanilla psidisable-v1r1
    : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
    : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
    : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
    : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
    : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
    : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
    : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
    : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
    : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
    :
    : It's showing that the vanilla kernel takes a hit (as the bisection
    : indicated it would) and that disabling PSI by default is reasonably
    : close in terms of performance for this particular workload on this
    : particular machine so;

    Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Tested-by: Mel Gorman
    Reported-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull perf fixes from Ingo Molnar:
    "Misc fixes:

    - counter freezing related regression fix

    - uprobes race fix

    - Intel PMU unusual event combination fix

    - .. and diverse tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    uprobes: Fix handle_swbp() vs. unregister() + register() race once more
    perf/x86/intel: Disallow precise_ip on BTS events
    perf/x86/intel: Add generic branch tracing check to intel_pmu_has_bts()
    perf/x86/intel: Move branch tracing setup to the Intel-specific source file
    perf/x86/intel: Fix regression by default disabling perfmon v4 interrupt handling
    perf tools beauty ioctl: Support new ISO7816 commands
    tools uapi asm-generic: Synchronize ioctls.h
    tools arch x86: Update tools's copy of cpufeatures.h
    tools headers uapi: Synchronize i915_drm.h
    perf tools: Restore proper cwd on return from mnt namespace
    tools build feature: Check if get_current_dir_name() is available
    perf tools: Fix crash on synthesizing the unit

    Linus Torvalds
     
  • Pull more tracing fixes from Steven Rostedt:
    "Two more fixes:

    - Change idx variable in DO_TRACE macro to __idx to avoid name
    conflicts. A kvm event had "idx" as a parameter and it confused the
    macro.

    - Fix a race where interrupts would be traced when set_graph_function
    was set. The previous patch set increased a race window that
    tricked the function graph tracer to think it should trace
    interrupts when it really should not have.

    The bug has been there before, but was seldom hit. Only the last
    patch series made it more common"

    * tag 'trace-v4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/fgraph: Fix set_graph_function from showing interrupts
    tracepoint: Use __idx instead of idx in DO_TRACE macro to make it unique

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:
    "While rewriting the function graph tracer, I discovered a design flaw
    that was introduced by a patch that tried to fix one bug, but by doing
    so created another bug.

    As both bugs corrupt the output (but they do not crash the kernel), I
    decided to fix the design such that it could have both bugs fixed. The
    original fix, fixed time reporting of the function graph tracer when
    doing a max_depth of one. This was code that can test how much the
    kernel interferes with userspace. But in doing so, it could corrupt
    the time keeping of the function profiler.

    The issue is that the curr_ret_stack variable was being used for two
    different meanings. One was to keep track of the stack pointer on the
    ret_stack (shadow stack used by the function graph tracer), and the
    other use case was the graph call depth. Although, the two may be
    closely related, where they got updated was the issue that lead to the
    two different bugs that required the two use cases to be updated
    differently.

    The big issue with this fix is that it requires changing each
    architecture. The good news is, I was able to remove a lot of code
    that was duplicated within the architectures and place it into a
    single location. Then I could make the fix in one place.

    I pushed this code into linux-next to let it settle over a week, and
    before doing so, I cross compiled all the affected architectures to
    make sure that they built fine.

    In the mean time, I also pulled in a patch that fixes the sched_switch
    previous tasks state output, that was not actually correct"

    * tag 'trace-v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    sched, trace: Fix prev_state output in sched_switch tracepoint
    function_graph: Have profiler use curr_ret_stack and not depth
    function_graph: Reverse the order of pushing the ret_stack and the callback
    function_graph: Move return callback before update of curr_ret_stack
    function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
    function_graph: Make ftrace_push_return_trace() static
    sparc/function_graph: Simplify with function_graph_enter()
    sh/function_graph: Simplify with function_graph_enter()
    s390/function_graph: Simplify with function_graph_enter()
    riscv/function_graph: Simplify with function_graph_enter()
    powerpc/function_graph: Simplify with function_graph_enter()
    parisc: function_graph: Simplify with function_graph_enter()
    nds32: function_graph: Simplify with function_graph_enter()
    MIPS: function_graph: Simplify with function_graph_enter()
    microblaze: function_graph: Simplify with function_graph_enter()
    arm64: function_graph: Simplify with function_graph_enter()
    ARM: function_graph: Simplify with function_graph_enter()
    x86/function_graph: Simplify with function_graph_enter()
    function_graph: Create function_graph_enter() to consolidate architecture code

    Linus Torvalds
     
  • The stackleak_erase() function is called on the trampoline stack at the
    end of syscall. This stack is not big enough for ftrace and kprobes
    operations, e.g. it can be exhausted if we use kprobe_events for
    stackleak_erase().

    So let's disable function tracing and kprobes of stackleak_erase().

    Reported-by: kernel test robot
    Fixes: 10e9ae9fabaf ("gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack")
    Signed-off-by: Alexander Popov
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Kees Cook

    Alexander Popov
     

30 Nov, 2018

1 commit

  • The tracefs file set_graph_function is used to only function graph functions
    that are listed in that file (or all functions if the file is empty). The
    way this is implemented is that the function graph tracer looks at every
    function, and if the current depth is zero and the function matches
    something in the file then it will trace that function. When other functions
    are called, the depth will be greater than zero (because the original
    function will be at depth zero), and all functions will be traced where the
    depth is greater than zero.

    The issue is that when a function is first entered, and the handler that
    checks this logic is called, the depth is set to zero. If an interrupt comes
    in and a function in the interrupt handler is traced, its depth will be
    greater than zero and it will automatically be traced, even if the original
    function was not. But because the logic only looks at depth it may trace
    interrupts when it should not be.

    The recent design change of the function graph tracer to fix other bugs
    caused the depth to be zero while the function graph callback handler is
    being called for a longer time, widening the race of this happening. This
    bug was actually there for a longer time, but because the race window was so
    small it seldom happened. The Fixes tag below is for the commit that widen
    the race window, because that commit belongs to a series that will also help
    fix the original bug.

    Cc: stable@kernel.org
    Fixes: 39eb456dacb5 ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack")
    Reported-by: Joe Lawrence
    Tested-by: Joe Lawrence
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

29 Nov, 2018

1 commit

  • Pull networking fixes from David Miller:

    1) ARM64 JIT fixes for subprog handling from Daniel Borkmann.

    2) Various sparc64 JIT bug fixes (fused branch convergance, frame
    pointer usage detection logic, PSEODU call argument handling).

    3) Fix to use BH locking in nf_conncount, from Taehee Yoo.

    4) Fix race of TX skb freeing in ipheth driver, from Bernd Eckstein.

    5) Handle return value of TX NAPI completion properly in lan743x
    driver, from Bryan Whitehead.

    6) MAC filter deletion in i40e driver clears wrong state bit, from
    Lihong Yang.

    7) Fix use after free in rionet driver, from Pan Bian.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits)
    s390/qeth: fix length check in SNMP processing
    net: hisilicon: remove unexpected free_netdev
    rapidio/rionet: do not free skb before reading its length
    i40e: fix kerneldoc for xsk methods
    ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
    i40e: Fix deletion of MAC filters
    igb: fix uninitialized variables
    netfilter: nf_tables: deactivate expressions in rule replecement routine
    lan743x: Enable driver to work with LAN7431
    tipc: fix lockdep warning during node delete
    lan743x: fix return value for lan743x_tx_napi_poll
    net: via: via-velocity: fix spelling mistake "alignement" -> "alignment"
    qed: fix spelling mistake "attnetion" -> "attention"
    net: thunderx: fix NULL pointer dereference in nic_remove
    sctp: increase sk_wmem_alloc when head->truesize is increased
    firestream: fix spelling mistake: "Inititing" -> "Initializing"
    net: phy: add workaround for issue where PHY driver doesn't bind to the device
    usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2
    sparc: Adjust bpf JIT prologue for PSEUDO calls.
    bpf, doc: add entries of who looks over which jits
    ...

    Linus Torvalds
     

28 Nov, 2018

9 commits

  • The IBPB control code in x86 removed the usage. Remove the functionality
    which was introduced for this.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.de

    Thomas Gleixner
     
  • arch_smt_update() is only called when the sysfs SMT control knob is
    changed. This means that when SMT is enabled in the sysfs control knob the
    system is considered to have SMT active even if all siblings are offline.

    To allow finegrained control of the speculation mitigations, the actual SMT
    state is more interesting than the fact that siblings could be enabled.

    Rework the code, so arch_smt_update() is invoked from each individual CPU
    hotplug function, and simplify the update function while at it.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de

    Thomas Gleixner
     
  • Make the scheduler's 'sched_smt_present' static key globaly available, so
    it can be used in the x86 speculation control code.

    Provide a query function and a stub for the CONFIG_SMP=n case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de

    Thomas Gleixner
     
  • Currently the 'sched_smt_present' static key is enabled when at CPU bringup
    SMT topology is observed, but it is never disabled. However there is demand
    to also disable the key when the topology changes such that there is no SMT
    present anymore.

    Implement this by making the key count the number of cores that have SMT
    enabled.

    In particular, the SMT topology bits are set before interrrupts are enabled
    and similarly, are cleared after interrupts are disabled for the last time
    and the CPU dies.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de

    Peter Zijlstra (Intel)
     
  • The profiler uses trace->depth to find its entry on the ret_stack, but the
    depth may not match the actual location of where its entry is (if an
    interrupt were to preempt the processing of the profiler for another
    function, the depth and the curr_ret_stack will be different).

    Have it use the curr_ret_stack as the index to find its ret_stack entry
    instead of using the depth variable, as that is no longer guaranteed to be
    the same.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • The function graph profiler uses the ret_stack to store the "subtime" and
    reuse it by nested functions and also on the return. But the current logic
    has the profiler callback called before the ret_stack is updated, and it is
    just modifying the ret_stack that will later be allocated (it's just lucky
    that the "subtime" is not touched when it is allocated).

    This could also cause a crash if we are at the end of the ret_stack when
    this happens.

    By reversing the order of the allocating the ret_stack and then calling the
    callbacks attached to a function being traced, the ret_stack entry is no
    longer used before it is allocated.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • In the past, curr_ret_stack had two functions. One was to denote the depth
    of the call graph, the other is to keep track of where on the ret_stack the
    data is used. Although they may be slightly related, there are two cases
    where they need to be used differently.

    The one case is that it keeps the ret_stack data from being corrupted by an
    interrupt coming in and overwriting the data still in use. The other is just
    to know where the depth of the stack currently is.

    The function profiler uses the ret_stack to save a "subtime" variable that
    is part of the data on the ret_stack. If curr_ret_stack is modified too
    early, then this variable can be corrupted.

    The "max_depth" option, when set to 1, will record the first functions going
    into the kernel. To see all top functions (when dealing with timings), the
    depth variable needs to be lowered before calling the return hook. But by
    lowering the curr_ret_stack, it makes the data on the ret_stack still being
    used by the return hook susceptible to being overwritten.

    Now that there's two variables to handle both cases (curr_ret_depth), we can
    move them to the locations where they can handle both cases.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Currently, the depth of the ret_stack is determined by curr_ret_stack index.
    The issue is that there's a race between setting of the curr_ret_stack and
    calling of the callback attached to the return of the function.

    Commit 03274a3ffb44 ("tracing/fgraph: Adjust fgraph depth before calling
    trace return callback") moved the calling of the callback to after the
    setting of the curr_ret_stack, even stating that it was safe to do so, when
    in fact, it was the reason there was a barrier() there (yes, I should have
    commented that barrier()).

    Not only does the curr_ret_stack keep track of the current call graph depth,
    it also keeps the ret_stack content from being overwritten by new data.

    The function profiler, uses the "subtime" variable of ret_stack structure
    and by moving the curr_ret_stack, it allows for interrupts to use the same
    structure it was using, corrupting the data, and breaking the profiler.

    To fix this, there needs to be two variables to handle the call stack depth
    and the pointer to where the ret_stack is being used, as they need to change
    at two different locations.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • As all architectures now call function_graph_enter() to do the entry work,
    no architecture should ever call ftrace_push_return_trace(). Make it static.

    This is needed to prepare for a fix of a design bug on how the curr_ret_stack
    is used.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

27 Nov, 2018

2 commits

  • Make fetching of the BPF call address from ppc64 JIT generic. ppc64
    was using a slightly different variant rather than through the insns'
    imm field encoding as the target address would not fit into that space.
    Therefore, the target subprog number was encoded into the insns' offset
    and fetched through fp->aux->func[off]->bpf_func instead. Given there
    are other JITs with this issue and the mechanism of fetching the address
    is JIT-generic, move it into the core as a helper instead. On the JIT
    side, we get information on whether the retrieved address is a fixed
    one, that is, not changing through JIT passes, or a dynamic one. For
    the former, JITs can optimize their imm emission because this doesn't
    change jump offsets throughout JIT process.

    Signed-off-by: Daniel Borkmann
    Reviewed-by: Sandipan Das
    Tested-by: Sandipan Das
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Currently all the architectures do basically the same thing in preparing the
    function graph tracer on entry to a function. This code can be pulled into a
    generic location and then this will allow the function graph tracer to be
    fixed, as well as extended.

    Create a new function graph helper function_graph_enter() that will call the
    hook function (ftrace_graph_entry) and the shadow stack operation
    (ftrace_push_return_trace), and remove the need of the architecture code to
    manage the shadow stack.

    This is needed to prepare for a fix of a design bug on how the curr_ret_stack
    is used.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

26 Nov, 2018

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2018-11-25

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Fix an off-by-one bug when adjusting subprog start offsets after
    patching, from Edward.

    2) Fix several bugs such as overflow in size allocation in queue /
    stack map creation, from Alexei.

    3) Fix wrong IPv6 destination port byte order in bpf_sk_lookup_udp
    helper, from Andrey.

    4) Fix several bugs in bpftool such as preventing an infinite loop
    in get_fdinfo, error handling and man page references, from Quentin.

    5) Fix a warning in bpf_trace_printk() that wasn't catching an
    invalid format string, from Martynas.

    6) Fix a bug in BPF cgroup local storage where non-atomic allocation
    was used in atomic context, from Roman.

    7) Fix a NULL pointer dereference bug in bpftool from reallocarray()
    error handling, from Jakub and Wen.

    8) Add a copy of pkt_cls.h and tc_bpf.h uapi headers to the tools
    include infrastructure so that bpftool compiles on older RHEL7-like
    user space which does not ship these headers, from Yonghong.

    9) Fix BPF kselftests for user space where to get ping test working
    with ping6 and ping -6, from Li.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Nov, 2018

1 commit

  • A format string consisting of "%p" or "%s" followed by an invalid
    specifier (e.g. "%p%\n" or "%s%") could pass the check which
    would make format_decode (lib/vsprintf.c) to warn.

    Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()")
    Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com
    Signed-off-by: Martynas Pumputis
    Signed-off-by: Daniel Borkmann

    Martynas Pumputis
     

23 Nov, 2018

2 commits

  • Commit:

    142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")

    added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb()
    memory barriers, to ensure that handle_swbp() uses fully-initialized
    uprobes only.

    However, the smp_rmb() is mis-placed: this barrier should be placed
    after handle_swbp() has tested for the flag, thus guaranteeing that
    (program-order) subsequent loads from the uprobe can see the initial
    stores performed by prepare_uprobe().

    Move the smp_rmb() accordingly. Also amend the comments associated
    to the two memory barriers to indicate their actual locations.

    Signed-off-by: Andrea Parri
    Acked-by: Oleg Nesterov
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: stable@kernel.org
    Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")
    Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar

    Andrea Parri
     
  • Fix the following issues:

    - allow queue_stack_map for root only
    - fix u32 max_entries overflow
    - disallow value_size == 0

    Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
    Reported-by: Wei Wu
    Signed-off-by: Alexei Starovoitov
    Cc: Mauricio Vasquez B
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

22 Nov, 2018

1 commit

  • If swiotlb_bounce_page() failed, calling arch_sync_dma_for_device() may
    lead to such delights as performing cache maintenance on whatever
    address phys_to_virt(SWIOTLB_MAP_ERROR) looks like, which is typically
    outside the kernel memory map and goes about as well as expected.

    Don't do that.

    Fixes: a4a4330db46a ("swiotlb: add support for non-coherent DMA")
    Tested-by: John Stultz
    Signed-off-by: Robin Murphy
    Signed-off-by: Christoph Hellwig

    Robin Murphy
     

19 Nov, 2018

3 commits

  • Merge misc fixes from Andrew Morton:
    "16 fixes"

    * emailed patches from Andrew Morton :
    mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
    mm, page_alloc: check for max order in hot path
    scripts/spdxcheck.py: make python3 compliant
    tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
    lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
    mm/vmstat.c: fix NUMA statistics updates
    mm/gup.c: fix follow_page_mask() kerneldoc comment
    ocfs2: free up write context when direct IO failed
    scripts/faddr2line: fix location of start_kernel in comment
    mm: don't reclaim inodes with many attached pages
    mm, memory_hotplug: check zone_movable in has_unmovable_pages
    mm/swapfile.c: use kvzalloc for swap_info_struct allocation
    MAINTAINERS: update OMAP MMC entry
    hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
    kernel/sched/psi.c: simplify cgroup_move_task()
    z3fold: fix possible reclaim races

    Linus Torvalds
     
  • Pull scheduler fix from Ingo Molnar:
    "Fix an exec() related scalability/performance regression, which was
    caused by incorrectly calculating load and migrating tasks on exec()
    when they shouldn't be"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Fix cpu_util_wake() for 'execl' type workloads

    Linus Torvalds
     
  • The existing code triggered an invalid warning about 'rq' possibly being
    used uninitialized. Instead of doing the silly warning suppression by
    initializa it to NULL, refactor the code to bail out early instead.

    Warning was:

    kernel/sched/psi.c: In function `cgroup_move_task':
    kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]

    Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
    Fixes: 2ce7135adc9ad ("psi: cgroup support")
    Signed-off-by: Olof Johansson
    Reviewed-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Olof Johansson
     

17 Nov, 2018

2 commits

  • Naresh reported an issue with the non-atomic memory allocation of
    cgroup local storage buffers:

    [ 73.047526] BUG: sleeping function called from invalid context at
    /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
    [ 73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: test_cgroup_sto
    [ 73.068342] INFO: lockdep is turned off.
    [ 73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
    4.20.0-rc2-next-20181113 #1
    [ 73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
    2.0b 07/27/2017
    [ 73.088018] Call Trace:
    [ 73.090463] dump_stack+0x70/0xa5
    [ 73.093783] ___might_sleep+0x152/0x240
    [ 73.097619] __might_sleep+0x4a/0x80
    [ 73.101191] __kmalloc_node+0x1cf/0x2f0
    [ 73.105031] ? cgroup_storage_update_elem+0x46/0x90
    [ 73.109909] cgroup_storage_update_elem+0x46/0x90

    cgroup_storage_update_elem() (as well as other update map update
    callbacks) is called with disabled preemption, so GFP_ATOMIC
    allocation should be used: e.g. alloc_htab_elem() in hashtab.c.

    Reported-by: Naresh Kamboju
    Tested-by: Naresh Kamboju
    Signed-off-by: Roman Gushchin
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Roman Gushchin
     
  • When patching in a new sequence for the first insn of a subprog, the start
    of that subprog does not change (it's the first insn of the sequence), so
    adjust_subprog_starts should check start < off).
    Also added a test to test_verifier.c (it's essentially the syz reproducer).

    Fixes: cc8b0b92a169 ("bpf: introduce function calls (function boundaries)")
    Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com
    Signed-off-by: Edward Cree
    Acked-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov

    Edward Cree
     

14 Nov, 2018

6 commits

  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Notice that in this particular case, I replaced the code comments with
    a proper "fall through" annotation, which is what GCC is expecting
    to find.

    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Daniel Thompson
    Signed-off-by: Daniel Thompson

    Gustavo A. R. Silva
     
  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Notice that in this particular case, I replaced the code comments with
    a proper "fall through" annotation, which is what GCC is expecting
    to find.

    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Daniel Thompson
    Signed-off-by: Daniel Thompson

    Gustavo A. R. Silva
     
  • Replace the whole switch statement with a for loop. This makes the
    code clearer and easy to read.

    This also addresses the following Coverity warnings:

    Addresses-Coverity-ID: 115090 ("Missing break in switch")
    Addresses-Coverity-ID: 115091 ("Missing break in switch")
    Addresses-Coverity-ID: 114700 ("Missing break in switch")

    Suggested-by: Daniel Thompson
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Daniel Thompson
    [daniel.thompson@linaro.org: Tiny grammar change in description]
    Signed-off-by: Daniel Thompson

    Gustavo A. R. Silva
     
  • gcc 8.1.0 warns with:

    kernel/debug/kdb/kdb_support.c: In function ‘kallsyms_symbol_next’:
    kernel/debug/kdb/kdb_support.c:239:4: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
    strncpy(prefix_name, name, strlen(name)+1);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    kernel/debug/kdb/kdb_support.c:239:31: note: length computed here

    Use strscpy() with the destination buffer size, and use ellipses when
    displaying truncated symbols.

    v2: Use strscpy()

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Toppins
    Cc: Jason Wessel
    Cc: Daniel Thompson
    Cc: kgdb-bugreport@lists.sourceforge.net
    Reviewed-by: Daniel Thompson
    Signed-off-by: Daniel Thompson

    Prarit Bhargava
     
  • Since commit ad67b74d2469 ("printk: hash addresses printed with %p"),
    all pointers printed with %p are printed with hashed addresses
    instead of real addresses in order to avoid leaking addresses in
    dmesg and syslog. But this applies to kdb too, with is unfortunate:

    Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry
    kdb> ps
    15 sleeping system daemon (state M) processes suppressed,
    use 'ps A' to see all.
    Task Addr Pid Parent [*] cpu State Thread Command
    0x(ptrval) 329 328 1 0 R 0x(ptrval) *sh

    0x(ptrval) 1 0 0 0 S 0x(ptrval) init
    0x(ptrval) 3 2 0 0 D 0x(ptrval) rcu_gp
    0x(ptrval) 4 2 0 0 D 0x(ptrval) rcu_par_gp
    0x(ptrval) 5 2 0 0 D 0x(ptrval) kworker/0:0
    0x(ptrval) 6 2 0 0 D 0x(ptrval) kworker/0:0H
    0x(ptrval) 7 2 0 0 D 0x(ptrval) kworker/u2:0
    0x(ptrval) 8 2 0 0 D 0x(ptrval) mm_percpu_wq
    0x(ptrval) 10 2 0 0 D 0x(ptrval) rcu_preempt

    The whole purpose of kdb is to debug, and for debugging real addresses
    need to be known. In addition, data displayed by kdb doesn't go into
    dmesg.

    This patch replaces all %p by %px in kdb in order to display real
    addresses.

    Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
    Cc:
    Signed-off-by: Christophe Leroy
    Signed-off-by: Daniel Thompson

    Christophe Leroy
     
  • On a powerpc 8xx, 'btc' fails as follows:

    Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
    kdb> btc
    btc: cpu status: Currently on cpu 0
    Available cpus: 0
    kdb_getarea: Bad address 0x0

    when booting the kernel with 'debug_boot_weak_hash', it fails as well

    Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
    kdb> btc
    btc: cpu status: Currently on cpu 0
    Available cpus: 0
    kdb_getarea: Bad address 0xba99ad80

    On other platforms, Oopses have been observed too, see
    https://github.com/linuxppc/linux/issues/139

    This is due to btc calling 'btt' with %p pointer as an argument.

    This patch replaces %p by %px to get the real pointer value as
    expected by 'btt'

    Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
    Cc:
    Signed-off-by: Christophe Leroy
    Reviewed-by: Daniel Thompson
    Signed-off-by: Daniel Thompson

    Christophe Leroy
     

12 Nov, 2018

2 commits

  • A ~10% regression has been reported for UnixBench's execl throughput
    test by Aaron Lu and Ye Xiaolong:

    https://lkml.org/lkml/2018/10/30/765

    That test is pretty simple, it does a "recursive" execve() syscall on the
    same binary. Starting from the syscall, this sequence is possible:

    do_execve()
    do_execveat_common()
    __do_execve_file()
    sched_exec()
    select_task_rq_fair()
    Reported-by: Ye Xiaolong
    Tested-by: Aaron Lu
    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
    Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • Pull timer fix from Thomas Gleixner:
    "Just the removal of a redundant call into the sched deadline overrun
    check"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    posix-cpu-timers: Remove useless call to check_dl_overrun()

    Linus Torvalds