23 Jan, 2017

1 commit


19 Jan, 2017

2 commits

  • Pull SMP hotplug update from Thomas Gleixner:
    "This contains a trivial typo fix and an extension to the core code for
    dynamically allocating states in the prepare stage.

    The extension is necessary right now because we need a proper way to
    unbreak LTTNG, which iscurrently non functional due to the removal of
    the notifiers. Surely it's out of tree, but it's widely used by
    distros.

    The simple solution would have been to reserve a state for LTTNG, but
    I'm not fond about unused crap in the kernel and the dynamic range,
    which we admittedly should have done right away, allows us to remove
    quite some of the hardcoded states, i.e. those which have no ordering
    requirements. So doing the right thing now is better than having an
    smaller intermediate solution which needs to be reworked anyway"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Provide dynamic range for prepare stage
    perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug

    Linus Torvalds
     
  • Pull RCU fixes from Ingo Molnar:
    "This fixes sporadic ACPI related hangs in synchronize_rcu() that were
    caused by the ACPI code mistakenly relying on an aspect of RCU that
    was neither promised to work nor reliable but which happened to work -
    until in v4.9 we changed the RCU implementation, which made the hangs
    more prominent.

    Since the mis-use of the RCU facility wasn't properly detected and
    prevented either, these fixes make the RCU side work reliably instead
    of working around the problem in the ACPI code.

    Hence the slightly larger diffstat that goes beyond the normal scope
    of RCU fixes in -rc kernels"

    * 'rcu-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Narrow early boot window of illegal synchronous grace periods
    rcu: Remove cond_resched() from Tiny synchronize_sched()

    Linus Torvalds
     

18 Jan, 2017

4 commits

  • After the recent removal of the hotplug notifiers the variable 'hasdied' in
    _cpu_down() is set but no longer read, leading to the following GCC warning
    when building with 'make W=1':

    kernel/cpu.c:767:7: warning: variable ‘hasdied’ set but not used [-Wunused-but-set-variable]

    Fix it by removing the variable.

    Fixes: 530e9b76ae8f ("cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions")
    Signed-off-by: Tobias Klauser
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20170117143501.20893-1-tklauser@distanz.ch
    Signed-off-by: Thomas Gleixner

    Tobias Klauser
     
  • Pull modules fix from Jessica Yu:

    - fix out-of-tree module breakage when it supplies its own definitions
    of true and false

    * tag 'modules-for-v4.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    taint/module: Fix problems when out-of-kernel driver defines true or false

    Linus Torvalds
     
  • Commit 7fd8329ba502 ("taint/module: Clean up global and module taint
    flags handling") used the key words true and false as character members
    of a new struct. These names cause problems when out-of-kernel modules
    such as VirtualBox include their own definitions of true and false.

    Fixes: 7fd8329ba502 ("taint/module: Clean up global and module taint flags handling")
    Signed-off-by: Larry Finger
    Cc: Petr Mladek
    Cc: Jessica Yu
    Cc: Rusty Russell
    Reported-by: Valdis Kletnieks
    Reviewed-by: Petr Mladek
    Acked-by: Rusty Russell
    Signed-off-by: Jessica Yu

    Larry Finger
     
  • Pull networking fixes from David Miller:

    1) Handle multicast packets properly in fast-RX path of mac80211, from
    Johannes Berg.

    2) Because of a logic bug, the user can't actually force SW
    checksumming on r8152 devices. This makes diagnosis of hw
    checksumming bugs really annoying. Fix from Hayes Wang.

    3) VXLAN route lookup does not take the source and destination ports
    into account, which means IPSEC policies cannot be matched properly.
    Fix from Martynas Pumputis.

    4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.

    5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.

    6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
    construction properly in fib_dump_info(), from David Ahern.

    7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
    Schmidt.

    8) Openvswitch conntack actions need to maintain a correct checksum,
    fix from Lance Richardson.

    9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
    fact it already checks this, but not in all of the necessary spots.
    Fix from Basil Gunn.

    10) Action GET operations in the packet scheduler can erroneously bump
    the reference count of the entry, making it unreleasable. Fix from
    Jamal Hadi Salim. Jamal gives a great set of example command lines
    that trigger this in the commit message.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    net sched actions: fix refcnt when GETing of action after bind
    net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
    net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
    net/mlx4_core: Fix racy CQ (Completion Queue) free
    net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
    net/mlx5e: Fix a -Wmaybe-uninitialized warning
    ax25: Fix segfault after sock connection timeout
    bpf: rework prog_digest into prog_tag
    tipc: allocate user memory with GFP_KERNEL flag
    net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
    ip6_tunnel: Account for tunnel header in tunnel MTU
    mld: do not remove mld souce list info when set link down
    be2net: fix MAC addr setting on privileged BE3 VFs
    be2net: don't delete MAC on close on unprivileged BE3 VFs
    be2net: fix status check in be_cmd_pmac_add()
    cpmac: remove hopeless #warning
    ravb: do not use zero-length alignment DMA descriptor
    mlx4: do not call napi_schedule() without care
    openvswitch: maintain correct checksum state in conntrack actions
    tcp: fix tcp_fastopen unaligned access complaints on sparc
    ...

    Linus Torvalds
     

17 Jan, 2017

1 commit

  • Commit 7bd509e311f4 ("bpf: add prog_digest and expose it via
    fdinfo/netlink") was recently discussed, partially due to
    admittedly suboptimal name of "prog_digest" in combination
    with sha1 hash usage, thus inevitably and rightfully concerns
    about its security in terms of collision resistance were
    raised with regards to use-cases.

    The intended use cases are for debugging resp. introspection
    only for providing a stable "tag" over the instruction sequence
    that both kernel and user space can calculate independently.
    It's not usable at all for making a security relevant decision.
    So collisions where two different instruction sequences generate
    the same tag can happen, but ideally at a rather low rate. The
    "tag" will be dumped in hex and is short enough to introspect
    in tracepoints or kallsyms output along with other data such
    as stack trace, etc. Thus, this patch performs a rename into
    prog_tag and truncates the tag to a short output (64 bits) to
    make it obvious it's not collision-free.

    Should in future a hash or facility be needed with a security
    relevant focus, then we can think about requirements, constraints,
    etc that would fit to that situation. For now, rework the exposed
    parts for the current use cases as long as nothing has been
    released yet. Tested on x86_64 and s390x.

    Fixes: 7bd509e311f4 ("bpf: add prog_digest and expose it via fdinfo/netlink")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Andy Lutomirski
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jan, 2017

5 commits

  • Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
    the removal of the cpu hotplug notifiers.

    Usually I don't care much about out of tree modules, but LTTNG is widely
    used in distros. There are two ways to solve that:

    1) Reserve a hotplug state for LTTNG

    2) Add a dynamic range for the prepare states.

    While #1 is the simplest solution, #2 is the proper one as we can convert
    in tree users, which do not care about ordering, to the dynamic range as
    well.

    Add a dynamic range which allows LTTNG to request states in the prepare
    stage.

    Reported-and-tested-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Sebastian Sewior
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • …ck/linux-rcu into rcu/urgent

    Pull an urgent RCU fix from Paul E. McKenney:

    "This series contains a pair of commits that permit RCU synchronous grace
    periods (synchronize_rcu() and friends) to work correctly throughout boot.
    This eliminates the current "dead time" starting when the scheduler spawns
    its first taks and ending when the last of RCU's kthreads is spawned
    (this last happens during early_initcall() time). Although RCU's
    synchronous grace periods have long been documented as not working
    during this time, prior to 4.9, the expedited grace periods worked by
    accident, and some ACPI code came to rely on this unintentional behavior.
    (Note that this unintentional behavior was -not- reliable. For example,
    failures from ACPI could occur on !SMP systems and on systems booting
    with the rcu_normal kernel boot parameter.)

    Either way, there is a bug that needs fixing, and the 4.9 switch of RCU's
    expedited grace periods to workqueues could be considered to have caused
    a regression. This series therefore makes RCU's expedited grace periods
    operate correctly throughout the boot process. This has been demonstrated
    to fix the problems ACPI was encountering, and has the added longer-term
    benefit of simplifying RCU's behavior."

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Pull namespace fixes from Eric Biederman:
    "This tree contains 4 fixes.

    The first is a fix for a race that can causes oopses under the right
    circumstances, and that someone just recently encountered.

    Past that are several small trivial correct fixes. A real issue that
    was blocking development of an out of tree driver, but does not appear
    to have caused any actual problems for in-tree code. A potential
    deadlock that was reported by lockdep. And a deadlock people have
    experienced and took the time to track down caused by a cleanup that
    removed the code to drop a reference count"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    sysctl: Drop reference added by grab_header in proc_sys_readdir
    pid: fix lockdep deadlock warning due to ucount_lock
    libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
    mnt: Protect the mountpoint hashtable with mount_lock

    Linus Torvalds
     
  • Pull NOHZ fix from Ingo Molnar:
    "This fixes an old NOHZ race where we incorrectly calculate the next
    timer interrupt in certain circumstances where hrtimers are pending,
    that can cause hard to reproduce stalled-values artifacts in
    /proc/stat"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    nohz: Fix collision between tick and other hrtimers

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
    driver fixes, plus miscellanous tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Reject non sampling events with precise_ip
    perf/x86/intel: Account interrupts for PEBS errors
    perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
    perf/core: Fix sys_perf_event_open() vs. hotplug
    perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
    perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
    perf/x86: Set pmu->module in Intel PMU modules
    perf probe: Fix to probe on gcc generated symbols for offline kernel
    perf probe: Fix --funcs to show correct symbols for offline module
    perf symbols: Robustify reading of build-id from sysfs
    perf tools: Install tools/lib/traceevent plugins with install-bin
    tools lib traceevent: Fix prev/next_prio for deadline tasks
    perf record: Fix --switch-output documentation and comment
    perf record: Make __record_options static
    tools lib subcmd: Add OPT_STRING_OPTARG_SET option
    perf probe: Fix to get correct modname from elf header
    samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
    samples/bpf sock_example: Avoid getting ethhdr from two includes
    perf sched timehist: Show total scheduling time

    Linus Torvalds
     

15 Jan, 2017

2 commits

  • The current preemptible RCU implementation goes through three phases
    during bootup. In the first phase, there is only one CPU that is running
    with preemption disabled, so that a no-op is a synchronous grace period.
    In the second mid-boot phase, the scheduler is running, but RCU has
    not yet gotten its kthreads spawned (and, for expedited grace periods,
    workqueues are not yet running. During this time, any attempt to do
    a synchronous grace period will hang the system (or complain bitterly,
    depending). In the third and final phase, RCU is fully operational and
    everything works normally.

    This has been OK for some time, but there has recently been some
    synchronous grace periods showing up during the second mid-boot phase.
    This code worked "by accident" for awhile, but started failing as soon
    as expedited RCU grace periods switched over to workqueues in commit
    8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
    Note that the code was buggy even before this commit, as it was subject
    to failure on real-time systems that forced all expedited grace periods
    to run as normal grace periods (for example, using the rcu_normal ksysfs
    parameter). The callchain from the failure case is as follows:

    early_amd_iommu_init()
    |-> acpi_put_table(ivrs_base);
    |-> acpi_tb_put_table(table_desc);
    |-> acpi_tb_invalidate_table(table_desc);
    |-> acpi_tb_release_table(...)
    |-> acpi_os_unmap_memory
    |-> acpi_os_unmap_iomem
    |-> acpi_os_map_cleanup
    |-> synchronize_rcu_expedited

    The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
    which caused the code to try using workqueues before they were
    initialized, which did not go well.

    This commit therefore reworks RCU to permit synchronous grace periods
    to proceed during this mid-boot phase. This commit is therefore a
    fix to a regression introduced in v4.9, and is therefore being put
    forward post-merge-window in v4.10.

    This commit sets a flag from the existing rcu_scheduler_starting()
    function which causes all synchronous grace periods to take the expedited
    path. The expedited path now checks this flag, using the requesting task
    to drive the expedited grace period forward during the mid-boot phase.
    Finally, this flag is updated by a core_initcall() function named
    rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

    Note that this arrangement assumes that tasks are not sent POSIX signals
    (or anything similar) from the time that the first task is spawned
    through core_initcall() time.

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Reported-by: "Zheng, Lv"
    Reported-by: Borislav Petkov
    Signed-off-by: Paul E. McKenney
    Tested-by: Stan Kain
    Tested-by: Ivan
    Tested-by: Emanuel Castelo
    Tested-by: Bruno Pesavento
    Tested-by: Borislav Petkov
    Tested-by: Frederic Bezies
    Cc: # 4.9.0-

    Paul E. McKenney
     
  • It is now legal to invoke synchronize_sched() at early boot, which causes
    Tiny RCU's synchronize_sched() to emit spurious splats. This commit
    therefore removes the cond_resched() from Tiny RCU's synchronize_sched().

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Signed-off-by: Paul E. McKenney
    Cc: # 4.9.0-

    Paul E. McKenney
     

14 Jan, 2017

5 commits

  • It's possible to set up PEBS events to get only errors and not
    any data, like on SNB-X (model 45) and IVB-EP (model 62)
    via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

    This leads to a soft lock up, because the error path of the
    intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
    for error PEBS interrupts, so in case you're getting ONLY
    errors you don't have a way to stop the event when it's over
    the max_samples_per_tick limit:

    NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
    ...
    RIP: 0010:[] [] smp_call_function_single+0xe2/0x140
    ...
    Call Trace:
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    ? perf_cgroup_attach+0x70/0x70
    perf_install_in_context+0x199/0x1b0
    ? ctx_resched+0x90/0x90
    SYSC_perf_event_open+0x641/0xf90
    SyS_perf_event_open+0x9/0x10
    do_syscall_64+0x6c/0x1f0
    entry_SYSCALL64_slow_path+0x25/0x25

    Add perf_event_account_interrupt() which does the interrupt
    and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
    error path.

    We keep the pending_kill and pending_wakeup logic only in the
    __perf_event_overflow() path, because they make sense only if
    there's any data to deliver.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Di Shen reported a race between two concurrent sys_perf_event_open()
    calls where both try and move the same pre-existing software group
    into a hardware context.

    The problem is exactly that described in commit:

    f63a8daa5812 ("perf: Fix event->ctx locking")

    ... where, while we wait for a ctx->mutex acquisition, the event->ctx
    relation can have changed under us.

    That very same commit failed to recognise sys_perf_event_context() as an
    external access vector to the events and thereby didn't apply the
    established locking rules correctly.

    So while one sys_perf_event_open() call is stuck waiting on
    mutex_lock_double(), the other (which owns said locks) moves the group
    about. So by the time the former sys_perf_event_open() acquires the
    locks, the context we've acquired is stale (and possibly dead).

    Apply the established locking rules as per perf_event_ctx_lock_nested()
    to the mutex_lock_double() for the 'move_group' case. This obviously means
    we need to validate state after we acquire the locks.

    Reported-by: Di Shen (Keen Lab)
    Tested-by: John Dias
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Min Chong
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
    Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is problem with installing an event in a task that is 'stuck' on
    an offline CPU.

    Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
    blocked task doesn't run and doesn't require a CPU etc.. Only on
    wakeup do we ammend the situation and place the task on a available
    CPU.

    If we hit such a task with perf_install_in_context() we'll loop until
    either that task wakes up or the CPU comes back online, if the task
    waking depends on the event being installed, we're stuck.

    While looking into this issue, I also spotted another problem, if we
    hit a task with perf_install_in_context() that is in the middle of
    being migrated, that is we observe the old CPU before sending the IPI,
    but run the IPI (on the old CPU) while the task is already running on
    the new CPU, things also go sideways.

    Rework things to rely on task_curr() -- outside of rq->lock -- which
    is rather tricky. Imagine the following scenario where we're trying to
    install the first event into our task 't':

    CPU0 CPU1 CPU2

    (current == t)

    t->perf_event_ctxp[] = ctx;
    smp_mb();
    cpu = task_cpu(t);

    switch(t, n);
    migrate(t, 2);
    switch(p, t);

    ctx = t->perf_event_ctxp[]; // must not be NULL

    smp_function_call(cpu, ..);

    generic_exec_single()
    func();
    spin_lock(ctx->lock);
    if (task_curr(t)) // false

    add_event_to_ctx();
    spin_unlock(ctx->lock);

    perf_event_context_sched_in();
    spin_lock(ctx->lock);
    // sees event

    So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
    Because if CPU2's load of that variable were to observe NULL, it would
    not try to schedule the ctx and we'd have a task running without its
    counter, which would be 'bad'.

    As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
    first and not see the event yet, then CPU0 must observe task_curr()
    and retry. If the install happens first, then we must see the event on
    sched-in and all is well.

    I think we can translate the first part (until the 'must not be NULL')
    of the scenario to a litmus test like:

    C C-peterz

    {
    }

    P0(int *x, int *y)
    {
    int r1;

    WRITE_ONCE(*x, 1);
    smp_mb();
    r1 = READ_ONCE(*y);
    }

    P1(int *y, int *z)
    {
    WRITE_ONCE(*y, 1);
    smp_store_release(z, 1);
    }

    P2(int *x, int *z)
    {
    int r1;
    int r2;

    r1 = smp_load_acquire(z);
    smp_mb();
    r2 = READ_ONCE(*x);
    }

    exists
    (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

    Where:
    x is perf_event_ctxp[],
    y is our tasks's CPU, and
    z is our task being placed on the rq of CPU2.

    The P0 smp_mb() is the one added by this patch, ordering the store to
    perf_event_ctxp[] from find_get_context() and the load of task_cpu()
    in task_function_call().

    The smp_store_release/smp_load_acquire model the RCpc locking of the
    rq->lock and the smp_mb() of P2 is the context switch switching from
    whatever CPU2 was running to our task 't'.

    This litmus test evaluates into:

    Test C-peterz Allowed
    States 7
    0:r1=0; 2:r1=0; 2:r2=0;
    0:r1=0; 2:r1=0; 2:r2=1;
    0:r1=0; 2:r1=1; 2:r2=1;
    0:r1=1; 2:r1=0; 2:r2=0;
    0:r1=1; 2:r1=0; 2:r2=1;
    0:r1=1; 2:r1=1; 2:r2=0;
    0:r1=1; 2:r1=1; 2:r2=1;
    No
    Witnesses
    Positive: 0 Negative: 7
    Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
    Observation C-peterz Never 0 7
    Hash=e427f41d9146b2a5445101d3e2fcaa34

    And the strong and weak model agree.

    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Will Deacon
    Cc: jeremy.linton@arm.com
    Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull VFIO fixes from Alex Williamson:

    - Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)

    - Export and make use of has_capability() to fix incorrect use of
    ns_capable() for testing task capabilities (Jike Song)

    * tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
    vfio/type1: Remove pid_namespace.h include
    vfio iommu type1: fix the testing of capability for remote task
    capability: export has_capability
    vfio-mdev: remove some dead code
    vfio-mdev: buffer overflow in ioctl()
    vfio-mdev: return -EFAULT if copy_to_user() fails

    Linus Torvalds
     
  • Pull KVM fixes from Paolo Bonzini:

    - fix for module unload vs deferred jump labels (note: there might be
    other buggy modules!)

    - two NULL pointer dereferences from syzkaller

    - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
    made worse during this merge window, "just" kernel memory leak on
    releases

    - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: fix emulation of "MOV SS, null selector"
    KVM: x86: fix NULL deref in vcpu_scan_ioapic
    KVM: eventfd: fix NULL deref irqbypass consumer
    KVM: x86: Introduce segmented_write_std
    KVM: x86: flush pending lapic jump label updates on module unload
    jump_labels: API for flushing deferred jump label updates

    Linus Torvalds
     

12 Jan, 2017

2 commits

  • has_capability() is sometimes needed by modules to test capability
    for specified task other than current, so export it.

    Cc: Kirti Wankhede
    Signed-off-by: Jike Song
    Acked-by: Serge Hallyn
    Acked-by: James Morris
    Signed-off-by: Alex Williamson

    Jike Song
     
  • Modules that use static_key_deferred need a way to synchronize with
    any delayed work that is still pending when the module is unloaded.
    Introduce static_key_deferred_flush() which flushes any pending
    jump label updates.

    Signed-off-by: David Matlack
    Cc: stable@vger.kernel.org
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Paolo Bonzini

    David Matlack
     

11 Jan, 2017

4 commits

  • When the tick is stopped and an interrupt occurs afterward, we check on
    that interrupt exit if the next tick needs to be rescheduled. If it
    doesn't need any update, we don't want to do anything.

    In order to check if the tick needs an update, we compare it against the
    clockevent device deadline. Now that's a problem because the clockevent
    device is at a lower level than the tick itself if it is implemented
    on top of hrtimer.

    Every hrtimer share this clockevent device. So comparing the next tick
    deadline against the clockevent device deadline is wrong because the
    device may be programmed for another hrtimer whose deadline collides
    with the tick. As a result we may end up not reprogramming the tick
    accidentally.

    In a worst case scenario under full dynticks mode, the tick stops firing
    as it is supposed to every 1hz, leaving /proc/stat stalled:

    Task in a full dynticks CPU
    ----------------------------

    * hrtimer A is queued 2 seconds ahead
    * the tick is stopped, scheduled 1 second ahead
    * tick fires 1 second later
    * on tick exit, nohz schedules the tick 1 second ahead but sees
    the clockevent device is already programmed to that deadline,
    fooled by hrtimer A, the tick isn't rescheduled.
    * hrtimer A is cancelled before its deadline
    * tick never fires again until an interrupt happens...

    In order to fix this, store the next tick deadline to the tick_sched
    local structure and reuse that value later to check whether we need to
    reprogram the clock after an interrupt.

    On the other hand, ts->sleep_length still wants to know about the next
    clock event and not just the tick, so we want to improve the related
    comment to avoid confusion.

    Reported-by: James Hartsock
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Wanpeng Li
    Acked-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Frederic Weisbecker
     
  • Since commit 00cd5c37afd5 ("ptrace: permit ptracing of /sbin/init") we
    can now trace init processes. init is initially protected with
    SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
    there are a number of paths during tracing where SIGNAL_UNKILLABLE can
    be implicitly cleared.

    This can result in init becoming stoppable/killable after tracing. For
    example, running:

    while true; do kill -STOP 1; done &
    strace -p 1

    and then stopping strace and the kill loop will result in init being
    left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
    init will now respond to future SIGSTOP signals rather than ignoring
    them.

    Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
    that we don't clear SIGNAL_UNKILLABLE.

    Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
    Signed-off-by: Jamie Iles
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jamie Iles
     
  • Both arch_add_memory() and arch_remove_memory() expect a single threaded
    context.

    For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
    not hold any locks over this check and branch:

    if (pgd_val(*pgd)) {
    pud = (pud_t *)pgd_page_vaddr(*pgd);
    paddr_last = phys_pud_init(pud, __pa(vaddr),
    __pa(vaddr_end),
    page_size_mask);
    continue;
    }

    pud = alloc_low_page();
    paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
    page_size_mask);

    The result is that two threads calling devm_memremap_pages()
    simultaneously can end up colliding on pgd initialization. This leads
    to crash signatures like the following where the loser of the race
    initializes the wrong pgd entry:

    BUG: unable to handle kernel paging request at ffff888ebfff0000
    IP: memcpy_erms+0x6/0x10
    PGD 2f8e8fc067 PUD 0 /*
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Commit 01b3f52157ff ("bpf: fix allocation warnings in bpf maps and
    integer overflow") has added checks for the maximum allocateable size.
    It (ab)used KMALLOC_SHIFT_MAX for that purpose.

    While this is not incorrect it is not very clean because we already have
    KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
    KMALLOC_MAX_SIZE instead.

    The original motivation for using KMALLOC_SHIFT_MAX was to work around
    an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
    but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
    will fit into MAX_ORDER".

    Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christoph Lameter
    Cc: Alexei Starovoitov
    Cc: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Jan, 2017

1 commit

  • =========================================================
    [ INFO: possible irq lock inversion dependency detected ]
    4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G W
    ---------------------------------------------------------
    swapper/1/0 just changed the state of lock:
    (&(&sighand->siglock)->rlock){-.....}, at: [] __lock_task_sighand+0xb6/0x2c0
    but this lock took another, HARDIRQ-unsafe lock in the past:
    (ucounts_lock){+.+...}
    and interrupts could create inverse lock ordering between them.
    other info that might help us debug this:
    Chain exists of: &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
    Possible interrupt unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(ucounts_lock);
    local_irq_disable();
    lock(&(&sighand->siglock)->rlock);
    lock(&(&tty->ctrl_lock)->rlock);

    lock(&(&sighand->siglock)->rlock);

    *** DEADLOCK ***

    This patch removes a dependency between rlock and ucount_lock.

    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrei Vagin
    Acked-by: Al Viro
    Signed-off-by: Eric W. Biederman

    Andrei Vagin
     

06 Jan, 2017

1 commit

  • Pull audit fixes from Paul Moore:
    "Two small fixes relating to audit's use of fsnotify.

    The first patch plugs a leak and the second fixes some lock
    shenanigans. The patches are small and I banged on this for an
    afternoon with our testsuite and didn't see anything odd"

    * 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
    audit: Fix sleep in atomic
    fsnotify: Remove fsnotify_duplicate_mark()

    Linus Torvalds
     

04 Jan, 2017

1 commit

  • Audit tree code was happily adding new notification marks while holding
    spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
    lead to sleeping while holding a spinlock, deadlocks due to lock
    inversion, and probably other fun. Fix the problem by acquiring
    group->mark_mutex earlier.

    CC: Paul Moore
    Signed-off-by: Jan Kara
    Signed-off-by: Paul Moore

    Jan Kara
     

27 Dec, 2016

1 commit

  • The attempt to prevent overwriting an active state resulted in a
    disaster which effectively disables all dynamically allocated hotplug
    states.

    Cleanup the mess.

    Fixes: dc280d936239 ("cpu/hotplug: Prevent overwriting of callbacks")
    Reported-by: Markus Trippelsdorf
    Reported-by: Boris Ostrovsky
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

26 Dec, 2016

4 commits

  • Pull timer type cleanups from Thomas Gleixner:
    "This series does a tree wide cleanup of types related to
    timers/timekeeping.

    - Get rid of cycles_t and use a plain u64. The type is not really
    helpful and caused more confusion than clarity

    - Get rid of the ktime union. The union has become useless as we use
    the scalar nanoseconds storage unconditionally now. The 32bit
    timespec alike storage got removed due to the Y2038 limitations
    some time ago.

    That leaves the odd union access around for no reason. Clean it up.

    Both changes have been done with coccinelle and a small amount of
    manual mopping up"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ktime: Get rid of ktime_equal()
    ktime: Cleanup ktime_set() usage
    ktime: Get rid of the union
    clocksource: Use a plain u64 instead of cycle_t

    Linus Torvalds
     
  • Pull SMP hotplug notifier removal from Thomas Gleixner:
    "This is the final cleanup of the hotplug notifier infrastructure. The
    series has been reintgrated in the last two days because there came a
    new driver using the old infrastructure via the SCSI tree.

    Summary:

    - convert the last leftover drivers utilizing notifiers

    - fixup for a completely broken hotplug user

    - prevent setup of already used states

    - removal of the notifiers

    - treewide cleanup of hotplug state names

    - consolidation of state space

    There is a sphinx based documentation pending, but that needs review
    from the documentation folks"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/armada-xp: Consolidate hotplug state space
    irqchip/gic: Consolidate hotplug state space
    coresight/etm3/4x: Consolidate hotplug state space
    cpu/hotplug: Cleanup state names
    cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
    staging/lustre/libcfs: Convert to hotplug state machine
    scsi/bnx2i: Convert to hotplug state machine
    scsi/bnx2fc: Convert to hotplug state machine
    cpu/hotplug: Prevent overwriting of callbacks
    x86/msr: Remove bogus cleanup from the error path
    bus: arm-ccn: Prevent hotplug callback leak
    perf/x86/intel/cstate: Prevent hotplug callback leak
    ARM/imx/mmcd: Fix broken cpu hotplug handling
    scsi: qedi: Convert to hotplug state machine

    Linus Torvalds
     
  • ktime_set(S,N) was required for the timespec storage type and is still
    useful for situations where a Seconds and Nanoseconds part of a time value
    needs to be converted. For anything where the Seconds argument is 0, this
    is pointless and can be replaced with a simple assignment.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

25 Dec, 2016

4 commits

  • There is no point in having an extra type for extra confusion. u64 is
    unambiguous.

    Conversion was done with the following coccinelle script:

    @rem@
    @@
    -typedef u64 cycle_t;

    @fix@
    typedef cycle_t;
    @@
    -cycle_t
    +u64

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: John Stultz

    Thomas Gleixner
     
  • hotcpu_notifier(), cpu_notifier(), __hotcpu_notifier(), __cpu_notifier(),
    register_hotcpu_notifier(), register_cpu_notifier(),
    __register_hotcpu_notifier(), __register_cpu_notifier(),
    unregister_hotcpu_notifier(), unregister_cpu_notifier(),
    __unregister_hotcpu_notifier(), __unregister_cpu_notifier()

    are unused now. Remove them and all related code.

    Remove also the now pointless cpu notifier error injection mechanism. The
    states can be executed step by step and error rollback is the same as cpu
    down, so any state transition can be tested w/o requiring the notifier
    error injection.

    Some CPU hotplug states are kept as they are (ab)used for hotplug state
    tracking.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161221192112.005642358@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Developers manage to overwrite states blindly without thought. That's fatal
    and hard to debug. Add sanity checks to make it fail.

    This requries to restructure the code so that the dynamic state allocation
    happens in the same lock protected section as the actual store. Otherwise
    the previous assignment of 'Reserved' to the name field would trigger the
    overwrite check.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Link: http://lkml.kernel.org/r/20161221192111.675234535@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • This was entirely automated, using the script by Al:

    PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
    sed -i -e "s!$PATT!#include !" \
    $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

    to do the replacement at the end of the merge window.

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Dec, 2016

2 commits

  • Pull perf fixes from Ingo Molnar:
    "On the kernel side there's two x86 PMU driver fixes and a uprobes fix,
    plus on the tooling side there's a number of fixes and some late
    updates"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    perf sched timehist: Fix invalid period calculation
    perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
    perf sched timehist: Enlarge default 'comm_width'
    perf sched timehist: Honour 'comm_width' when aligning the headers
    perf/x86: Fix overlap counter scheduling bug
    perf/x86/pebs: Fix handling of PEBS buffer overflows
    samples/bpf: Move open_raw_sock to separate header
    samples/bpf: Remove perf_event_open() declaration
    samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
    tools lib bpf: Add bpf_prog_{attach,detach}
    samples/bpf: Switch over to libbpf
    perf diff: Do not overwrite valid build id
    perf annotate: Don't throw error for zero length symbols
    perf bench futex: Fix lock-pi help string
    perf trace: Check if MAP_32BIT is defined (again)
    samples/bpf: Make perf_event_read() static
    uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
    samples/bpf: Make samples more libbpf-centric
    tools lib bpf: Add flags to bpf_create_map()
    tools lib bpf: use __u32 from linux/types.h
    ...

    Linus Torvalds
     
  • There are only two calls sites of fsnotify_duplicate_mark(). Those are
    in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
    for audit tree, inode pointer and group gets set in
    fsnotify_add_mark_locked() later anyway, mask and free_mark are already
    set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
    actively harmful because following fsnotify_add_mark_locked() will leak
    group reference by overwriting the group pointer. So just remove the two
    calls to fsnotify_duplicate_mark() and the function.

    Signed-off-by: Jan Kara
    [PM: line wrapping to fit in 80 chars]
    Signed-off-by: Paul Moore

    Jan Kara