03 Apr, 2017

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "This update provides:

    - make the scheduler clock switch to unstable mode smooth so the
    timestamps stay at microseconds granularity instead of switching to
    tick granularity.

    - unbreak perf test tsc by taking the new offset into account which
    was added in order to proveide better sched clock continuity

    - switching sched clock to unstable mode runs all clock related
    computations which affect the sched clock output itself from a work
    queue. In case of preemption sched clock uses half updated data and
    provides wrong timestamps. Keep the math in the protected context
    and delegate only the static key switch to workqueue context.

    - remove a duplicate header include"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/headers: Remove duplicate #include line
    sched/clock: Fix broken stable to unstable transfer
    sched/clock, x86/perf: Fix "perf test tsc"
    sched/clock: Fix clear_sched_clock_stable() preempt wobbly

    Linus Torvalds
     

01 Apr, 2017

1 commit

  • Pull crypto fixes from Herbert Xu:
    "This fixes the following issues:

    - memory corruption when kmalloc fails in xts/lrw

    - mark some CCP DMA channels as private

    - fix reordering race in padata

    - regression in omap-rng DT description"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
    crypto: ccp - Make some CCP DMA channels private
    padata: avoid race in reordering
    dt-bindings: rng: clocks property on omap_rng not always mandatory

    Linus Torvalds
     

27 Mar, 2017

1 commit

  • When it is determined that the clock is actually unstable, and
    we switch from stable to unstable, the __clear_sched_clock_stable()
    function is eventually called.

    In this function we set gtod_offset so the following holds true:

    sched_clock() + raw_offset == ktime_get_ns() + gtod_offset

    But instead of getting the latest timestamps, we use the last values
    from scd, so instead of sched_clock() we use scd->tick_raw, and
    instead of ktime_get_ns() we use scd->tick_gtod.

    However, later, when we use gtod_offset sched_clock_local() we do not
    add it to scd->tick_gtod to calculate the correct clock value when we
    determine the boundaries for min/max clocks.

    This can result in tick granularity sched_clock() values, so fix it.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hpa@zytor.com
    Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
    Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Ingo Molnar

    Pavel Tatashin
     

26 Mar, 2017

1 commit

  • Pull audit fix from Paul Moore:
    "We've got an audit fix, and unfortunately it is big.

    While I'm not excited that we need to be sending you something this
    large during the -rcX phase, it does fix some very real, and very
    tangled, problems relating to locking, backlog queues, and the audit
    daemon connection.

    This code has passed our testsuite without problem and it has held up
    to my ad-hoc stress tests (arguably better than the existing code),
    please consider pulling this as fix for the next v4.11-rcX tag"

    * 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
    audit: fix auditd/kernel connection state tracking

    Linus Torvalds
     

24 Mar, 2017

3 commits

  • Under extremely heavy uses of padata, crashes occur, and with list
    debugging turned on, this happens instead:

    [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
    __list_add+0xae/0x130
    [87487.301868] list_add corruption. prev->next should be next
    (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
    [87487.339011] [] dump_stack+0x68/0xa3
    [87487.342198] [] ? console_unlock+0x281/0x6d0
    [87487.345364] [] __warn+0xff/0x140
    [87487.348513] [] warn_slowpath_fmt+0x4a/0x50
    [87487.351659] [] __list_add+0xae/0x130
    [87487.354772] [] ? _raw_spin_lock+0x64/0x70
    [87487.357915] [] padata_reorder+0x1e6/0x420
    [87487.361084] [] padata_do_serial+0xa5/0x120

    padata_reorder calls list_add_tail with the list to which its adding
    locked, which seems correct:

    spin_lock(&squeue->serial.lock);
    list_add_tail(&padata->list, &squeue->serial.list);
    spin_unlock(&squeue->serial.lock);

    This therefore leaves only place where such inconsistency could occur:
    if padata->list is added at the same time on two different threads.
    This pdata pointer comes from the function call to
    padata_get_next(pd), which has in it the following block:

    next_queue = per_cpu_ptr(pd->pqueue, cpu);
    padata = NULL;
    reorder = &next_queue->reorder;
    if (!list_empty(&reorder->list)) {
    padata = list_entry(reorder->list.next,
    struct padata_priv, list);
    spin_lock(&reorder->lock);
    list_del_init(&padata->list);
    atomic_dec(&pd->reorder_objects);
    spin_unlock(&reorder->lock);

    pd->processed++;

    goto out;
    }
    out:
    return padata;

    I strongly suspect that the problem here is that two threads can race
    on reorder list. Even though the deletion is locked, call to
    list_entry is not locked, which means it's feasible that two threads
    pick up the same padata object and subsequently call list_add_tail on
    them at the same time. The fix is thus be hoist that lock outside of
    that block.

    Signed-off-by: Jason A. Donenfeld
    Acked-by: Steffen Klassert
    Signed-off-by: Herbert Xu

    Jason A. Donenfeld
     
  • Pull power management fixes from Rafael Wysocki:
    "One of these is an intel_pstate regression fix and it is not a small
    change, but it mostly removes code that shouldn't be there. That code
    was acquired by mistake and has been a source of constant pain since
    then, so the time has come to get rid of it finally. We have not seen
    problems with this change in the lab, so fingers crossed.

    The rest is more usual: one more intel_pstate commit removing useless
    code, a cpufreq core fix to make it restore policy limits on CPU
    online (which prevents the limits from being reset over system
    suspend/resume), a schedutil cpufreq governor initialization fix to
    make it actually work as advertised on all systems and an extra sanity
    check in the cpuidle core to prevent crashes from happening if the
    arch code messes things up.

    Specifics:

    - Make intel_pstate use one set of global P-state limits in the
    active mode regardless of the scaling_governor settings for
    individual CPUs instead of switching back and forth between two of
    them in a way that is hard to control (Rafael Wysocki).

    - Drop a useless function from intel_pstate to prevent it from
    modifying the maximum supported frequency value unexpectedly which
    may confuse the cpufreq core (Rafael Wysocki).

    - Fix the cpufreq core to restore policy limits on CPU online so that
    the limits are not reset over system suspend/resume, among other
    things (Viresh Kumar).

    - Fix the initialization of the schedutil cpufreq governor to make
    the IO-wait boosting mechanism in it actually work on systems with
    one CPU per cpufreq policy (Rafael Wysocki).

    - Add a sanity check to the cpuidle core to prevent crashes from
    happening if the architecture code initialization fails to set up
    things as expected (Vaidyanathan Srinivasan)"

    * tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: Restore policy min/max limits on CPU online
    cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
    cpufreq: intel_pstate: Fix policy data management in passive mode
    cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
    cpufreq: intel_pstate: One set of global limits in active mode

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Several netfilter fixes from Pablo and the crew:
    - Handle fragmented packets properly in netfilter conntrack, from
    Florian Westphal.
    - Fix SCTP ICMP packet handling, from Ying Xue.
    - Fix big-endian bug in nftables, from Liping Zhang.
    - Fix alignment of fake conntrack entry, from Steven Rostedt.

    2) Fix feature flags setting in fjes driver, from Taku Izumi.

    3) Openvswitch ipv6 tunnel source address not set properly, from Or
    Gerlitz.

    4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.

    5) sk->sk_frag.page not released properly in some cases, from Eric
    Dumazet.

    6) Fix RTNL deadlocks in nl80211, from Johannes Berg.

    7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.

    8) Cure improper inflight handling during AF_UNIX GC, from Andrey
    Ulanov.

    9) sch_dsmark doesn't write to packet headers properly, from Eric
    Dumazet.

    10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
    Yeganeh.

    11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.

    12) Fix nametbl deadlock in tipc, from Ying Xue.

    13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
    Pressman.

    14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.

    15) Fix hashmap allocation handling, from Alexei Starovoitov.

    16) nl_fib_input() needs stronger netlink message length checking, from
    Eric Dumazet.

    17) Fix double-free of sk->sk_filter during sock clone, from Daniel
    Borkmann.

    18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
    net:ethernet:aquantia: Fix for RX checksum offload.
    amd-xgbe: Fix the ECC-related bit position definitions
    sfc: cleanup a condition in efx_udp_tunnel_del()
    Bluetooth: btqcomsmd: fix compile-test dependency
    inet: frag: release spinlock before calling icmp_send()
    tcp: initialize icsk_ack.lrcvtime at session start time
    genetlink: fix counting regression on ctrl_dumpfamily()
    socket, bpf: fix sk_filter use after free in sk_clone_lock
    ipv4: provide stronger user input validation in nl_fib_input()
    bpf: fix hashmap extra_elems logic
    enic: update enic maintainers
    net: bcmgenet: remove bcmgenet_internal_phy_setup()
    ipv6: make sure to initialize sockc.tsflags before first use
    fjes: Do not load fjes driver if extended socket device is not power on.
    fjes: Do not load fjes driver if system does not have extended socket device.
    net/mlx5e: Count LRO packets correctly
    net/mlx5e: Count GSO packets correctly
    net/mlx5: Increase number of max QPs in default profile
    net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
    net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
    ...

    Linus Torvalds
     

23 Mar, 2017

3 commits

  • People reported that commit:

    5680d8094ffa ("sched/clock: Provide better clock continuity")

    broke "perf test tsc".

    That commit added another offset to the reported clock value; so
    take that into account when computing the provided offset values.

    Reported-by: Adrian Hunter
    Reported-by: Arnaldo Carvalho de Melo
    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Paul reported a problems with clear_sched_clock_stable(). Since we run
    all of __clear_sched_clock_stable() from workqueue context, there's a
    preempt problem.

    Solve it by only running the static_key_disable() from workqueue.

    Reported-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: fweisbec@gmail.com
    Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In both kmalloc and prealloc mode the bpf_map_update_elem() is using
    per-cpu extra_elems to do atomic update when the map is full.
    There are two issues with it. The logic can be misused, since it allows
    max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
    at map creation time can fail percpu alloc for large map values with a warn:
    WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
    illegal size (32824) or align (8) for percpu allocation

    The fixes for both of these issues are different for kmalloc and prealloc modes.
    For prealloc mode allocate extra num_possible_cpus elements and store
    their pointers into extra_elems array instead of actual elements.
    Hence we can use these hidden(spare) elements not only when the map is full
    but during bpf_map_update_elem() that replaces existing element too.
    That also improves performance, since pcpu_freelist_pop/push is avoided.
    Unfortunately this approach cannot be used for kmalloc mode which needs
    to kfree elements after rcu grace period. Therefore switch it back to normal
    kmalloc even when full and old element exists like it was prior to
    commit 6c9059817432 ("bpf: pre-allocate hash map elements").

    Add tests to check for over max_entries and large map values.

    Reported-by: Dave Jones
    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

21 Mar, 2017

2 commits

  • What started as a rather straightforward race condition reported by
    Dmitry using the syzkaller fuzzer ended up revealing some major
    problems with how the audit subsystem managed its netlink sockets and
    its connection with the userspace audit daemon. Fixing this properly
    had quite the cascading effect and what we are left with is this rather
    large and complicated patch. My initial goal was to try and decompose
    this patch into multiple smaller patches, but the way these changes
    are intertwined makes it difficult to split these changes into
    meaningful pieces that don't break or somehow make things worse for
    the intermediate states.

    The patch makes a number of changes, but the most significant are
    highlighted below:

    * The auditd tracking variables, e.g. audit_sock, are now gone and
    replaced by a RCU/spin_lock protected variable auditd_conn which is
    a structure containing all of the auditd tracking information.

    * We no longer track the auditd sock directly, instead we track it
    via the network namespace in which it resides and we use the audit
    socket associated with that namespace. In spirit, this is what the
    code was trying to do prior to this patch (at least I think that is
    what the original authors intended), but it was done rather poorly
    and added a layer of obfuscation that only masked the underlying
    problems.

    * Big backlog queue cleanup, again. In v4.10 we made some pretty big
    changes to how the audit backlog queues work, here we haven't changed
    the queue design so much as cleaned up the implementation. Brought
    about by the locking changes, we've simplified kauditd_thread() quite
    a bit by consolidating the queue handling into a new helper function,
    kauditd_send_queue(), which allows us to eliminate a lot of very
    similar code and makes the looping logic in kauditd_thread() clearer.

    * All netlink messages sent to auditd are now sent via
    auditd_send_unicast_skb(). Other than just making sense, this makes
    the lock handling easier.

    * Change the audit_log_start() sleep behavior so that we never sleep
    on auditd events (unchanged) or if the caller is holding the
    audit_cmd_mutex (changed). Previously we didn't sleep if the caller
    was auditd or if the message type fell between a certain range; the
    type check was a poor effort of doing what the cmd_mutex check now
    does. Richard Guy Briggs originally proposed not sleeping the
    cmd_mutex owner several years ago but his patch wasn't acceptable
    at the time. At least the idea lives on here.

    * A problem with the lost record counter has been resolved. Steve
    Grubb and I both happened to notice this problem and according to
    some quick testing by Steve, this problem goes back quite some time.
    It's largely a harmless problem, although it may have left some
    careful sysadmins quite puzzled.

    Cc: # 4.10.x-
    Reported-by: Dmitry Vyukov
    Signed-off-by: Paul Moore

    Paul Moore
     
  • sugov_start() only initializes struct sugov_cpu per-CPU structures
    for shared policies, but it should do that for single-CPU policies too.

    That in particular makes the IO-wait boost mechanism work in the
    cases when cpufreq policies correspond to individual CPUs.

    Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Cc: 4.9+ # 4.9+

    Rafael J. Wysocki
     

18 Mar, 2017

4 commits

  • Pull CPU hotplug fix from Thomas Gleixner:
    "A single fix preventing the concurrent execution of the CPU hotplug
    callback install/invocation machinery. Long standing bug caused by a
    massive brain slip of that Gleixner dude, which went unnoticed for
    almost a year"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Serialize callback invocations proper

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A set of perf related fixes:

    - fix a CR4.PCE propagation issue caused by usage of mm instead of
    active_mm and therefore propagated the wrong value.

    - perf core fixes, which plug a use-after-free issue and make the
    event inheritance on fork more robust.

    - a tooling fix for symbol handling"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf symbols: Fix symbols__fixup_end heuristic for corner cases
    x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
    x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
    perf/core: Better explain the inherit magic
    perf/core: Simplify perf_event_free_task()
    perf/core: Fix event inheritance on fork()
    perf/core: Fix use-after-free in perf_release()

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "From the scheduler departement:

    - a bunch of sched deadline related fixes which deal with various
    buglets and corner cases.

    - two fixes for the loadavg spikes which are caused by the delayed
    NOHZ accounting"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/deadline: Use deadline instead of period when calculating overflow
    sched/deadline: Throttle a constrained deadline task activated after the deadline
    sched/deadline: Make sure the replenishment timer fires in the next period
    sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
    sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
    sched/deadline: Add missing update_rq_clock() in dl_task_timer()

    Linus Torvalds
     
  • Pull locking fixes from Thomas Gleixner:
    "Three fixes related to locking:

    - fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
    for the XCHGADD variant already

    - plug a potential use after free in the futex code

    - prevent leaking a held spinlock in an futex error handling code
    path"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
    futex: Add missing error handling to FUTEX_REQUEUE_PI
    futex: Fix potential use-after-free in FUTEX_REQUEUE_PI

    Linus Torvalds
     

17 Mar, 2017

1 commit

  • Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
    introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
    in order to allow similar semantics for memory hotplug like for cpu
    hotplug.

    The corresponding functions for cpu hotplug are get/put_online_cpus()
    and cpu_hotplug_begin/done() for cpu hotplug.

    The commit however missed to introduce functions that would serialize
    memory hotplug operations like they are done for cpu hotplug with
    cpu_maps_update_begin/done().

    This basically leaves mem_hotplug.active_writer unprotected and allows
    concurrent writers to modify it, which may lead to problems as outlined
    by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
    mem_hotplug_{begin, done}").

    That commit was extended again with commit b5d24fda9c3d ("mm,
    devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
    done}") which serializes memory hotplug operations for some call sites
    by using the device_hotplug lock.

    In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
    for memory hotplug") a sanity check was added to mem_hotplug_begin() to
    verify that the device_hotplug lock is held.

    This in turn triggers the following warning on s390:

    WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
    Call Trace:
    assert_held_device_hotplug+0x40/0x58)
    mem_hotplug_begin+0x34/0xc8
    add_memory_resource+0x7e/0x1f8
    add_memory+0xda/0x130
    add_memory_merged+0x15c/0x178
    sclp_detect_standby_memory+0x2ae/0x2f8
    do_one_initcall+0xa2/0x150
    kernel_init_freeable+0x228/0x2d8
    kernel_init+0x2a/0x140
    kernel_thread_starter+0x6/0xc

    One possible fix would be to add more lock_device_hotplug() and
    unlock_device_hotplug() calls around each call site of
    mem_hotplug_begin/end(). But that would give the device_hotplug lock
    additional semantics it better should not have (serialize memory hotplug
    operations).

    Instead add a new memory_add_remove_lock which has the similar semantics
    like cpu_add_remove_lock for cpu hotplug.

    To keep things hopefully a bit easier the lock will be locked and unlocked
    within the mem_hotplug_begin/end() functions.

    Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
    Signed-off-by: Heiko Carstens
    Reported-by: Sebastian Ott
    Acked-by: Dan Williams
    Acked-by: Rafael J. Wysocki
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Ben Hutchings
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

16 Mar, 2017

11 commits

  • While going through the event inheritance code Oleg got confused.

    Add some comments to better explain the silent dissapearance of
    orphaned events.

    So what happens is that at perf_event_release_kernel() time; when an
    event looses its connection to userspace (and ceases to exist from the
    user's perspective) we can still have an arbitrary amount of inherited
    copies of the event. We want to synchronously find and remove all
    these child events.

    Since that requires a bit of lock juggling, there is the possibility
    that concurrent clone()s will create new child events. Therefore we
    first mark the parent event as DEAD, which marks all the extant child
    events as orphaned.

    We then avoid copying orphaned events; in order to avoid getting more
    of them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Dmitry Vyukov
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: fweisbec@gmail.com
    Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We have ctx->event_list that contains all events; no need to
    repeatedly iterate the group lists to find them all.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Dmitry Vyukov
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: fweisbec@gmail.com
    Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While hunting for clues to a use-after-free, Oleg spotted that
    perf_event_init_context() can loose an error value with the result
    that fork() can succeed even though we did not fully inherit the perf
    event context.

    Spotted-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Dmitry Vyukov
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: oleg@redhat.com
    Cc: stable@vger.kernel.org
    Fixes: 889ff0150661 ("perf/core: Split context's event group list into pinned and non-pinned lists")
    Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Dmitry reported syzcaller tripped a use-after-free in perf_release().

    After much puzzlement Oleg spotted the below scenario:

    Task1 Task2

    fork()
    perf_event_init_task()
    /* ... */
    goto bad_fork_$foo;
    /* ... */
    perf_event_free_task()
    mutex_lock(ctx->lock)
    perf_free_event(B)

    perf_event_release_kernel(A)
    mutex_lock(A->child_mutex)
    list_for_each_entry(child, ...) {
    /* child == B */
    ctx = B->ctx;
    get_ctx(ctx);
    mutex_unlock(A->child_mutex);

    mutex_lock(A->child_mutex)
    list_del_init(B->child_list)
    mutex_unlock(A->child_mutex)

    /* ... */

    mutex_unlock(ctx->lock);
    put_ctx() /* >0 */
    free_task();
    mutex_lock(ctx->lock);
    mutex_lock(A->child_mutex);
    /* ... */
    mutex_unlock(A->child_mutex);
    mutex_unlock(ctx->lock)
    put_ctx() /* 0 */
    ctx->task && !TOMBSTONE
    put_task_struct() /* UAF */

    This patch closes the hole by making perf_event_free_task() destroy the
    task ctx relation such that perf_event_release_kernel() will no longer
    observe the now dead task.

    Spotted-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: fweisbec@gmail.com
    Cc: oleg@redhat.com
    Cc: stable@vger.kernel.org
    Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events")
    Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
    Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • I was testing Daniel's changes with his test case, and tweaked it a
    little. Instead of having the runtime equal to the deadline, I
    increased the deadline ten fold.

    Daniel's test case had:

    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    To make it more interesting, I changed it to:

    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    The results were rather surprising. The behavior that Daniel's patch
    was fixing came back. The task started using much more than .1% of the
    CPU. More like 20%.

    Looking into this I found that it was due to the dl_entity_overflow()
    constantly returning true. That's because it uses the relative period
    against relative runtime vs the absolute deadline against absolute
    runtime.

    runtime / (deadline - t) > dl_runtime / dl_period

    There's even a comment mentioning this, and saying that when relative
    deadline equals relative period, that the equation is the same as using
    deadline instead of period. That comment is backwards! What we really
    want is:

    runtime / (deadline - t) > dl_runtime / dl_deadline

    We care about if the runtime can make its deadline, not its period. And
    then we can say "when the deadline equals the period, the equation is
    the same as using dl_period instead of dl_deadline".

    After correcting this, now when the task gets enqueued, it can throttle
    correctly, and Daniel's fix to the throttling of sleeping deadline
    tasks works even when the runtime and deadline are not the same.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     
  • During the activation, CBS checks if it can reuse the current task's
    runtime and period. If the deadline of the task is in the past, CBS
    cannot use the runtime, and so it replenishes the task. This rule
    works fine for implicit deadline tasks (deadline == period), and the
    CBS was designed for implicit deadline tasks. However, a task with
    constrained deadline (deadine < period) might be awakened after the
    deadline, but before the next period. In this case, replenishing the
    task would allow it to run for runtime / deadline. As in this case
    deadline < period, CBS enables a task to run for more than the
    runtime / period. In a very loaded system, this can cause a domino
    effect, making other tasks miss their deadlines.

    To avoid this problem, in the activation of a constrained deadline
    task after the deadline but before the next period, throttle the
    task and set the replenishing timer to the begin of the next period,
    unless it is boosted.

    Reproducer:

    --------------- %< ---------------
    int main (int argc, char **argv)
    {
    int ret;
    int flags = 0;
    unsigned long l = 0;
    struct timespec ts;
    struct sched_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.size = sizeof(attr);

    attr.sched_policy = SCHED_DEADLINE;
    attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
    attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

    ts.tv_sec = 0;
    ts.tv_nsec = 2000 * 1000; /* 2 ms */

    ret = sched_setattr(0, &attr, flags);

    if (ret < 0) {
    perror("sched_setattr");
    exit(-1);
    }

    for(;;) {
    /* XXX: you may need to adjust the loop */
    for (l = 0; l < 150000; l++);
    /*
    * The ideia is to go to sleep right before the deadline
    * and then wake up before the next period to receive
    * a new replenishment.
    */
    nanosleep(&ts, NULL);
    }

    exit(0);
    }
    --------------- >% ---------------

    On my box, this reproducer uses almost 50% of the CPU time, which is
    obviously wrong for a task with 2/2000 reservation.

    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     
  • Currently, the replenishment timer is set to fire at the deadline
    of a task. Although that works for implicit deadline tasks because the
    deadline is equals to the begin of the next period, that is not correct
    for constrained deadline tasks (deadline < period).

    For instance:

    f.c:
    --------------- %< ---------------
    int main (void)
    {
    for(;;);
    }
    --------------- >% ---------------

    # gcc -o f f.c

    # trace-cmd record -e sched:sched_switch \
    -e syscalls:sys_exit_sched_setattr \
    chrt -d --sched-runtime 490000000 \
    --sched-deadline 500000000 \
    --sched-period 1000000000 0 ./f

    # trace-cmd report | grep "{pid of ./f}"

    After setting parameters, the task is replenished and continue running
    until being throttled:

    f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0

    The task is throttled after running 492318 ms, as expected:

    f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]

    But then, the task is replenished 500719 ms after the first
    replenishment:

    -0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]

    Running for 490277 ms:

    f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]

    Hence, in the first period, the task runs 2 * runtime, and that is a bug.

    During the first replenishment, the next deadline is set one period away.
    So the runtime / period starts to be respected. However, as the second
    replenishment took place in the wrong instant, the next replenishment
    will also be held in a wrong instant of time. Rather than occurring in
    the nth period away from the first activation, it is taking place
    in the (nth period - relative deadline).

    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Luca Abeni
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Romulo Silva de Oliveira
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tommaso Cucinotta
    Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     
  • We hang if SIGKILL has been sent, but the task is stuck in down_read()
    (after do_exit()), even though no task is doing down_write() on the
    rwsem in question:

    INFO: task libupnp:21868 blocked for more than 120 seconds.
    libupnp D 0 21868 1 0x08100008
    ...
    Call Trace:
    __schedule()
    schedule()
    __down_read()
    do_exit()
    do_group_exit()
    __wake_up_parent()

    This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
    the following commit:

    04cafed7fc19 ("locking/rwsem: Fix down_write_killable()")

    ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.

    Signed-off-by: Niklas Cassel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Niklas Cassel
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: d47996082f52 ("locking/rwsem: Introduce basis for down_write_killable()")
    Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
    Signed-off-by: Ingo Molnar

    Niklas Cassel
     
  • 'calc_load_update' is accessed without any kind of locking and there's
    a clear assumption in the code that only a single value is read or
    written.

    Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
    unintentionally seeing multiple values, or having the load/stores
    split.

    Technically the loads in calc_global_*() don't require this since
    those are the only functions that update 'calc_load_update', but I've
    added the READ_ONCE() for consistency.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
    the pending sample window time on exit, setting the next update not
    one window into the future, but two.

    This situation on exiting NO_HZ is described by:

    this_rq->calc_load_update < jiffies < calc_load_update

    In this scenario, what we should be doing is:

    this_rq->calc_load_update = calc_load_update [ next window ]

    But what we actually do is:

    this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]

    This has the effect of delaying load average updates for potentially
    up to ~9seconds.

    This can result in huge spikes in the load average values due to
    per-cpu uninterruptible task counts being out of sync when accumulated
    across all CPUs.

    It's safe to update the per-cpu active count if we wake between sample
    windows because any load that we left in 'calc_load_idle' will have
    been zero'd when the idle load was folded in calc_global_load().

    This issue is easy to reproduce before,

    commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    just by forking short-lived process pipelines built from ps(1) and
    grep(1) in a loop. I'm unable to reproduce the spikes after that
    commit, but the bug still seems to be present from code review.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
    Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • The following warning can be triggered by hot-unplugging the CPU
    on which an active SCHED_DEADLINE task is running on:

    ------------[ cut here ]------------
    WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
    rq->clock_update_flags < RQCF_ACT_SKIP
    CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
    Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
    Call Trace:

    dump_stack+0x85/0xc4
    __warn+0x172/0x1b0
    warn_slowpath_fmt+0xb4/0xf0
    ? __warn+0x1b0/0x1b0
    ? debug_check_no_locks_freed+0x2c0/0x2c0
    ? cpudl_set+0x3d/0x2b0
    replenish_dl_entity+0x71e/0xc40
    enqueue_task_dl+0x2ea/0x12e0
    ? dl_task_timer+0x777/0x990
    ? __hrtimer_run_queues+0x270/0xa50
    dl_task_timer+0x316/0x990
    ? enqueue_task_dl+0x12e0/0x12e0
    ? enqueue_task_dl+0x12e0/0x12e0
    __hrtimer_run_queues+0x270/0xa50
    ? hrtimer_cancel+0x20/0x20
    ? hrtimer_interrupt+0x119/0x600
    hrtimer_interrupt+0x19c/0x600
    ? trace_hardirqs_off+0xd/0x10
    local_apic_timer_interrupt+0x74/0xe0
    smp_apic_timer_interrupt+0x76/0xa0
    apic_timer_interrupt+0x93/0xa0

    The DL task will be migrated to a suitable later deadline rq once the DL
    timer fires and currnet rq is offline. The rq clock of the new rq should
    be updated. This patch fixes it by updating the rq clock after holding
    the new rq's rq lock.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Matt Fleming
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

15 Mar, 2017

6 commits

  • Pull networking fixes from David Miller:

    1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
    from Steffen Klassert.

    2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
    Alexei Starovoitov.

    3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
    Michal Schmidt.

    4) We can get a divide by zero when TCP socket are morphed into
    listening state, fix from Eric Dumazet.

    5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
    skb_complete_tx_timestamp(). From Eric Dumazet.

    6) Use after free in dccp_feat_activate_values(), also from Eric
    Dumazet.

    7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
    Jarod Wilson.

    8) Fix use after free in vrf_xmit(), from David Ahern.

    9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
    Alexey Kodanev.

    10) Properly check napi_complete_done() return value in order to decide
    whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
    Lendacky.

    11) Fix double free of hwmon device in marvell phy driver, from Andrew
    Lunn.

    12) Don't crash on malformed netlink attributes in act_connmark, from
    Etienne Noss.

    13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
    from Sabrina Dubroca.

    14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
    Florian Westphal.

    15) Fix routing redirect races in dccp and tcp, basically the ICMP
    handler can't modify the socket's cached route in it's locked by the
    user at this moment. From Jon Maxwell.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
    qed: Enable iSCSI Out-of-Order
    qed: Correct out-of-bound access in OOO history
    qed: Fix interrupt flags on Rx LL2
    qed: Free previous connections when releasing iSCSI
    qed: Fix mapping leak on LL2 rx flow
    qed: Prevent creation of too-big u32-chains
    qed: Align CIDs according to DORQ requirement
    mlxsw: reg: Fix SPVMLR max record count
    mlxsw: reg: Fix SPVM max record count
    net: Resend IGMP memberships upon peer notification.
    dccp: fix memory leak during tear-down of unsuccessful connection request
    tun: fix premature POLLOUT notification on tun devices
    dccp/tcp: fix routing redirect race
    ucc/hdlc: fix two little issue
    vxlan: fix ovs support
    net: use net->count to check whether a netns is alive or not
    bridge: drop netfilter fake rtable unconditionally
    ipv6: avoid write to a possibly cloned skb
    net: wimax/i2400m: fix NULL-deref at probe
    isdn/gigaset: fix NULL-deref at probe
    ...

    Linus Torvalds
     
  • Pull cgroup fixes from Tejun Heo:
    "Three cgroup fixes. Nothing critical:

    - the pids controller could trigger suspicious RCU warning
    spuriously. Fixed.

    - in the debug controller, %p -> %pK to protect kernel pointer
    from getting exposed.

    - documentation formatting fix"

    * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroups: censor kernel pointer in debug files
    cgroup/pids: remove spurious suspicious RCU usage warning
    cgroup: Fix indenting in PID controller documentation

    Linus Torvalds
     
  • Pull workqueue fix from Tejun Heo:
    "If a delayed work is queued with NULL @wq, workqueue code explodes
    after the timer expires at which point it's difficult to tell who the
    culprit was.

    This actually happened and the offender was net/smc this time.

    Add an explicit sanity check for it in the queueing path"

    * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq

    Linus Torvalds
     
  • Thomas spotted that fixup_pi_state_owner() can return errors and we
    fail to unlock the rt_mutex in that case.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • While working on the futex code, I stumbled over this potential
    use-after-free scenario. Dmitry triggered it later with syzkaller.

    pi_mutex is a pointer into pi_state, which we drop the reference on in
    unqueue_me_pi(). So any access to that pointer after that is bad.

    Since other sites already do rt_mutex_unlock() with hb->lock held, see
    for example futex_lock_pi(), simply move the unlock before
    unqueue_me_pi().

    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • The setup/remove_state/instance() functions in the hotplug core code are
    serialized against concurrent CPU hotplug, but unfortunately not serialized
    against themself.

    As a consequence a concurrent invocation of these function results in
    corruption of the callback machinery because two instances try to invoke
    callbacks on remote cpus at the same time. This results in missing callback
    invocations and initiator threads waiting forever on the completion.

    The obvious solution to replace get_cpu_online() with cpu_hotplug_begin()
    is not possible because at least one callsite calls into these functions
    from a get_online_cpu() locked region.

    Extend the protection scope of the cpuhp_state_mutex from solely protecting
    the state arrays to cover the callback invocation machinery as well.

    Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface")
    Reported-and-tested-by: Bart Van Assche
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: hpa@zytor.com
    Cc: mingo@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

13 Mar, 2017

1 commit

  • Pull x86 fixes from Thomas Gleixner:

    - a fix for the kexec/purgatory regression which was introduced in the
    merge window via an innocent sparse fix. We could have reverted that
    commit, but on deeper inspection it turned out that the whole
    machinery is neither documented nor robust. So a proper cleanup was
    done instead

    - the fix for the TLB flush issue which was discovered recently

    - a simple typo fix for a reboot quirk

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tlb: Fix tlb flushing when lguest clears PGE
    kexec, x86/purgatory: Unbreak it and clean it up
    x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk

    Linus Torvalds
     

11 Mar, 2017

2 commits

  • The purgatory code defines global variables which are referenced via a
    symbol lookup in the kexec code (core and arch).

    A recent commit addressing sparse warnings made these static and thereby
    broke kexec_file.

    Why did this happen? Simply because the whole machinery is undocumented and
    lacks any form of forward declarations. The variable names are unspecific
    and lack a prefix, so adding forward declarations creates shadow variables
    in the core code. Aside of that the code relies on magic constants and
    duplicate struct definitions with no way to ensure that these things stay
    in sync. The section placement of the purgatory variables happened by
    chance and not by design.

    Unbreak kexec and cleanup the mess:

    - Add proper forward declarations and document the usage
    - Use common struct definition
    - Use the proper common defines instead of magic constants
    - Add a purgatory_ prefix to have a proper name space
    - Use ARRAY_SIZE() instead of a homebrewn reimplementation
    - Add proper sections to the purgatory variables [ From Mike ]

    Fixes: 72042a8c7b01 ("x86/purgatory: Make functions and variables static")
    Reported-by: Mike Galbraith <
    Signed-off-by: Thomas Gleixner
    Cc: Nicholas Mc Guire
    Cc: Borislav Petkov
    Cc: Vivek Goyal
    Cc: "Tobin C. Harding"
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Merge fixes from Andrew Morton:
    "26 fixes"

    * emailed patches from Andrew Morton : (26 commits)
    userfaultfd: remove wrong comment from userfaultfd_ctx_get()
    fat: fix using uninitialized fields of fat_inode/fsinfo_inode
    sh: cayman: IDE support fix
    kasan: fix races in quarantine_remove_cache()
    kasan: resched in quarantine_remove_cache()
    mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
    thp: fix another corner case of munlock() vs. THPs
    rmap: fix NULL-pointer dereference on THP munlocking
    mm/memblock.c: fix memblock_next_valid_pfn()
    userfaultfd: selftest: vm: allow to build in vm/ directory
    userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
    userfaultfd: non-cooperative: fix fork fctx->new memleak
    mm/cgroup: avoid panic when init with low memory
    drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
    mm/vmstats: add thp_split_pud event for clarity
    include/linux/fs.h: fix unsigned enum warning with gcc-4.2
    userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
    userfaultfd: non-cooperative: robustness check
    userfaultfd: non-cooperative: rollback userfaultfd_exit
    x86, mm: unify exit paths in gup_pte_range()
    ...

    Linus Torvalds
     

10 Mar, 2017

3 commits

  • Patch series "userfaultfd non-cooperative further update for 4.11 merge
    window".

    Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
    more testing. I've been doing testing before and this was also tested
    by kbuild bot and exercised by the selftest, but this bug never
    reproduced before.

    I dropped userfaultfd_exit as result. I dropped it because of
    implementation difficulty in receiving signals in __mmput and because I
    think -ENOSPC as result from the background UFFDIO_COPY should be enough
    already.

    Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
    wasn't exercised by the selftest and when I tried to exercise it, after
    moving it to a more correct place in __mmput where it would make more
    sense and where the vma list is stable, it resulted in the
    event_wait_completion in D state. So then I added the second patch to
    be sure even if we call userfaultfd_event_wait_completion too late
    during task exit(), we won't risk to generate tasks in D state. The
    same check exists in handle_userfault() for the same reason, except it
    makes a difference there, while here is just a robustness check and it's
    run under WARN_ON_ONCE.

    While looking at the userfaultfd_event_wait_completion() function I
    looked back at its callers too while at it and I think it's not ok to
    stop executing dup_fctx on the fcs list because we relay on
    userfaultfd_event_wait_completion to execute
    userfaultfd_ctx_put(fctx->orig) which is paired against
    userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
    list_add(fcs). This change only takes care of fctx->orig but this area
    also needs further review looking for similar problems in fctx->new.

    The only patch that is urgent is the first because it's an use after
    free during a SMP race condition that affects all processes if
    CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
    impossible without SLUB poisoning enabled.

    This patch (of 3):

    I once reproduced this oops with the userfaultfd selftest, it's not
    easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[] [] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
    do_exit+0x297/0xd10
    SyS_exit+0x17/0x20
    tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP [] userfaultfd_exit+0x29/0xa0
    RSP
    ---[ end trace 9fecd6dcb442846a ]---

    In the debugger I located the "mm" pointer in the stack and walking
    mm->mmap->vm_next through the end shows the vma->vm_next list is fully
    consistent and it is null terminated list as expected. So this has to
    be an SMP race condition where userfaultfd_exit was running while the
    vma list was being modified by another CPU.

    When userfaultfd_exit() run one of the ->vm_next pointers pointed to
    SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

    The reason is that it's not running in __mmput but while there are still
    other threads running and it's not holding the mmap_sem (it can't as it
    has to wait the even to be received by the manager). So this is an use
    after free that was happening for all processes.

    One more implementation problem aside from the race condition:
    userfaultfd_exit has really to check a flag in mm->flags before walking
    the vma or it's going to slowdown the exit() path for regular tasks.

    One more implementation problem: at that point signals can't be
    delivered so it would also create a task in D state if the manager
    doesn't read the event.

    The major design issue: it overall looks superfluous as the manager can
    check for -ENOSPC in the background transfer:

    if (mmget_not_zero(ctx->mm)) {
    [..]
    } else {
    return -ENOSPC;
    }

    It's safer to roll it back and re-introduce it later if at all.

    [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
    Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mike Rapoport
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Fix typos and add the following to the scripts/spelling.txt:

    overide||override

    While we are here, fix the doubled "address" in the touched line
    Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.

    Also, fix the comment block style in the touched hunks in
    drivers/media/dvb-frontends/drx39xyj/drx_driver.h.

    Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Fix typos and add the following to the scripts/spelling.txt:

    disble||disable
    disbled||disabled

    I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
    untouched. The macro is not referenced at all, but this commit is
    touching only comment blocks just in case.

    Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada