16 Jul, 2016

1 commit


15 Jul, 2016

3 commits

  • Merge misc fixes from Andrew Morton:
    "20 fixes"

    * emailed patches from Andrew Morton :
    m32r: fix build warning about putc
    mm: workingset: printk missing log level, use pr_info()
    mm: thp: refix false positive BUG in page_move_anon_rmap()
    mm: rmap: call page_check_address() with sync enabled to avoid racy check
    mm: thp: move pmd check inside ptl for freeze_page()
    vmlinux.lds: account for destructor sections
    gcov: add support for gcc version >= 6
    mm, meminit: ensure node is online before checking whether pages are uninitialised
    mm, meminit: always return a valid node from early_pfn_to_nid
    kasan/quarantine: fix bugs on qlist_move_cache()
    uapi: export lirc.h header
    madvise_free, thp: fix madvise_free_huge_pmd return value after splitting
    Revert "scripts/gdb: add documentation example for radix tree"
    Revert "scripts/gdb: add a Radix Tree Parser"
    scripts/gdb: Perform path expansion to lx-symbol's arguments
    scripts/gdb: add constants.py to .gitignore
    scripts/gdb: rebuild constants.py on dependancy change
    scripts/gdb: silence 'nothing to do' message
    kasan: add newline to messages
    mm, compaction: prevent VM_BUG_ON when terminating freeing scanner

    Linus Torvalds
     
  • Pull scheduler fix from Ingo Molnar:
    "Fix a CPU hotplug related corruption of the load average that got
    introduced in this merge window"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Correct off by one bug in load migration calculation

    Linus Torvalds
     
  • Link: http://lkml.kernel.org/r/20160701130914.GA23225@styxhp
    Signed-off-by: Florian Meier
    Reviewed-by: Peter Oberparleiter
    Tested-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Florian Meier
     

14 Jul, 2016

1 commit

  • …t.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull perf and timer fixes from Ingo Molnar:
    "A fix for a posix CPU timers bug, and a perf printk message fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix bogus kernel printk, again

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    posix_cpu_timer: Exit early when process has been reaped

    Linus Torvalds
     

13 Jul, 2016

2 commits

  • The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
    account that the function is now called from a thread running on the outgoing
    CPU. As a result a cpu unplug leakes a load of 1 into the global load
    accounting mechanism.

    Fix it by adjusting for the currently running thread which calls
    calc_load_migrate().

    Reported-by: Anton Blanchard
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Vaidyanathan Srinivasan
    Cc: rt@linutronix.de
    Cc: shreyas@linux.vnet.ibm.com
    Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Xiaolong Ye reported lock debug warnings triggered by the following commit:

    8de4a0066106 ("perf/x86: Convert the core to the hotplug state machine")

    The bug is the following: the cpuhp_bp_states[] array is cut short when
    CONFIG_SMP=n, but the dynamically registered callbacks are stored nevertheless
    and happily scribble outside of the array bounds...

    We need to store them in case that the state is unregistered so we can invoke
    the teardown function. That's independent of CONFIG_SMP. Make sure the array
    is large enough.

    Reported-by: kernel test robot
    Signed-off-by: Thomas Gleixner
    Cc: Adam Borowski
    Cc: Alexander Shishkin
    Cc: Anna-Maria Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Cc: lkp@01.org
    Cc: stable@vger.kernel.org
    Cc: tipbuild@zytor.com
    Fixes: cff7d378d3fd "cpu/hotplug: Convert to a state machine for the control processor"
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607122144560.4083@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

11 Jul, 2016

1 commit

  • Variable "now" seems to be genuinely used unintialized
    if branch

    if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

    is not taken and branch

    if (unlikely(sighand == NULL)) {

    is taken. In this case the process has been reaped and the timer is marked as
    disarmed anyway. So none of the postprocessing of the sample is
    required. Return right away.

    Signed-off-by: Alexey Dobriyan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
    Signed-off-by: Thomas Gleixner

    Alexey Dobriyan
     

09 Jul, 2016

1 commit


07 Jul, 2016

1 commit

  • The following commit:

    66eb579e66ec ("perf: allow for PMU-specific event filtering")

    added the pmu::filter_match() callback. This was intended to
    avoid HW constraints on events from resulting in extremely
    pessimistic scheduling.

    However, pmu::filter_match() is only called for the leader of each event
    group. When the leader is a SW event, we do not filter the groups, and
    may fail at pmu::add() time, and when this happens we'll give up on
    scheduling any event groups later in the list until they are rotated
    ahead of the failing group.

    This can result in extremely sub-optimal event scheduling behaviour,
    e.g. if running the following on a big.LITTLE platform:

    $ taskset -c 0 ./perf stat \
    -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
    -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
    ls

    context-switches (0.00%)
    armv8_cortex_a57/config=0x11/ (0.00%)
    24 context-switches (37.36%)
    57589154 armv8_cortex_a53/config=0x11/ (37.36%)

    Here the 'a53' event group was always eligible to be scheduled, but
    the 'a57' group never eligible to be scheduled, as the task was always
    affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
    group was eligible, but the HW event failed at pmu::add() time,
    resulting in ctx_flexible_sched_in giving up on scheduling further
    groups with HW events.

    One way of avoiding this is to check pmu::filter_match() on siblings
    as well as the group leader. If any of these fail their
    pmu::filter_match() call, we must skip the entire group before
    attempting to add any events.

    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 66eb579e66ec ("perf: allow for PMU-specific event filtering")
    Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
    [ Small readability edits. ]
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

30 Jun, 2016

3 commits

  • Pull audit fixes from Paul Moore:
    "Two small patches to fix audit problems in 4.7-rcX: the first fixes a
    potential kref leak, the second removes some header file noise.

    The first is an important bug fix that really should go in before 4.7
    is released, the second is not critical, but falls into the very-nice-
    to-have category so I'm including in the pull request.

    Both patches are straightforward, self-contained, and pass our
    testsuite without problem"

    * 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
    audit: move audit_get_tty to reduce scope and kabi changes
    audit: move calcs after alloc and check when logging set loginuid

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "I've been traveling so this accumulates more than week or so of bug
    fixing. It perhaps looks a little worse than it really is.

    1) Fix deadlock in ath10k driver, from Ben Greear.

    2) Increase scan timeout in iwlwifi, from Luca Coelho.

    3) Unbreak STP by properly reinjecting STP packets back into the
    stack. Regression fix from Ido Schimmel.

    4) Mediatek driver fixes (missing malloc failure checks, leaking of
    scratch memory, wrong indexing when mapping TX buffers, etc.) from
    John Crispin.

    5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
    Sowa.

    6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.

    7) Fix netlink notifications in ovs for tunnels, delete link messages
    are never emitted because of how the device registry state is
    handled. From Nicolas Dichtel.

    8) Conntrack module leaks kmemcache on unload, from Florian Westphal.

    9) Prevent endless jump loops in nft rules, from Liping Zhang and
    Pablo Neira Ayuso.

    10) Not early enough spinlock initialization in mlx4, from Eric
    Dumazet.

    11) Bind refcount leak in act_ipt, from Cong WANG.

    12) Missing RCU locking in HTB scheduler, from Florian Westphal.

    13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
    barrier, using heap for SG and IV, and erroneous use of async flag
    when allocating AEAD conext.)

    14) RCU handling fix in TIPC, from Ying Xue.

    15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
    SIT driver, from Simon Horman.

    16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.

    17) Fix potential deadlock in team enslave, from Ido Schimmel.

    18) Memory leak in KCM procfs handling, from Jiri Slaby.

    19) ESN generation fix in ipv4 ESP, from Herbert Xu.

    20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
    WANG.

    21) Use after free in netem, from Eric Dumazet.

    22) Uninitialized last assert time in multicast router code, from Tom
    Goff.

    23) Skip raw sockets in sock_diag destruction broadcast, from Willem
    de Bruijn.

    24) Fix link status reporting in thunderx, from Sunil Goutham.

    25) Limit resegmentation of retransmit queue so that we do not
    retransmit too large GSO frames. From Eric Dumazet.

    26) Delay bpf program release after grace period, from Daniel
    Borkmann"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
    openvswitch: fix conntrack netlink event delivery
    qed: Protect the doorbell BAR with the write barriers.
    neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
    e1000e: keep VLAN interfaces functional after rxvlan off
    cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
    qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
    bpf, perf: delay release of BPF prog after grace period
    net: bridge: fix vlan stats continue counter
    tcp: do not send too big packets at retransmit time
    ibmvnic: fix to use list_for_each_safe() when delete items
    net: thunderx: Fix TL4 configuration for secondary Qsets
    net: thunderx: Fix link status reporting
    net/mlx5e: Reorganize ethtool statistics
    net/mlx5e: Fix number of PFC counters reported to ethtool
    net/mlx5e: Prevent adding the same vxlan port
    net/mlx5e: Check for BlueFlame capability before allocating SQ uar
    net/mlx5e: Change enum to better reflect usage
    net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
    net/mlx5: Update command strings
    net: marvell: Add separate config ANEG function for Marvell 88E1111
    ...

    Linus Torvalds
     
  • Pull cgroup fixes from Tejun Heo:
    "Three fix patches. Two are for cgroup / css init failure path. The
    last one makes css_set_lock irq-safe as the deadline scheduler ends up
    calling put_css_set() from irq context"

    * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Disable IRQs while holding css_set_lock
    cgroup: set css->id to -1 during init
    cgroup: remove redundant cleanup in css_create

    Linus Torvalds
     

29 Jun, 2016

3 commits

  • Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
    destruction of BPF program from free_event_rcu() callback to __free_event(),
    which is problematic if used with tail calls: if prog A is attached as
    trace event directly, but at the same time present in a tail call map used
    by another trace event program elsewhere, then we need to delay destruction
    via RCU grace period since it can still be in use by the program doing the
    tail call (the prog first needs to be dropped from the tail call map, then
    trace event with prog A attached destroyed, so we get immediate destruction).

    Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Jann Horn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The only users of audit_get_tty and audit_put_tty are internal to
    audit, so move it out of include/linux/audit.h to kernel.h and create
    a proper function rather than inlining it. This also reduces kABI
    changes.

    Suggested-by: Paul Moore
    Signed-off-by: Richard Guy Briggs
    [PM: line wrapped description]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     
  • Move the calculations of values after the allocation in case the
    allocation fails. This avoids wasting effort in the rare case that it
    fails, but more importantly saves us extra logic to release the tty
    ref.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

27 Jun, 2016

2 commits

  • Commit:

    fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

    did something non-obvious but also did it buggy yet latent.

    The problem was exposed for real by a later commit in the v4.7 merge window:

    2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

    ... after which tg->load_avg and cfs_rq->load.weight had different
    units (10 bit fixed point and 20 bit fixed point resp.).

    Add a comment to explain the use of cfs_rq->load.weight over the
    'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
    for the difference in unit.

    Since this is (now, as per a previous commit) the only user of
    calc_tg_weight(), collapse it.

    The effects of this bug should be randomly inconsistent SMP-balancing
    of cgroups workloads.

    Reported-by: Jirka Hladky
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
    Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Starting with the following commit:

    fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

    calc_tg_weight() doesn't compute the right value as expected by effective_load().

    The difference is in the 'correction' term. In order to ensure \Sum
    rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
    lagging a correction on the current cfs_rq->avg.load_avg value.
    Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
    cfs_rq->avg.load_avg.

    Now, per the referenced commit, calc_tg_weight() doesn't use
    cfs_rq->avg.load_avg, as is later used in @w, but uses
    cfs_rq->load.weight instead.

    So stop using calc_tg_weight() and do it explicitly.

    The effects of this bug are wake_affine() making randomly
    poor choices in cgroup-intense workloads.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v4.3+
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Jun, 2016

6 commits

  • Pull scheduler fixes from Thomas Gleixner:
    "A couple of scheduler fixes:

    - force watchdog reset while processing sysrq-w

    - fix a deadlock when enabling trace events in the scheduler

    - fixes to the throttled next buddy logic

    - fixes for the average accounting (missing serialization and
    underflow handling)

    - allow kernel threads for fallback to online but not active cpus"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Allow kthreads to fall back to online && !active cpus
    sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
    sched/fair: Initialize throttle_count for new task-groups lazily
    sched/fair: Fix cfs_rq avg tracking underflow
    kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
    sched/debug: Fix deadlock when enabling sched events
    sched/fair: Fix post_init_entity_util_avg() serialization

    Linus Torvalds
     
  • Pull locking fix from Thomas Gleixner:
    "A single fix to address a race in the static key logic"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/static_key: Fix concurrent static_key_slow_inc()

    Linus Torvalds
     
  • Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
    breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:

    kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
    kernel/fork.c:355:8: error: assignment from incompatible pointer type
    stack = alloc_thread_stack_node(tsk, node);
    ^

    Fix it by renaming free_stack() to free_thread_stack(), and updating the
    return type of alloc_thread_stack_node().

    Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Merge misc fixes from Andrew Morton:
    "Two weeks worth of fixes here"

    * emailed patches from Andrew Morton : (41 commits)
    init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
    autofs: don't get stuck in a loop if vfs_write() returns an error
    mm/page_owner: avoid null pointer dereference
    tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
    fs/nilfs2: fix potential underflow in call to crc32_le
    oom, suspend: fix oom_reaper vs. oom_killer_disable race
    ocfs2: disable BUG assertions in reading blocks
    mm, compaction: abort free scanner if split fails
    mm: prevent KASAN false positives in kmemleak
    mm/hugetlb: clear compound_mapcount when freeing gigantic pages
    mm/swap.c: flush lru pvecs on compound page arrival
    memcg: css_alloc should return an ERR_PTR value on error
    memcg: mem_cgroup_migrate() may be called with irq disabled
    hugetlb: fix nr_pmds accounting with shared page tables
    Revert "mm: disable fault around on emulated access bit architecture"
    Revert "mm: make faultaround produce old ptes"
    mailmap: add Boris Brezillon's email
    mailmap: add Antoine Tenart's email
    mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
    mm: mempool: kasan: don't poot mempool objects in quarantine
    ...

    Linus Torvalds
     
  • Tetsuo has reported the following potential oom_killer_disable vs.
    oom_reaper race:

    (1) freeze_processes() starts freezing user space threads.
    (2) Somebody (maybe a kenrel thread) calls out_of_memory().
    (3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
    (4) oom_killer_disable() sets oom_killer_disabled = true.
    (5) P1 leaves __refrigerator() and enters do_exit().
    (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
    (7) oom_killer_disable() returns while P1 not yet finished
    (8) P1 perform IO/interfere with the freezer.

    This situation is unfortunate. We cannot move oom_killer_disable after
    all the freezable kernel threads are frozen because the oom victim might
    depend on some of those kthreads to make a forward progress to exit so
    we could deadlock. It is also far from trivial to teach the oom_reaper
    to not call exit_oom_victim() because then we would lose a guarantee of
    the OOM killer and oom_killer_disable forward progress because
    exit_mm->mmput might block and never call exit_oom_victim.

    It seems the easiest way forward is to workaround this race by calling
    try_to_freeze_tasks again after oom_killer_disable. This will make sure
    that all the tasks are frozen or it bails out.

    Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
    Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We've had the thread info allocated together with the thread stack for
    most architectures for a long time (since the thread_info was split off
    from the task struct), but that is about to change.

    But the patches that move the thread info to be off-stack (and a part of
    the task struct instead) made it clear how confused the allocator and
    freeing functions are.

    Because the common case was that we share an allocation with the thread
    stack and the thread_info, the two pointers were identical. That
    identity then meant that we would have things like

    ti = alloc_thread_info_node(tsk, node);
    ...
    tsk->stack = ti;

    which certainly _worked_ (since stack and thread_info have the same
    value), but is rather confusing: why are we assigning a thread_info to
    the stack? And if we move the thread_info away, the "confusing" code
    just gets to be entirely bogus.

    So remove all this confusion, and make it clear that we are doing the
    stack allocation by renaming and clarifying the function names to be
    about the stack. The fact that the thread_info then shares the
    allocation is an implementation detail, and not really about the
    allocation itself.

    This is a pure renaming and type fix: we pass in the same pointer, it's
    just that we clarify what the pointer means.

    The ia64 code that actually only has one single allocation (for all of
    task_struct, thread_info and kernel thread stack) now looks a bit odd,
    but since "tsk->stack" is actually not even used there, that oddity
    doesn't matter. It would be a separate thing to clean that up, I
    intentionally left the ia64 changes as a pure brute-force renaming and
    type change.

    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Jun, 2016

6 commits

  • During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
    online but not active. A CPU_ONLINE callback may create or bind a
    kthread so that its cpus_allowed mask only allows the CPU which is
    being brought online. The kthread may start executing before the CPU
    is made active and can end up in select_fallback_rq().

    In such cases, the expected behavior is selecting the CPU which is
    coming online; however, because select_fallback_rq() only chooses from
    active CPUs, it determines that the task doesn't have any viable CPU
    in its allowed mask and ends up overriding it to cpu_possible_mask.

    CPU_ONLINE callbacks should be able to put kthreads on the CPU which
    is coming online. Update select_fallback_rq() so that it follows
    cpu_online() rather than cpu_active() for kthreads.

    Reported-by: Gautham R Shenoy
    Tested-by: Gautham R. Shenoy
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Abdul Haleem
    Cc: Aneesh Kumar
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Hierarchy could be already throttled at this point. Throttled next
    buddy could trigger a NULL pointer dereference in pick_next_task_fair().

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • Cgroup created inside throttled group must inherit current throttle_count.
    Broken throttle_count allows to nominate throttled entries as a next buddy,
    later this leads to null pointer dereference in pick_next_task_fair().

    This patch initialize cfs_rq->throttle_count at first enqueue: laziness
    allows to skip locking all rq at group creation. Lazy approach also allows
    to skip full sub-tree scan at throttling hierarchy (not in this patch).

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • The following scenario is possible:

    CPU 1 CPU 2
    static_key_slow_inc()
    atomic_inc_not_zero()
    -> key.enabled == 0, no increment
    jump_label_lock()
    atomic_inc_return()
    -> key.enabled == 1 now
    static_key_slow_inc()
    atomic_inc_not_zero()
    -> key.enabled == 1, inc to 2
    return
    ** static key is wrong!
    jump_label_update()
    jump_label_unlock()

    Testing the static key at the point marked by (**) will follow the
    wrong path for jumps that have not been patched yet. This can
    actually happen when creating many KVM virtual machines with userspace
    LAPIC emulation; just run several copies of the following program:

    #include
    #include
    #include
    #include

    int main(void)
    {
    for (;;) {
    int kvmfd = open("/dev/kvm", O_RDONLY);
    int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
    close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
    close(vmfd);
    close(kvmfd);
    }
    return 0;
    }

    Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
    The static key's purpose is to skip NULL pointer checks and indeed one
    of the processes eventually dereferences NULL.

    As explained in the commit that introduced the bug:

    706249c222f6 ("locking/static_keys: Rework update logic")

    jump_label_update() needs key.enabled to be true. The solution adopted
    here is to temporarily make key.enabled == -1, and use go down the
    slow path when key.enabled
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v4.3+
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 706249c222f6 ("locking/static_keys: Rework update logic")
    Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
    [ Small stylistic edits to the changelog and the code. ]
    Signed-off-by: Ingo Molnar

    Paolo Bonzini
     
  • While testing the deadline scheduler + cgroup setup I hit this
    warning.

    [ 132.612935] ------------[ cut here ]------------
    [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
    [ 132.612952] Modules linked in: (a ton of modules...)
    [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
    [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
    [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
    [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
    [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
    [ 132.612986] Call Trace:
    [ 132.612987] [] dump_stack+0x63/0x85
    [ 132.612994] [] __warn+0xcb/0xf0
    [ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.612999] [] warn_slowpath_null+0x1d/0x20
    [ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
    [ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
    [ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
    [ 132.613015] [] put_css_set+0x5c/0x60
    [ 132.613016] [] cgroup_free+0x7f/0xa0
    [ 132.613017] [] __put_task_struct+0x42/0x140
    [ 132.613018] [] dl_task_timer+0xca/0x250
    [ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.613030] [] __hrtimer_run_queues+0xee/0x270
    [ 132.613031] [] hrtimer_interrupt+0xa8/0x190
    [ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
    [ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
    [ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
    [ 132.613038] [] ? native_safe_halt+0x6/0x10
    [ 132.613043] [] default_idle+0x1e/0xd0
    [ 132.613044] [] arch_cpu_idle+0xf/0x20
    [ 132.613046] [] default_idle_call+0x2a/0x40
    [ 132.613047] [] cpu_startup_entry+0x2e7/0x340
    [ 132.613048] [] start_secondary+0x155/0x190
    [ 132.613049] ---[ end trace f91934d162ce9977 ]---

    The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
    context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
    this problem - and other problems of sharing a spinlock with an
    interrupt.

    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Cc: cgroups@vger.kernel.org
    Cc: stable@vger.kernel.org # 4.5+
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Rik van Riel
    Reviewed-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Daniel Bristot de Oliveira
     
  • None of the code actually wants a thread_info, it all wants a
    task_struct, and it's just converting back and forth between the two
    ("ti->task" to get the task_struct from the thread_info, and
    "task_thread_info(task)" to go the other way).

    No semantic change.

    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2016

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Two fixes for the tracing system:

    - When trace_printk() is used with a non constant format descriptor,
    it adds a NULL pointer into the trace format section, and the code
    isn't prepared to deal with it. This bug appeared by a change that
    was added in v3.5.

    - The ftracetest (selftests section) can't handle testing histograms
    when histograms are not configured. Currently it shows that they
    fail the test, when they should state that they are unsupported.
    This bug was added in the 4.7 merge window with the addition of the
    historgram code"

    * tag 'trace-v4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftracetest: Fix hist unsupported result in hist selftests
    tracing: Handle NULL formats in hold_module_trace_bprintk_format()

    Linus Torvalds
     

20 Jun, 2016

2 commits

  • If a task uses a non constant string for the format parameter in
    trace_printk(), then the trace_printk_fmt variable is set to NULL. This
    variable is then saved in the __trace_printk_fmt section.

    The function hold_module_trace_bprintk_format() checks to see if duplicate
    formats are used by modules, and reuses them if so (saves them to the list
    if it is new). But this function calls lookup_format() that does a strcmp()
    to the value (which is now NULL) and can cause a kernel oops.

    This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
    when not using bprintk()") which added "__used" to the trace_printk_fmt
    variable, and before that, the kernel simply optimized it out (no NULL value
    was saved).

    The fix is simply to handle the NULL pointer in lookup_format() and have the
    caller ignore the value if it was NULL.

    Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

    Reported-by: xingzhen
    Acked-by: Namhyung Kim
    Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • As per commit:

    b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

    > the code generated from update_cfs_rq_load_avg():
    >
    > if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > removed_load = 1;
    > }
    >
    > turns into:
    >
    > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
    > ffffffff8108706b: 48 85 c0 test %rax,%rax
    > ffffffff8108706e: 74 40 je ffffffff810870b0
    > ffffffff81087070: 4c 89 f8 mov %r15,%rax
    > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
    > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
    > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
    > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
    > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
    > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.

    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.

    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.

    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.

    Note: GCC generates crap code for this, might warrant a look later.

    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
    maybe we should do clamping on add too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Jun, 2016

2 commits

  • If percpu_ref initialization fails during css_create(), the free path
    can end up trying to free css->id of zero. As ID 0 is unused, it
    doesn't cause a critical breakage but it does trigger a warning
    message. Fix it by setting css->id to -1 from init_and_link_css().

    Signed-off-by: Tejun Heo
    Cc: Wenwei Tao
    Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • With commit e9d867a67fd03ccc ("sched: Allow per-cpu kernel threads to
    run on online && !active"), __set_cpus_allowed_ptr() expects that only
    strict per-cpu kernel threads can have affinity to an online CPU which
    is not yet active.

    This assumption is currently broken in the CPU_ONLINE notification
    handler for the workqueues where restore_unbound_workers_cpumask()
    calls set_cpus_allowed_ptr() when the first cpu in the unbound
    worker's pool->attr->cpumask comes online. Since
    set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
    only one CPU is online which is not yet active, we get the following
    WARN_ON during an CPU online operation.

    ------------[ cut here ]------------
    WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
    __set_cpus_allowed_ptr+0x228/0x2e0
    Modules linked in:
    CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4

    Call Trace:
    [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
    [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
    [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
    [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
    [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
    [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
    [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
    [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
    [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
    [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
    ---[ end trace 00f1456578b2a3b2 ]---

    This patch fixes this by limiting the mask to the intersection of
    the pool affinity and online CPUs.

    Changelog-cribbed-from: Gautham R. Shenoy
    Reported-by: Abdul Haleem
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     

16 Jun, 2016

2 commits

  • similar to bpf_perf_event_output() the bpf_perf_event_read() helper
    needs to check the type of the perf_event before reading the counter.

    Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
    Reported-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The ctx structure passed into bpf programs is different depending on bpf
    program type. The verifier incorrectly marked ctx->data and ctx->data_end
    access based on ctx offset only. That caused loads in tracing programs
    int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
    to be incorrectly marked as PTR_TO_PACKET which later caused verifier
    to reject the program that was actually valid in tracing context.
    Fix this by doing program type specific matching of ctx offsets.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Reported-by: Sasha Goldshtein
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

15 Jun, 2016

1 commit

  • Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
    private data"), a debugfs file's file_operations methods get proxied
    through lifetime aware wrappers.

    However, only a certain subset of the file_operations members is supported
    by debugfs and ->mmap isn't among them -- it appears to be NULL from the
    VFS layer's perspective.

    This behaviour breaks the /sys/kernel/debug/kcov file introduced
    concurrently with commit 5c9a8750a640 ("kernel: add kcov code coverage").

    Since that file never gets removed, there is no file removal race and thus,
    a lifetime checking proxy isn't needed.

    Avoid the proxying for /sys/kernel/debug/kcov by creating it via
    debugfs_create_file_unsafe() rather than debugfs_create_file().

    Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data")
    Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
    Reported-by: Sasha Levin
    Signed-off-by: Nicolai Stange
    Signed-off-by: Greg Kroah-Hartman

    Nicolai Stange
     

14 Jun, 2016

2 commits

  • Lengthy output of sysrq-w may take a lot of time on slow serial console.

    Currently we reset NMI-watchdog on the current CPU to avoid spurious
    lockup messages. Sometimes this doesn't work since softlockup watchdog
    might trigger on another CPU which is waiting for an IPI to proceed.
    We reset softlockup watchdogs on all CPUs, but we do this only after
    listing all tasks, and this may be too late on a busy system.

    So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

    Signed-off-by: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • I see a hang when enabling sched events:

    echo 1 > /sys/kernel/debug/tracing/events/sched/enable

    The printk buffer shows:

    BUG: spinlock recursion on CPU#1, swapper/1/0
    lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    ...
    Call Trace:
    [] dump_stack+0x85/0xc2
    [] spin_dump+0x78/0xc0
    [] do_raw_spin_lock+0x11a/0x150
    [] _raw_spin_lock+0x61/0x80
    [] ? try_to_wake_up+0x256/0x4e0
    [] try_to_wake_up+0x256/0x4e0
    [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
    [] wake_up_process+0x15/0x20
    [] insert_work+0x84/0xc0
    [] __queue_work+0x18f/0x660
    [] queue_work_on+0x46/0x90
    [] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
    [] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
    [] soft_cursor+0x1ad/0x230
    [] bit_cursor+0x649/0x680
    [] ? update_attr.isra.2+0x90/0x90
    [] fbcon_cursor+0x14a/0x1c0
    [] hide_cursor+0x28/0x90
    [] vt_console_print+0x3bf/0x3f0
    [] call_console_drivers.constprop.24+0x183/0x200
    [] console_unlock+0x3d4/0x610
    [] vprintk_emit+0x3c5/0x610
    [] vprintk_default+0x29/0x40
    [] printk+0x57/0x73
    [] enqueue_entity+0xc2e/0xc70
    [] enqueue_task_fair+0x59/0xab0
    [] ? kvm_sched_clock_read+0x9/0x20
    [] ? sched_clock+0x9/0x10
    [] activate_task+0x5c/0xa0
    [] ttwu_do_activate+0x54/0xb0
    [] sched_ttwu_pending+0x7a/0xb0
    [] scheduler_ipi+0x61/0x170
    [] smp_trace_reschedule_interrupt+0x4f/0x2a0
    [] trace_reschedule_interrupt+0x96/0xa0
    [] ? native_safe_halt+0x6/0x10
    [] ? trace_hardirqs_on+0xd/0x10
    [] default_idle+0x20/0x1a0
    [] arch_cpu_idle+0xf/0x20
    [] default_idle_call+0x2f/0x50
    [] cpu_startup_entry+0x37e/0x450
    [] start_secondary+0x160/0x1a0

    Note the hang only occurs when echoing the above from a physical serial
    console, not from an ssh session.

    The bug is caused by a deadlock where the task is trying to grab the rq
    lock twice because printk()'s aren't safe in sched code.

    Signed-off-by: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
    Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf