07 Dec, 2015

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "This updates contains the following changes:

    - Fix a signal handling regression in the bit wait functions.

    - Avoid false positive warnings in the wakeup path.

    - Initialize the scheduler root domain properly.

    - Handle gtime calculations in proc/$PID/stat proper.

    - Add more documentation for the barriers in try_to_wake_up().

    - Fix a subtle race in try_to_wake_up() which might cause a task to
    be scheduled on two cpus

    - Compile static helper function only when it is used"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
    sched/core: Better document the try_to_wake_up() barriers
    sched/cputime: Fix invalid gtime in proc
    sched/core: Clear the root_domain cpumasks in init_rootdomain()
    sched/core: Remove false-positive warning from wake_up_process()
    sched/wait: Fix signal handling in bit wait helpers
    sched/rt: Hide the push_irq_work_func() declaration

    Linus Torvalds
     

04 Dec, 2015

8 commits

  • Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
    that we'll prematurely continue with the wakeup and effectively run p on
    two CPUs at the same time.

    Even though the overlap is very limited; the task is in the middle of
    being scheduled out; it could still result in corruption of the
    scheduler data structures.

    CPU0 CPU1

    set_current_state(...)


    context_switch(X, Y)
    prepare_lock_switch(Y)
    Y->on_cpu = 1;
    finish_lock_switch(X)
    store_release(X->on_cpu, 0);

    try_to_wake_up(X)
    LOCK(p->pi_lock);

    t = X->on_cpu; // 0

    context_switch(Y, X)
    prepare_lock_switch(X)
    X->on_cpu = 1;
    finish_lock_switch(Y)
    store_release(Y->on_cpu, 0);

    schedule();
    deactivate_task(X);
    X->on_rq = 0;

    if (X->on_rq) // false

    if (t) while (X->on_cpu)
    cpu_relax();

    context_switch(X, ..)
    finish_lock_switch(X)
    store_release(X->on_cpu, 0);

    Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Explain how the control dependency and smp_rmb() end up providing
    ACQUIRE semantics and pair with smp_store_release() in
    finish_lock_switch().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • /proc/stats shows invalid gtime when the thread is running in guest.
    When vtime accounting is not enabled, we cannot get a valid delta.
    The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
    is only updated when vtime accounting is runtime enabled.

    This patch makes task_gtime() just return gtime without computing the
    buggy non-existing tickless delta when vtime accounting is not enabled.

    Use context_tracking_is_enabled() to check if vtime is accounting on
    some cpu, in which case only we need to check the tickless delta. This
    way we fix the gtime value regression on machines not running nohz full.

    The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
    CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

    I ran and stop a busy loop in VM and see the gtime in host.
    Dump the 43rd field which shows the gtime in every second:

    # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
    S 4348
    R 7064566
    R 7064766
    R 7064967
    R 7065168
    S 4759
    S 4759

    During running busy loop, it returns large value.

    After applying this patch, we can see right gtime.

    # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
    S 5338
    R 5365
    R 5465
    R 5566
    R 5666
    S 5726
    S 5726

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     
  • root_domain::rto_mask allocated through alloc_cpumask_var()
    contains garbage data, this may cause problems. For instance,
    When doing pull_rt_task(), it may do useless iterations if
    rto_mask retains some extra garbage bits. Worse still, this
    violates the isolated domain rule for clustered scheduling
    using cpuset, because the tasks(with all the cpus allowed)
    belongs to one root domain can be pulled away into another
    root domain.

    The patch cleans the garbage by using zalloc_cpumask_var()
    instead of alloc_cpumask_var() for root_domain::rto_mask
    allocation, thereby addressing the issues.

    Do the same thing for root_domain's other cpumask memembers:
    dlo_mask, span, and online.

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
    Signed-off-by: Ingo Molnar

    Xunlei Pang
     
  • Because wakeups can (fundamentally) be late, a task might not be in
    the expected state. Therefore testing against a task's state is racy,
    and can yield false positives.

    Signed-off-by: Sasha Levin
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: oleg@redhat.com
    Fixes: 9067ac85d533 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
    Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com
    Signed-off-by: Ingo Molnar

    Sasha Levin
     
  • Vladimir reported getting RCU stall warnings and bisected it back to
    commit:

    743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")

    That commit inadvertently reversed the calls to schedule() and signal_pending(),
    thereby not handling the case where the signal receives while we sleep.

    Reported-by: Vladimir Murzin
    Tested-by: Vladimir Murzin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: mark.rutland@arm.com
    Cc: neilb@suse.de
    Cc: oleg@redhat.com
    Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
    Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
    Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull networking fixes from David Miller:
    "A lot of Thanksgiving turkey leftovers accumulated, here goes:

    1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

    2) IDs for some new iwlwifi chips, from Oren Givon.

    3) Fix rtlwifi lockups on boot, from Larry Finger.

    4) Fix memory leak in fm10k, from Stephen Hemminger.

    5) We have a route leak in the ipv6 tunnel infrastructure, fix from
    Paolo Abeni.

    6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

    7) Wrong lockdep annotations in tcp md5 support, fix from Eric
    Dumazet.

    8) Work around some middle boxes which prevent proper handling of TCP
    Fast Open, from Yuchung Cheng.

    9) TCP repair can do huge kmalloc() requests, build paged SKBs
    instead. From Eric Dumazet.

    10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
    Borkmann.

    11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
    Nikolay Aleksandrov.

    12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
    Weikusat.

    13) Fix double free in VRF code, from Nikolay Aleksandrov.

    14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

    15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

    16) Fix clearing of persistent array maps in bpf, from Daniel
    Borkmann.

    17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
    early enough. From Eric Dumazet.

    18) Fix out of bounds accesses in bpf array implementation when
    updating elements, from Daniel Borkmann.

    19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
    Dumazet.

    20) When dumping proxy neigh entries, we have to accomodate NULL
    device pointers properly, from Konstantin Khlebnikov.

    21) SCTP doesn't release all ipv6 socket resources properly, fix from
    Eric Dumazet.

    22) Prevent underflows of sch->q.qlen for multiqueue packet
    schedulers, also from Eric Dumazet.

    23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
    Huang and Michael Chan.

    24) Don't actively scan radar channels, from Antonio Quartulli"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
    net: phy: reset only targeted phy
    bnxt_en: Setup uc_list mac filters after resetting the chip.
    bnxt_en: enforce proper storing of MAC address
    bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
    net: lpc_eth: remove irq > NR_IRQS check from probe()
    net_sched: fix qdisc_tree_decrease_qlen() races
    openvswitch: fix hangup on vxlan/gre/geneve device deletion
    ipv4: igmp: Allow removing groups from a removed interface
    ipv6: sctp: implement sctp_v6_destroy_sock()
    arm64: bpf: add 'store immediate' instruction
    ipv6: kill sk_dst_lock
    ipv6: sctp: add rcu protection around np->opt
    net/neighbour: fix crash at dumping device-agnostic proxy entries
    sctp: use GFP_USER for user-controlled kmalloc
    sctp: convert sack_needed and sack_generation to bits
    ipv6: add complete rcu protection around np->opt
    bpf: fix allocation warnings in bpf maps and integer overflow
    mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
    net: mvneta: enable setting custom TX IP checksum limit
    net: mvneta: fix error path for building skb
    ...

    Linus Torvalds
     
  • Pull tracing fix from Steven Rostedt:
    "During the merge window I added a new file that is used to filter
    trace events on pids. It filters all events where only tasks with
    their pid in that file exists. It also handles the sched_switch and
    sched_wakeup trace events where the current task does not have its pid
    in the file, but the task either being switched to or awaken does.

    Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both
    of these tracepoints use the same class as the sched_wakeup
    tracepoint, and they too should be included in what gets filtered by
    the set_event_pid file"

    * tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter

    Linus Torvalds
     

03 Dec, 2015

1 commit

  • For large map->value_size the user space can trigger memory allocation warnings like:
    WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
    __alloc_pages_nodemask+0x695/0x14e0()
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x68/0x92 lib/dump_stack.c:50
    [] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
    [] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
    [< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989
    [] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
    [] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
    [< inline >] alloc_pages include/linux/gfp.h:451
    [] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
    [] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
    [] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
    [< inline >] kmalloc_large include/linux/slab.h:390
    [] __kmalloc+0x234/0x250 mm/slub.c:3525
    [< inline >] kmalloc include/linux/slab.h:463
    [< inline >] map_update_elem kernel/bpf/syscall.c:288
    [< inline >] SYSC_bpf kernel/bpf/syscall.c:744

    To avoid never succeeding kmalloc with order >= MAX_ORDER check that
    elem->value_size and computed elem_size are within limits for both hash and
    array type maps.
    Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
    Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Dec, 2015

2 commits

  • During own review but also reported by Dmitry's syzkaller [1] it has been
    noticed that we trigger a heap out-of-bounds access on eBPF array maps
    when updating elements. This happens with each map whose map->value_size
    (specified during map creation time) is not multiple of 8 bytes.

    In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
    used to align array map slots for faster access. However, in function
    array_map_update_elem(), we update the element as ...

    memcpy(array->value + array->elem_size * index, value, array->elem_size);

    ... where we access 'value' out-of-bounds, since it was allocated from
    map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
    and later on copied through copy_from_user(value, uvalue, map->value_size).
    Thus, up to 7 bytes, we can access out-of-bounds.

    Same could happen from within an eBPF program, where in worst case we
    access beyond an eBPF program's designated stack.

    Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
    official release yet, it only affects priviledged users.

    In case of array_map_lookup_elem(), the verifier prevents eBPF programs
    from accessing beyond map->value_size through check_map_access(). Also
    from syscall side map_lookup_elem() only copies map->value_size back to
    user, so nothing could leak.

    [1] http://github.com/google/syzkaller

    Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The set_event_pid filter relies on attaching to the sched_switch and
    sched_wakeup tracepoints to see if it should filter the tracing on schedule
    tracepoints. By adding the callbacks to sched_wakeup, pids in the
    set_event_pid file will trace the wakeups of those tasks with those pids.

    But sched_wakeup_new and sched_waking were missed. These two should also be
    traced. Luckily, these tracepoints share the same class as sched_wakeup
    which means they can use the same pre and post callbacks as sched_wakeup
    does.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

01 Dec, 2015

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "I found two minor bugs while doing development on the ring buffer
    code.

    The first is something that's been there since its creation. If a
    reader reads a page out of the ring buffer before there's any events
    on it, it can get an out of date timestamp for that event. It may be
    off by a few microseconds, more if the first event gets discarded.
    The fix was to only update the reader time stamp when it actually sees
    an event on the page, instead of just reading the timestamp from the
    page even if it has no events on it. That timestamp is still volatile
    until an event is present.

    The second bug is more recent. Instead of passing around parameters a
    descriptor was made and the parameters are passed via a single
    descriptor. This simplified the code a bit. But there was one place
    that expected the parameter to be passed by value not reference (which
    a descriptor now does). And it added to the length of the event,
    which may be ignored later, but the length should not have been
    increased. The only real problem with this bug is that it may
    allocate more than was needed for the event"

    * tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Put back the length if crossed page with add_timestamp
    ring-buffer: Update read stamp with first real commit on page

    Linus Torvalds
     

26 Nov, 2015

1 commit

  • Currently, when having map file descriptors pointing to program arrays,
    there's still the issue that we unconditionally flush program array
    contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
    when such a file descriptor is released and is independent of the map's
    refcount.

    Having this flush independent of the refcount is for a reason: there
    can be arbitrary complex dependency chains among tail calls, also circular
    ones (direct or indirect, nesting limit determined during runtime), and
    we need to make sure that the map drops all references to eBPF programs
    it holds, so that the map's refcount can eventually drop to zero and
    initiate its freeing. Btw, a walk of the whole dependency graph would
    not be possible for various reasons, one being complexity and another
    one inconsistency, i.e. new programs can be added to parts of the graph
    at any time, so there's no guaranteed consistent state for the time of
    such a walk.

    Now, the program array pinning itself works, but the issue is that each
    derived file descriptor on close would nevertheless call unconditionally
    into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
    this flush until the last reference to a user is dropped. As this only
    concerns a subset of references (f.e. a prog array could hold a program
    that itself has reference on the prog array holding it, etc), we need to
    track them separately.

    Short analysis on the refcounting: on map creation time usercnt will be
    one, so there's no change in behaviour for bpf_map_release(), if unpinned.
    If we already fail in map_create(), we are immediately freed, and no
    file descriptor has been made public yet. In bpf_obj_pin_user(), we need
    to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
    reference, so before we drop the reference on the fd with fdput().
    Therefore, if actual pinning fails, we need to drop that reference again
    in bpf_any_put(), otherwise we keep holding it. When last reference
    drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
    care of dropping the usercnt again. In the bpf_obj_get_user() case, the
    bpf_any_get() will grab a reference on the usercnt, still at a time when
    we have the reference on the path. Should we later on fail to grab a new
    file descriptor, bpf_any_put() will drop it, otherwise we hold it until
    bpf_map_release() time.

    Joint work with Alexei.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

25 Nov, 2015

1 commit

  • I got a crash during a "perf top" session that was caused by a race in
    __task_pid_nr_ns() :

    pid_nr_ns() was inlined, but apparently compiler chose to read
    task->pids[type].pid twice, and the pid->level dereference crashed
    because we got a NULL pointer at the second read :

    if (pid && ns->level level) { // CRASH

    Just use RCU API properly to solve this race, and not worry about "perf
    top" crashing hosts :(

    get_task_pid() can benefit from same fix.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

24 Nov, 2015

2 commits

  • Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing
    data" added a descriptor that holds various data instead of passing around
    several variables through parameters. The problem was that one of the
    parameters was modified in a function and the code was designed not to have
    an effect on that modified parameter. Now that the parameter is a
    descriptor and any modifications to it are non-volatile, the size of the
    data could be unnecessarily expanded.

    Remove the extra space added if a timestamp was added and the event went
    across the page.

    Cc: stable@vger.kernel.org # 4.3+
    Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data"
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Do not update the read stamp after swapping out the reader page from the
    write buffer. If the reader page is swapped out of the buffer before an
    event is written to it, then the read_stamp may get an out of date
    timestamp, as the page timestamp is updated on the first commit to that
    page.

    rb_get_reader_page() only returns a page if it has an event on it, otherwise
    it will return NULL. At that point, check if the page being returned has
    events and has not been read yet. Then at that point update the read_stamp
    to match the time stamp of the reader page.

    Cc: stable@vger.kernel.org # 2.6.30+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

23 Nov, 2015

1 commit

  • The push_irq_work_func() function is conditionally defined only
    when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
    forward declaration remains visibile without HAVE_RT_PUSH_IPI,
    causing a gcc warning in ARM64 allnoconfig:

    kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]

    This changes the code to use the same condition for both the
    declaration and the function definition, which gets rid of the
    warning.

    As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
    thing after:

    8053871d0f7f ("smp: Fix smp_call_function_single_async() locking")

    Until that is done, this patch can be used to avoid the warning.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
    Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel
    Signed-off-by: Ingo Molnar

    Arnd Bergmann
     

21 Nov, 2015

2 commits

  • Commit 08d78658f393 ("panic: release stale console lock to always get the
    logbuf printed out") introduced an unwanted bad unlock balance report when
    panic() is called directly and not from OOPS (e.g. from out_of_memory()).
    The difference is that in case of OOPS we disable locks debug in
    oops_enter() and on direct panic call nobody does that.

    Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out")
    Reported-by: kernel test robot
    Signed-off-by: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Masami Hiramatsu
    Cc: Jiri Kosina
    Cc: Baoquan He
    Cc: Prarit Bhargava
    Cc: Xie XiuQi
    Cc: Seth Jennings
    Cc: "K. Y. Srinivasan"
    Cc: Jan Kara
    Cc: Petr Mladek
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • sigsuspend() is nowhere used except in signal.c itself, so we can mark it
    static do not pollute the global namespace.

    But this patch is more than a boring cleanup patch, it fixes a real issue
    on UserModeLinux. UML has a special console driver to display ttys using
    xterm, or other terminal emulators, on the host side. Vegard reported
    that sometimes UML is unable to spawn a xterm and he's facing the
    following warning:

    WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()

    It turned out that this warning makes absolutely no sense as the UML
    xterm code calls sigsuspend() on the host side, at least it tries. But
    as the kernel itself offers a sigsuspend() symbol the linker choose this
    one instead of the glibc wrapper. Interestingly this code used to work
    since ever but always blocked signals on the wrong side. Some recent
    kernel change made the WARN_ON() trigger and uncovered the bug.

    It is a wonderful example of how much works by chance on computers. :-)

    Fixes: 68f3f16d9ad0f1 ("new helper: sigsuspend()")
    Signed-off-by: Richard Weinberger
    Reported-by: Vegard Nossum
    Tested-by: Vegard Nossum
    Acked-by: Oleg Nesterov
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     

20 Nov, 2015

1 commit


16 Nov, 2015

3 commits

  • Pull perf updates from Thomas Gleixner:
    "Mostly updates to the perf tool plus two fixes to the kernel core code:

    - Handle tracepoint filters correctly for inherited events (Peter
    Zijlstra)

    - Prevent a deadlock in perf_lock_task_context (Paul McKenney)

    - Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
    Melo)

    - Print full source file paths when using 'perf annotate --print-line
    --full-paths' (Michael Petlan)

    - Fix 'perf probe -d' when just one out of uprobes and kprobes is
    enabled (Wang Nan)

    - Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
    tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)

    - Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
    the 'perf test' LLVM entries, when running it in-tree, to
    .gitignore (Yunlong Song)

    - libbpf error reporting improvements, using a strerror interface to
    more precisely tell the user about problems with the provided
    scriptlet, be it in C or as a ready made object file (Wang Nan)

    - Do not be case sensitive when searching for matching 'perf test'
    entries (Arnaldo Carvalho de Melo)

    - Inform the user about objdump failures in 'perf annotate' (Andi
    Kleen)

    - Improve the LLVM 'perf test' entry, introduce a new ones for BPF
    and kbuild tests to check the environment used by clang to compile
    .c scriptlets (Wang Nan)"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
    tools include: Add compiler.h to list.h
    perf probe: Verify parameters in two functions
    perf session: Add missing newlines to some pr_err() calls
    perf annotate: Support full source file paths for srcline fix
    perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
    perf: Fix inherited events vs. tracepoint filters
    perf: Disable IRQs across RCU RS CS that acquires scheduler lock
    perf test: Do not be case sensitive when searching for matching tests
    perf test: Add 'perf test BPF'
    perf test: Enhance the LLVM tests: add kbuild test
    perf test: Enhance the LLVM test: update basic BPF test program
    perf bpf: Improve BPF related error messages
    perf tools: Make fetch_kernel_version() publicly available
    bpf tools: Add new API bpf_object__get_kversion()
    bpf tools: Improve libbpf error reporting
    perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
    perf annotate: Inform the user about objdump failures in --stdio
    perf stat: Make stat options global
    perf sched latency: Fix thread pid reuse issue
    ...

    Linus Torvalds
     
  • Pull scheduler fix from Thomas Gleixner:
    "A single fix to prevent math underflow in the numa balancing code"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix math underflow in task_tick_numa()

    Linus Torvalds
     
  • ….kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull irq and timer fixes from Thomas Gleixner:

    - An irq regression fix to restore the wakeup behaviour of chained
    interrupts.

    - A timer fix for a long standing race versus timers scheduled on a
    target cpu which got exposed by recent changes in the workqueue
    implementation.

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/PM: Restore system wake up from chained interrupts

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers: Use proper base migration in add_timer_on()

    Linus Torvalds
     

13 Nov, 2015

2 commits

  • Pull trace cleanups from Steven Rostedt:
    "This contains three more clean up patches.

    One patch is needed to make tracing work without debugfs now that
    tracing uses its own tracefs.

    The second is removing an unused variable.

    The third is fixing a warning about unused variables when MAX_TRACER
    is not configured. Note, this warning shows up in gcc 6.0, but does
    not show up in gcc 4.9, as it seems that gcc does not complain about
    constants not being used"

    * tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: #ifdef out uses of max trace when CONFIG_TRACER_MAX_TRACE is not set
    tracing: Remove unused ftrace_cpu_disabled per cpu variable
    tracing: Make tracing work when debugfs is not configured in

    Linus Torvalds
     
  • Pull second batch of kvm updates from Paolo Bonzini:
    "Four changes:

    - x86: work around two nasty cases where a benign exception occurs
    while another is being delivered. The endless stream of exceptions
    causes an infinite loop in the processor, which not even NMIs or
    SMIs can interrupt; in the virt case, there is no possibility to
    exit to the host either.

    - x86: support for Skylake per-guest TSC rate. Long supported by
    AMD, the patches mostly move things from there to common
    arch/x86/kvm/ code.

    - generic: remove local_irq_save/restore from the guest entry and
    exit paths when context tracking is enabled. The patches are a few
    months old, but we discussed them again at kernel summit. Andy
    will pick up from here and, in 4.5, try to remove it from the user
    entry/exit paths.

    - PPC: Two bug fixes, see merge commit 370289756becc for details"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
    KVM: x86: rename update_db_bp_intercept to update_bp_intercept
    KVM: svm: unconditionally intercept #DB
    KVM: x86: work around infinite loop in microcode when #AC is delivered
    context_tracking: avoid irq_save/irq_restore on guest entry and exit
    context_tracking: remove duplicate enabled check
    KVM: VMX: Dump TSC multiplier in dump_vmcs()
    KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
    KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
    KVM: VMX: Enable and initialize VMX TSC scaling
    KVM: x86: Use the correct vcpu's TSC rate to compute time scale
    KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
    KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
    KVM: x86: Replace call-back compute_tsc_offset() with a common function
    KVM: x86: Replace call-back set_tsc_khz() with a common function
    KVM: x86: Add a common TSC scaling function
    KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
    KVM: x86: Collect information for setting TSC scaling ratio
    KVM: x86: declare a few variables as __read_mostly
    KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
    KVM: PPC: Book3S HV: Don't dynamically split core when already split
    ...

    Linus Torvalds
     

12 Nov, 2015

1 commit

  • With kASLR enabled, old_addr provided by patch module is being shifted
    accrodingly so that the symbol lookups work. To have module relocations
    handled properly as well, the same transformation needs to be perfomed
    on relocation address information.

    [jkosina@suse.cz: extended / reworded changelog a bit]
    Reported-by: Cyril B.
    Signed-off-by: Zhou Chengming
    Acked-by: Josh Poimboeuf
    Signed-off-by: Jiri Kosina

    Zhou Chengming
     

11 Nov, 2015

3 commits

  • Pull networking fixes from David Miller:

    1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.

    2) Several spots need to get to the original listner for SYN-ACK
    packets, most spots got this ok but some were not. Whilst covering
    the remaining cases, create a helper to do this. From Eric Dumazet.

    3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
    from Rasmus Villemoes.

    4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.

    5) Use after free in mvneta driver, from Justin Maggard.

    6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.

    7) Add missing ZLIB_INFLATE dependency for new qed driver. From Arnd
    Bergmann.

    8) Fix multicast getsockopt deadlock, from WANG Cong.

    9) Fix deadlock in btusb, from Kuba Pawlak.

    10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
    counter state. From Sabrina Dubroca.

    11) Fix packet_bind() race, which can cause lost notifications, from
    Francesco Ruggeri.

    12) Fix MAC restoration in qlcnic driver during bonding mode changes,
    from Jarod Wilson.

    13) Revert bridging forward delay change which broke libvirt and other
    userspace things, from Vlad Yasevich.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    Revert "bridge: Allow forward delay to be cfgd when STP enabled"
    bpf_trace: Make dependent on PERF_EVENTS
    qed: select ZLIB_INFLATE
    net: fix a race in dst_release()
    net: mvneta: Fix memory use after free.
    net: Documentation: Fix default value tcp_limit_output_bytes
    macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
    mvneta: add FIXED_PHY dependency
    net: caif: check return value of alloc_netdev
    net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
    drivers: net: xgene: fix RGMII 10/100Mb mode
    netfilter: nft_meta: use skb_to_full_sk() helper
    net_sched: em_meta: use skb_to_full_sk() helper
    sched: cls_flow: use skb_to_full_sk() helper
    netfilter: xt_owner: use skb_to_full_sk() helper
    smack: use skb_to_full_sk() helper
    net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
    bpf: doc: correct arch list for supported eBPF JIT
    dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
    bonding: fix panic on non-ARPHRD_ETHER enslave failure
    ...

    Linus Torvalds
     
  • Arnd Bergmann reported:

    In my ARM randconfig tests, I'm getting a build error for
    newly added code in bpf_perf_event_read and bpf_perf_event_output
    whenever CONFIG_PERF_EVENTS is disabled:

    kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
    kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
    if (event->oncpu != smp_processor_id() ||
    ^
    kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
    event->pmu->count)

    This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
    is disabled. I'm not sure if that is a configuration we care
    about, otherwise we could prevent this case from occuring by
    adding Kconfig dependencies.

    Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
    By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.

    Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfel
    Reported-by: Arnd Bergmann
    Signed-off-by: Steven Rostedt
    Signed-off-by: David S. Miller

    Steven Rostedt
     
  • Pull libnvdimm updates from Dan Williams:
    "Outside of the new ACPI-NFIT hot-add support this pull request is more
    notable for what it does not contain, than what it does. There were a
    handful of development topics this cycle, dax get_user_pages, dax
    fsync, and raw block dax, that need more more iteration and will wait
    for 4.5.

    The patches to make devm and the pmem driver NUMA aware have been in
    -next for several weeks. The hot-add support has not, but is
    contained to the NFIT driver and is passing unit tests. The coredump
    support is straightforward and was looked over by Jeff. All of it has
    received a 0day build success notification across 107 configs.

    Summary:

    - Add support for the ACPI 6.0 NFIT hot add mechanism to process
    updates of the NFIT at runtime.

    - Teach the coredump implementation how to filter out DAX mappings.

    - Introduce NUMA hints for allocations made by the pmem driver, and
    as a side effect all devm allocations now hint their NUMA node by
    default"

    * tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    coredump: add DAX filtering for FDPIC ELF coredumps
    coredump: add DAX filtering for ELF coredumps
    acpi: nfit: Add support for hot-add
    nfit: in acpi_nfit_init, break on a 0-length table
    pmem, memremap: convert to numa aware allocations
    devm_memremap_pages: use numa_mem_id
    devm: make allocations numa aware by default
    devm_memremap: convert to return ERR_PTR
    devm_memunmap: use devres_release()
    pmem: kill memremap_pmem()
    x86, mm: quiet arch_add_memory()

    Linus Torvalds
     

10 Nov, 2015

8 commits

  • tracing_max_lat_fops is used only when TRACER_MAX_TRACE enabled, so also
    swith the related code. The related warning with defconfig under x86_64:

    CC kernel/trace/trace.o
    kernel/trace/trace.c:5466:37: warning: ‘tracing_max_lat_fops’ defined but not used [-Wunused-const-variable]
    static const struct file_operations tracing_max_lat_fops = {

    Signed-off-by: Chen Gang
    Signed-off-by: Steven Rostedt

    Chen Gang
     
  • Commit e509bd7da149 ("genirq: Allow migration of chained interrupts
    by installing default action") breaks PCS wake up IRQ behaviour on
    TI OMAP based platforms (dra7-evm).

    TI OMAP IRQ wake up configuration:
    GIC-irqchip->PCM_IRQ
    |- omap_prcm_register_chain_handler
    |- PRCM-irqchip -> PRCM_IO_IRQ
    |- pcs_irq_chain_handler
    |- pinctrl-irqchip -> PCS_uart1_wakeup_irq

    This happens because IRQ PM code (irq/pm.c) is expected to ignore
    chained interrupts by default:
    static bool suspend_device_irq(struct irq_desc *desc)
    {
    if (!desc->action || desc->no_suspend_depth)
    return false;
    - it's expected !desc->action = true for chained interrupts;

    but, after above change, all chained interrupt descriptors will
    have default action handler installed - chained_action.
    As result, chained interrupts will be silently disabled during system
    suspend.

    Hence, fix it by introducing helper function irq_desc_is_chained() and
    use it in suspend_device_irq() for chained interrupts identification
    and skip them, once detected.

    Fixes: e509bd7da149 ("genirq: Allow migration of chained interrupts..")
    Signed-off-by: Grygorii Strashko
    Reviewed-by: Mika Westerberg
    Cc: Tony Lindgren
    Cc:
    Cc:
    Cc: Tony Lindgren
    Link: http://lkml.kernel.org/r/1447149492-20699-1-git-send-email-grygorii.strashko@ti.com
    Signed-off-by: Thomas Gleixner

    Grygorii Strashko
     
  • guest_enter and guest_exit must be called with interrupts disabled,
    since they take the vtime_seqlock with write_seq{lock,unlock}.
    Therefore, it is not necessary to check for exceptions, nor to
    save/restore the IRQ state, when context tracking functions are
    called by guest_enter and guest_exit.

    Split the body of context_tracking_entry and context_tracking_exit
    out to __-prefixed functions, and use them from KVM.

    Rik van Riel has measured this to speed up a tight vmentry/vmexit
    loop by about 2%.

    Cc: Andy Lutomirski
    Cc: Frederic Weisbecker
    Cc: Paul McKenney
    Reviewed-by: Rik van Riel
    Tested-by: Rik van Riel
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • All calls to context_tracking_enter and context_tracking_exit
    are already checking context_tracking_is_enabled, except the
    context_tracking_user_enter and context_tracking_user_exit
    functions left in for the benefit of assembly calls.

    Pull the check up to those functions, by making them simple
    wrappers around the user_enter and user_exit inline functions.

    Cc: Frederic Weisbecker
    Cc: Paul McKenney
    Reviewed-by: Rik van Riel
    Tested-by: Rik van Riel
    Acked-by: Andy Lutomirski
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Merge third patch-bomb from Andrew Morton:
    "We're pretty much done over here - I'm still waiting for a nouveau
    merge so I can cleanly finish up Christoph's dma-mapping rework.

    - bunch of small misc stuff

    - fold abs64() into abs(), remove abs64()

    - new_valid_dev() cleanups

    - binfmt_elf_fdpic feature work"

    * emailed patches from Andrew Morton : (24 commits)
    fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries
    fs/stat.c: remove unnecessary new_valid_dev() check
    fs/reiserfs/namei.c: remove unnecessary new_valid_dev() check
    fs/nilfs2/namei.c: remove unnecessary new_valid_dev() check
    fs/ncpfs/dir.c: remove unnecessary new_valid_dev() check
    fs/jfs: remove unnecessary new_valid_dev() checks
    fs/hpfs/namei.c: remove unnecessary new_valid_dev() check
    fs/f2fs/namei.c: remove unnecessary new_valid_dev() check
    fs/ext2/namei.c: remove unnecessary new_valid_dev() check
    fs/exofs/namei.c: remove unnecessary new_valid_dev() check
    fs/btrfs/inode.c: remove unnecessary new_valid_dev() check
    fs/9p: remove unnecessary new_valid_dev() checks
    include/linux/kdev_t.h: old/new_valid_dev() can return bool
    include/linux/kdev_t.h: remove unused huge_valid_dev()
    kmap_atomic_to_page() has no users, remove it
    drivers/scsi/cxgbi: fix build with EXTRA_CFLAGS
    dma: remove external references to dma_supported
    Documentation/sysctl/vm.txt: fix misleading code reference of overcommit_memory
    remove abs64()
    kernel.h: make abs() work with 64-bit types
    ...

    Linus Torvalds
     
  • Pull module updates from Rusty Russell:
    "Nothing exciting, minor tweaks and cleanups"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    scripts: [modpost] add new sections to white list
    modpost: Add flag -E for making section mismatches fatal
    params: don't ignore the rest of cmdline if parse_one() fails
    modpost: abort if a module symbol is too long

    Linus Torvalds
     
  • Switch everything to the new and more capable implementation of abs().
    Mainly to give the new abs() a bit of a workout.

    Cc: Michal Nazarewicz
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Masami Hiramatsu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Dan Williams
     

09 Nov, 2015

2 commits

  • The NUMA balancing code implements delays in scanning by
    advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

    With unsigned math, that creates an underflow, which results
    in task_numa_work being queued all the time, even when we
    don't want to.

    Avoiding the math underflow makes it possible to reduce CPU
    overhead in the NUMA balancing code.

    Reported-and-tested-by: Jan Stancek
    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/1446756983-28173-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Arnaldo reported that tracepoint filters seem to misbehave (ie. not
    apply) on inherited events.

    The fix is obvious; filters are only set on the actual (parent)
    event, use the normal pattern of using this parent event for filters.
    This is safe because each child event has a reference to it.

    Reported-by: Arnaldo Carvalho de Melo
    Tested-by: Arnaldo Carvalho de Melo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Adrian Hunter
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Frédéric Weisbecker
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Wang Nan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20151102095051.GN17308@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra